• Global BI Experts
  • Call (888) 556 5990
Posted on September 26, 2017 under SAP Lumira, SAP HANA

By:
Arman Avetisyan
Sr. SAP Basis/BI & Software Developer
aavetisyan@comerit.com

 

SAP HANA is still very expensive for BIG Data and many organizations trying to leverage  Hadoop in their landscape because it’s running on commodity hardware and able to store huge volumes of data.

Rather than simply archiving HANA’s historical data, we can use a Multi-node Hadoop cluster to store historical data , analyze it, build applications, perform machine learning and more.

In this tutorial, we are going to see what differences are between Hadoop and SAP HANA and how to leverage the strengths of both the Apache Hadoop and SAP HANA platforms.

We will learn how to move data from SAP HANA to Hadoop and perform data visualizations with SAP Lumira.

Finally, we will review how to use Apache Spark (PySpark) to create Applications using the data located in Hadoop.

Overview
  • What is Hadoop?
  • What is Hive & Sqoop?
  • Hadoop vs SAP HANA
  • Hadoop with SAP HANA & SAP Lumira
  • Uploading and transforming data from SAP HANA to a Multi-node Hadoop cluster
  • Importing Data From SAP HANA to Hadoop HDFS using Sqoop
  • Hive as metastore DB for HDFS -> create structured data from unstructured data located in HDFS
  • Moving data from HDFS to Hive
  • Connecting SAP Lumira to Hive for Big Data Analysis
  • Connecting PySpark to Hive for Application development, Analysis, Machine Learning and more.
 
What is Hadoop?

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

 
What is Hive & Sqoop?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority.

Sqoop is a tool designed to transfer data between Hadoop and relational database servers or SAP HANA. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases.

 
Hadoop vs SAP HANA
SAP HANA vs Hadoop

 

Using HADOOP WITH SAP HANA & Lumira
  • Import data from HANA to Hadoop HDFS using SQOOP.
  • Connect SAP Lumira to Hadoop for data visualization using Hive.
  • Use PySpark & Hive to create applications on Hadoop.

Using Hadoop with SAP HANA & Lumira

 
Importing Data From HANA to HDFS using Sqoop

sqoop import –username <USERNAME> –password <PASSWORD> –connect jdbc:sap://<host address>:<porthadoop>/?currentschema=<SCHEMA_NAME> –driver com.sap.db.jdbc.Driver –table <TABLE_NAME>  --split-by <Column Name>

Importing data from SAP HANA to HDFS using Sqoop

 

Check to see if the data was uploaded to HDFS

Hadoop fs –ls /user/hduser/

 
Hive as metastore DB for HDFS -> create structured data from unstructured data located in HDFS
  • Open Hive
  • Create table
  • Move Data from HDFS to Hive
 
Creating a Table in Hive

Execute the command bellow in hive to create “customer” table.

Create table customer (

customer_number int,

customer_name String,

city String,

valid_to String,

sales_organization String,

country String)

ROW FROMAT DELIMITED

FIELDS TERMINATED BY ‘,’

LINES TERMINATED BY ‘\n’;

 
Moving data from HDFS to Hive

LOAD DATA INPATH <filepath>  INTO TABLE <tablename>

Example:

LOAD DATA INPATH /user/hduser/COMERIT_DEMO/customer/part*  INTO TABLE customer;

 
Run Hive Server

Run hive server to access the hive tables outside of the box.

$HIVE_HOME/bin/hive --service hiveserver2

 
 
Connect SAP Lumira to Hive Server for Big Data Analysis

Connecting SAP Lumira to Hive server for big data analysis

 
Connect Spark to Hive for Application development, Analysis, Machine Learning and more (with Python)

This is a small console application built in Python using PySpark.

The app takes “Product ID” as an input and returned similar products from the same category which have higher “Sold Quantity” count.

Download the source code

Sep 26, 2017 12:21:48 PM / by Arman Avetisyan

Arman Avetisyan

Written by Arman Avetisyan

Arman is an experienced SAP BI, Big Data, and software developer whose combined development and SAP skills allows him to offer unique value to his clients. Comerit has been proud to have Arman as a part of our team since 2015.

SAP blog

Want more content like this? Make sure to drop your email below and you'll get updated each month about our latest blogs and think-pieces.

 

Sign up for updates

Recent Posts