Sr. SAP Basis/BI & Software Developer
SAP HANA is still very expensive for BIG Data and many organizations trying to leverage Hadoop in their landscape because it’s running on commodity hardware and able to store huge volumes of data.
Rather than simply archiving HANA’s historical data, we can use a Multi-node Hadoop cluster to store historical data , analyze it, build applications, perform machine learning and more.
In this tutorial, we are going to see what differences are between Hadoop and SAP HANA and how to leverage the strengths of both the Apache Hadoop and SAP HANA platforms.
We will learn how to move data from SAP HANA to Hadoop and perform data visualizations with SAP Lumira.
Finally, we will review how to use Apache Spark (PySpark) to create Applications using the data located in Hadoop.
- What is Hadoop?
- What is Hive & Sqoop?
- Hadoop vs SAP HANA
- Hadoop with SAP HANA & SAP Lumira
- Uploading and transforming data from SAP HANA to a Multi-node Hadoop cluster
- Importing Data From SAP HANA to Hadoop HDFS using Sqoop
- Hive as metastore DB for HDFS -> create structured data from unstructured data located in HDFS
- Moving data from HDFS to Hive
- Connecting SAP Lumira to Hive for Big Data Analysis
- Connecting PySpark to Hive for Application development, Analysis, Machine Learning and more.
What is Hadoop?
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
What is Hive & Sqoop?
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority.
Sqoop is a tool designed to transfer data between Hadoop and relational database servers or SAP HANA. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases.
Hadoop vs SAP HANA
Using HADOOP WITH SAP HANA & Lumira
- Import data from HANA to Hadoop HDFS using SQOOP.
- Connect SAP Lumira to Hadoop for data visualization using Hive.
- Use PySpark & Hive to create applications on Hadoop.
Importing Data From HANA to HDFS using Sqoop
sqoop import –username <USERNAME> –password <PASSWORD> –connect jdbc:sap://<host address>:<porthadoop>/?currentschema=<SCHEMA_NAME> –driver com.sap.db.jdbc.Driver –table <TABLE_NAME> --split-by <Column Name>
Check to see if the data was uploaded to HDFS
Hadoop fs –ls /user/hduser/
Hive as metastore DB for HDFS -> create structured data from unstructured data located in HDFS
- Open Hive
- Create table
- Move Data from HDFS to Hive
Creating a Table in Hive
Execute the command bellow in hive to create “customer” table.
Create table customer (
ROW FROMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’;
Moving data from HDFS to Hive
LOAD DATA INPATH <filepath> INTO TABLE <tablename>
LOAD DATA INPATH /user/hduser/COMERIT_DEMO/customer/part* INTO TABLE customer;
Run Hive Server
Run hive server to access the hive tables outside of the box.
$HIVE_HOME/bin/hive --service hiveserver2
Connect SAP Lumira to Hive Server for Big Data Analysis
Connect Spark to Hive for Application development, Analysis, Machine Learning and more (with Python)
This is a small console application built in Python using PySpark.
The app takes “Product ID” as an input and returned similar products from the same category which have higher “Sold Quantity” count.