Analysts, software companies and even some end users are convinced that each microgram of data counts. It must be retrieved from wherever it “lives”, no matter what it looks like. Then, it has to be housed for a long term in a place guaranteeing the best analysis conditions.
There are, of course, several real life targets for that, such as Smart City, Smart Factory, Predictive Maintenance, Intelligent Vehicle, and Precision Agriculture. Some may sound futuristic whereas others are already part of our lives.
From a terminology perspective we have to deal with: Natural Language Processing, Sentiment Analysis, IOT, Data Streaming, Event Processing, Machine Learning, Predictive analytics… and of course, “Big Data”.
To achieve that, on the technical side we need tools and storage. In the SAP world we have that of course: HANA (and IQ). Nothing to say on the quality of the tools, but the cost of storage itself is daunting: as you remember, we work (mostly) in memory! So, the idea is to put the huge amounts of data into a place that is tightly integrated with the SAP world. This place is Hadoop.
In short, Hadoop is an Apache (~free) project for storing and handling big amounts of data. The technical inspiration came from Google when it released the MapReduce algorithm in 2004. For the less technical part, and how the name came to be – at the time, Doug Cutting’s son (creator’s son) had a toy elephant named Hadoop.
SAP use cases
SAP supports Hadoop in some technical and business scenarios.
The DWF v1.0 (Data Warehousing Foundation) is a set of tools for data management. Among these tools, the DLM (Data Lifecycle Manager) is able to move the data according to its “temperature”. Cold data, i.e. infrequently used, can be stored according to predefined rules in a “cold store” that can be IQ, HANA Dynamic Tiering engine, as well as Hadoop.
The red arrows and forms show the data move which is customized (“what to move and where”) and scheduled (“when to move”) in HANA in an application running on top of the internal application server (the XS engine). In this example data is moved from HANA to Hadoop.
The black arrows show the path of an SQL query fetching data. This happens through a “Union View” which makes the underlying physical structures transparent for the application.
This mechanism is pure database oriented. In other words, this is a table oriented tool with no application level intelligence, meaning that it is not possible to ship purchase orders (for example) from HANA to Hadoop. A purchase order is a business object spread over a quantity of tables. To be able to ship it from one place to another, we need knowledge similar to that of the archiving process (relationships between tables and even other related objects).
Hopefully the data aging process of S/4HANA will be able to use it in future versions.
SAP CAR (Customer Activity Repository), a retail industry oriented application, belongs also to the processes and applications for which SAP has documented the way to couple it to Hadoop. One of the functions of SAP CAR is to integrate the sales data from the stores for aggregation and then to send the aggregated results to the ERP (sales orders, goods movements, a.s.o.).
The non-aggregated remaining data could certainly be reused, for example for behavior analysis and forecasting purposes. The problem is of course the volume. And here comes Hadoop again.
SAP explains that there are two ways ship and hold CAR data on Hadoop:
- Transfer data from CAR to Hadoop with an SAP given report. Currently, this options (table content aging report) does not have a lot of documentation, except this explanation on SAP website.
- Use SDA (Smart Data Access) and create, for example, the TLOGF table (one of the biggest) on Hadoop. More on that can be found in the “Quickstart HDA for CAR 2.0 FP2” guide.
Of course, all other trendy processes and applications dealing with big data and running on SAP HANA or other SAP engines can take advantage of Hadoop. Hereunder an example combining the Complex Event Processing platform with its sources (IOT, Sentiment Analysis) and outputs.
Almost every data stream, whatever its nature, can go through an event processing engine for real-time analysis (for computing KPIs & raising Alerts for example). This can be sensor data from factories, trains, football players, or it can be discussion flows on Twitter.
Here also, the question is the same as for retailers with CAR: what to do with this data, that could potentially contain important information? Then, the answer could be the same: Hadoop.
What is Hadoop (more than a Google smelly toy)?
The heart of Hadoop has four major components:
- Entry level servers.
- A distributed filesystem (HDFS) spreading data across the cluster in a redundant way.
- A resource negotiator (YARN) handling the workload distribution on the resources.
- A Java framework (MAP REDUCE) enabling application development on top of the Hadoop cluster.
The next picture depicts that minimalistic Hadoop landscape from above in the center of a large ecosystem. It is not possible to represent all the members but here are some major ones:
It is important to note that most of these tools have their counterpart in terms of functionality in the SAP world (ex: Graph processing and Machine Learning are integrated into HANA, same for Workflow engine and Scheduler, other tools are concurrent to SAP CEP, Data Services and Lumira).
Other “semi-free” products like Pentaho can also cover several aspects around Hadoop like data integration and analytics as well as act like a bridge to other ecosystems (SAP, MongoDB…). It is “semi-free” because some of the tools need to be purchased on a subscription model (see for example the Pentaho Wikipedia page).
All of these tools are covered by lots of literature and are Apache (sub-) projects by themselves so we won’t talk about each of them. YouTube and Wikipedia will be good entry-points to learn more about them. However, we cannot talk about Hadoop without saying a few words on the frameworks and especially on MapReduce (especially since the goal is to discuss SAP Vora later on). Let’s try to understand its mechanism using an example.
The most common MapReduce example is the Word Count program. It is shipped among the example programs when you install Hadoop. The word count program tells you—for a given input file—how many times you see each word.
Here is how it works:
This is a simplified version because words were replaced by single letters and some intermediate steps (sort & shuffle) are not represented.
Programs running in the MapReduce framework have two procedures: map() and reduce(). These procedures are distributed and run in parallel in the Hadoop cluster:
- The Map procedures will filter and sort the input and write output files containing “key,value pairs”. Here it associates each letter with 1. These Map output files are used by Reduce procedures as input files.
- The Reduce procedures will aggregate the Map output and produce a result file.
What does not appear on the schema is that all the intermediate results are written and read from file and therefore make the MapReduce program IO intensive. The answer to that problem is Spark, which is another framework running also on top of Hadoop (HDFS & YARN).
- Spark uses memory rather than file for intermediate storage. Where MapReduce defines a “small” (~100 MB) buffer for intermediate storage and writes to file when a buffer overflow occurs, Spark relies on operating system memory management mechanisms. Data is written to virtual memory, meaning that the operating system decides whether to put it into RAM or SWAP.
- MapReduce systematically reads all the data from the input file and then start working. Spark starts processing only once it knows what kind of result is expected, so for example it can filter the input file and fetch only the relevant lines.
- Spark is not bound to YARN & HDFS. It supports other cluster engines (Mesos) and has also its own. Same for the distributed filesystem, you can also use Amazon S3.
On the “functional” side, Spark differs also from MapReduce:
- Spark includes natively some functions that are to be installed separately in a MapReduce context like Machine Learning, Streaming and Graph Processing. Graph processing, for example, gains to be run in Spark regarding the I/Os because this kind of processing has a lot of intermediate results. Better keep them in memory.
- The initial Spark user interface is a shell (three are available: Scala, Python & R) which means that you don’t have to proceed to complex developments. The same spirit can be found on MapReduce side if you decide to install and use Hive.
- Of course some tools can work with both: Oozie, Avro, Parquet.
When SAP enters the ring
For a couple of years (2001-2002), the trend was e-commerce and the underlying J2EE application servers. SAP acquired one of these Java application server editors: IN-Q-MY (with CEO Shai Agassi – do you remember?). The interesting part was when SAP “opened” the J2EE server and modified it (by developing closer integration towards the ABAP engine, adding table buffering capabilities) and named its whole technical layer (ABAP & JAVA): the Web Application Server (WAS).
Similarly to this “Java adventure”, now that the trend is big data & IoT, SAP is on its way towards Hadoop and comes with “Vora” in its luggage. Or, according to SAP AG, “HANA Vora” even if it does not sit on HANA but is integrated into Hadoop Spark. Vora contains SAP developments and even third party tools since Vora 1.2: HashiCorp Consul replacing (?) Zookeeper functionalities from earlier releases. SAP modelled the Java engine to fit its needs and now the same is happening to Hadoop.
Here is a high-level view of HANA and Vora.
Note: The arrow depicting the relationship goes in both directions:
- Vora can access HANA (HANA is seen as a data source accessible using SPARKSQL); and
- HANA can access Vora (via Spark Adapter/Controller).
For a complete overview of HANA <-> HADOOP integration, here is a link to SAP Online Help.
Check also Vora developers guide to see more details regarding how to access HANA data from Vora.
What can I do with Vora? The answer in version 1.2 is : “SAP HANA Vora enables OLAP analysis of Hadoop data, through data hierarchy enhancements in SparkSQL."
Here is an example on the OLAP-way of seeing data when it is organized in hierarchies, thanks to Vora and HANA (the business related technical terms are in French, but it isn’t critically important for general understanding).
- On the left hand side, we have a train (most probably the common ancestor of TGV & Shinkansen) with sensors sending raw data to Hadoop.
- On the right hand side, there is an application running in HANA which has the knowledge of the Bill of Material of our train. Could be an MRO application.
- Both of these worlds have to be combined (joint) to have suitable information. With hierarchical enrichment of sensor data we are able to:
- Raise an alert only if both thermometers are giving extreme values because we know they are on the same hierarchy level. If only one shows a critical value we have to further investigate to know if there is a problem on one thermometer and maybe also the train.
- Anticipate which are the parent components that will fail in case of a child failure or the other way round
- And more.
This is possible because Vora knows how to deal with hierarchies. They are integrated in a normalized form to Vora (cf. table in the picture) and can be queried with special functions like level(u), is_root(u), is_child(u,v), is_ancestor(u,v)… More on that in the developers guide.
Check also videos on this subject from the HANA Academy, available on YouTube. Here is the first in a series of 3 videos.
According to the “DMM200 – SAP HANA Vora: Overview, Architecture, Use Cases, and Roadmap” session at TECHED 2016, release 1.3 of Vora should also incorporate a time series engine and a graph engine, among other things. These features already exist in HANA, so when running both HANA and Vora, the question will be: “On which side should I run my calculations?”
Vora from a technician’s perspective (install, operate, size..)
Installation. Vora is an SAP product. In the SAP context, installation tasks have strong guidelines. Currently, three Hadoop distributions are certified by SAP: MAPR, Hortonworks and Cloudera. Check note 2213226 - Prerequisites for installing SAP HANA Vora: Operating Systems and Hadoop Components
Operations. Administration of the Hadoop cluster is done with tools like Ambari. For monitoring, Ganglia is a good candidate.
Data backup and safeguarding. Production Data in a Hadoop cluster can have the same criticality as in an ERP. Traditional backup tools exist with Hadoop agents (ex: Commvault). Another solution to safeguard data is to create a replicated cluster and use Hadoop native Apache DistCP2 tool.
Development. In a Hadoop environment, developments have the same importance as anywhere else in the IT world. This means that versioning, deployment and overall organization must rely on robust processes and tools. Here we are talking about tools like GIT, Jenkins & Maven as well as home grown scripts.
Regarding landscape and sizing, SAP as well as each component comes with recommendations.
The figures here are initial recommendations taken from installation and sizing guides. In addition to the initial guidelines, SAP gives also formulas for more precise estimation.
Here is a starting point:
In the SAP world we are used to have almost accurate sizings with the SAP Quicksizer tool. There is no such tool in the Hadoop world. The best recommendation is to make sizing benchmarks with significant amount of data, to be able to make the best extrapolations.
Here are three examples of hardware hosting Hadoop clusters:
Hadoop runs well on Raspberry PI. I have some doubts regarding Spark & Vora.
Christian Lindholm is a leading Technical SAP Architect at oXya’s headquarters in Paris, France. Joining oXya in 2008, Christian has nearly 20 years of experience in technical SAP roles, starting as an SAP Basis Admin and progressing to one of oXya’s leading Technical SAP Architects around the world. In his role, Christian works with SAP customers around the world to design, optimize and implement complex solutions, to serve customers’ unique needs.
oXya, a Hitachi Group Company, is a technical SAP consulting company, established in 1998. oXya provides ongoing managed services (outsourcing) for enterprises around the world. In addition, oXya helps customers that run SAP with various projects, including upgrades and migrations, including to SAP HANA. oXya currently employs ~700 SAP experts, who service more than 330 enterprise customers with hundreds of thousands of SAP users around the world.