Hu Yoshida

Hyper-converged Meets Big Data Analytics

Blog Post created by Hu Yoshida Employee on Feb 3, 2016

Hyper-converged Infrastructure and Big Data Analytics are two of the biggest trends in modern data architecture today. However, both are challenged by data silos. Hitachi Data Systems can now address those challenges by merging data silos with its next generation Hyper Scale-out Platform (HSP), which now offers native integration with the Pentaho Enterprise Platform to form a turnkey big data appliance. To better understand the significance of this release, let’s examine both architectures.

Data lake.jpg


Hyper-converged Infrastructure

Hyper-converged is a type of infrastructure with a software-centric architecture that tightly integrates compute, storage, networking and virtualization resources in a commodity hardware box (appliance) supported by a single vendor. IDC names vendors in this space like Nutanix and Simpivity. Many vendors support VMware’s EVO Rail. Hitachi Data Systems offers EVO Rail on the UCP 1000. In some cases, a distributed file system is used to manage the data, cluster multiple nodes, and scale out the data sharing resources.


Big Data Analytics

Big data analytics requires the processing of large data sets that contain a variety of data types, to discover hidden patterns or correlations that provide actionable insights for business advantage.  In search of faster ways to process large amounts of data, customers are looking to new technologies like Hadoop, an open source programming framework which supports the processing of large data sets in a distributed compute environment. The Hadoop file system can support hundreds and thousands of nodes with rapid data transfer between the nodes. Hadoop, is a faster way to manage the data and offer analytics-based insight. However, it requires data to be loaded into the Hadoop file system. So if you want to analyse data that was generated on a hyper-converged infrastructure, you would need to extract the data, translate and load it on to Hadoop.


Hitachi Hyper Scale-out Platform

The Hitachi Hyper Scale-Out Platform (HSP) offers a unique architecture that gives customers the ability to do both. HSP is a common hyper-converged platform that can rapidly ingest data at scale, spin up VMs on demand to execute applications that share data between all VMs, and execute Hadoop jobs directly at the data source along with other VMs. The offerings described earlier can only do one or the other, not both, which makes the Hitachi Hyper Scale-Out Platform different from anything else out there. HSP is a turnkey, hyper-converged system designed to support simplified management of Hadoop and other popular big data analytics frameworks, like Apache Spark, Cassandra, OpenStack and others.


Its hyper-convergence packaging provides an easy to install platform that eliminates the complexity and set-up time for big data projects. The virtualization is based on open source KVM, which enables applications to be brought to the data in a virtual machine on an HSP node. This removes the need to move data to the applications, and enables applications and analytics to run on the same data from different nodes.


HSP doesn't use the standard Hadoop Distributed File System (HDFS).  Instead it uses a scale-out file system that supports the HDFS API, as well as a POSIX-compliant file system. Avoiding HDFS eliminates the bottleneck and single point of failure of the name node that distributes the allocation of the data across the nodes. The HSP file system distributes metadata to every node so that each node knows where all the data is. Through the HDFS API the HSP file system is transparent to Hadoop. A standard POSIX compliant file systems means that other analytic tools beside Hadoop can access the data stored in the HSP and there is no need to extract the data and move it to another storage system.


What’s New in the Next-Gen HSP? Pentaho Integration.

Our next-gen HSP offers native integration with Pentaho’s open source-based data integration and business analytics platform in a turnkey big data appliance. Pentaho provides a data integration (PDI) toolset that enables a wide variety of data sources to be transformed and “blended” together, including transactional and unstructured data. 

James Dixon, Pentaho’s chief technology officer summed it up well in our announcement: “The HSP-Pentaho appliance gives customers an affordable, enterprise-class option to unify all their disparate datasets and workloads—including legacy applications and data warehouses—via a modern, scalable and hyper-converged platform that eliminates complexity. We’re pleased to be working with HDS to deliver a simplified, all-in-the-box solution that combines compute, analytics and data management functions in a plug-and-play, future-ready architecture. The Hitachi Hyper Scale-Out Platform 400 is a great first-step in simplifying the entire analytic process.”