Chapter 1. Introduction

Table of Contents

1.1. Overview
1.2. Key Features
1.3. Supported Hadoop and HBase Versions

This chapter describes the concepts and features of the Tibero Hadoop Connector.

1.1. Overview

Hadoop is an open-source framework solution from Apache Software Foundation for easily storing, distributing, and parallel processing large amounts of data.

Hadoop includes the following software stacks.

  • HDFS (Hadoop Distributed File System)

    A distributed file system that provides failure recovery and high availability through data block redundancy in a distributed file system.

  • MapReduce

    A distributed programming framework that automatically performs distributed parallel processing on tasks that are divided into Map and Reduce. It supports parallel and distributed processing using resources from multiple nodes and failure recovery function.

  • HBase

    A type of NoSQL database that supports semi-structured data model, which allows the addition of unlimited number of columns. HBase is composed of column stores and in-memory tables that enhance data storing and query performance. As a logical layer built over HDFS, it also provides functions for data loss and failure recovery.

In summary, Hadoop is a system for high performance data storing and processing. The number of businesses using Hadoop is increasing rapidly as the amount of data is growing exponentially. However, Hadoop requires a MapReduce program for data processing which creates high programming burden to create various queries needed for data analysis. It lacks an interactive SQL interface with immediate feedback causing inconvenience in having to write code to achieve the desired results.

There are also many cases that require multiple types of data sources to store various data formats. In such cases, unstructured data is stored in HDFS, semi-structured data in HBase, and structured in the existing RDBMS. Combining heterogeneous data sources to analyze legacy database and big data together dramatically increases data processing complexity.

1.2. Key Features

Tibero Hadoop Connector is a solution that satisfies big data processing requirements, and the need for heterogeneous data source integration and convenient interface.

The following Tibero Hadoop Connector features are provided to supplement Hadoop.

  • Provides Extern Table interface to process data in HDFS and HBase with data in RDBMS tables.

  • External Table interface reduces data migration inconvenience.

  • Supports all query functions of Tibero.

  • Supports data integration functions such as table joins between HBase and Tibero tables.

  • Supports DML on HBase tables.

Data in Hadoop can be combined with data in Tibero in a query using Ansi-SQL. The access interface between Tibero and Hadoop HDFS or HBase is unified in SQL, which reduces the burden of using heterogeneous data sources. Using SQL to perform various queries according to the fast changing data analysis needs facilitates a fast data analysis process.

Tibero Hadoop Connector uses the External Table function to access data so that queries can be performed on various data formats as with structured data. Various functions including query processing functions provided by Tibero InfiniData can also be used with data in Hadoop. For HBase, the Tibero Hadoop Connector supports the function to create a semi-structured data schema with unlimited number of columns, which is not provided in legacy RDBMS, and select queries as well as data update through DML.

In summary, Tibero Hadoop Connector enables easy integrated analysis of data in Hadoop and RDBMS. Such agile big data analysis functionality can help to quickly respond to the rapidly changing business environment.

1.3. Supported Hadoop and HBase Versions

Tibero HDFS Connector and HBase Connector only supports LINUX OS.

HDFS Connector supports Hadoop 1.2.X versions, and HBase Connector supports HBase 0.94.