Chapter 2. HDFS Connector

Table of Contents

2.1. HDFS External Table Creation
2.2. Querying with HDFS Connector

This chapter describes how to use the Tibero HDFS Connector.

2.1. HDFS External Table Creation

The syntax for creating an external table using the HDFS Connector is the same as the general syntax for creating an external table.

hdfs://[HDFS Namenode IP or hostname][:Port]/[File Path]

In the previous syntax, Port is optional and 8020 is its default value.

The following is an example of defining an external table using the HDFS Connector.

CREATE DIRECTORY TMP_DIR AS '/home/tibero/';

CREATE TABLE HDFS_EXT (
    ID NUMBER(10),
    TEXT VARCHAR(64)
) ORGANIZATION EXTERNAL (
    DEFAULT DIRECTORY TMP_DIR
    ACCESS PARAMETERS (
        LOAD DATA INTO TABLE HDFS_EXT
        FIELDS TERMINATED BY '|'
        (ID, TEXT)
    ) LOCATION (
        'hdfs://hadoop_name/user/tibero/f0.txt',
        'hdfs://hadoop_name/user/tibero/f1.txt',
        'hdfs://hadoop_name/user/tibero/f2.txt'
    )
);

The previous example creates three HDFS files for the external table. The host name of the HDFS Namenode is 'hadoop_name', and the port number defaults to 8020. Next, it reads the three files, 'f0.txt', 'f1.txt', and 'f2.txt', from the '/user/tibero' directory below HDFS.

The directory object is only needed for the DDL syntax to create the external table, and it has no effect on the HDFS file path. Local files as well as HDFS files can be used to create an external table, and all functions available to external tables can be used in the same way.

Note

For detailed information about external table creation syntax, refer to Tibero SQL Reference Guide.

2.2. Querying with HDFS Connector

To query with HDFS Connector, create an external table and execute queries against it. External Table interface enables the use of all query functions provided by Tibero, join operations with a Tibero table, various aggregate functions, UDF, etc.

The following is an example of using the HDFS Connector to execute a query.

SELECT COUNT(*), AVG(PEOPLE.AGE) 
FROM HDFS_EXT, PEOPLE
WHERE HDFS_EXT.ID = PEOPLE.ID
GROUP BY HDFS_EXT.TEXT;

The /*+ parallel */ hint can also be used for parallel execution as with general tables.

SELECT /*+ parallel (8) */ COUNT(*), AVG(PEOPLE.AGE) 
FROM HDFS_EXT, PEOPLE
WHERE HDFS_EXT.ID = PEOPLE.ID
GROUP BY HDFS_EXT.TEXT;

Parallel execution can enhance performance since it uses parallel table scans by dividing HDFS files into HDFS block units. However, DML cannot be executed against external tables of HDFS Connector.