Hadoop connector


Hadoop connector

Parent document: Connectors

Main function

Hadoop connector can be used to read hdfs files in batch scenarios. Its function points mainly include:

  • Support reading files in multiple hdfs directories at the same time
  • Support reading hdfs files of various formats

Maven dependency

<dependency>
   <groupId>com.bytedance.bitsail</groupId>
   <artifactId>bitsail-connector-hadoop</artifactId>
   <version>${revision}</version>
</dependency>

Supported data types

  • Basic data types supported by Hadoop connectors:
    • Integer type:
      • short
      • int
      • long
      • biginterger
    • Float type:
      • float
      • double
      • bigdecimal
    • Time type:
      • timestamp
      • date
      • time
    • String type:
      • string
    • Bool type:
      • boolean
    • Binary type:
      • binary
  • Composited data types supported by Hadoop connectors:
    • map
    • list

Parameters

The following mentioned parameters should be added to job.reader block when using, for example:

{
  "job": {
    "reader": {
      "path_list": "hdfs://test_path/test.csv"
    }
  }
}

Necessary parameters

Param nameRequiredOptional valueDescription
classYesClass name of hadoop connector, com.bytedance.bitsail.connector.hadoop.source.HadoopInputFormat
path_listYesSpecifies the path of the read in file. Multiple paths can be specified, separated by ','
content_typeYesJSON
CSV
Specify the format of the read in file. For details, refer to支持的文件格式
columnsYesDescribing fields' names and types

Optional parameters

Param nameRequiredOptional valueDescription
hadoop_confNoSpecify the read configuration of hadoop in the standard json format string
reader_parallelism_numNoReader parallelism

Supported format

Support the following formats:

JSON

It supports parsing text files in json format. Each line is required to be a standard json string.

The following parameters are supported to adjust the json parsing stype:

Parameter nameDefault valueDescription
job.common.case_insensitivetrueWhether to be sensitive to the case of the key in the json field
job.common.json_serializer_featuresSpecify the mode when 'FastJsonUtil' is parsed. The format is ',' separated string, for example"QuoteFieldNames,UseSingleQuotes"
job.common.convert_error_column_as_nullfalseWhether to set the field with parsing error to null

CSV

Support parsing of text files in csv format. Each line is required to be a standard csv string.

The following parameters are supported to adjust the csv parsing style:

Parameter nameDefault valueDescription
job.common.csv_delimiter','csv delimiter
job.common.csv_escapeescape character
job.common.csv_quotequote character
job.common.csv_with_null_stringSpecify the conversion value of null field. It is not converted by default

Configuration examples: Hadoop connector example