Hadoop connector

Parent document: Connectors

Main function

Hadoop connector can be used to read hdfs files in batch scenarios. Its function points mainly include:

Support reading files in multiple hdfs directories at the same time
Support reading hdfs files of various formats

Maven dependency

<dependency>
   <groupId>com.bytedance.bitsail</groupId>
   <artifactId>bitsail-connector-hadoop</artifactId>
   <version>${revision}</version>
</dependency>

Supported data types

Basic data types supported by Hadoop connectors:
- Integer type:
  - short
  - int
  - long
  - biginterger
- Float type:
  - float
  - double
  - bigdecimal
- Time type:
  - timestamp
  - date
  - time
- String type:
  - string
- Bool type:
  - boolean
- Binary type:
  - binary
Composited data types supported by Hadoop connectors:
- map
- list

Parameters

The following mentioned parameters should be added to job.reader block when using, for example:

{
  "job": {
    "reader": {
      "path_list": "hdfs://test_path/test.csv"
    }
  }
}

Necessary parameters

Param name	Required	Optional value	Description
class	Yes		Class name of hadoop connector, `com.bytedance.bitsail.connector.hadoop.source.HadoopInputFormat`
path_list	Yes		Specifies the path of the read in file. Multiple paths can be specified, separated by `','`
content_type	Yes	JSON CSV	Specify the format of the read in file. For details, refer to支持的文件格式
columns	Yes		Describing fields' names and types

Optional parameters

Param name	Required	Optional value	Description
hadoop_conf	No		Specify the read configuration of hadoop in the standard json format string
reader_parallelism_num	No		Reader parallelism

Supported format

Support the following formats:

JSON
CSV

JSON

It supports parsing text files in json format. Each line is required to be a standard json string.

The following parameters are supported to adjust the json parsing stype:

Parameter name	Default value	Description
`job.common.case_insensitive`	true	Whether to be sensitive to the case of the key in the json field
`job.common.json_serializer_features`		Specify the mode when 'FastJsonUtil' is parsed. The format is `','` separated string, for example`"QuoteFieldNames,UseSingleQuotes"`
`job.common.convert_error_column_as_null`	false	Whether to set the field with parsing error to null

CSV

Support parsing of text files in csv format. Each line is required to be a standard csv string.

The following parameters are supported to adjust the csv parsing style:

Parameter name	Default value	Description
`job.common.csv_delimiter`	`','`	csv delimiter
`job.common.csv_escape`		escape character
`job.common.csv_quote`		quote character
`job.common.csv_with_null_string`		Specify the conversion value of null field. It is not converted by default

Configuration examples: Hadoop connector example

Hadoop connector

Hadoop connector

Main function

Maven dependency

Supported data types

Parameters

Necessary parameters

Optional parameters

Supported format

JSON

CSV

Documents

Community

More

Hadoop connector

# Hadoop connector

# Main function

# Maven dependency

# Supported data types

# Parameters

# Necessary parameters

# Optional parameters

# Supported format

# JSON

# CSV

# Related documents

Hadoop connector

Main function

Maven dependency

Supported data types

Parameters

Necessary parameters

Optional parameters

Supported format

JSON

CSV

Related documents