Hadoop connector
Hadoop connector
Parent document: Connectors
Main function
Hadoop connector can be used to read hdfs files in batch scenarios. Its function points mainly include:
- Support reading files in multiple hdfs directories at the same time
- Support reading hdfs files of various formats
Maven dependency
<dependency>
<groupId>com.bytedance.bitsail</groupId>
<artifactId>bitsail-connector-hadoop</artifactId>
<version>${revision}</version>
</dependency>
Supported data types
- Basic data types supported by Hadoop connectors:
- Integer type:
- short
- int
- long
- biginterger
- Float type:
- float
- double
- bigdecimal
- Time type:
- timestamp
- date
- time
- String type:
- string
- Bool type:
- boolean
- Binary type:
- binary
- Integer type:
- Composited data types supported by Hadoop connectors:
- map
- list
Parameters
The following mentioned parameters should be added to job.reader
block when using, for example:
{
"job": {
"reader": {
"path_list": "hdfs://test_path/test.csv"
}
}
}
Necessary parameters
Param name | Required | Optional value | Description |
---|---|---|---|
class | Yes | Class name of hadoop connector, com.bytedance.bitsail.connector.hadoop.source.HadoopInputFormat | |
path_list | Yes | Specifies the path of the read in file. Multiple paths can be specified, separated by ',' | |
content_type | Yes | JSON CSV | Specify the format of the read in file. For details, refer to支持的文件格式 |
columns | Yes | Describing fields' names and types |
Optional parameters
Param name | Required | Optional value | Description |
---|---|---|---|
hadoop_conf | No | Specify the read configuration of hadoop in the standard json format string | |
reader_parallelism_num | No | Reader parallelism |
Supported format
Support the following formats:
JSON
It supports parsing text files in json format. Each line is required to be a standard json string.
The following parameters are supported to adjust the json parsing stype:
Parameter name | Default value | Description |
---|---|---|
job.common.case_insensitive | true | Whether to be sensitive to the case of the key in the json field |
job.common.json_serializer_features | Specify the mode when 'FastJsonUtil' is parsed. The format is ',' separated string, for example"QuoteFieldNames,UseSingleQuotes" | |
job.common.convert_error_column_as_null | false | Whether to set the field with parsing error to null |
CSV
Support parsing of text files in csv format. Each line is required to be a standard csv string.
The following parameters are supported to adjust the csv parsing style:
Parameter name | Default value | Description |
---|---|---|
job.common.csv_delimiter | ',' | csv delimiter |
job.common.csv_escape | escape character | |
job.common.csv_quote | quote character | |
job.common.csv_with_null_string | Specify the conversion value of null field. It is not converted by default |
Related documents
Configuration examples: Hadoop connector example