StreamingFile connector


StreamingFile connector

Parent document: Connectors

StreamingFile Connector mainly used in streaming, and it supports write both hdfs and use hive Exactly-Once semantics. Provide reliable guarantee for real-time data warehouses.

Feature

  • Support Exactly Once
  • Support multi committer, compatible difference situations like data integrity or high timeliness.
  • Data Trigger, effectively solve problems like delayed data or out of order .
  • Hive DDL automatic detection, reduce manually align schema with hive.

Support data types

  • HDFS
    • No need to care about the data types; write byte array directly.
  • HIVE
    • Basic data types.
      • TINYINT
      • SMALLINT
      • INT
      • BIGINT
      • BOOLEAN
      • FLOAT
      • DOUBLE
      • STRING
      • BINARY
      • TIMESTAMP
      • DECIMAL
      • CHAR
      • VARCHAR
      • DATE
    • Complex data types.
      • Array
      • Map

Parameters

Common Parameters

NameRequiredDefault ValueEnumeration ValueComments
classYes-com.bytedance.bitsail.connector.legacy.streamingfile.sink.FileSystemSinkFunctionDAGBuilder
dump.format.typeYes-hdfs
hive
Write hdfs or hive

Advanced Parameters

NameRequiredDefault ValueEnumeration ValueComments
enable_event_timeNoFalseEnable event time or not.
event_time_fieldsNo-If enable event time, use this parameter to specify the field name in the record.
event_time_patternNo-If enable event time,if this parameter is null then use unix timestamp to parse the event_time_fields. If this field is not empty, use this field's value to parse the field value, examples: "yyyy-MM-dd HH:mm:ss"
event_time.tag_durationNo900000Unit:millisecond. Maximum wait time for the event time trigger. The formula: event_time - pending_commit_time > event_time.tag_duration, then will trigger the event time.Example: current event time=9:45, tag_duration=40min, pending trigger_commit_time=8:00, then 9:45 - (8:00 + 60min) = 45min > 40min the result is true, then event time could be trigger.
dump.directory_frequencyNodump.directory_frequency.daydump.directory_frequency.day
dump.directory_frequency.hour
Use for write hdfs.
dump.directory_frequency.day:/202201/xx_data
dump.directory_frequency.hour: /202201/01/data
rolling.inactivity_intervalNo-The interval of the file rolling.
rolling.max_part_sizeNo-The file size of the file rolling.
partition_strategyNopartition_lastpartition_first
partition_last
Committer strategy. partition_last: Waiting for all data ready then add hive partition to metastore.partition_first:add partition first。

HDFS Parameters

NameRequiredDefault ValueEnumeration ValueComments
dump.output_dirYes-The location of hdfs output.
hdfs.dump_typeYes-hdfs.dump_type.text
hdfs.dump_type.json
hdfs.dump_type.msgpack
hdfs.dump_type.binary: protobuf record, need use with follow parameters, proto.descriptor and proto.class_name.
How the parse the record for the event_time
partition_infosYes-The partition for the hdfs directory, hdfs only can be the follow value [{"name":"date","value":"yyyyMMdd","type":"TIME"},{"name":"hour","value":"HH","type":"TIME"}]
hdfs.replicationNo3hdfs replication num.
hdfs.compression_codecNoNonehdfs file compression strategy.

Hive Parameters

NameRequiredDefault ValueEnumeration ValueComments
db_nameYes-Database name for hive.
table_nameYes-Table name for hive.
metastore_propertiesYes-Hive metastore configuration. eg: {"metastore_uris":"thrift:localhost:9083"}
source_schemaYes-Source schema, eg: [{"name":"id","type":"bigint"},{"name":"user_name","type":"string"},{"name":"create_time","type":"bigint"}]
sink_schemaYes-Sink schema, eg: [{"name":"id","type":"bigint"},{"name":"user_name","type":"string"},{"name":"create_time","type":"bigint"}]
partition_infosYes-Hive partition definition, eg: [{"name":"date","type":"TIME"},{"name":"hour","type":"TIME"}]
hdfs.dump_typeYes-hdfs.dump_type.text
hdfs.dump_type.json
hdfs.dump_type.msgpack
hdfs.dump_type.binary: protobuf record, need use with follow parameter, proto.descriptor and proto.class_name。

Reference docs

Configuration examples: StreamingFile connector example