HBase connector


HBase connector

Parent document: Connectors

BitSail HBase can be used to read and write HBase tables. The main function points are as follows:

  • Support scanning HBase tables.
  • Support set RowKey while writing HBase tables.

Maven dependency

<dependency>
   <groupId>com.bytedance.bitsail</groupId>
   <artifactId>bitsail-connector-hbase</artifactId>
   <version>${revision}</version>
</dependency>

HBase reader

Supported data types

HBase reader supports transform binary data from HBase into following formats of data:

  • string
  • boolean
  • int
  • short
  • long
  • bigint
  • double
  • float
  • date
  • bytes

Parameters

The following mentioned parameters should be added to job.reader block when using, for example:

{
  "job": {
    "reader": {
      "class": "com.bytedance.bitsail.connector.hbase.source.HBaseInputFormat",
      "table": "test_table",
      "hbase_conf":{
        "hbase.zookeeper.quorum":"127.0.0.1",
        "hbase.zookeeper.property.clientPort":"2181",
        "hbase.mapreduce.splittable": "test_table"
      },
      "columns": [
        {
          "index": 0,
          "name": "cf1:str1",
          "type": "bigint"
        },
        {
          "index": 1,
          "name": "cf1:int1",
          "type": "string"
        }
      ]
    }
  }
}

Necessary parameters

Param nameRequiredOptional valueDescription
classYesHBase reader class name, com.bytedance.bitsail.connector.legacy.hbase.source.HBaseInputFormat
tableYesTarget HBase table to read.
columnsYesDescribing fields' names and types. The format should be: columnFamily:columnName.
hbase_confYesConfigurations for creating HBase connection.

Optional parameters

Param nameRequiredOptional valueDescription
reader_parallelism_numNoRead parallelism num
encodingNoThe encoding style when decoding binary data from HBase. Default utf-8.

HBase writer

Supported data types

HBase writer supports transform the following formats of data into binary data:

  • varchar
  • string
  • boolean
  • short
  • int
  • long
  • bigint
  • double
  • decimal
  • float
  • date
  • timestamp
  • binary

Parameters

The following mentioned parameters should be added to job.writer block when using, for example:

{
  "job": {
    "writer": {
      "class": "com.bytedance.bitsail.connector.hbase.sink.HBaseOutputFormat",
      "table": "test_table",
      "hbase_conf":{
        "hbase.zookeeper.quorum":"127.0.0.1",
        "hbase.zookeeper.property.clientPort":"2181"
      },
      "row_key_column": "id_$(cf1:str1)",
      "columns": [
        {
          "index": 0,
          "name": "cf1:str1",
          "type": "string"
        },
        {
          "index": 1,
          "name": "cf1:int1",
          "type": "bigint"
        }
      ]
    }
  }
}

Necessary parameters

Param nameRequiredOptional valueDescription
classYesHBase writer class name, com.bytedance.bitsail.connector.legacy.hbase.sink.HBaseOutputFormat
tableYesTarget table to write.
columnsYesDescribing fields' names and types. The format should be: columnFamily:columnName.
hbase_confYesConfigurations for creating HBase connection.
row_key_columnYesDetermine the RowKey for each row.

The format of row_key_column is as follows:

  • $(XX) means using the value of XX defined in columns.
  • md5(...) means the md5 operation.

For example: $(cf:name)_md5($(cf:id)_split_$(cf:age))

Optional parameters

Param nameRequiredOptional valueDescription
writer_parallelism_numNoWriter parallelism num. No larger than the region number of table.
encodingNoThe encoding style when encoding data. Default utf-8.
null_modeNoHow to process null value.
"skip" means this value will not be inserted.
"empty" means set the row to empty bytes.
Default "empty".
wal_flagNoIf enable Write-ahead logging. Default false.
write_buffer_sizeNoThe buffer size of mutate operation. Default 8MB.
version_columnNoDetermine the timestamp of inserted rows.

The usage of version_column is as follow:

  • If not set, use the runtime timestamp as cells' timestamp.
  • If "version_column" = {"index":x}, then use the value of the x-th (begin from 0) column as cells' timestamp. For example:
      {
        "version_column": {
            "index": 1      // Use the second column defined in `job.writer.columns`
        }
      }
    
  • If "version_column" = {"value":xxx} or "version_column" = {"value":"xxx"}, then use the give value as cells' timestamp. For example:
      {
        "version_column": {
            "value": "1234567890"       // Cells' timestamp: 1234567890
        }
      }
    

Configuration examples: HBase connector example