Closed Bug 1246425 Opened 9 years ago Closed 9 years ago

Import Parquet datasets in Hive metastore.

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: rvitillo)

References

Details

Roberto Agostino Vitillo (:rvitillo)

Assignee

Description

•

9 years ago

Declaring the Parquet datasets in the Hive metastore is needed to use various processing engines to query our data. We need a tool that given the location of a Parquet dataset on S3 imports it into the Hive metastore as Hive doesn't support yet that functionality [1]. There are several steps involved: 1) Discover the partitioning scheme. A parquet dataset can be partitioned by one or more columns. For instance, if dataset "foo" is partitioned by columns A and B, the Parquet files would have the following layout: s3://bucket/path/to/dataset/foo/A=10/B="here"/XYX.parquet s3://bucket/path/to/dataset/foo/A=11/B="there"/ABC.parquet ... 2) Parse the Avro schema of the Parquet dataset parquet-tools [2] can be used to print the schema of a parquet file: java -jar parquet-tools.jar meta ABC.parquet | grep parquet.avro.schema | awk '{print $4}' 3) Generate a HiveQL DDL statement that links to the Parquet dataset [3] create external table foo (A int, B string, C string, D string) partitioned by(A int, B string) stored as parquet location 's3://bucket/path/to/dataset/foo'; 4) Execute the DDL statement and add partitions to the Hive metastore hive -e "create external table foo..." hive -e "msck repair table foo" Note that even though there is a way to import a Parquet dataset in Hive without doing part of the above mentioned steps [4], the tables don't appear to be usable by Presto and possibly other tools that we might want to use in the future. [1] https://issues.apache.org/jira/browse/HIVE-10593 [2] http://198.11.219.187:8081/nexus/content/groups/public/com/twitter/parquet-tools/1.6.0-IBM-7/parquet-tools-1.6.0-IBM-7.jar [3] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables [4] https://stackoverflow.com/questions/34202743/create-hive-table-to-read-parquet-files-from-parquet-avro-schema/34207923#34207923

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

9 years ago

Depends on: 1246420

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

9 years ago

Blocks: 1246426

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

9 years ago

Assignee: nobody → azhang

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 1

•

9 years ago

Note: we can assume that partitioning columns have type string.

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

9 years ago

Assignee: azhang → rvitillo

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 2

•

9 years ago

https://github.com/vitillo/parquet2hive

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Import Parquet datasets in Hive metastore.

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Assigned: rvitillo)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Updated

Comment 2

Updated