Closed
Bug 1246425
Opened 9 years ago
Closed 9 years ago
Import Parquet datasets in Hive metastore.
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Assigned: rvitillo)
References
Details
Declaring the Parquet datasets in the Hive metastore is needed to use various processing engines to query our data.
We need a tool that given the location of a Parquet dataset on S3 imports it into the Hive metastore as Hive doesn't support yet that functionality [1].
There are several steps involved:
1) Discover the partitioning scheme.
A parquet dataset can be partitioned by one or more columns. For instance, if dataset "foo" is partitioned by columns A and B, the Parquet files would have the following layout:
s3://bucket/path/to/dataset/foo/A=10/B="here"/XYX.parquet
s3://bucket/path/to/dataset/foo/A=11/B="there"/ABC.parquet
...
2) Parse the Avro schema of the Parquet dataset
parquet-tools [2] can be used to print the schema of a parquet file:
java -jar parquet-tools.jar meta ABC.parquet | grep parquet.avro.schema | awk '{print $4}'
3) Generate a HiveQL DDL statement that links to the Parquet dataset [3]
create external table foo (A int, B string, C string, D string)
partitioned by(A int, B string)
stored as parquet
location 's3://bucket/path/to/dataset/foo';
4) Execute the DDL statement and add partitions to the Hive metastore
hive -e "create external table foo..."
hive -e "msck repair table foo"
Note that even though there is a way to import a Parquet dataset in Hive without doing part of the above mentioned steps [4], the tables don't appear to be usable by Presto and possibly other tools that we might want to use in the future.
[1] https://issues.apache.org/jira/browse/HIVE-10593
[2] http://198.11.219.187:8081/nexus/content/groups/public/com/twitter/parquet-tools/1.6.0-IBM-7/parquet-tools-1.6.0-IBM-7.jar
[3] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables
[4] https://stackoverflow.com/questions/34202743/create-hive-table-to-read-parquet-files-from-parquet-avro-schema/34207923#34207923
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → azhang
Assignee | ||
Comment 1•9 years ago
|
||
Note: we can assume that partitioning columns have type string.
Assignee | ||
Updated•9 years ago
|
Assignee: azhang → rvitillo
Assignee | ||
Comment 2•9 years ago
|
||
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•