Closed Bug 1313182 Opened 8 years ago Closed 8 years ago

Create notebook that tests schema evolution in Spark (using sqlContext.read) and Presto (using p2h)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: frank, Assigned: frank)

References

(Blocks 1 open bug)

Details

Frank Bertsch [:frank]

Assignee

Description

•

8 years ago

It's been difficult to decide how the two systems will react to changes in schema. Having a notebook will give us a definite source that we can rerun if needed to see how schema changes will effect datasets.

Frank Bertsch [:frank]

Assignee

Updated

•

8 years ago

Assignee: nobody → fbertsch

Priority: -- → P1

Frank Bertsch [:frank]

Assignee

Comment 1

•

8 years ago

I've uploaded an initial notebook[0]. There are basically three modes of reading parquet: 1. Spark 2. Spark with mergeSchema = True 3. parquet2hive + Presto For each scenario, I've tested all of these modes. The scenarios I've covered are: 1. Adding a column 2. Removing a column 3. Renaming a column 5. Replacing a column (new type) 6. Transposing columns 7. Transposing with add and remove The biggest takeaway is that parquet.column.index.access=true for our hive metastore. With this in mind, many of these results make sense; e.g. column rename includes old column name data, and the DatabaseError on column replacement (it's trying to read two different types). [0] https://gist.github.com/fbertsch/1df69a191c2b4535b2df5bf64e57897d

Frank Bertsch [:frank]

Assignee

Comment 2

•

8 years ago

Currently trying to run with 'parquet.column.index.access=false', but the configuration property is not getting set by hive. Interestingly enough, it might actually be false be default, see [0]. [0] https://github.com/apache/hive/blob/41fbe7bb7d4ad1eb0510a08df22db59e7a81c245/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L350

Frank Bertsch [:frank]

Assignee

Comment 3

•

8 years ago

I've now ran the notebook with parquet.column.index.access=false and parquet.column.index.access=true. Both resulted in the same exact output. I'm trying to decide if somehow the configuration property wasn't set for the actual queries that came through. It's hard to test because you can't set the configuration directly before the query, since the configuration is Hive (the catalog), and the query is Presto.

Frank Bertsch [:frank]

Assignee

Comment 4

•

8 years ago

I've uploaded notebooks and explanations [0]. Basically, we have expected results setting hive.parquet.use-column-names=true in the hive.properties file. This leads to results such as column renaming not reading old values. We should probably set this property on our cluster. The downside is only with column renaming - it's not possible, since the old column can not be read in as the new. But this is how Spark reads it as well, so parity in that regard seems good for us. [0] https://github.com/fbertsch/schema_evolution_exploration

Frank Bertsch [:frank]

Assignee

Comment 5

•

8 years ago

I'm going to mark this as resolved, if we feel we need more we can continue using the notebooks for future investigation.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Create notebook that tests schema evolution in Spark (using sqlContext.read) and Presto (using p2h)

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: frank, Assigned: frank)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated