Closed Bug 1313182 Opened 8 years ago Closed 8 years ago

Create notebook that tests schema evolution in Spark (using sqlContext.read) and Presto (using p2h)

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: frank)

References

(Blocks 1 open bug)

Details

It's been difficult to decide how the two systems will react to changes in schema. Having a notebook will give us a definite source that we can rerun if needed to see how schema changes will effect datasets.
Assignee: nobody → fbertsch
Priority: -- → P1
I've uploaded an initial notebook[0]. There are basically three modes of reading parquet: 1. Spark 2. Spark with mergeSchema = True 3. parquet2hive + Presto For each scenario, I've tested all of these modes. The scenarios I've covered are: 1. Adding a column 2. Removing a column 3. Renaming a column 5. Replacing a column (new type) 6. Transposing columns 7. Transposing with add and remove The biggest takeaway is that parquet.column.index.access=true for our hive metastore. With this in mind, many of these results make sense; e.g. column rename includes old column name data, and the DatabaseError on column replacement (it's trying to read two different types). [0] https://gist.github.com/fbertsch/1df69a191c2b4535b2df5bf64e57897d
Currently trying to run with 'parquet.column.index.access=false', but the configuration property is not getting set by hive. Interestingly enough, it might actually be false be default, see [0]. [0] https://github.com/apache/hive/blob/41fbe7bb7d4ad1eb0510a08df22db59e7a81c245/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L350
I've now ran the notebook with parquet.column.index.access=false and parquet.column.index.access=true. Both resulted in the same exact output. I'm trying to decide if somehow the configuration property wasn't set for the actual queries that came through. It's hard to test because you can't set the configuration directly before the query, since the configuration is Hive (the catalog), and the query is Presto.
I've uploaded notebooks and explanations [0]. Basically, we have expected results setting hive.parquet.use-column-names=true in the hive.properties file. This leads to results such as column renaming not reading old values. We should probably set this property on our cluster. The downside is only with column renaming - it's not possible, since the old column can not be read in as the new. But this is how Spark reads it as well, so parity in that regard seems good for us. [0] https://github.com/fbertsch/schema_evolution_exploration
I'm going to mark this as resolved, if we feel we need more we can continue using the notebooks for future investigation.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.