Closed Bug 1286305 Opened 8 years ago Closed 8 years ago

Add new system.os fields to longitudinal dataset

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chutten, Assigned: harter)

References

()

Details

Attachments

(1 file)

Since bug 1255472, three more windows-only fields were added to the telemetry ping environment's system.os segment: windowsBuildNumber, windowsUBR, and installYear. Please add these to the longitudinal dataset.
Points: --- → 1
Priority: -- → P2
Blocks: 1286262
Assignee: nobody → rharter
Status: NEW → ASSIGNED
Priority: P2 → P1
Attachment #8773106 - Flags: review?(rvitillo)
An update on my progress. Roberto, there's nothing actionable here for you, but please make sure I'm not chasing my tail / doing something silly: After generating an example longitudinal dataset, I ran the following analysis to test whether the new fields were flowing through properly. Unfortunately, my test shows windowsBuildNumber (res1) windowsUBR (res2) are both uniformly null. However, installDate appears to be working just fine. > sbt assembly > spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160101 --bucket telemetry-test-bucket > val testData = sqlContext.parquetFile("s3://telemetry-test-bucket/longitudinal/v20160101/") > testData.select("system_os").map(x => x(0).toString().split(",")(5)).distinct.collect res1: Array[String] = Array(null) > testData.select("system_os").map(x => x(0).toString().split(",")(6)).distinct.collect res2: Array[String] = Array(null) > testData.select("system_os").map(x => x(0).toString().split(",")(7)).distinct.collect res3: Array[String] = Array(2079, 1980, 1981, [...] Side note: The anonymous function I use in the above map seems too complicated for the task at hand. However, when I try to access the array of results stored in the “system_os” column of the testData dataframe, I get the following errors: scala> testData.select("system_os").take(1)(0)(0)(0) <console>:28: error: Any does not take parameters testData.select("system_os").take(1)(0)(0)(0) ^
Flags: needinfo?(rvitillo)
(In reply to Ryan Harter [:harter] from comment #2) > After generating an example longitudinal dataset, I ran the following > analysis to test whether the new fields were flowing through properly. > Unfortunately, my test shows windowsBuildNumber (res1) windowsUBR (res2) are > both uniformly null. However, installDate appears to be working just fine. You are using a very old submission date (20160101) which is probably why you don't see any values for windowsBuildNumber and windowsUBR. > > > sbt assembly > > spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160101 --bucket telemetry-test-bucket > > > val testData = sqlContext.parquetFile("s3://telemetry-test-bucket/longitudinal/v20160101/") > > testData.select("system_os").map(x => x(0).toString().split(",")(5)).distinct.collect > res1: Array[String] = Array(null) > > > testData.select("system_os").map(x => x(0).toString().split(",")(6)).distinct.collect > res2: Array[String] = Array(null) > > > testData.select("system_os").map(x => x(0).toString().split(",")(7)).distinct.collect > res3: Array[String] = Array(2079, 1980, 1981, [...] > > > Side note: > > The anonymous function I use in the above map seems too complicated for the > task at hand. However, when I try to access the array of results stored in > the “system_os” column of the testData dataframe, I get the following errors: > > scala> testData.select("system_os").take(1)(0)(0)(0) > <console>:28: error: Any does not take parameters > testData.select("system_os").take(1)(0)(0)(0) > ^ That's expected as the apply method of Row returns Any; see [1] for methods of Row that don't return Any. The Dataframe API isn't very pleasant to work with unfortunately. The Dataset API fixes many of its shortcomings. [1] https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html [2] https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#datasets
Flags: needinfo?(rvitillo)
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3) > (In reply to Ryan Harter [:harter] from comment #2) > > > After generating an example longitudinal dataset, I ran the following > > analysis to test whether the new fields were flowing through properly. > > Unfortunately, my test shows windowsBuildNumber (res1) windowsUBR (res2) are > > both uniformly null. However, installDate appears to be working just fine. > > You are using a very old submission date (20160101) which is probably why > you don't see any values for windowsBuildNumber and windowsUBR. Yep, rerunning the analysis for 2016-07-01 shows a lot of new values. Everything appears to be working. > > > > > sbt assembly > > > spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160101 --bucket telemetry-test-bucket > > > > > val testData = sqlContext.parquetFile("s3://telemetry-test-bucket/longitudinal/v20160101/") > > > testData.select("system_os").map(x => x(0).toString().split(",")(5)).distinct.collect > > res1: Array[String] = Array(null) > > > > > testData.select("system_os").map(x => x(0).toString().split(",")(6)).distinct.collect > > res2: Array[String] = Array(null) > > > > > testData.select("system_os").map(x => x(0).toString().split(",")(7)).distinct.collect > > res3: Array[String] = Array(2079, 1980, 1981, [...] > > > > > > Side note: > > > > The anonymous function I use in the above map seems too complicated for the > > task at hand. However, when I try to access the array of results stored in > > the “system_os” column of the testData dataframe, I get the following errors: > > > > scala> testData.select("system_os").take(1)(0)(0)(0) > > <console>:28: error: Any does not take parameters > > testData.select("system_os").take(1)(0)(0)(0) > > ^ > > That's expected as the apply method of Row returns Any; see [1] for methods > of Row that don't return Any. The Dataframe API isn't very pleasant to work > with unfortunately. The Dataset API fixes many of its shortcomings. > > [1] > https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html > [2] https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#datasets I was hoping there was an easier way to do this. I'll read through that guide. Thanks for the pointer.
This is complete and the longitudinal dataset has been recalculated.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Attachment #8773106 - Flags: review?(rvitillo) → review+
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: