Closed
Bug 1286305
Opened 8 years ago
Closed 8 years ago
Add new system.os fields to longitudinal dataset
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: chutten, Assigned: harter)
References
()
Details
Attachments
(1 file)
Since bug 1255472, three more windows-only fields were added to the telemetry ping environment's system.os segment: windowsBuildNumber, windowsUBR, and installYear.
Please add these to the longitudinal dataset.
Updated•8 years ago
|
Points: --- → 1
Priority: -- → P2
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → rharter
Assignee | ||
Updated•8 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Updated•8 years ago
|
Priority: P2 → P1
Assignee | ||
Comment 1•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Attachment #8773106 -
Flags: review?(rvitillo)
Assignee | ||
Comment 2•8 years ago
|
||
An update on my progress. Roberto, there's nothing actionable here for you, but please make sure I'm not chasing my tail / doing something silly:
After generating an example longitudinal dataset, I ran the following analysis to test whether the new fields were flowing through properly. Unfortunately, my test shows windowsBuildNumber (res1) windowsUBR (res2) are both uniformly null. However, installDate appears to be working just fine.
> sbt assembly
> spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160101 --bucket telemetry-test-bucket
> val testData = sqlContext.parquetFile("s3://telemetry-test-bucket/longitudinal/v20160101/")
> testData.select("system_os").map(x => x(0).toString().split(",")(5)).distinct.collect
res1: Array[String] = Array(null)
> testData.select("system_os").map(x => x(0).toString().split(",")(6)).distinct.collect
res2: Array[String] = Array(null)
> testData.select("system_os").map(x => x(0).toString().split(",")(7)).distinct.collect
res3: Array[String] = Array(2079, 1980, 1981, [...]
Side note:
The anonymous function I use in the above map seems too complicated for the task at hand. However, when I try to access the array of results stored in the “system_os” column of the testData dataframe, I get the following errors:
scala> testData.select("system_os").take(1)(0)(0)(0)
<console>:28: error: Any does not take parameters
testData.select("system_os").take(1)(0)(0)(0)
^
Flags: needinfo?(rvitillo)
Comment 3•8 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #2)
> After generating an example longitudinal dataset, I ran the following
> analysis to test whether the new fields were flowing through properly.
> Unfortunately, my test shows windowsBuildNumber (res1) windowsUBR (res2) are
> both uniformly null. However, installDate appears to be working just fine.
You are using a very old submission date (20160101) which is probably why you don't see any values for windowsBuildNumber and windowsUBR.
>
> > sbt assembly
> > spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160101 --bucket telemetry-test-bucket
>
> > val testData = sqlContext.parquetFile("s3://telemetry-test-bucket/longitudinal/v20160101/")
> > testData.select("system_os").map(x => x(0).toString().split(",")(5)).distinct.collect
> res1: Array[String] = Array(null)
>
> > testData.select("system_os").map(x => x(0).toString().split(",")(6)).distinct.collect
> res2: Array[String] = Array(null)
>
> > testData.select("system_os").map(x => x(0).toString().split(",")(7)).distinct.collect
> res3: Array[String] = Array(2079, 1980, 1981, [...]
>
>
> Side note:
>
> The anonymous function I use in the above map seems too complicated for the
> task at hand. However, when I try to access the array of results stored in
> the “system_os” column of the testData dataframe, I get the following errors:
>
> scala> testData.select("system_os").take(1)(0)(0)(0)
> <console>:28: error: Any does not take parameters
> testData.select("system_os").take(1)(0)(0)(0)
> ^
That's expected as the apply method of Row returns Any; see [1] for methods of Row that don't return Any. The Dataframe API isn't very pleasant to work with unfortunately. The Dataset API fixes many of its shortcomings.
[1] https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html
[2] https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#datasets
Flags: needinfo?(rvitillo)
Assignee | ||
Comment 4•8 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3)
> (In reply to Ryan Harter [:harter] from comment #2)
>
> > After generating an example longitudinal dataset, I ran the following
> > analysis to test whether the new fields were flowing through properly.
> > Unfortunately, my test shows windowsBuildNumber (res1) windowsUBR (res2) are
> > both uniformly null. However, installDate appears to be working just fine.
>
> You are using a very old submission date (20160101) which is probably why
> you don't see any values for windowsBuildNumber and windowsUBR.
Yep, rerunning the analysis for 2016-07-01 shows a lot of new values. Everything appears to be working.
> >
> > > sbt assembly
> > > spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160101 --bucket telemetry-test-bucket
> >
> > > val testData = sqlContext.parquetFile("s3://telemetry-test-bucket/longitudinal/v20160101/")
> > > testData.select("system_os").map(x => x(0).toString().split(",")(5)).distinct.collect
> > res1: Array[String] = Array(null)
> >
> > > testData.select("system_os").map(x => x(0).toString().split(",")(6)).distinct.collect
> > res2: Array[String] = Array(null)
> >
> > > testData.select("system_os").map(x => x(0).toString().split(",")(7)).distinct.collect
> > res3: Array[String] = Array(2079, 1980, 1981, [...]
> >
> >
> > Side note:
> >
> > The anonymous function I use in the above map seems too complicated for the
> > task at hand. However, when I try to access the array of results stored in
> > the “system_os” column of the testData dataframe, I get the following errors:
> >
> > scala> testData.select("system_os").take(1)(0)(0)(0)
> > <console>:28: error: Any does not take parameters
> > testData.select("system_os").take(1)(0)(0)(0)
> > ^
>
> That's expected as the apply method of Row returns Any; see [1] for methods
> of Row that don't return Any. The Dataframe API isn't very pleasant to work
> with unfortunately. The Dataset API fixes many of its shortcomings.
>
> [1]
> https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html
> [2] https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#datasets
I was hoping there was an easier way to do this. I'll read through that guide. Thanks for the pointer.
Assignee | ||
Comment 5•8 years ago
|
||
This is complete and the longitudinal dataset has been recalculated.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Attachment #8773106 -
Flags: review?(rvitillo) → review+
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•