Closed Bug 1286305 Opened 8 years ago Closed 8 years ago

Add new system.os fields to longitudinal dataset

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: chutten, Assigned: harter)

References

(
URL
)

Details

Attachments

(1 file)

Add new system.os fields to longitudinal dataset 8 years ago Ryan Harter [:harter] (deleted), text/x-github-pull-request	rvitillo : review+	Details

Chris H-C :chutten

Reporter

Description

•

8 years ago

Since bug 1255472, three more windows-only fields were added to the telemetry ping environment's system.os segment: windowsBuildNumber, windowsUBR, and installYear. Please add these to the longitudinal dataset.

Roberto Agostino Vitillo (:rvitillo)

Updated

•

8 years ago

Blocks: 1255755

Thomas Huelbert

Updated

•

8 years ago

Points: --- → 1

Priority: -- → P2

Ryan Harter [:harter]

Assignee

Updated

•

8 years ago

Blocks: 1286262

Ryan Harter [:harter]

Assignee

Updated

•

8 years ago

Assignee: nobody → rharter

Ryan Harter [:harter]

Assignee

Updated

•

8 years ago

Status: NEW → ASSIGNED

Ryan Harter [:harter]

Assignee

Updated

•

8 years ago

Priority: P2 → P1

Ryan Harter [:harter]

Assignee

Comment 1

•

8 years ago

Attached file Add new system.os fields to longitudinal dataset (deleted) — Details

Ryan Harter [:harter]

Assignee

Updated

•

8 years ago

Attachment #8773106 - Flags: review?(rvitillo)

Ryan Harter [:harter]

Assignee

Comment 2

•

8 years ago

An update on my progress. Roberto, there's nothing actionable here for you, but please make sure I'm not chasing my tail / doing something silly: After generating an example longitudinal dataset, I ran the following analysis to test whether the new fields were flowing through properly. Unfortunately, my test shows windowsBuildNumber (res1) windowsUBR (res2) are both uniformly null. However, installDate appears to be working just fine. > sbt assembly > spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160101 --bucket telemetry-test-bucket > val testData = sqlContext.parquetFile("s3://telemetry-test-bucket/longitudinal/v20160101/") > testData.select("system_os").map(x => x(0).toString().split(",")(5)).distinct.collect res1: Array[String] = Array(null) > testData.select("system_os").map(x => x(0).toString().split(",")(6)).distinct.collect res2: Array[String] = Array(null) > testData.select("system_os").map(x => x(0).toString().split(",")(7)).distinct.collect res3: Array[String] = Array(2079, 1980, 1981, [...] Side note: The anonymous function I use in the above map seems too complicated for the task at hand. However, when I try to access the array of results stored in the “system_os” column of the testData dataframe, I get the following errors: scala> testData.select("system_os").take(1)(0)(0)(0) <console>:28: error: Any does not take parameters testData.select("system_os").take(1)(0)(0)(0) ^

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 3

•

8 years ago

(In reply to Ryan Harter [:harter] from comment #2) > After generating an example longitudinal dataset, I ran the following > analysis to test whether the new fields were flowing through properly. > Unfortunately, my test shows windowsBuildNumber (res1) windowsUBR (res2) are > both uniformly null. However, installDate appears to be working just fine. You are using a very old submission date (20160101) which is probably why you don't see any values for windowsBuildNumber and windowsUBR. > > > sbt assembly > > spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160101 --bucket telemetry-test-bucket > > > val testData = sqlContext.parquetFile("s3://telemetry-test-bucket/longitudinal/v20160101/") > > testData.select("system_os").map(x => x(0).toString().split(",")(5)).distinct.collect > res1: Array[String] = Array(null) > > > testData.select("system_os").map(x => x(0).toString().split(",")(6)).distinct.collect > res2: Array[String] = Array(null) > > > testData.select("system_os").map(x => x(0).toString().split(",")(7)).distinct.collect > res3: Array[String] = Array(2079, 1980, 1981, [...] > > > Side note: > > The anonymous function I use in the above map seems too complicated for the > task at hand. However, when I try to access the array of results stored in > the “system_os” column of the testData dataframe, I get the following errors: > > scala> testData.select("system_os").take(1)(0)(0)(0) > <console>:28: error: Any does not take parameters > testData.select("system_os").take(1)(0)(0)(0) > ^ That's expected as the apply method of Row returns Any; see [1] for methods of Row that don't return Any. The Dataframe API isn't very pleasant to work with unfortunately. The Dataset API fixes many of its shortcomings. [1] https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html [2] https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#datasets

Flags: needinfo?(rvitillo)

Ryan Harter [:harter]

Assignee

Comment 4

•

8 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3) > (In reply to Ryan Harter [:harter] from comment #2) > > > After generating an example longitudinal dataset, I ran the following > > analysis to test whether the new fields were flowing through properly. > > Unfortunately, my test shows windowsBuildNumber (res1) windowsUBR (res2) are > > both uniformly null. However, installDate appears to be working just fine. > > You are using a very old submission date (20160101) which is probably why > you don't see any values for windowsBuildNumber and windowsUBR. Yep, rerunning the analysis for 2016-07-01 shows a lot of new values. Everything appears to be working. > > > > > sbt assembly > > > spark-submit --master yarn-client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.10/telemetry-batch-view-*.jar --from 20160101 --to 20160101 --bucket telemetry-test-bucket > > > > > val testData = sqlContext.parquetFile("s3://telemetry-test-bucket/longitudinal/v20160101/") > > > testData.select("system_os").map(x => x(0).toString().split(",")(5)).distinct.collect > > res1: Array[String] = Array(null) > > > > > testData.select("system_os").map(x => x(0).toString().split(",")(6)).distinct.collect > > res2: Array[String] = Array(null) > > > > > testData.select("system_os").map(x => x(0).toString().split(",")(7)).distinct.collect > > res3: Array[String] = Array(2079, 1980, 1981, [...] > > > > > > Side note: > > > > The anonymous function I use in the above map seems too complicated for the > > task at hand. However, when I try to access the array of results stored in > > the “system_os” column of the testData dataframe, I get the following errors: > > > > scala> testData.select("system_os").take(1)(0)(0)(0) > > <console>:28: error: Any does not take parameters > > testData.select("system_os").take(1)(0)(0)(0) > > ^ > > That's expected as the apply method of Row returns Any; see [1] for methods > of Row that don't return Any. The Dataframe API isn't very pleasant to work > with unfortunately. The Dataset API fixes many of its shortcomings. > > [1] > https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html > [2] https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#datasets I was hoping there was an easier way to do this. I'll read through that guide. Thanks for the pointer.

Ryan Harter [:harter]

Assignee

Comment 5

•

8 years ago

This is complete and the longitudinal dataset has been recalculated.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Roberto Agostino Vitillo (:rvitillo)

Updated

•

8 years ago

Attachment #8773106 - Flags: review?(rvitillo) → review+

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Add new system.os fields to longitudinal dataset

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: chutten, Assigned: harter)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Updated

Updated

Updated

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Attachment

General

Description

File Name

Content Type