Closed Bug 608347 Opened 14 years ago Closed 14 years ago

create AMO collection mode in collect.py

Categories

(Webtools Graveyard :: Graph Server, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: rhelmer)

References

Details

Attachments

(8 files, 1 obsolete file)

Blocks: 599169
Assignee: nobody → ianb
This is implemented for the graphs code in: http://hg.mozilla.org/users/ibicking_mozilla.com/graphs/rev/043b9d1e4e38 This hasn't been deployed, and may need to be backported to deploy into production. The core code is here: http://hg.mozilla.org/users/ibicking_mozilla.com/graphs/file/tip/server/pyfomatic/collect_amo.py (with tests here: http://hg.mozilla.org/users/ibicking_mozilla.com/graphs/file/tip/tests/), along with a small change to server/pyfomatic/collect.py that would need to be backported (http://hg.mozilla.org/users/ibicking_mozilla.com/graphs/diff/043b9d1e4e38/server/pyfomatic/collect.py) Also server/graphsdb.py will have to be updated to have a amo_db connection object that points to the AMO database.
One issue: the database schema in at https://github.com/jbalogh/zamboni/blob/master/migrations/91-adjust-perf-tables.sql#L29 removing the perf_osversions.version column, as graphs does not contain any specific OS version information (except embedded in the OS name I guess). Or... maybe we were going to parse the OS name into two parts, I wasn't sure.
Thanks! If you could parse it out, it would be easier to use on AMO since we'll want them separated.
Thanks. CCing alice to get this deployed on staging
I'll get this onto staging this week. Thanks for all the hard work, Ian.
Depends on: 614400
Attached patch corrected os_id to osversion_id (deleted) — Splinter Review
Attachment #493796 - Flags: review?
Attachment #493796 - Flags: review? → review?(ianb)
Attachment #493796 - Flags: review?(ianb) → review+
This is on stage and working, thanks everyone. Two things so far: 1) The exponential numbers. Several averages are coming through as things like 8.16079e+07. Technically, this is fine, but is something wrong there? Those numbers are way different than most. 2) It looks like everything is getting inserted into the database 4 times, sometimes with different averages. There should be 1 row per add-on, per app, per os, per test (in this case, always ts). You can see the current data as of this post at http://pastebin.mozilla.org/874871
Looks like there is some poor name matching going on: http://pastebin.mozilla.org/874936 You can see that I'm sending results for a bunch of metrics, but they are all going into the 'ts' bucket. Anything that isn't called 'ts' or 'tp4' should be ignored.
Attached patch handle Null addon_id case (deleted) — Splinter Review
Attachment #496305 - Flags: review?(ianb)
Attachment #496305 - Flags: review?(ianb) → review+
Looking over the code, the average that you use is a straight average. The rest of talos world always excludes the max values before calculating an average (talos average = total - max_result/ runs-1). Can you fix up the calculation?
(In reply to comment #11) > Looking over the code, the average that you use is a straight average. The > rest of talos world always excludes the max values before calculating an > average (talos average = total - max_result/ runs-1). Can you fix up the > calculation? Pushed to http://hg.mozilla.org/users/ibicking_mozilla.com/graphs/rev/911b07ebff80 and also now live on staging.
There are now quadruplicate entries for addon_id = NULL: http://pastebin.mozilla.org/882712
A couple of fixes in this patch: - correctly calculate average - handle 'NULL' addon_id so as not to create quadruplicate entries
Attachment #497921 - Flags: review?(ianb)
Take the working collect_amo.py and add it to the currently used graph server front end.
Attachment #497922 - Flags: review?(ianb)
Attachment #497921 - Flags: review?(ianb) → review+
Attachment #497922 - Flags: review?(ianb) → review+
Comment on attachment 497922 [details] [diff] [review] [checked in]backport new amo collection mode to old graph server front end changeset: 326:6360c723794c
Attachment #497922 - Attachment description: backport new amo collection mode to old graph server front end → [checked in]backport new amo collection mode to old graph server front end
Depends on: 620015
Comment on attachment 497921 [details] [diff] [review] [checked in] final fixes for ian's collect_amo.py Pushed to ian's repo, ce1940d12c17
Attachment #497921 - Attachment description: final fixes for ian's collect_amo.py → [checked in] final fixes for ian's collect_amo.py
This patch was pushed live in bug 620015 and the AMO DB being unavailable caused a tree closure (bug 620572) Bug 620596 is about putting a queue in place, which is a good long-term solution but will take some time to implement. For right now, let's throw the AMO data away if we cannot contact the database (per clouserw). Reassigning this to me (Ian is now in Labs and no longer works on graph server).
Assignee: ianb → robert
Status: NEW → ASSIGNED
I've set up a Hudson job to run the unit tests for this feature: https://hudson.mozilla.org/job/graphs I see two failures right now, they both look like rounding errors. Will take a look at these before closing this bug.
Attached patch catch and report mysql connection problems (obsolete) (deleted) — Splinter Review
Not tested for real yet, but introduces no new errors in the unit test.
Attachment #500853 - Flags: review?(anodelman)
Erg - just realized that that is a patch for the 2.0 graph server branch. We need one for 1.0 so that this can be rolled out to the current production graph server. I can hack something together.
This patch is against hg.m.o/graphs branch 1.0 After giving it some more thought I think catching Exception might be the right thing to do here, since we really honestly don't care what happens in the "try" block and want to continue processing if at all possible. Alice, I don't know if "RETURN\tfail (...)" is going to do be useful to talos, feel free to change this.
Attachment #500853 - Attachment is obsolete: true
Attachment #500972 - Flags: review?(anodelman)
Attachment #500853 - Flags: review?(anodelman)
Comment on attachment 500972 [details] [diff] [review] [checked in]catch and report mysql connection problems This patch covers the issue in collect.py, but there is still connection problems in graphsdb.py. If the amo_db is not able to connect then regular talos sends also fail, as when graphsdb is called it attempts to set up both the regular db connection and the amo_db connection. graphdb.py should be changed to hold the db credentials, actual db connections should be set up elsewhere on an as-needed basis.
Attachment #500972 - Flags: review?(anodelman) → review+
This handles the connection problem described in comment 23. The actual connection is now done inside collect_amo.py, which is wrapped in a try/except by collect.py I'd really like to move the regular "db" connect out of graphsdb.py too, but it'd make this patch noisy and there isn't any real benefit right now. Stuck a FIXME comment in there, we can do this more consistently on the 2.0 branch.
Attachment #501375 - Flags: review?(anodelman)
Attachment #501375 - Flags: review?(anodelman) → review+
Solution tested on staging and regular talos data passing to graph server is successful even if amo_db is down. This effectively removes the amo_db dependency and clears the way for this to be rolled out.
There's an issue in collect.py (1.0 branch) is the spacing, this fixes it. In the old state the VALUES/AVERAGE code was called even if AMO data was sent. By adding the correct depth to the if statement when AMO data is sent the VALUES/AVERAGE code is no longer executed. Tested green on staging.
Attachment #501452 - Flags: review?(rhelmer)
Attachment #501452 - Flags: review?(rhelmer) → review+
Comment on attachment 501452 [details] [diff] [review] [checked in]fix spacing in collect.py changeset: 329:e69875df700d
Attachment #501452 - Attachment description: fix spacing in collect.py → [checked in]fix spacing in collect.py
Comment on attachment 500972 [details] [diff] [review] [checked in]catch and report mysql connection problems changeset: 330:c8063e009870
Attachment #500972 - Attachment description: catch and report mysql connection problems → [checked in]catch and report mysql connection problems
Comment on attachment 501375 [details] [diff] [review] [checked in]move amo_db connection into collect_amo.py changeset: 331:91bdf3d3cf29
Attachment #501375 - Attachment description: move amo_db connection into collect_amo.py → [checked in]move amo_db connection into collect_amo.py
Depends on: 623386
This has been pushed to production in bug 623386, and we tested that the DB being unavailable (bad auth, bad IP) does not impact normal data collection. During the downtime, we were not able to confirm normal operation using an example from: https://wiki.mozilla.org/Buildbot/Talos/DataFormat#AMO on prod or staging, although it tested green on talos staging after the change in comment 29 landed (by anode). I don't have access or visibility into talos staging; however re-testing now I see different behavior on prod and stage: $ curl -F filename=@/tmp/test http://graphs-stage.mozilla.org/server/collect.cgi RETURN fail unsubscriptable object $ curl -F filename=@/tmp/test http://graphs.mozilla.org/server/collect.cgi RETURN fail global name 'db' is not defined It would be good to test this in Talos staging (I have no visibility into or access to this), if it is not working then we probably need to change something in the environment on prod, and/or ensure we've got the same code in both prod and stage for graphserver. clouserw is following up on this testing, as I am out this week.
(In reply to comment #30) > This has been pushed to production in bug 623386, and we tested that the DB > being unavailable (bad auth, bad IP) does not impact normal data collection. > > During the downtime, we were not able to confirm normal operation using an > example from: > https://wiki.mozilla.org/Buildbot/Talos/DataFormat#AMO on prod or staging, > although it tested green on talos staging after the change in comment 29 landed > (by anode). > > I don't have access or visibility into talos staging; however re-testing now I > see different behavior on prod and stage: > > $ curl -F filename=@/tmp/test > http://graphs-stage.mozilla.org/server/collect.cgi > RETURN fail unsubscriptable object > > $ curl -F filename=@/tmp/test http://graphs.mozilla.org/server/collect.cgi > RETURN fail global name 'db' is not defined > > It would be good to test this in Talos staging (I have no visibility into or > access to this), if it is not working then we probably need to change something > in the environment on prod, and/or ensure we've got the same code in both prod > and stage for graphserver. > > clouserw is following up on this testing, as I am out this week. Bhearsum, no one on my team (except Alice, who's away) knows how to get into graphs-staging. Is this something you guys (RelEng) can verify? Or teach us(me, harth, jmaher) how to verify? Who actually owns this piece of the puzzle? Thanks for the help.
I tried running this in staging and found that we don't have the newly required "AddonTester" branch in the staging or production database. I did do a run through of bug 599169 in staging production against the Tryserver branch, however. In staging, I got this error when it tried to submit: DEBUG: process_Request line: fail (1452, 'Cannot add or update a child row: a foreign key constraint fails (`addons_reskin/perf_results`, CONSTRAINT `perf_results_addon_id_key` FOREIGN KEY (`addon_id`) REFERENCES `addons` (`id`))') DEBUG: process_Request line: In production, I got the same error rhelmer did in his test: DEBUG: process_Request line: fail global name 'db' is not defined DEBUG: process_Request line:
> I did do a run through of bug 599169 in staging production against the > Tryserver branch, however. In staging, I got this error when it tried to > submit: > DEBUG: process_Request line: fail (1452, 'Cannot add or update a child row: a > foreign key constraint fails (`addons_reskin/perf_results`, CONSTRAINT > `perf_results_addon_id_key` FOREIGN KEY (`addon_id`) REFERENCES `addons` > (`id`))') > DEBUG: process_Request line: What data were you submitting? It sounds like you were trying to insert data for an add-on that doesn't exist.
Waiting to get my access fixed on graphs-stage (bug 613267), catlee was kind enough to look on graphs-stage for local patches and we found this one. I am sure that this is causing the exception in comment 30 that we can't repro on staging. Just a missing import. We should remove any unnecessary local patches on staging and have a good test case before pushing this to prod though, so we don't keep spinning our wheels on this one (comment 32 and comment 33 are looking close to a good test).
Attachment #512852 - Flags: review?(catlee)
Attachment #512852 - Attachment is patch: true
Attachment #512852 - Attachment mime type: application/octet-stream → text/plain
Attachment #512852 - Flags: review?(catlee) → review+
(In reply to comment #33) > > I did do a run through of bug 599169 in staging production against the > > Tryserver branch, however. In staging, I got this error when it tried to > > submit: > > DEBUG: process_Request line: fail (1452, 'Cannot add or update a child row: a > > foreign key constraint fails (`addons_reskin/perf_results`, CONSTRAINT > > `perf_results_addon_id_key` FOREIGN KEY (`addon_id`) REFERENCES `addons` > > (`id`))') > > DEBUG: process_Request line: > > What data were you submitting? It sounds like you were trying to insert data > for an add-on that doesn't exist. I _think_ this is what's being sent, possibly constructed differently: START AMO Firefox,3.6.15pre,https://addons.mozilla.org/en-US/firefox/downloads/latest/4925 talos-r3-snow-001,ts,Tryserver,baaf2315445d,20110216033208,1297952166 0,1267.00,NULL 1,958.00,NULL END Do we have to manually add entries for all the add-ons that we'll be testing, or is that done on demand?
Comment on attachment 512852 [details] [diff] [review] [checked in] import db (was patched locally on stage) Can someone please land this in hg.m.o/graphs ? I don't have access to push there (filed bug 635119 for that).
ping? We still need the "import db" patch landed, and to figure out why submissions are failing because of foreign key constraints.
(In reply to comment #37) > ping? We still need the "import db" patch landed, and to figure out why > submissions are failing because of foreign key constraints. I am waiting on getting access to hg.m.o/graphs (set it blocking), maybe someone with access can land this sooner (catlee?) I can take a look at the foreign key constraints problem tomorrow (will probably need help from the AMO team on this).
Depends on: 635119
(In reply to comment #35) > I _think_ this is what's being sent, possibly constructed differently: > START > AMO > Firefox,3.6.15pre,https://addons.mozilla.org/en-US/firefox/downloads/latest/4925 > talos-r3-snow-001,ts,Tryserver,baaf2315445d,20110216033208,1297952166 > 0,1267.00,NULL > 1,958.00,NULL > END OK this works if we pass just the id (4925) and not the whole URL on the first line after AMO (per https://wiki.mozilla.org/Buildbot/Talos/DataFormat#AMO) Just did a successful test: """ [rhelmer@bm-buildgraph01 ~]$ cat amo.txt START AMO Firefox,3.6.15pre,4925 talos-r3-snow-001,ts,Tryserver,baaf2315445d,20110216033208,1297952166 0,1267.00,NULL 1,958.00,NULL END [rhelmer@bm-buildgraph01 ~]$ curl -F filename=@amo.txt 'http://graphs-stage.mozilla.org/server/collect.cgi' RETURN success """ clouserw verified that "958" was inserted for that addon ID in the perf_appversions table, for Firefox 3.6.15pre OS X 10.6.6 (the way averaging works in talos I believe is to throw out the first result and do a mean average on the rest, with only one value I think this is expected behavior). Conclusion - looks like it works ok, we just need that little "import db" patch landed. I'll prod about my access today, if someone else can land it in the meantime that would be grand.
Comment on attachment 512852 [details] [diff] [review] [checked in] import db (was patched locally on stage) aki landed this for us: http://hg.mozilla.org/graphs/rev/092eb6cef1ea
Attachment #512852 - Attachment description: import db (was patched locally on stage) → [checked in] import db (was patched locally on stage)
OK we are ready to go. Should I file an IT bug to get this pushed to prod? Do we need a downtime window?
No longer depends on: 635119
(In reply to comment #41) > OK we are ready to go. Should I file an IT bug to get this pushed to prod? Do > we need a downtime window? Went ahead and did this, bug 636851.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Thanks!
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: