600859 - Push Socorro 1.7.4 to production

Reporter

Description

•

14 years ago

Just seeking a release window for this at present, so I'm filing early as per our postmortem discussions. Overview: We have two changes, bug 596689 and bug 600246, which are needed for Firefox beta 7 according to chofmann. These changes only affect CSV file generation (a cron job). (We had planned on a further 4 bugs, but these will not be ready in time for beta 7, so there will be a 1.7.5 for those, ready in approximately two weeks.) User facing downtime: none Risk/rollback plan: If there is any problem with the CSVs generated, we can roll back the cron job to 1.7.3 and run it again, so risk is very low. Timeline: The changes have been tested locally. Staging these bugs is blocked by bug 598752 which I expect to clear up today. (Waiting on a mod_python box.) Suggested window: 10/5 or 10/7 in the regular maintenance window.

matthew zeier [:mrz]

Comment 1

•

14 years ago

Suggesting 10/07 only because there's a sumo push on Tuesday. I don't know if there are conflicting resources for that though. aravind/jabba?

Flags: needs-downtime+

Whiteboard: 10/07/2010

LegNeato

Comment 2

•

14 years ago

Just a quick comment...in the future it'd be nice to get diffs attached to the bugs so us not familiar with the code can see how invasive the changes being pushed are. 10/7 would be fine with me as I can rely on the web interface if something goes wrong with csv generation (and I am pretty sure the fixes are fairly trivial).

K Lars Lohn [:lars] [:klohn]

Comment 3

•

14 years ago

Attached patch a patch of the changes to the daily csv for 1.7.4 (deleted) — Details — Splinter Review

Here's a patch that shows the minor differences between 1.7.3 and 1.7.4.

chris hofmann

Comment 4

•

14 years ago

Its not really clear to me why this needs to go in a regular maintenance window. It seems like this is a really good change to decouple from any other work since it only involves the .csv files that me and a handful of people use each day. laura did a good job of outlining minimal risk, and *zero* downtime, and rollback plan in comment 0. our history of doing these .csv changes indicates detaching them from other socorro updates actually works out best, since the problems we have run into in the past result from forgetting to restart the cron jobs, or delays of the changes going into affect because of regressions from other web, database, or report processing load issues. lets spend the two minutes to push the script changes and have them get picked up in the next cron job run so I can start analyzing more of the data we need to ship a good firefox 4 this weekend.

Laura Thomson :laura

Reporter

Comment 5

•

14 years ago

chofmann: the only thing is that the changes aren't tested/staged yet.

chris hofmann

Comment 6

•

14 years ago

the best test is to run the script, and run it against the production db. think of this more like "hey aravind, can you run some sql so we can look at some data from the database", rather than a "release" the only difference beyond that is that this script runs every night on a cron job.

matthew zeier [:mrz]

Comment 7

•

14 years ago

(In reply to comment #4) > Its not really clear to me why this needs to go in a regular maintenance > window. It seems like this is a really good change to decouple from any other > work since it only involves the .csv files that me and a handful of people use > each day. > > laura did a good job of outlining minimal risk, and *zero* downtime, and > rollback plan in comment 0. Until a site has a good track record of doing risk free pushes, I want these in scheduled announced windows. Everyone in IT knows to be around during the Tuesday/Thursday times incase they need to be called in.

chris hofmann

Comment 8

•

14 years ago

(In reply to comment #7) > > Until a site has a good track record of doing risk free pushes, I want these in > scheduled announced windows. Everyone in IT knows to be around during the > Tuesday/Thursday times incase they need to be called in. I would agree for changes that are integral to the operation of the site. This one is not. I'm not sure it really falls into the bucket of something I'd call a "push" or revision to the operation of site. Its just a script that runs some sql that runs on a cron job. Can we just move this update to the script over the production side, run it by hand, and produce some csv files that I can start looking at? That would be fine to suit my initial needs of trying to get a better handle on video card, driver, OS, and .dll's coming into play on 3d, direct 2d, and hardware accelerated graphics related crashes as soon as possible. Then we could wait to push the new script update into place where the cron job runs it at the next announced window.

K Lars Lohn [:lars] [:klohn]

Comment 9

•

14 years ago

the changes to the Socorro system to accommodate this enhancement touch two apps: the dailyUrl cron (should probably be renamed to dailyCsv), as well as the processor. The processor change is very minor, but I think that is what pushes this change over the threshold to require a release. The change to the processor is necessary to meet the requirements of getting the number of cores into 'cpu_info' field. The number of cores, while available in the metadata, has not, until now, been recorded in the database. I modified the processor to read that data and merge it with the 'cpu_info' database column rather than adding a new database column. So the change is minor to the processor. If you can do without the number of cores information, the scope of the problem shrinks to affecting only the 'dailyUrl' cron. As such, then I would advocate that it would not require a full blown release to get into production. I regret that this project has lost some of the agility that it had a year and a half ago when changes like this could go from concept to production in a matter of hours. The introduction of formal procedures inevitably will slow things down from my "cowboy" actions in the past. The goal is, of course, to improve predictability and reliability, the downside is the reduction in agility.

chris hofmann

Comment 10

•

14 years ago

ok, I buy the argument that changes to the processor cross the line and make this change "release worthy." I should have looked closer at the diffs to see that's the approach that was taken > I regret that this project has lost some of the agility it had... me too. I think there is a middle ground here that we need to be striving for. We definitely need to get *much* more agile on pushing reports and tools used analyze the data! Problems with recent releases show that we need to get more systematic (and accept the corresponding reduced agility) around changes to configuration, system loading, incoming report processing, and data management kinds of changes. I hope I can convince everyone not to conflate the two. I'd like to see us push for a quarterly goal of pushing 25 or 50 new reporting and analysis tool enhancements that would help us to use the data we have much more effectively. Many of these have been delayed or deferred as part of the work to enhance the backend processing and data storage. I keep hoping we can shift the balance soon. does this seem achievable? I'd really like to see this get pushed on Tuesday. Is there any way this can be done in parallel with the unrelated SUMO change Matt talked about in comment 1, or is there a place arbitrate the competing push priorities between projects?

Laura Thomson :laura

Reporter

Comment 11

•

14 years ago

Staging and testing is now complete. The tag is http://socorro.googlecode.com/svn/tags/releases/1.7.4_r2552_20101005/ This release is ready to go into the starting gate. mrz, any chance tonight or are we looking at Thursday?

matthew zeier [:mrz]

Comment 12

•

14 years ago

Thursday's preferable.

matthew zeier [:mrz]

Updated

•

14 years ago

Assignee: server-ops → jdow

Justin Fitzhugh

Comment 13

•

14 years ago

are we *100%* confident in our rollback plan (and would we loose any data if it's executed)? is it documented? how much downtime is acceptable before we initiate the backout plan?

matthew zeier [:mrz]

Comment 14

•

14 years ago

To clarify, here's what I need before 4pm: 1. Deployment plan. What steps does Ops run? 2. Rollback steps. What steps does Ops run? 3. How do we know this push was a success? If it's not, roll back in 30 mins?

Laura Thomson :laura

Reporter

Comment 15

•

14 years ago

1. Update (and restart where needed) each Socorro component. (There are only changes to processor and one cron job, but we like to keep the codebase in sync.) The tag is http://socorro.googlecode.com/svn/tags/releases/1.7.4_r2552_20101005/ Update by switching to this tag. In detail: - Update and restart collectors - Update and restart monitor - Update and restart processors - Update and restart web services - Update cron jobs - Update web app and purge caches (memcache, lb) 2. To rollback, perform the same steps but using the tag http://socorro.googlecode.com/svn/tags/releases/1.7.3_20100916/ 3. This release only affects the output from the dailyUrl cron. After this has run (not sure when it's scheduled) if there are any problems with the output (chofmann is the main user, so he will confirm), perform steps in 2. and manually re-execute cron job.

chris hofmann

Comment 16

•

14 years ago

we also need to monitor the processors a bit closer after the update and restart since the change there has a small chance of impacting performance with the code that creates a new table (that's why this qualifies as a "release").

K Lars Lohn [:lars] [:klohn]

Comment 17

•

14 years ago

point of clarification: we will monitor the processor immediately after restart. We can detect success nearly instantly by looking in the database at any newly processed report: data in the 'cpu_info' column will contain the number of cores suffixed to the end. chofmann, is not correct about the addition of a new table, it is the processing of that one column that is different. Should the whole process "go pear shaped", and the processor fails to work at all (extremely unlikely), we will know immediately, data loss would be confined to very small number: sum(aProcessor.numberOfThreads for aProcessor in ListOfAllProcessors).

Justin Dow [:jabba]

Assignee

Comment 18

•

14 years ago

The update is done and lars and laura confirmed over irc that everything is working properly.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Laura Thomson :laura

Reporter

Comment 19

•

14 years ago

Webapp, collection, processing all working as expected. chofmann: please verify the csvs are as expected.

chris hofmann

Comment 20

•

14 years ago

initial look at expanded cpu info and app notes addtions to the .csv file looks great and will really help! in the previous days .csv we got limited cpu info of awk -F\t '{print $13}' 20101006* | sort | uniq -c | more 5385 \N 377 amd64 1 cpu_name 3061 ppc 362027 x86 now we get expanded info awk -F\t '{print $13}' 20101007* | sort | uniq -c | sort -nr | more 41703 x86 | GenuineIntel family 6 model 23 stepping 10 36847 x86 | GenuineIntel family 6 model 15 stepping 13 13767 x86 | GenuineIntel family 6 model 23 stepping 6 10817 x86 | GenuineIntel family 15 model 4 stepping 1 10342 x86 | GenuineIntel family 15 model 2 stepping 9 10329 x86 | GenuineIntel family 6 model 15 stepping 11 9855 x86 | AuthenticAMD family 15 model 107 stepping 2 9829 x86 | GenuineIntel family 15 model 4 stepping 9 8820 x86 | GenuineIntel family 6 model 15 stepping 6 8127 \N <long tail snipped> For the new app_notes entries we see 22659 of 380041 reports had app notes entries awk -F\t '$26 !~ /\N/ {print $26}' 20101007* | sort | uniq -c | sort -nr | more 798 AdapterVendorID: 0000, AdapterDeviceID: 0000 563 AdapterVendorID: 8086, AdapterDeviceID: 2a42 476 AdapterVendorID: 8086, AdapterDeviceID: 2a02 403 AdapterVendorID: 1106, AdapterDeviceID: 3344 360 AdapterVendorID: 1039, AdapterDeviceID: 6330 355 AdapterVendorID: 8086, AdapterDeviceID: 29c2 329 AdapterVendorID: 10de, AdapterDeviceID: 03d0 329 AdapterVendorID: 10de, AdapterDeviceID: 0322 319 AdapterVendorID: 0000, AdapterDeviceID: 0000\n 310 AdapterVendorID: 8086, AdapterDeviceID: 2772 307 AdapterVendorID: 8086, AdapterDeviceID: 0046 299 AdapterVendorID: 8086, AdapterDeviceID: 2a42\n 245 AdapterVendorID: 10de, AdapterDeviceID: 0622 224 AdapterVendorID: 1002, AdapterDeviceID: 5975 215 AdapterVendorID: 10de, AdapterDeviceID: 0402 212 AdapterVendorID: 10de, AdapterDeviceID: 0326 171 AdapterVendorID: 1106, AdapterDeviceID: 3371 171 AdapterVendorID: 10de, AdapterDeviceID: 0641 162 AdapterVendorID: 10de, AdapterDeviceID: 0640 <long tail snipped> It will take more work to do some statistically analysis on how these generations of cpu families and there age, and graphics cards correlate to various signatures and general "crashiness" but now we have some data to chew on. one small cosmetic bug that might affect blcary's and others parsing of the .csv file. column header for the cpu_name field has changed to ?column? we can fix that in a follow up up bug.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard