Closed Bug 1286358 Opened 8 years ago Closed 7 years ago

Buildbot to consume data from Treeherder's SETA

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Product:

Component:

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: kmoir, Assigned: kmoir)

References

Details

Attachments

(3 files)

Bug 1286358 - Use Treeherder's SETA 8 years ago Armen [:armenzg] (deleted), text/x-review-board-request		Details
Bug 1286358 - Use Treeherder's SETA 8 years ago Armen [:armenzg] (deleted), text/x-review-board-request		Details
compare_end_points.py 8 years ago Armen [:armenzg] (deleted), text/plain		Details

Kim Moir [:kmoir] ET

Assignee

Description

•

8 years ago

instead of flaky service on ouija.allizom.org here is the url armenzg> here's the flask app for the Heroku version https://github.com/mozilla/ouija/blob/master/src/server.py this is the list of jobs that should be run http://seta-dev.herokuapp.com/data/setadetails/?buildbot=1 today we consume a list of jobs that should not be run and create a dict of that in buildbot to skip jobs in the scheduler

Kim Moir [:kmoir] ET

Assignee

Updated

•

8 years ago

Assignee: nobody → kmoir

Kim Moir [:kmoir] ET

Assignee

Updated

•

8 years ago

Blocks: 1176784

Armen [:armenzg]

Comment 1

•

8 years ago

jmaher: what do you think is needed before we can start consuming the data from the Heroku instance?

Joel Maher ( :jmaher ) (UTC -8)

Comment 2

•

8 years ago

while this hasn't been a top priority for me, I will say that the heroku instance isn't running all the services that we run- I believe it is due to needing to pay for larger databases and computing power. I believe a good next step is to figure out what we need and how to pay for it, then get things truly running in parallel. MikeLing, can you give us details of what we are running now on Heroku and what we could do if we paid for the service?

Flags: needinfo?(sabergeass)

Comment 3

•

8 years ago

(In reply to Joel Maher ( :jmaher ) from comment #2) > MikeLing, can you give us details of what we are running now on Heroku and > what we could do if we paid for the service? Sure! For now, I can make sure the 'seta'(not ouija) can run on the heroku just like alertmanager server. But, due to the limitation of database size on heroku, we still need to use the a part of data from alertmanager server right now[1]. And we can only support four or five days' data querying on heroku also because the database size. So, after we paid for the service, we could support data querying for each day and no longer need to retrieve data from alertmanager server. Furthermore, we could write a script to make it automate run failure.py and updated.py to get the latest data and make our results more reliable :) [1]https://github.com/MikeLing/ouija/blob/ouija-rewrite/tools/failures.py#L34

Flags: needinfo?(sabergeass)

Armen [:armenzg]

Comment 4

•

8 years ago

I'm an admin and I can upgrade this. Is it the add on that needs upgrading or the dyno? If the add on, are these the instructions? [1] If so, would you like me to go ahead and upgrade? or do you want me to wait? Can we also rename the app to "seta"? Instead of seta-dev. [1] https://devcenter.heroku.com/articles/upgrading-heroku-postgres-databases

Comment 5

•

8 years ago

(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #4) > Is it the add on that needs upgrading or the dyno? I'm not sure about this. For now, I'think we only need to upgrade the database size. But t's a great thing to have more dyno(I don't know what to do with them for now > If the add on, are these the instructions? [1] Yeah, I think so. > If so, would you like me to go ahead and upgrade? or do you want me to wait? > > Can we also rename the app to "seta"? Instead of seta-dev. My opinion is please go ahead to upgrade it, so we can do more things on it :) And I'm totally ok with the seta or seta-dev. Thank you! BTW, I have no enough authority to use 'fork' to add a stage server for seta[1] [1]https://devcenter.heroku.com/articles/multiple-environments#starting-from-an-existing-app

Armen [:armenzg]

Updated

•

8 years ago

Depends on: 1292560

Armen [:armenzg]

Comment 6

•

8 years ago

I've upgraded the DB. I've also upgraded the dyno so we can have metrics. I've also forked seta-dev to seta. I've pointed seta to armenzg/ouija instead of your repo (I will eventually point it to mozilla/ouija when I have permission). I've also created a seta pipeline, I've added the seta app as production while seta-dev is staging. PRs will eventually autodeploy new versions of SETA to be tested against. You can visit it here: https://seta.herokuapp.com

Summary: see if we can consume data from heroku app w seta data → Consume data from the Heroku app seta-dev

Armen [:armenzg]

Comment 7

•

8 years ago

The seta app now deploys automatically from mozilla/ouija:master. seta-dev is based on mikeling's repo (manual deployments).

Kim Moir [:kmoir] ET

Assignee

Comment 9

•

8 years ago

armenzg: what is the status of the new seta deployment after the gsoc term has completed? I notice https://seta.herokuapp.com doesn't seem to work anymore

Flags: needinfo?(armenzg)

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

8 years ago

we have seta and seta-dev (staging server) on heroku. There are a few things required to get this done: 1) migrate old data 2) create an endpoint for taskcluster to use 3) land in-tree code for taskcluster 4) when taskcluster is a-ok, migrate buildbot to new server Right now we are working on 1 and 2 via pull requests/issues (https://github.com/mozilla/ouija/pulls ). I have personally been working on #1 and it required cleaning up a lot of data and fixing a major loss of data from ~6 weeks ago when a treeherder API changed and we stopped getting important data for SETA. The data is fine now, I am testing data migration and hopefully early next week we can call #1 done. Regarding #2, there is much discussion on PR's for the endpoint, including example code- possibly when #1 is done we can all focus more heavily on #2 and resolve that the week after next- that would mean that SETA + taskcluster would be running by the end of the month and a couple weeks later we could look at migrating off of buildbot.

Flags: needinfo?(armenzg)

Kim Moir [:kmoir] ET

Assignee

Comment 11

•

8 years ago

Armen is this the new endpoint we need to consume. I saw your note to the releng list and looked at the pull request http://seta-dev.herokuapp.com/data/setadetails/?buildbot=1&date=2016-12-21

Flags: needinfo?(armenzg)

Armen [:armenzg]

Updated

•

8 years ago

Depends on: 1306709

Flags: needinfo?(armenzg)

Summary: Consume data from the Heroku app seta-dev → Consume data from Treeherder's SETA

Armen [:armenzg]

Comment 12

•

8 years ago

I also see it being called with &inactive=1. Do you ever call it without &inactive=1? I want to know when the API is called in different ways. https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla-tests/config_seta.py#75 https://dxr.mozilla.org/build-central/source/tools/buildfarm/maintenance/update_seta.py#45

Kim Moir [:kmoir] ET

Assignee

Comment 13

•

8 years ago

No, I don't call it without inactive. I just want the list of tests to run at a less frequent interval and construct the dictionary of values for skipconfig from that data

Armen [:armenzg]

Comment 14

•

8 years ago

I will tackle this on the first week of January. Any other reviewer while kmoir is on PTO?

Assignee: kmoir → armenzg

Armen [:armenzg]

Updated

•

8 years ago

Blocks: 1325404

Armen [:armenzg]

Comment 15

•

8 years ago

I've landed on master what I believe are the sufficient changes to switch Buildbot over. I will review this bug and the output from both systems to understand if there are still any issues. At the beggining of this bug we wanted to switch over from: http://alertmanager.allizom.org/data/setadetails to: http://seta.herokuapp.com/data/setadetails for now we have: https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/ but the final url will be (we're waiting for a deployment this week): https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/ From my understanding of this bug we focused on making SETA work for TaskCluster first. We also reached a good point on MikeLing's work for Buildbot support, however, we decided to wait a bit longer to consume from Treeherder directly. By doing this, we would only switch over once rather than twice. Thanks MikeLing for your hardwork and making my work easier! As per kmoir describes: > Today we consume a list of jobs that should *not* be run and create a dict of that in buildbot > to skip jobs in the scheduler. Changes since the original implementation: * We don't need to specify a date when calling the API (&date=2017-01-09) * &inactive=1 is no more. We now use priority=5 (it means low value jobs) * We now use &build_system_type-{buildbot,taskcluster} instead of &buildbot=1 Each endpoint seems to return different values: * Alertmanager - 846 builders [1] * Seta-dev - 1014 builders [2] * Seta - 1199 builders [3] * TH's API - 1249 builders [4] In any case, the new TH SETA api seems to show reasonable values for priority=1 [5] 29 builders. This includes 28 talos builders from preseed.json + 1 builder from analyzing failures [6]. We're going to have to wait few days until the SETA changes make it into production. Will needs to deploy some major changes and do some manual DB work this week. I would like to see the number of builders we will get with priority=1 once we have a lot more of failures fixed by commits data. I will still prepare the Buildbot patches for what I believe are the current set of required changes. jmaher, kmoir: could you please review this comment and see if it makes sense to you? [1] http://alertmanager.allizom.org/data/setadetails/?date=2017-01-09&buildbot=1&branch=mozilla-inbound&active=0 [2] http://seta-dev.herokuapp.com/data/setadetails/?buildbot=1&priority=5 [3] http://seta.herokuapp.com/data/setadetails/?buildbot=1&priority=5 [4] https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json [5] https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=1&format=json [6] https://treeherder.allizom.org/api/seta/failures-fixed-by-commit/?format=json

Summary: Consume data from Treeherder's SETA → Buildbot to consume data from Treeherder's SETA

Comment hidden (mozreview-request)

Comment hidden (mozreview-request)

Armen [:armenzg]

Comment 18

•

8 years ago

That's all it will take. I will have to wait until the Treeherder changes from 'master' make it into 'production'.

Armen [:armenzg]

Comment 19

•

8 years ago

kmoir, jmaher: do you have any comments wrt to comment 17?

Flags: needinfo?(kmoir)

Flags: needinfo?(jmaher)

Kim Moir [:kmoir] ET

Assignee

Comment 20

•

8 years ago

No, I'm just writing a patch so we can consume the new endpoint. The data is via https instead of http so I'm cleaning up the buildbot code to consume that.

Flags: needinfo?(kmoir)

Armen [:armenzg]

Comment 21

•

8 years ago

One major difference between the old SETA and the new one is that we don't consider jobs fixed by commit which have been tagged to be fixed by an empty field. This definitely can affect the list of builders to be determined as low value jobs. https://treeherder.allizom.org/api/seta/failures-fixed-by-commit/ vs alertmanager.allizom.org/data/seta/?startDate=2017-01-01&endDate=2017-01-09 (I've been waiting a while but I'm sure is this endpoint)

Joel Maher ( :jmaher ) (UTC -8)

Comment 22

•

8 years ago

I am concerned with the 1 job to run via analysis- is this with 90 days of history, or with what is done on staging/locally? We should have a representative amount of builders as p1 vs p5. I fully understand there are differences between alertmanager, heroku, treeherder- some of this is that we are slightly changing the sources, for example- I believe alertmanager is dealing with desktop-test/android-test and not the new taskcluster names, so treeherder is the winner in the case.

Flags: needinfo?(jmaher)

Armen [:armenzg]

Comment 23

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #22) > I am concerned with the 1 job to run via analysis- is this with 90 days of > history, or with what is done on staging/locally? We should have a > representative amount of builders as p1 vs p5. > This is staging failures data (3 revisions): https://treeherder.allizom.org/api/seta/failures-fixed-by-commit/?format=json I want to see the production output since staging adds no real value. FYI 4 months. Treeherder expires data after 4 months.

Kim Moir [:kmoir] ET

Assignee

Comment 24

•

8 years ago

So I have been testing patches to consume the new data. As I mentioned to Armen yesterday in irc, I ran into some problems since the seta data is now provided via https vs the previous http I wrote a script like this to test import httplib import json host = "treeherder.allizom.org" path = "/api/project/mozilla-inbound/seta/job-priorities/" try: port = int(443) conn = httplib.HTTPSConnection(host, port) conn.request("GET", path) r1 = conn.getresponse() if r1.status == 200: data = json.loads(r1.read()) print data except ValueError, e: print("JSON parsing error %s: %s" % (url, str(e))) which works with a more current version of python, but not the version used to run buildbot (2.7.3). With python 2.7.3 we get the error ssl.SSLError: [Errno 1] _ssl.c:504: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error investigation has revealed that this is because the SNI libraries were only backported python > 2.7.3 One suggestion is to use requests but this is not on the masters either So I'm continuing to investigate. We don't really want to upgrade python on all the buildbot masters now given that their demise is imminent later this year

Armen [:armenzg]

Comment 25

•

8 years ago

If we switch to BBB, would we get SETA support via TaskCluster? If so, we could wait until we run everything via TaskCluster/BBB and get SETA via that way. Alternatively, could we make a call to curl or wget? In any case, here's our current comparison of builder lists between systems: * low value jobs TH - 124 builders [1] * high value jobs TH - 1154 builders [2] * low value jobs Allizom - 846 builders [3] * high value jobs Allizom - 62 builders [4] It is rather dissapointing to see those numbers; maybe I'm being too harsh on myself but I would have expected at least some closeness to each other. I know that on Ouija we consider jobs marked with blank string fixed by commit when on TH we don't. If we add the builders from both endpoints we get: * TH 1278 builders * Allizom 968 builders If bug 1330354 was fixed we could at least try to play with production data to determine if there are any issues in the logic. [1] https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json [2] https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=1&format=json [3] http://alertmanager.allizom.org/data/setadetails/?date=2017-01-11&buildbot=1&branch=mozilla-inbound&active=0 [4] http://alertmanager.allizom.org/data/setadetails/?date=2017-01-11&buildbot=1&branch=mozilla-inbound&active=1

Joel Maher ( :jmaher ) (UTC -8)

Comment 26

•

8 years ago

If I knew were were going to be running everything via BBB in <4 weeks, I would say "yes, lets just wait for BBB and do SETA on everything there". Unfortunately I think we are not going to be doing that. Odd that we seem to have flipped low value/high value. When running 'failures.py', how many 'fixed by commit' regressions are we working with in TH? In allizom, we have 628 failures over 90 days.

Kim Moir [:kmoir] ET

Assignee

Comment 27

•

8 years ago

So, the ssl issue isn't an problem anymore, I looked switched to the treeherder url and it works. Probably something to do with an self signed cert on allizom or something. In any case, I found another problem the data here https://treeherder.mozilla.org/api/project/autoland/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json specifies mozilla-inbound in the branch like this when the url points to autoland "jobtypes":{"2017-01-11":["Rev7 MacOSX Yosemite 10.10.5 mozilla-inbound talos chromez-e10s" Is this expected? same for https://treeherder.mozilla.org/api/project/graphics/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json I assume that the data in all three links could be different thus my scripts parse on the branch name

Flags: needinfo?(armenzg)

Armen [:armenzg]

Comment 28

•

8 years ago

Thanks Kim! I will look into it. (In reply to Joel Maher ( :jmaher) from comment #26) > Odd that we seem to have flipped low value/high value. When running > 'failures.py', how many 'fixed by commit' regressions are we working with in > TH? In allizom, we have 628 failures over 90 days. I can't tell because of bug 1330354. Enough for the MySql operations to take over 49 seconds.

Flags: needinfo?(armenzg)

Armen [:armenzg]

Comment 29

•

8 years ago

The fix is now available for checking on treeherder-stage: https://treeherder.allizom.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json https://treeherder.allizom.org/api/project/graphics/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json https://treeherder.allizom.org/api/project/autoland/seta/job-priorities/?build_system_type=buildbot&priority=5&format=json

Armen [:armenzg]

Comment 30

•

8 years ago

kmoir: I'm returning the bug to you as I will be gone after today. If you find anymore issues please chat with rwood/jmaher.

Assignee: armenzg → kmoir

Depends on: 1330652

Armen [:armenzg]

Comment 31

•

8 years ago

Attached file compare_end_points.py (deleted) — Details

Here's a script I started in order to compare endpoints. A feature I wanted to add is to sort the data before writing to disk. This would allow using diffing tools like vimdiff or diff. Good luck with the bug!

Armen [:armenzg]

Updated

•

8 years ago

Attachment #8826329 - Attachment mime type: text/x-python-script → text/plain

Joel Maher ( :jmaher ) (UTC -8)

Comment 32

•

8 years ago

I am looking at the differences in data and trying to figure out how we get data for fixed_by_commit in treeherder. I believe we have an issue here: https://github.com/mozilla/treeherder/blob/edc3d7ad112c7c60e341fab7c1485c0e41408036/treeherder/etl/seta.py#L95 I have a query that I believe replicates the fixed_by_commit logic: https://sql.telemetry.mozilla.org/queries/2517 the concerns here are the option_collection_hash might be incorrect. What I see in this is that we are splitting the name on -{option}, where option = [opt, debug, pgo, asan], where we should be doing: if option in ['pgo', 'asan']: option = 'opt' job_type_name.split('{option}-'.format(buildtype=platform_option))[-1] I think that would solve things, but I would need to debug it a bit more. :rwood, this is probably the next level of debugging to do.

Flags: needinfo?(rwood)

Robert Wood [:rwood]

Updated

•

8 years ago

Flags: needinfo?(rwood)

Kim Moir [:kmoir] ET

Assignee

Comment 33

•

8 years ago

Hmm, seems I'm having ssl eerors again when trying to fetch the url using the version of python that buildbot uses from treeherder.mozilla.org. Traceback (most recent call last): File "test.py", line 11, in <module> conn.request("GET", path) File "/tools/python27/lib/python2.7/httplib.py", line 958, in request self._send_request(method, url, body, headers) File "/tools/python27/lib/python2.7/httplib.py", line 992, in _send_request self.endheaders(body) File "/tools/python27/lib/python2.7/httplib.py", line 954, in endheaders self._send_output(message_body) File "/tools/python27/lib/python2.7/httplib.py", line 814, in _send_output self.send(msg) File "/tools/python27/lib/python2.7/httplib.py", line 776, in send self.connect() File "/tools/python27/lib/python2.7/httplib.py", line 1161, in connect self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file) File "/tools/python27/lib/python2.7/ssl.py", line 381, in wrap_socket ciphers=ciphers) File "/tools/python27/lib/python2.7/ssl.py", line 143, in __init__ self.do_handshake() File "/tools/python27/lib/python2.7/ssl.py", line 305, in do_handshake self._sslobj.do_handshake() ssl.SSLError: [Errno 1] _ssl.c:504: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error

Kim Moir [:kmoir] ET

Assignee

Comment 34

•

8 years ago

I was thinking of using the non buildbot version of python already installed to run a cron job to fetch the json then change the buildbot configs to parse the local copy upon reconfig

Kim Moir [:kmoir] ET

Assignee

Comment 35

•

7 years ago

I don't think this bug is relevant anymore since we can see the end of the road for tc migration and seta only runs on certain trunk branches.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Ed Morley [:emorley]

Comment 36

•

7 years ago

Could someone file the decom bug for Heroku seta and make it depend on the necessary bb->tc bugs?

Armen [:armenzg]

Comment 37

•

7 years ago

I think we will need to decom alertmanager's SETA or remove the logic from Buildbot consuming from it. We deleted the SETA Heroku apps not long ago. TC uses Treeherder's SETA. Please correct me if needed.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 38

•

7 years ago

Armen, that is all correct

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 39

•

7 years ago

we can do the work in bug 1383863

Ed Morley [:emorley]

Comment 40

•

7 years ago

That sounds great - thank you!

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.