Closed
Bug 1127532
Opened 10 years ago
Closed 9 years ago
host "snappy" server (symbolapi.m.o)
Categories
(Socorro :: Infra, task)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rhelmer, Assigned: rhelmer)
References
Details
There's currently a bespoke VM hosting symbolapi.m.o which runs https://github.com/vdjeric/Snappy-Symbolication-Server/
Snappy just needs access to our S3 symbols, should not need any access to other services.
Assignee | ||
Comment 1•10 years ago
|
||
Should we put this on an EC2 node?
Currently this only needs access to public data (the S3 symbols-public bucket which is totally open now), so this might be a good candidate for Heroku. Especially as we want vladan to be able to get access to it for debugging if necessary, and we want it to auto-deploy.
Flags: needinfo?(dmaher)
Assignee | ||
Comment 2•10 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #1)
> Should we put this on an EC2 node?
>
> Currently this only needs access to public data (the S3 symbols-public
> bucket which is totally open now), so this might be a good candidate for
> Heroku. Especially as we want vladan to be able to get access to it for
> debugging if necessary, and we want it to auto-deploy.
Maybe answering my own question, but this seems to be too slow for Heroku which times out web requests after 30 seconds.
Comment 3•10 years ago
|
||
It times out 30 seconds until the first byte is sent, but then there's a rolling 55 second window to accommodate streaming things. Are you worried about the initial byte?
Assignee | ||
Comment 4•10 years ago
|
||
(In reply to Chris Lonnen :lonnen from comment #3)
> It times out 30 seconds until the first byte is sent, but then there's a
> rolling 55 second window to accommodate streaming things. Are you worried
> about the initial byte?
Snappy downloads and parses (potentially multiple) symbol files per incoming request, and it does this before sending the first byte.
It *was* taking over 30s to return a response, before I realized how different the defaults are from the sample.ini, also the default path for the MRU cache wasn't working on Heroku.
It seems faster now:
curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":4}' https://murmuring-waters-3757.herokuapp.com/
Vladan, this is just a temporary URL (it's running on my own Heroku account) but would you mind checking to see if it seems to be working correctly and is fast enough ^?
Flags: needinfo?(vdjeric)
Comment 5•10 years ago
|
||
- We don't need HTTPS for symbolication
- I tried your curl command, and it took 20 seconds to satisfy that toy request. It caches it afterwards.
* By contrast, the current symbolication server (using NFS symbol mount) answers a similar toy request almost instantaneously.
* It's a developer tool, but I'd like it to be perform faster. I think when we ran it on Mark Reid's EC2 node (also uses symbols from S3), it was much faster.
- I tried symbolicating a brief profile captured with the profiler extension against that herokuapp URL. It took it 40 seconds to return a response.
* The profiler extension couldn't parse the response: "profiler.filteredThreadSamples is undefined"
* Ted, do you know what this error is about?
Flags: needinfo?(vdjeric) → needinfo?(ted)
Assignee | ||
Comment 6•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #5)
> - We don't need HTTPS for symbolication
>
> - I tried your curl command, and it took 20 seconds to satisfy that toy
> request. It caches it afterwards.
> * By contrast, the current symbolication server (using NFS symbol mount)
> answers a similar toy request almost instantaneously.
> * It's a developer tool, but I'd like it to be perform faster. I think when
> we ran it on Mark Reid's EC2 node (also uses symbols from S3), it was much
> faster.
>
> - I tried symbolicating a brief profile captured with the profiler extension
> against that herokuapp URL. It took it 40 seconds to return a response.
I spun up an EC2 micro instance in the same zone as the symbols S3 bucket. Heroku is in
a different AWS zone, and may be slower for other reasons.
How does this compare:
http://ec2-54-191-238-159.us-west-2.compute.amazonaws.com
Comment 7•10 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #2)
> (In reply to Robert Helmer [:rhelmer] from comment #1)
> > Should we put this on an EC2 node?
> >
> > Currently this only needs access to public data (the S3 symbols-public
> > bucket which is totally open now), so this might be a good candidate for
> > Heroku. Especially as we want vladan to be able to get access to it for
> > debugging if necessary, and we want it to auto-deploy.
>
> Maybe answering my own question, but this seems to be too slow for Heroku
> which times out web requests after 30 seconds.
I have no strong opinion either way. If we host it on EC2 then we're responsible for the infrastructure, so that will need to be config managed (etc), which is fine. If we go Heroku, then we don't manage the infra, which is also fine. Either way, however, we still need to manage things like deployment, monitoring, asset management, and so forth. We don't currently have a good policy in place for managing not-AWS, so that could be a consideration, though we should probably think about the best way to do that going forward. :)
Flags: needinfo?(dmaher)
Comment 8•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #5)
> - I tried symbolicating a brief profile captured with the profiler extension
> against that herokuapp URL. It took it 40 seconds to return a response.
> * The profiler extension couldn't parse the response:
> "profiler.filteredThreadSamples is undefined"
> * Ted, do you know what this error is about?
I don't know what this error is, sorry. The responses from my test queries look the same from both the existing server and the test servers. I used the profiler extension to profile against a local server running the new code when I wrote those patches and it worked fine here.
In terms of timing:
luser@eye7:/build$ time curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://symbolapi.mozilla.org/; echo
[["XREMain::XRE_mainRun() (in xul.pdb)", "KiUserCallbackDispatcher (in wntdll.pdb)"]]
real 0m0.659s
user 0m0.007s
sys 0m0.003s
luser@eye7:/build$ time curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":3,"osName":"Windows","appName":"Firefox"}' https://murmuring-waters-3757.herokuapp.com/; echo
[["XREMain::XRE_mainRun() (in xul.pdb)", "KiUserCallbackDispatcher (in wntdll.pdb)"]]
real 0m0.381s
user 0m0.005s
sys 0m0.009s
luser@eye7:/build$ time curl -d '{"stacks":[[[0,11723767],[1, 65802]]],"memoryMap":[["xul.pdb","44E4EC8C2F41492B9369D6B9A059577C2"],["wntdll.pdb","D74F79EB1F8D4A45ABCD2F476CCABACC2"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://ec2-54-244-178-234.us-west-2.compute.amazonaws.com/; echo
[["XREMain::XRE_mainRun() (in xul.pdb)", "KiUserCallbackDispatcher (in wntdll.pdb)"]]
real 0m0.392s
user 0m0.003s
sys 0m0.006s
All 3 servers seem to be about as fast. However, the Heroku server did take a long time to respond to my first request (which I didn't think to time, of course). I think Heroku might spin down the service when it's not in use. If that's true that's probably bad for server performance, as it caches parsed symbols in memory, so restarting means it loses its entire cache. In fact, it'd be doubly bad because as an optimization it attempts to reload previously-used symbols on startup, so if your request caused the server to be spun up you'd have to wait extra long while it parsed all those symbols.
Flags: needinfo?(ted)
Assignee | ||
Comment 9•10 years ago
|
||
(In reply to Daniel Maher [:phrawzty] from comment #7)
> (In reply to Robert Helmer [:rhelmer] from comment #2)
> > (In reply to Robert Helmer [:rhelmer] from comment #1)
> > > Should we put this on an EC2 node?
> > >
> > > Currently this only needs access to public data (the S3 symbols-public
> > > bucket which is totally open now), so this might be a good candidate for
> > > Heroku. Especially as we want vladan to be able to get access to it for
> > > debugging if necessary, and we want it to auto-deploy.
> >
> > Maybe answering my own question, but this seems to be too slow for Heroku
> > which times out web requests after 30 seconds.
>
> I have no strong opinion either way. If we host it on EC2 then we're
> responsible for the infrastructure, so that will need to be config managed
> (etc), which is fine. If we go Heroku, then we don't manage the infra,
> which is also fine. Either way, however, we still need to manage things
> like deployment, monitoring, asset management, and so forth. We don't
> currently have a good policy in place for managing not-AWS, so that could be
> a consideration, though we should probably think about the best way to do
> that going forward. :)
From the testing Ted and Vladan have done in this bug, this sounds like a better fit for EC2 in any case.
Comment 10•10 years ago
|
||
I retested with the profiler extension. I pointed the profiler.symbolicationUrl value to the 3 servers and timed how long it took to fetch symbols for a startup profile.
Heroku: ~45 seconds for the first (uncached) request, it then timed out and gave back an invalid response ("profiler.filteredThreadSamples is undefined"). Maybe it's a timeout? After that, it took about 10 seconds to respond (correctly).
EC2: ~60 seconds for the first (uncached) request, but it did return a correct response. About 10 seconds thereafter.
symbolapi.mozilla.org: 10 seconds for all requests. I think symbolapi.mozilla.org had the advantage of pre-fetching, or having been already used by others for symbolicating today's Nightly already.
In any case, I agree we should probably host on EC2. I don't think a micro instance is sufficient, we want more RAM for caching and a faster CPU for faster parsing of the .sym files. Maybe a c4.xlarge? Also he server's cache size config should be adjusted for the amount of RAM available.
Eventually, we should restore the pre-fetching functionality somehow but I'm ok with leaving that for later.
Comment 11•10 years ago
|
||
Note that it would be useful to know where the time is being spent during that first uncached request, whether it's in fetching or in processing the fetched .sym files. A few printfs ought to be enough. I'd appreciate it if someone could get those numbers, I don't have time to do that this week as I'm helping with the Telemetry/FHR unification which is on a pretty tight schedule, and a few other things as well.
Assignee | ||
Comment 12•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #11)
> Note that it would be useful to know where the time is being spent during
> that first uncached request, whether it's in fetching or in processing the
> fetched .sym files. A few printfs ought to be enough. I'd appreciate it if
> someone could get those numbers, I don't have time to do that this week as
> I'm helping with the Telemetry/FHR unification which is on a pretty tight
> schedule, and a few other things as well.
I am curious about this too, I'll work on it this week.
Assignee | ||
Comment 13•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #10)
> In any case, I agree we should probably host on EC2. I don't think a micro
> instance is sufficient, we want more RAM for caching and a faster CPU for
> faster parsing of the .sym files. Maybe a c4.xlarge? Also he server's cache
> size config should be adjusted for the amount of RAM available.
>
> Eventually, we should restore the pre-fetching functionality somehow but I'm
> ok with leaving that for later.
Sounds like a plan. Thanks!
Assignee | ||
Comment 14•10 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #12)
> (In reply to Vladan Djeric (:vladan) from comment #11)
> > Note that it would be useful to know where the time is being spent during
> > that first uncached request, whether it's in fetching or in processing the
> > fetched .sym files.
One thing I notice right away is that fetching a symbol file with e.g. curl on the same box takes almost no time, so presumably the problem is elsewhere. Real numbers coming soon.
Comment 15•10 years ago
|
||
My gut feeling is that it's what I described in comment 8--the Heroku dyno is being put to sleep (described here: https://devcenter.heroku.com/articles/dynos#dyno-sleeping ) which means the server process is terminated. When you make a new request it has to start again, and on startup it attempts to re-fill the memory cache using its MRU list from the last run, which adds extra overhead.
Comment 16•10 years ago
|
||
Just so we're comparing apples to apples here I picked two symbol files that symbolapi is extremely unlikely to have precached--xul.pdb from the Firefox 25 and 26 releases:
luser@eye7:/build$ time curl -d '{"stacks":[[[0,10047540]]],"memoryMap":[["xul.pdb","6F63BDA0293544FBA56816979837A3882"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://symbolapi.mozilla.org/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real 0m2.027s
user 0m0.005s
sys 0m0.005s
luser@eye7:/build$ time curl -d '{"stacks":[[[0,10047540]]],"memoryMap":[["xul.pdb","6F63BDA0293544FBA56816979837A3882"]],"version":3,"osName":"Windows","appName":"Firefox"}' https://murmuring-waters-3757.herokuapp.com/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real 0m23.855s
user 0m0.011s
sys 0m0.004s
luser@eye7:/build$ time curl -d '{"stacks":[[[0,10047540]]],"memoryMap":[["xul.pdb","6F63BDA0293544FBA56816979837A3882"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://ec2-54-244-178-234.us-west-2.compute.amazonaws.com/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real 0m11.729s
user 0m0.005s
sys 0m0.005s
luser@eye7:/build$ time curl -d '{"stacks":[[[0,47993]]],"memoryMap":[["xul.pdb","BC74C9A206B443FA82D5A64B9865E2601"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://symbolapi.mozilla.org/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real 0m2.316s
user 0m0.010s
sys 0m0.000s
luser@eye7:/build$ time curl -d '{"stacks":[[[0,47993]]],"memoryMap":[["xul.pdb","BC74C9A206B443FA82D5A64B9865E2601"]],"version":3,"osName":"Windows","appName":"Firefox"}' https://murmuring-waters-3757.herokuapp.com/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real 0m27.827s
user 0m0.005s
sys 0m0.009s
luser@eye7:/build$ time curl -d '{"stacks":[[[0,47993]]],"memoryMap":[["xul.pdb","BC74C9A206B443FA82D5A64B9865E2601"]],"version":3,"osName":"Windows","appName":"Firefox"}' http://ec2-54-244-178-234.us-west-2.compute.amazonaws.com/; echo
[["XREMain::XRE_main(int,char * * const,nsXREAppData const *) (in xul.pdb)"]]
real 0m9.889s
user 0m0.000s
sys 0m0.010s
The timing is fairly consistent there with the existing symbolapi taking ~2s, the Heroku dyno taking 20+s and the EC2 instance taking ~10s.
Comment 17•10 years ago
|
||
To repeat these experiments usefully you'd need to pick a new xul.pdb file that's unlikely to be cached. Old Firefox releases (or Thunderbird or whatever) are good candidates. If you're spinning up a new server you can just repeat one of those commands with your new URL, since a new server will have an empty cache. (Note that if you stop and restart a server it'll prefill its cache so that's not valid.)
Assignee | ||
Comment 18•10 years ago
|
||
So we could make Heroku faster by using higher-CPU dynos, but at the price it's probably not worth it.
Repeating the experiment in comment 16, first hit with a few different instance types:
* t2.medium - ~5s
* c4.xlarge - ~3s
I stopped the server, removed /tmp/snappy-mru-symbols.json, started and repeated several times, seems pretty consistent.
Comment 19•10 years ago
|
||
For a more realistic workload here's a request body packet captured from the profiler extension:
https://gist.github.com/luser/7054086aac022ab0ac01
Assignee | ||
Comment 20•10 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #19)
> For a more realistic workload here's a request body packet captured from the
> profiler extension:
> https://gist.github.com/luser/7054086aac022ab0ac01
This is great, thanks! I'll use this to compare the current symbolapi.m.o against the instance types we're considering, I want to make sure this service stays fast.
Comment 21•10 years ago
|
||
So is Snappy on EC2 now?
Assignee | ||
Comment 22•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #21)
> So is Snappy on EC2 now?
Not yet, we're working on building out our AWS infra this quarter.
Assignee | ||
Comment 24•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #23)
> When are the NFS-mounted symbols going away?
I don't think we have a specific date yet, but we won't shut it down until the replacement in AWS is ready. We're trying to get all of symbols and all of crash-stats done this quarter.
Flags: needinfo?(rhelmer)
Comment 25•10 years ago
|
||
Right, once we fix all the deps of bug 1071724 we'll no longer be using the NFS mount for anything. Once we've switched everyone to using the upload API (bug 1085557, bug 1130138) and land bug 1085530 we'll no longer be storing new symbols in NFS so s3 will be the repository of record. Actually turning off the NFS mount is just a formality at that point, as we'll no longer be putting any new data into it.
Obviously we should finalize the Snappy migration before we stop putting new symbols in NFS.
Assignee | ||
Comment 26•10 years ago
|
||
:phrawzty packaged this up for us, and we will spin up an EC2 node for it next:
https://github.com/mozilla/Snappy-Symbolication-Server/commit/172344b7b141a3385ef2382678dbbbd5ab8b86d6
Assignee | ||
Comment 27•9 years ago
|
||
We have this working fine in staging:
http://symbolapi.mocotoolsstaging.net/
Going to bring up the prod version soon, the last step will be transitioning DNS.
We need to be done with our AWS move first, flipping dependencies around.
Assignee | ||
Comment 28•9 years ago
|
||
Prod infra is up as well:
http://symbolapi.mocotoolsprod.net/
We need to do some more testing of our AWS env, the last step here is going to be switching the symbolapi.mozilla.org DNS to point to the above.
Assignee | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•