Closed Bug 484032 Opened 16 years ago Closed 16 years ago

Socorro Server: rework storage of dumps outside of the database

Categories

(Socorro :: General, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lars, Assigned: lars)

References

Details

As discussed with the PostgreSQL consultants during the week of March 9, 2009, the processed dumps should be moved out of the database and into a separate file system storage. questions: 1 - Feasibility: assuming a radix style storage hierarchy similar to the technique used in the JsonDumpStorage, what are the storage and inode requirements? 2 - Feasibility: can the client do the gzip decompression of the dump to show the tab view in the SocorroUI? 3 - data migration and cut over - will there be a period that we'll need to save the data in both places? 4 - how do we manage data retirement in this system? How important is it that the database and the file system are perfectly synchronized in time? more comments
Q2: The wiki has the details. I've tested a prototype on IE 6, Opera 9, Safari 3, and FF. Q3: Since dumps are only used in display, maybe it wouldn't be that bad to have a transition period where the "view crash report" page doesn't have frames/modules/dump for this window of data between the migration and cutover? After we ship, then we can rerun the migration script on this window.
The current prod style radix storage thats two levels deep should work fine. The store will be an NFS store like prod. I am assuming it will need 300 GB of space. Regarding the dumps table. Do we really need to migrate the table? Why can't we live with it for some time (3 months) and then just drop the table. For those three months, we should have code look in both places for the dump data. After a certain date, any future processed jobs will store their dump files in the new NFS store. Here is an alternate suggestion. For any old jobs, we don't migrate things right away. When an old uuid is requested, we just note its uuid in a "migrate_dumps" table, the monitor runs a background job that then scans this table and moves the reports as needed into the NFS store?
(In reply to comment #2) > Q3: Since dumps are only used in display, maybe it wouldn't be that bad to have > a transition period where the "view crash report" page doesn't have > frames/modules/dump for this window of data between the migration and cutover? > After we ship, then we can rerun the migration script on this window. Not acceptable to have a period where this data isn't available for some reports. Note that the database doesn't store the frames at all anymore, so that would mean a complete lack of frames for crash reports, which are vital to crash reporting.
Here's some thoughts: 1 – alter the Php app so that when it needs a specific dump, it first tries to get it from the file system via some api. If successful, the dump is provided from the file system as a compressed (gz) json file. The dump information is decompressed on the client side and arranged esthetically. If the lookup from the file system is unsuccessful, it tries to get it from the database. Alternately, see Option 4 below. 2 – alter Processor to save parse and save the dump information in the file system in a compressed format instead of saving it to the database. It should also parse the dump and save it in the form of a json file. As long at it is at it, the processor should save into the json file, the additional information that is stored in the reports table. This would allow future analysis programs to get a complete picture of a crash without having to lookup anything else up in the database. What of the 'restricted' datafields like URL? It would save that information too, relying on the server side code to filter out fields inappropriate for unauthorized eyes. See Option 4 below. 3 – create a temporary process that will walk the database dump tables saving the dumps into the file system. Once it has completed a dump table partition, it is allowed to delete that partition. Optional 4 – create a private http service that will return dump information when passed a uuid. It would first look to the file system. If unsuccessful, it would look to the database. It always returns information in a json format. Having this service would eliminate the need for the Php app to understand the radix file system format or know how to fetch the dump from the database. The Php app on the server side would just make the http request along with some authentication information and get back a json file. The server would use the authentication information to know what fields it can or cannot include within the json file. This service would leverage the existing Python code and eliminate the need to keep parallel code tracks in Python and Php. In addition, if we decide in the future to use some sort of Hadoop style processing, our data is already complete in a flat json format in a file system, ready to go. Because the service can look to both the file system and the database for a dump, there is no need for down time. As the migration of dump from the database to the file system proceeds, incidents looking to the database will dwindle. Eventually, the secondary look up code could be eliminated. There is never a need for downtime for migration.
I like the JSON file idea / no database hits for the most common case of looking up a single crash report. I think sensitive info should be stripped out. Apache is very efficient at serving up flat files off of disk. An AJAX call for the extra info can be made assuming the person is LDAP authenticated and this will hit the reports db table. Option 4 - This is great, but I would put this behind the PHP layer. It's a uuid to filepath service. This way you get the efficiency of apache, privacy feature, and a single point of knowledge for radix file system format. Downtime for migration is a short term issue, I don't think it should effect the ultimate design. I agree with Ted's comment #4 above, but what is the worst case scenarios for downtime? I don't think we've sketched out how long it would take to convert the thin slice of data, which is the dump table of crashes that come in during this processes (we would pre-process old dumps before hand and process new dumps once we ship)
ok, I'll agree with the efficiency issue on serving files directly from a file system. I'm re-examining my willingness to engineer solutions that eliminate down times for migrations at the expense of more complex code. The next question would be about the propose json format for the dump file. What would be the most useful?: 1 - save the file just as it is now in the dumps table 2 - wrap the whole dump into a single large element in a json file 3 - wrap each line into a string element within an ordered list of a json file 4 - break each line into a set of named elements within an ordered list of lines of a json file
Random drive-by, but it would be cool to be able to get the raw JSON for an individual report.
(In reply to comment #7) > The next question would be about the propose json format for the dump file. > What would be the most useful?: > > 1 - save the file just as it is now in the dumps table > 2 - wrap the whole dump into a single large element in a json file > 3 - wrap each line into a string element within an ordered list of a json file > 4 - break each line into a set of named elements within an ordered list of > lines of a json file #2 +1 psudeo JSON snippet {"signature": "js_ValueToString", "UUID": "e191213d-f39d-4ac0-bbe5-7f5f20081119", "etc":"etc", "dump": "OS|Mac OS X|10.5.0 9A581 CPU|x86|GenuineIntel family 6 model 15 stepping 10|2 Crash|EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE|0x14889c|0"} So JSON will have Details and Raw Dump and we'll prepare the frames and moduels tabs on the client side. dump will continue to be a newline delimited text blob.
Blocks: 487474
rather than having a uuid to path service, let's use mod_rewrite. We could set up a rule to change a uuid into the appropriate path automatically: http://whatever.com/uuid/AABBCCDDEEFFGGHHIIJJKKLLM2090308.json.gz becomes http://whatever.com/uuid/AA/BB/AABBCCDDEEFFGGHHIIJJKKLLM2090308.json.gz it would be faster than a service and ease some coding requirements.
No longer blocks: 487474
Depends on: 487474
Depends on: 487635
Depends on: 488334
After seeing a basic layout of the dumps today, I'd like to propose a change. Lars is currently storing the dumps in xx/xx/dump format. I'd like to change this to the old style of having a date and name directory. I'd also like a symlink from the date side, pointing to the name side of things. This will help speed up the clean up scripts significantly.
Per comment #11, I have added code in processedDumpStorage.py that handles a date branch that has directories root/date/YYYY/mm/dd/HH/M5/ that hold symbolic links: ooid -> root/name/oo/id/ooid.jsonz By default, the date used is 'now()', but you may pass a timestamp parameter to the appropriate method (either newEntry or putDumpToFile) if needed.
I call this done!
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.