Closed Bug 1103945 Opened 10 years ago Closed 10 years ago

[mozlog] Create a command line tool to merge multiple raw structured logs

Categories

(Testing :: Mozbase, defect)

defect
Not set
normal

Tracking

(firefox38 fixed)

RESOLVED FIXED
mozilla38
Tracking Status
firefox38 --- fixed

People

(Reporter: davehunt, Assigned: makzmaciek, Mentored)

Details

(Whiteboard: [lang=py][good first bug])

Attachments

(2 files, 4 obsolete files)

We'd like a command line tool that accepts paths to multiple raw structured logs and combines them into a single log. This would be particularly useful for getting a unified log where we chunk the tests. We could take the earliest suite-start and latest suite-end for use in the combined log. This would allow us to show time saved when chunks are run in parallel. The user should be able to specify the format for the combined log file (or perhaps multiple).
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: [lang=py][good first bug]
Hi James, I am interested in working on this script. Dave mentioned specifying a format for the combined log file. Is this the format of the file, i.e. txt, etc.? -Akshay
Flags: needinfo?(james)
Thanks for looking at this! So I suggest that in the first instance you make a tool that takes N structured log files and returns a single structured log that's (approximately) the concatenation of all the different structured logs. The output format is then fixed; it's the same line-delimited JSON as the input. There are a couple of tricks you might need here. The first thing is that you want the output to contain a *single* suite-start and suite-end message. The suite-start message should contain the combined list of all tests from the set of input structured log files. I guess to do this you want to read each file until you reach a suite_start message, grab the data you want and then read the rest of each file once you have constructed the output suite-start message. You probably also want to check that the run_info for each input file is the same and fail if it isn't. I can't think of a time where it makes a lot of sense to concatenate runs from different platforms. I don't think there are any fields specific to other things that could legitimately vary (e.g. the exact name of the field that the tests ran on), but if there are we may have to whitelist the run_info properties that make the output. Hopefully that makes some sense. I can give more detailed pointers if you need them.
Flags: needinfo?(james)
Hii,I am new to Open Source.Is there anything I can do here.
I would like to work on it. Would you attach a sample log file which you want to concatenate.
Attached file Log concatenation script (deleted) —
Pass path to log files as command line arguments. A file will be generated in same directory as script named as output.txt containing the concatenated log.
Attachment #8540628 - Flags: review?(james)
Attachment #8540628 - Flags: review?(james) → feedback?(james)
(In reply to Ankit Goyal from comment #6) > Created attachment 8540628 [details] > Log concatenation script > > Pass path to log files as command line arguments. > A file will be generated in same directory as script named as output.txt > containing the concatenated log. Thanks for looking at this, and sorry for the tardiness of my reply. Unfortunately the attached patch doesn't quite meet the requirements. The concatenation we want to do here has to be aware of the semantics of a structured log file. In particular there should be exactly one suite_start and one suite_end message in the concatenated file, and the result should update the contents of those entries to reflect the actual data in the concatenated file. For example imagine two input files (in some made up syntax): SUITE_START [test1] TEST_START test1 TEST_END test1 PASS SUITE_END SUITE_START [test2] TEST_START test2 TEST_END test2 FAIL SUITE_END The result of concatenating these files should be: SUITE_START [test1, test2] TEST_START test1 TEST_END test1 PASS TEST_START test2 TEST_END test2 FAIL SUITE_END Does that make more sense?
Attachment #8540628 - Flags: feedback?(james) → feedback-
Hi! I would like to work on this issue. Can you assign it to me? Could you explain couple more things please? Should I discard events before suite_start and after suite_end? How other fields (besides run_info, time and tests) should be merged in suite_{start, end}? Should merged events be sorted or just appended file by file? -Maciek
> Could you explain couple more things please? > Should I discard events before suite_start and after suite_end? Let's keep them in the logs, before the suite_start, or after the suite_end, message. > How other fields (besides run_info, time and tests) should be merged in > suite_{start, end}? I think it should likely validate that run_info and device_info are the same in each log file (possibly with an option to override this heck, in which case those fields should be ommitted from the output). For the common fields, I think it makes sense to emit data that is accurate for the combining process: time - The current timestamp thread - name of the current thread pid - id of the current python process source - I think this will typically be the same for all the input files, so use that component - "log concatenation" or something? > Should merged events be sorted or just appended file by file? The structure I suggest is [all pre-suite_start events, unsorted] <suite_start> [all intermediate events, unsorted] <suite_end> [all post-suite_end events] That is, I think a valid implementation strategy would be to open log_1..log_N and for each in turn output each event before suite_start is reached. Then output a single suite_start, then from each file in turn output each event before suite_end is reached, output suite_end, and then the same again for the events after suite_end.
Assignee: nobody → makzmaciek
Attached file merge_logs.py - proposed solution (obsolete) (deleted) —
Please see proposed solution attached. What do you think about it? Should I create a patch to a repo? (which one?)
Comment on attachment 8549897 [details] merge_logs.py - proposed solution This looks like a great start, thanks! I think we want this script to live in mozilla-central under testing/mozbase/mozlog/mozlog/structured/scripts/ There are also some utility functions in there that you should be able to use for things like reading the log files. Also note that having code at the top level is not ideal (it seems likely that we will want to be able to import this code and use it as a module). For instructions on creating a patch and asking for review, see [1]. I appreciate that's a bit of a wall of text, so if you have more specific questions please ask. [1] https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/How_to_Submit_a_Patch
Attachment #8549897 - Flags: feedback+
Please find the patch attached. Should I request a review now? I've refactored a bit the script to follow standards from other scripts in the same directory.
Attachment #8549897 - Attachment is obsolete: true
This looks good! Please go ahead and request a review.
Attachment #8552018 - Flags: review?(james)
Comment on attachment 8552018 [details] [diff] [review] Command_line_tool_to_merge_multiple_raw_structured_logs.patch Review of attachment 8552018 [details] [diff] [review]: ----------------------------------------------------------------- This looks like a really good start; just some minor changes required. Please add the script to the commands dict in testing/mozbase/mozlog/mozlog/structured/scripts/__init__.py so that it is callable using the structlog command. It also looks like your diff is against the wrong base; it makes sense to use the hg or git commands to produce a diff. Thanks for making the patch! ::: structured/scripts/logmerge.py @@ +15,5 @@ > +def fill_process_info(event): > + event["time"] = int(round(time.time() * 1000)) > + event["thread"] = current_thread().name > + event["pid"] = os.getpid() > + event["component"] = "logmerging" Let's just leave this as-is. @@ +20,5 @@ > + return event > + > + > +def process_until(f, output, action): > + entry = json.loads(f.readline()) So, rather than doing this with json directly it would be great if we could use the reader module in mozlog. This obviously requires some changes in other places too, but I don't think it's too substantial a change. @@ +22,5 @@ > + > +def process_until(f, output, action): > + entry = json.loads(f.readline()) > + while entry['action'] != action: > + json.dump(entry, output) This looks like dump_entry? @@ +40,5 @@ > +def validate_start_events(events): > + for start in events: > + if not start['run_info'] == events[0]['run_info']: > + print("Error: different run_info entries", file=sys.stderr) > + sys.exit(-1) I think this exit code ought to be positive. @@ +51,5 @@ > + > + > +def get_parser(add_help=True): > + parser = argparse.ArgumentParser("logmerge", description='Merge multiple log files.', add_help=add_help) > + parser.add_argument('-o', dest='output_filename', default="output.json", help='output file') Can we default the output to sys.stdout, please?
Attachment #8552018 - Flags: review?(james)
Attached patch Implemented fixes after code-review (obsolete) (deleted) — Splinter Review
Thanks for your feedback! Please see updated patch.
Attachment #8552018 - Attachment is obsolete: true
Attachment #8554863 - Flags: review?(james)
Comment on attachment 8554863 [details] [diff] [review] Implemented fixes after code-review Review of attachment 8554863 [details] [diff] [review]: ----------------------------------------------------------------- Just a couple of very minor fixups, otherwise looks good. ::: testing/mozbase/mozlog/mozlog/structured/scripts/logmerge.py @@ +16,5 @@ > +def fill_process_info(event): > + event["time"] = int(round(time.time() * 1000)) > + event["thread"] = current_thread().name > + event["pid"] = os.getpid() > + event["component"] = "logmerging" Let's just delete this line. @@ +21,5 @@ > + return event > + > + > +def process_until(f, output, action): > + for entry in reader.read(f): Being picky now, but I think this will be more obvious if you create the reader in main() and pass that around rather than the file object.
Attachment #8554863 - Flags: review?(james)
Attached patch Proposed solution (obsolete) (deleted) — Splinter Review
Thanks again for review, please see another patch attached.
Attachment #8554863 - Attachment is obsolete: true
Attachment #8555465 - Flags: review?(james)
Comment on attachment 8555465 [details] [diff] [review] Proposed solution Review of attachment 8555465 [details] [diff] [review]: ----------------------------------------------------------------- Looks good! Thanks for doing this.
Attachment #8555465 - Flags: review?(james) → review+
I notice this patch doesn't have author information, so I can't check it in as-is. I don't know exactly what workflow you are using, bit with git "git show -p ref" should generate a patch for ref and with mercurial "hg export -r rev" should do it. Note that in either case you will need to configure the VCS with your name/email. Whilst you are doing that, you should probably set a commit message in the standard Mozilla format: Bug 1103945 - Add command line tool to merge multiple raw structured logs, r=jgraham But of course I can do that part more easily. let me know if you need any help with this.
Flags: needinfo?(makzmaciek)
Attached patch Final solution (deleted) — Splinter Review
Thanks for review. Hopefully final patch attached.
Attachment #8555465 - Attachment is obsolete: true
Flags: needinfo?(makzmaciek)
Attachment #8558043 - Flags: review?(james)
Attachment #8558043 - Flags: review?(james) → review+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: