Closed Bug 534764 Opened 15 years ago Closed 14 years ago

Make NSPR_LOG_FILE work for multiple processes

Categories

(Core :: IPC, defect)

Other Branch
x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla5
Tracking Status
blocking2.0 --- -

People

(Reporter: jduell.mcbugs, Assigned: jduell.mcbugs)

References

Details

(Keywords: dev-doc-needed)

Attachments

(1 file, 16 obsolete files)

(deleted), patch
wtc
: review+
Details | Diff | Splinter Review
Right now if NSPR_LOG_FILE is set, each process in an electrolysis job winds up trying to exclusively open the given file name. Only one wins, and you only see its output in the log file. We should implement something so that each process gets its own log file. The most obvious way is to change the environment so that non-chrome processes check NSPR_LOG_FILE, and if set, modify it to something like "logname.N", where N is the pid of the child process. I've verified that this works, but am not sure where to put the logic.
I've worked around this myself by manually setting a different NSPR_LOG_FILE value for the child in my xpcshell setup files, but it'd be nice to have an actual fix, as it's yet another thing I have to inform developers of e10s necko about before they can be productive. Any ideas on where this should be fixed? I'm assuming it's not actually in NSPR, but CC-ing WTC in case it is.
Blocks: fennecko
I'm attaching the patch I currently use to manually set NSPR_LOG_FILE in the child for xpcshell tests. This also turns on necko e10s HTTP, i.e. sets NECKO_E10S_HTTP. The patch obviously depends on my patch for Bug 530952 being either applied, or landed on the e10s branch (soon, hopefully).
Jason: you can do this in NSPR, in the file mozilla/nsprpub/pr/src/io/prlog.c. You can add a new environment variable for this purpose, say NSPR_LOG_FILE_PID. Another solution is to make NSPR honor a magic token, say $PID, in NSPR_LOG_FILE. Then you can just specify NSPR_LOG_FILE as "logname.$PID". You can also solve this problem outside NSPR by calling the NSPR function PR_SetLogFile() in each child process. Then you can use your own mechanism to control the argument passed to PR_SetLogFile().
Attached patch A rough outline of probable future patch (obsolete) (deleted) — Splinter Review
Hi, I had spoken to Jason Duell about a starter's bug in e10s necko and as per his suggestion I am taking this up. From WTC's comment, I thought of using the magic token idea, replacing the token name from $PID to #PID ($ causes problem in bash for sure, and most likely other shells too). So NSPR_LOG_FILE will be set to <actual file name>.#PID. On finding the #PID token we append the pid to the file name. I choose this method considering that nspr could be used in a single process app. I am attaching a very rough patch explaining my approach and containing a few questions and comments. WTC/jduell, can you please look at the attachment and tell me if the approach is ok and also help me find answers for a few questions put up in the patch? I am not asking for a review as the attachment is nowhere near to be called as a patch :) -- brahmana
Assignee: nobody → om.brahmana
Status: NEW → ASSIGNED
Attachment #427057 - Attachment description: A rough outline of probably patch → A rough outline of probable future patch
Comment on attachment 427057 [details] [diff] [review] A rough outline of probable future patch Asking for a review just to get this under the radar.
Attachment #427057 - Flags: review?(wtc)
pid's (while the choice of strace -ff) are a terrible solution, on some platforms they're recycled frequently. might i suggest timestamps instead?
(In reply to comment #6) > pid's (while the choice of strace -ff) are a terrible solution, on some > platforms they're recycled frequently. > > might i suggest timestamps instead? then two processes opened at the same time run into a problem. I think the problem that is being tried to solve here is mostly that there aren't multiple processes accessing the same file at the same time, and as two processes can't have the same PID at the same time, that should work fine. Also, I don't really think they are recycled anywhere fast enough that we run into that problem of overwriting one log we're interested in with another.
pids for plugins which can come and go repeatedly can be recycled on certain platforms. you aren't going to get two processes with the same exact start time.
Comment on attachment 427057 [details] [diff] [review] A rough outline of probable future patch >+ /** >+ * Here goes the code to find the pattern "#PID" (case sensitive?). >+ * If found we do hasPidToken = PR_TRUE; >+ */ Nit: Don't use javadoc-style comment delemiter /**. Let's use "$PID" instead of "#PID". It's fine for this special token to be case-sensitive. There is no NSPR function for getting the PID. Just call the platform-specific functions. You can call PR_Malloc/PR_Free instead of malloc/free. Calling strlen is fine. >+ PRUint32 len = strlen(ev); >+ char *newLogFile = malloc(len + 35); >+ newLogFile = strcpy(newLogFile, ev); >+ char pidString[33]; >+ sprintf(pidString, ".%u", pid); >+ newLogFile = strcat(newLogFile, pidString); >+ free(ev); // A nspr specific free function here? >+ ev = newLogFile; Your code assumes that the PID special token is at the end. This is not necessarily true. You need to replace the special token by the pid. Call PR_snprintf instead of sprintf.
I'm fine with using '$PID' too. Re: platform-specific functions for getting the pid. On all unices and OSX, that's getpid(), and then '_getpid()' on Windows, right? Any other cases to worry about? We have another issue here. Ideally, we'd like for the behavior to work as follows: 1) The user sets NSPR_LOG_FILE with '$PID' somewhere in it, and each process gets its own log; or 2) The user doesn't use '$PID', and so all processes' output is interleaved into a single log file. The problem is that #2 is slightly broken: individual log file writes appear to interleave sanely, i.e. each line of output emits atomically, at least on Linux (perhaps I'm just getting lucky?). But since each process opens the same log file with PR_TRUNC, the last process to call PR_Init blows away whatever log entries previous processes have made up to that point. It would be nice to have a way to tell a process to not pass PR_TRUNC during log init (unless '$PID' is set), i.e. "I'm part of a job that has already initialized the log file, so don't blow it away". This function would need to be called before most any other NSPR function, though, as anything that calls _PR_ImplicitInitialization() winds up blowing away the log. WTC, does that sound feasible? brahmana: if we do add such a function to NSPR, the place to call it in Firefox would be in ipc/app/MozillaRuntimeMain.cpp, in main().
the problem with pid is pid rollover. note that on windows you can already do 2 relatively safely by using the windebug log file (i can't remember the spelling).
Jason: it's feasible to tell a process to not open the log file with PR_TRUNC. We've been using special log module names for these purposes, so we can add a "notrunc" special log module.
> so we can add a "notrunc" special log module I'd rather not have to rename the HTTP, etc., log module names and re-educate all my developers. Hopefully what you're talking about doesn't require that (but I don't see how it wouldn't). I just poked around in prlog.c, and I don't see the 'notrunc' logic you mention. There's just one call to PR_Open, and it uses PR_TRUNC.
What I meant is to add a new magic log module name "notrunc". It'll be handled as a directive rather than a real log module. Search in prlog.c for "sync", "bufsize", and "timestamp", which are the current magic log module names.
(In reply to comment #14) > What I meant is to add a new magic log module name "notrunc". > It'll be handled as a directive rather than a real log module. I think this would defeat the purpose. From what I understand Jason is trying to find a way in which the users do nothing extra, as compared to what they are doing now, to get log data from all processes. If we expect the user to put a "notrunc" in the log-modules list, we might, instead, expect them to use the $PID itself. So the most ideal thing would be to check for $PID and if it is present we will have different files. But if it is not there then we open the logfile without PR_TRUNCATE. This is sort of making nspr-logging "multiprocess ready" by default. Am i right Jason? Two more questions: 1) If we use $PID, wouldn't the shell (bash for sure) replace it with an env var named PID (which is empty in most cases)? This is why I had put in #PID. 2) What thoughts on timeless'timestamp suggestion? Will the PID reuse affect us that bad (in case of plugin processes specially)? Is PID itself better as it might be easier to map between a process and its log file?
The "notrunc" magic log module name allows you to use a single log file for all the processes, with their logging output interleaved (option #2 in Jason's comment 10). That may be preferrable to having multiple log files (option #1 in comment 10).
(In reply to comment #16) > The "notrunc" magic log module name allows you to use a single log > file for all the processes, with their logging output interleaved > (option #2 in Jason's comment 10). I think the idea here is to have this behavior without the user having to specify an extra log module. Assuming we are dropping the PID approach, I see two options here: 1) Make this (no PR_TRUNC) the default NSPR logging behavior 2) Provide an nspr api that the e10s mozilla can call to tell nspr to not use PR_TRUNC
WTC: So if I understand you correctly, I would have to re-educate my users to no longer use "NSPR_LOG_MODULES=nsHttp:5", but instead something like "NSPR_LOG_MODULES=nsHttp:5,notrunc"? And they'd need to remember to pass "notrunc" only if they aren't passing "$PID" in NSPR_LOG_FILE? I'd rather not make things that complicated to remember. Things will be less education to do if the current documented method (set "NSPR_LOG_MODULES=nsHttp:5") still works without modification (creating an interleaved file), and if users want per-process logging, they use a log name with $PID in it. To make that work, the easiest path seems to be to have an NSPR function that tells SetLogFile to not use PR_TRUNC, and I can arrange for the mozilla runtime to call it when 1) we know we're a child process and 2) there's no $PID in the log file name. (Note: we'll need to use getenv/_getenv directly to determine that, instead of PR_GetEnv, because the latter would truncate the log file as we read NSPR_LOG_FILE). Does that sound right? Other notes: I'm not worried about the pid reuse issue. We're not creating processes so frequently that there's any realistic chance that we'll wind up overwriting one of our logs with another processes'. It's much easier to figure out which log belongs to which process from the pid than a timestamp. > 1) Make this (no PR_TRUNC) the default NSPR logging behavior That would break current semantics. User's expect a new run of their app to create a new log file, not append to an existing one. This would cause lots of confusion. We have to make sure that each log file we use in a run gets called with PR_TRUNC once (and only once). > 1) If we use $PID, wouldn't the shell (bash for sure) replace it with an env var named PID (which is empty in most cases)? This is why I had put in #PID. This is true, unless users set the environment variable with single quotes around the file name. I'm sympathetic to using something else, like #PID or whatever. Do anyone have any idea if there's a character that's used most frequently for this by other programs?
You can try adding an NSPR function. I'm worried that NSPR starts logging before you get a chance to call that function.
Yes, we'll need to call the function I'm thinking of right away in main(), before calling any other NSPR function. I'm thinking of something like the following pair of functions: /* If used, this function must be called before any other NSPR function. * * If called, the log file specified by $NSPR_LOG_FILE will be opened in * append mode, i.e. without PR_TRUNC. * * This function is intended primarily for multiple-process jobs that share * the same log file, so that the 2nd and following processes do not truncate * the log at startup. */ void PR_PreInit_Logfile_NoTrunc(); /* If used, this function must be called before any other NSPR function. * * If 'pidstr' is a non-empty string, it will be searched for in NSPR_LOG, and * if found, replaced with the process id (pid) of the current process. * * This function is intended primarily for multiple-process jobs that wish * to use separate log files for each process. */ void PR_PreInit_Logfile_Pid(PRBool notrunc, const char *pidstr); There are other ways to accomplish the same thing: we could hard-wire in recognizing the '$PID' string to PR_SetLogFile, for instance. I'm fine with whatever, so long as we can get the following two semantics: 1) single log file with no_trunc 2) log file with $PID with trunc. I'm still going back-and-forth as to alternatives to '$PID' ("%PID"? Will that confuse the windows shell?), which is part of why I figured we could actually make this a parameter instead of hard-coding it.
If a C++ static object constructor calls NSPR, NSPR may start logging before the main() function starts. This is why I'm worried that calling a function may not work. You need to test it. Another approach is to use a magic token in the log file name to indicate it should be opened without PR_TRUNC. We should implement the $PID solution first.
We definitely call NSPR functions and log constructors in particular from static constructors via many `static gLogModuleFoo = PR_NewLogModule("foo");` calls scattered throughout the tree.
Attached patch patch_v0.1 with only #PID support (obsolete) (deleted) — Splinter Review
This patch implements the support for #PID taking into consideration the previous comments. I made it #PID as "$" results in value substitution in bash shell and "%" in Windows cmd shell. I have not put in any comments thinking the code is pretty much self-explanatory. Kindly let me know if it is not so. I tested it on my machine - Ubuntu - 8.04 x86_64 and it worked fine with and without dom.ipc.tabs.enabled.
Attachment #427057 - Attachment is obsolete: true
Attachment #430348 - Flags: review?(wtc)
Attachment #427057 - Flags: review?(wtc)
OK, so it looks like a solution for this going forward is the existing patch, and then we can add some logic in firefox to modify $NSPR_LOG_MODULES to "NSPR_LOG_MODULES=OLDVAL,notrunc" if and only if NSPR_LOG_FILE does NOT contain '#PID', and then add the 'notrunc' logic to NSPR. That'll give us a single log file without child processes truncating it, or a (truncated at startup) log per process. Sound about right, Wan-Teh? Do you have time to review Srirang's patch?
So we (firefox) apparently don't have cross-platform code for starting the child process with a different environment than the parent. But Ben Turner tells me the child inherits the parent's environment. So I propose the following hack: - in the parent/chrome process, we change NSPR_LOG_MODULES to the value we want for the child (i.e. with 'notrunc' appended) at some point before we launch the child process. We then change it back after we've launched the child. As far as I can see, NSPR doesn't look at NSPR_LOG_MODULES after startup (take a look to make sure, though), so this shouldn't actually break any behavior in the parent, even if we never changed NSPR_LOG_MODULES back. But let's change it back, just in case. It looks like one way to do this would be to change GeckoChildProcessHost::AsyncLaunch() so that you pass a wrapper function instead of PerformAsyncLaunch; the wrapper could set NSPR_LOG_MODULES, call PerformAsyncLaunch, and then change the env var back. http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#109
Attached patch patch_v0.2 with #PID and notrunc support (obsolete) (deleted) — Splinter Review
Implements support for : 1) #PID token in the log file name, which will be replaced by the pid of the process. 2) notrunc log module which instructs that the log file should not be truncated and that all logging data be appended to the current contents of the file. A separate bug will be created to deal with changing of the env variable only for the child process as mentioned in comment #25. W.r.t electrolysis, users will not actually set the notrunc log module at all. Firefox will take care of setting that only for the child processes if there is no #PID token in the log filename. This way the log file is cleaned once at the start-up by the chrome process and subsequent processes just append their data. That being said, I have a question. In case of a single file being used are we safeguarded against multiple processes trying to write to a single file? Do we need something extra to make sure there are no race conditions or such?
Attachment #430348 - Attachment is obsolete: true
Attachment #437637 - Flags: review?(wtc)
Attachment #430348 - Flags: review?(wtc)
> are we safeguarded against multiple processes trying to write to a single file? I don't know if we're safe at a C/C++/POSIX spec level, but on Linux at least logging seems to be atomic with respect to printfs of log lines, so no intermingled garbage lines. If we see that elsewhere we can open yet another bug :)
Blocks: 558617
No longer blocks: fennecko
Pinging to see if wtc or anyone else can do the review?
Interesting... I could take this bug. If you are not against hacking NSPR then we can have PID based file names (which I would also like to have). Other option is to just log the PID to the file.
Assignee: om.brahmana → honzab.moz
Attachment #420443 - Attachment is obsolete: true
Comment on attachment 481578 [details] [diff] [review] Other way to have separate logging, if handy for anyone... >+ // If there is no NSPR_LOG_FILE_CHILD then drop the NSPR_LOG_FILE env >+ // to prevent child processes from interactint with the chrome log >+ PR_SetEnv(PromiseFlatCString(nsprLogFileEnv).Data()); I don't think we want chrome-only logs by default. And I think we want a way (if possible) to allow both separate logs or a single, interleaved log (it makes everything show up in at least semi-chronological order, which can be handy when debugging). I could have sworn interleaving used to mostly work (with any bits written by chrome before the child got init'd getting blown away by the child's open(TRUNC)). But I just tried with fennec, and I see chrome's log msgs at first, until I type in a URL to the browser, after which the file contains *only* child-side messages. I'm not sure what's happening to the chrome LOG() calls after that. Weird. Both chrome/child should have the same file open. I'll poke around at this some more soon, unless Srirang has time to look at it. Assuming we can get interleaving working, we'll want a patch that combines Srirangs NSPR notrunc with a modified version of this patch that sets the appropriate ENV in chrome before spawning the child. If we can't get interleaving working, I'd vote that we just generate one log per process, with '.child.PID' appended. I don't see any reason to ever only generate a chrome log. Let me know if that sounds wrong.
Attachment #481578 - Flags: review-
Re: "I don't see any reason to ever only generate a chrome log": Having chrome and child process(es) all writing to one file never worked for me under Windows. That's why I created my little patch. Using 'timestamp' as one of the modules in NSPR_LOG_MODULES can be used to sync the logs if you open them in two windows side by side. However, having interleaving would be really nice, but how can we ensure (again, mainly on Windows) that writing one line of the log will be atomic? Also, how do I recognize what log comes from what process when interleaved?
Comment on attachment 481578 [details] [diff] [review] Other way to have separate logging, if handy for anyone... Don't remember I would ask for a review here.
Attachment #481578 - Flags: review-
Just a note that I tried to have an interleaved logging (based on Srirang's patch with notrunc) and surround writings to the log file with PR_(Un)LockFile(PRFileDesc*) that guards multiprocess concurrency. But there are following issues: - it's prevented to be used on win platform, as we are using stdio there to do logging (we have FILE*, not PRFileDesc) - when using stdio (no guarding), child output is often completely lost or overwritten with parent output - when switching to PR io, the child is unable to open the file (I haven't time to figure out why) Maybe solving the last issue may give us something useful for the time until we turn to a better logging facility.
It sounds like we should scrap interleaved logs for now and just turn this bug into generating a log per process, with "logfile.pid' used for child processes. That should just be a matter of taking the $PID bits from Srirang's patch for NSPR (ditch notrunc), and then changing Honza's patch to *always* append '.$PID' to $NSPR_LOG_FILE if it has been set.
I will try to review Srirang's patch. Sorry about the delay...
Comment on attachment 437637 [details] [diff] [review] patch_v0.2 with #PID and notrunc support We should rip the "notrunc" out of this patch.
Attachment #437637 - Flags: review?(wtc) → review-
Comment on attachment 437637 [details] [diff] [review] patch_v0.2 with #PID and notrunc support Srirang: Sorry about repeatedly forgetting to review your patch. Here are my review comments. 1. The "notrunc" changes can be cleanly separated from the "#PID" changes. I will create a separate patch for your "notrunc" changes for future reference. >+ char *delim = NULL; 'delim' should have the type 'const char *'. A better name for this variable would be 'pidVar', because "#PID" isn't a delimiter. I prefer "$PID" because '$' is used in many languages for variables. I understand that's exactly why you don't want to use '$' (to avoid having to escape '$' in Unix shells). >+ if(delim) { Nit: add a space after "if". Please search for "if(" throughout this file and fix them all. >+ char *prePidPart = strndup(ev, (delim - ev)); >+ char *postPidPart = strdup(delim + 4); Since the origial string is null terminated, we definitely do not need to duplicate postPidPart. You can just say const char *postPidPart = delim + 4; and do not free(postPidPart) later. >+ if(!prePidPart || !postPidPart) { With my suggested change above, you only need to test !prePidPart here. >+ PRUint32 pid = -1; Use 'int' as the type of 'pid'. Otherwise your pid < 0 test below will never be false. Also, move this variable declaration to the top of this block. Some C compilers still require that. >+#ifdef XP_WIN >+ pid = _getpid(); >+#else >+ pid = getpid(); >+#endif You can just call getpid() on Windows, without the ifdef. We suppress the compiler warning about not using the function name with a leading underscore '_'. > #ifdef XP_PC >- char* str = PR_smprintf("Unable to create nspr log file '%s'\n", ev); >+ char* str = PR_smprintf("Unable to get PID \n"); >+ if (str) { >+ OutputDebugStringA(str); >+ PR_smprintf_free(str); >+ } >+#else >+ fprintf(stderr, "Unable to get PID \n"); >+#endif Please create a static function for this kind of code, which you used at least 2-3 times in the patch. (In some places you only have a fprintf(stderr) statement inside #ifndef XP_PC, which should also be replaced by the new function.) >+ return; >+ } else { >+ logFileName = NULL; >+ logFileName = PR_smprintf("%s%d%s", prePidPart, pid, postPidPart); Remove "else", because the "if" block ends in a return statement. Remove the useless "logFileName = NULL;" line.
I extracted this patch from Srirang's patch in comment 26, for future reference.
Thank you Wan-Teh. Honza, Are you working on Wan-Teh's review comments or do you want me to look into that?
Srirang, I don't agree much with this approach (to have log files with pids in the name) so please go ahead and finish your patch yourself. Thanks. BTW, IMO, would be good to make the PID being part of the name optional.
Honza: what's your preferred approach? Interleaving? (I read your comments in this bug report but didn't see a description of your preference.) Thanks.
OS: Linux → Windows CE
OS: Windows CE → Linux
(In reply to comment #43) > Honza: what's your preferred approach? Interleaving? Yes, I would love to have it. I got used to have a single log file, with always the same name, that contains all the info in a chronological order from the current or last run. I can filter it, tail etc. It would be perfect to have this also for multiprocess case. I did some tests around this as I needed to have logging in Fennec. I didn't finish the work on it, but I will probably return to it sooner or later. Please read comment 35 to see where the work stopped.
I would love to see a single interleaved log working, too. If someone has time to look into making it work on Linux/win32/osx, that'd be great. I'm not sure when I'll have cycles for that. If it's not going to happen reasonably soon (before fennec ships?), we may have to give up and use separate logs for now, and do interleaved as a followup (if ever). Notes: 1) Have we tried setting 'sync' in NSPR_LOG_MODULES? Might make a difference. 2) We could always write a utility that slurps the separate logfiles into a single interleaved one when that's useful for debugging. There are a couple of issues with this: i) it would require users to set 'timestamp' in NSPR_LOG_MODULES, which would be an educational challenge; ii) the timestamp is pretty verbose right now, so the logs would be bigger (and uglier to read in their raw format). This might work better if there were a way to have NSPR write unix-style date timestamps (i.e. just a number), and it didn't require users to set an environment variable (instead we'd have some new function to turn it on). WTC, does that sound feasible?
jduell: I think so.
Attached patch Interleaving preview 1 patch (obsolete) (deleted) — Splinter Review
This is patch that for me on a windows (7) box produces a perfect interleaved log while running Fennec. I forbid using stdio on windows and os2 (see the comment why it is used). For the final patch I plan to make this a run-time option (no stdio when using ipc loggin). Also the seek to end has to be also Feedback appreciated!
Attachment #496689 - Flags: feedback?(jduell.mcbugs)
Attachment #496689 - Flags: feedback?(jduell.mcbugs) → feedback?(wtc)
Would this also make it possible for two instances of regular (non-electrolysis) Firefox to write to separate logs?
(In reply to comment #48) > Would this also make it possible for two instances of regular > (non-electrolysis) Firefox to write to separate logs? You can always set a different NSPR_LOG_FILE value for each of the processes.
Blocks: 632007
Blocks: 628563
I think we should mark this a softblocker now that we've established that this is 1) causing not just truncated logs, but logs with confusing null bytes in them; and 2) happening not just for fennec, but for Firefox (due to plugins being out of process). I'm sympathetic to an interleaved approach, but am concerned that it could be either fragile (requiring no stdio usage: how confident are we that there's no more in the tree, or in extensions?), and/or might have a performance hit (the comment for the #define you comment out in your patch seems to indicate stdio was being used for perf reasons: " On NT, we can't define _PUT_LOG as PR_Write or _PR_MD_WRITE, because every asynchronous file io operation leads to a fiber context switch"). For FF 4 we might want to play it conservative and have separate log files.
blocking2.0: --- → ?
This sucks, but I'm not going to block on it. Get a patch up and I'll consider approving it.
blocking2.0: ? → -
This patch creates a separate logfile for each child (with name LOGNAME.child-1, 2, 3, etc.). This patch could be a standalone solution: but to get a single, interleaved log we'd need to 1) have users add ",timestamp" to NSPR_LOG_MODULES, and 2) write a script that merges the various logs into one, using timestamp order.
Assignee: honzab.moz → jduell.mcbugs
Attachment #513988 - Flags: review?(benjamin)
This patch adds an ',epoch' parameter to NSPR_LOG_MODULES, causing a shorter and simpler timestamp (microseconds since 1970) to be used. With this, multiple logs can be merged more simply (with "sort -n", instead of a script), and log lines are shorter.
Attachment #513989 - Flags: review?(wtc)
NSPR part of final patch: allow original/parent process to turn on epoch logging after startup. This will allow us to get timestamped logs even if the user hasn't passed ',epoch' in the modules list.
Attachment #513991 - Flags: review?(wtc)
Gecko part of final patch: turn on epoch timestamps in both parent and child if the user hasn't turned them on already. This is imperfect, in that there will already be some unstamped entries in the parent's log by the time this runs. But all entries that happen while multiple processes exist will be stamped, which is the important part--a script can deal with the details of merging. The benefit of the last two patches is that they don't require us to educate users to pass ',epoch' (or ,timestamp) in NSPR_LOG_MODULES. If we don't think that benefit is a big deal, we can just change the instructions on HTTP logging on MDC and deal with unstamped logs by asking users to resubmit them with stamps. I go back and forth on the right solution: thoughts appreciated.
Attachment #513995 - Flags: review?(benjamin)
Regarding interleaved logs: I think we should open a new bug for them once FF 4 ships and see if we can establish if we can get them portably/reliably/performantly. Meanwhile you can force interleaved logs in my patches by setting GECKO_INTERLEAVE_NSPR_LOG=1 (not sure if that's the right naming convention: it's a weird case, where the logic isn't part of NSPR, hence the GECKO prefix). Perhaps that should have been LOGS, not LOG, too.
Comment on attachment 513989 [details] [diff] [review] FF 4 solution, patch 2: NSPR change to add simple "epoch" timestamps PR_Now() has very low resolution on windows. In tens of milliseconds or so.
Attachment #513989 - Attachment is obsolete: true
Attachment #514158 - Flags: review?(wtc)
Attachment #513989 - Flags: review?(wtc)
Attachment #513991 - Attachment is obsolete: true
Attachment #514159 - Flags: review?(wtc)
Attachment #513991 - Flags: review?(wtc)
Attachment #513995 - Attachment is obsolete: true
Attachment #514160 - Flags: superreview?(benjamin)
Attachment #514160 - Flags: review?(honzab.moz)
Attachment #513995 - Flags: review?(benjamin)
Attachment #513988 - Flags: superreview?(benjamin)
Attachment #513988 - Flags: review?(honzab.moz)
Attachment #513988 - Flags: review?(benjamin)
OK, the patches now use PR_IntervalNow, which NSPR "guarantees" to have resolution of at least 1000 ticks/second. That isn't fabulous, but it should be good enough for most of our logging purposes. If there's a better portable timestamp out there I should be using, let me know. Obviously these need to get moved on quickly if we want them in FF 4. I'm not sure what our policy is on making changes to the mozilla-central copy of NSPR if it's not possible to get this into the NSPR repo in time.
Attachment #514160 - Flags: superreview?(benjamin) → superreview?(jones.chris.g)
Attachment #513988 - Flags: superreview?(benjamin) → superreview?(jones.chris.g)
PR_IntervalNow() suffers from a similar problem. It jumps by 15-16 ms on my Win7 on Intel based system. I had the same experience on WinXP on AMD as well. On windows specifically it is good to use QueryPerformanceCounter and QueryPerformanceFrequency functions. It is supported since Windows 2000 Professional and Windows 2000 Server. See google://QueryPerformanceCounter if interested. MSDN links obsolete fast... [ BTW, add support for URLs like these would be very very cool! :) ]
Ugh. How sure are we that QPC works reliably as a cross-CPU counter on post XP systems? Poking around at google, MDSN still seems to note that the timing is not necessarily synchronized across CPU issues due to hardware/bios "errors": http://msdn.microsoft.com/en-us/library/ms644904(v=vs.85).aspx This post makes it sound like AMD desktop chips don't provide a synchronized QPC as of 2008: http://blogs.msdn.com/b/oldnewthing/archive/2008/09/08/8931563.aspx#8935926 Other posts say this is fixed for post-XP systems: http://channel9.msdn.com/Forums/TechOff/152339-QueryPerformanceCounter-AthlonX2/d84d4d0574b44bacb2d79dea0134f1ec Given how little time we've got before FF 4, I'm wondering if the "right" solution for now is just to 1) leave interleaving on by default; 2) Add a GECKO_SEPARATE_NSPR_LOGS option, and tell users to set it and ',timestamp' if their interleaved log is not useful. This will--with minimal changes--give us 1) interleaved logs on platforms where it works out of the box, and 2) a way to get at least get some roughly timestamped logs on Windows. If WTC is up for adding ',epoch', we could detect post-XP Window and use QPC for the counter there. This is assuming that most/much of the time, interleaving produces something useful on windows. (Honza, is that the case, or are you always seeing things like the all of the child's output getting lost? Some Null bytes we can live with for now). If interleaving is totally broken on windows, we could make separate logs the default only on Windows. I'll have access to an SMP windows 7 box tomorrow, and can see what things look like there. And I can put up a patch with the above logic if we agree it's the best triage for this for now (long term I want to see what Chromium is doing, and/or have some sort of logging service or something).
OK, here's a patch that just allows separate NSPR logs if the user sets GECKO_SEPARATE_NSPR_LOGS=1 in their environment. Child logs are just named with a '.child-{1,2,3...}' appended to the parent's log name. This is simple, and low-risk, and about all I think we can try to get into FF 4. But I think it's worth getting in, so we have a fallback in case we lose log input in some cases when the default interleaving is used. We can tell users to also set timestamps on, and while the granularity isn't great, it should be good enough to puzzle out the parent/child log ordering for most purposes.
Attachment #437637 - Attachment is obsolete: true
Attachment #481578 - Attachment is obsolete: true
Attachment #485332 - Attachment is obsolete: true
Attachment #496689 - Attachment is obsolete: true
Attachment #513988 - Attachment is obsolete: true
Attachment #514158 - Attachment is obsolete: true
Attachment #514159 - Attachment is obsolete: true
Attachment #514160 - Attachment is obsolete: true
Attachment #515651 - Flags: review?(honzab.moz)
Attachment #515651 - Flags: approval2.0?
Attachment #496689 - Flags: feedback?(wtc)
Attachment #513988 - Flags: superreview?(jones.chris.g)
Attachment #513988 - Flags: review?(honzab.moz)
Attachment #514160 - Flags: superreview?(jones.chris.g)
Attachment #514160 - Flags: review?(honzab.moz)
Attachment #514158 - Flags: review?(wtc)
Attachment #514159 - Flags: review?(wtc)
Comment on attachment 515651 [details] [diff] [review] Allow separate NSPR logs if $GECKO_SEPARATE_NSPR_LOGS is set No approvals before reviews. (Need to understand risk, that understanding comes, in part, from reviews.)
Attachment #515651 - Flags: approval2.0?
Comment on attachment 515651 [details] [diff] [review] Allow separate NSPR logs if $GECKO_SEPARATE_NSPR_LOGS is set >+ const char* gpnl = PR_GetEnv("GECKO_SEPARATE_NSPR_LOGS"); gsnl ?
Attachment #515651 - Flags: review?(honzab.moz) → review+
Comment on attachment 513989 [details] [diff] [review] FF 4 solution, patch 2: NSPR change to add simple "epoch" timestamps Jason: thanks for the patch. I suggest two changes. 1. "epoch" is a poor name because in NSPR it means 1970-01-01 00:00:00 UTC. It may be better to just call this variable "timestamp2". After all, the only difference from "timestamp" is how it's printed. It should be documented if both "timestamp" and "epoch" are specified, which one has precedence, unless you actually want to print the timestamp in both formats. 2. >+ nb_tid = PR_snprintf(line, sizeof(line)-1, "%Ld: ", PR_Now()); The documented format specifier for PRInt64 is %lld: http://www.mozilla.org/projects/nspr/reference/html/prprf.html#23299 I'm not sure if %Ld also works.
Comment on attachment 513989 [details] [diff] [review] FF 4 solution, patch 2: NSPR change to add simple "epoch" timestamps Another idea is to use the logging level for the existing "timestamp" special log module to control the format of the printed timestamps.
Comment on attachment 513991 [details] [diff] [review] FF 4 solution, patch 3: allow epoch timestamps to be turned on at runtime This patch should be fine. To prevent a proliferation of such functions, perhaps we should add a NSPR_API(void) PR_SetSpecialLogModule(const char *moduleName, int level);
Comment on attachment 514158 [details] [diff] [review] 513989: FF 4 solution, patch 2 v2: implement 'mergestamps' using PR_IntervalNow instead of PR_Now The problem with merging PR_IntervalNow() timestamps from multiple processes is that each process may have a different "origin" (starting point) of the interval time. So they can't be merged in general. In contrast, the origin of PR_Now() timestamps is defined to be 1970-01-01 00:00:00 UTC, which is the same across processes.
Attachment #514158 - Flags: review-
Attachment #514159 - Flags: review-
Comment on attachment 496689 [details] [diff] [review] Interleaving preview 1 patch Honza: thank you for the patch. The ideas in this patch are good. For example, the "notrunc" special log module was requested before in bug 342368. This patch turns off _PR_USE_STDIO_FOR_LOGGING for Win32 and OS/2. I suspect this is because PR_LockFile doesn't work on stdio's FILE. For Win32, we can use the _locking and _fileno functions to lock a FILE: http://msdn.microsoft.com/en-us/library/zs6wbdhx(v=VS.71).aspx http://msdn.microsoft.com/en-us/library/8054ew2f(v=VS.71).aspx We may not have a solution for OS/2, but that's fine. We cannot support OS/2 forever. _PR_USE_STDIO_FOR_LOGGING is useful to Win32 for two reasons: 1. It is required for the WINNT build configuration (not used by Mozilla clients). 2. It converts LF to CRLF. So we should try to preserve _PR_USE_STDIO_FOR_LOGGING. The PR_Seek and _PR_MD_SEEK calls should be unnecessary if we open the log file in append mode. >- newLogFile = PR_Open(file, PR_WRONLY|PR_CREATE_FILE|PR_TRUNCATE, 0666); >+ PRIntn flags = PR_WRONLY|PR_CREATE_FILE; >+ if (truncateLogFile) >+ flags |= PR_TRUNCATE; >+ else >+ flags |= PR_APPEND; >+ >+ newLogFile = PR_Open(file, flags, 0666); It seems that we should always specify PR_APPEND. >+#ifdef XP_WIN >+ pid = _getpid(); >+#else >+ pid = getpid(); >+#endif NSPR doesn't use XP_WIN, so you're actually using getpid() on Windows here. NSPR allows the use of the "deprecated" POSIX function names on Windows, so you can simply remove the ifdef here. Since this patch is a preview patch, I won't comment on the coding style nits.
Attachment #496689 - Flags: feedback-
Comment on attachment 515651 [details] [diff] [review] Allow separate NSPR logs if $GECKO_SEPARATE_NSPR_LOGS is set Potential issues: 1. The code is not thread-safe. I just want to make sure this is intentional. 2. Please test the code on Windows. Windows has two sets of environment variables -- the native environment variables and the CRT environment variables. PR_SetEnv calls putenv, which only changes the CRT environment variables. Suggested changes: 1. Just use strcpy instead of PL_strcpy. You can even replace the malloc+strcpy sequence with strdup, which is available on all platforms. 2. I would probably rename gpnl to separateLogs. 3. Can aExtraOpts be declared as const std::vector<std::string>& ? The current type, std::vector<std::string>, requires copying. I understand it allows the default value std::vector<std::string>() to work. 4. Document PerformAsyncLaunch1 and PerformAsyncLaunch2 in GeckoChildProcessHost.h. Are these function names OK? The "1" and "2" look strange.
I have tested the patch on windows and it works well.
Attached patch Allow separate NSPR logs v2: Comments addressed (obsolete) (deleted) — Splinter Review
Comments addressed. Carrying +r forward. Note: JST tells me this is unlikely to make FF 4 (since this only affects logging by plugins), but might be worth landing for fennec. But maybe not even there, since we're not using Windows for fennec, and that seems to be the problematic platform for interleaving. > rename gpnl to separateLogs. done. > The code is not thread-safe. Added lock around code that sets the static variable. Thanks for spotting this, though I don't think the code gets called re-entrantly. I'll ask cjones. > replace the malloc+strcpy sequence with strdup done. > Document PerformAsyncLaunch1 and PerformAsyncLaunch2 in > GeckoChildProcessHost.h. Are these function names OK? > The "1" and "2" look strange. There's no great naming convention I know of for this (a function we've split up for implementation reasons). I changed PerformAsyncLaunch1 back to PerformAsyncLaunch, and changed PerformAsyncLaunch2 to PerformAsyncLaunchImpl, which seems to be a common convention in our codebase. > 3. Can aExtraOpts be declared as const std::vector<std::string>& ? > The current type, std::vector<std::string>, requires copying. > I understand it allows the default value std::vector<std::string>() > to work. In this patch I made my new function (PerformAsyncLaunchImpl) private, and have it take a ref instead of a copy. PerformAsyncLaunch needs the copy since it's getting the args from a different thread.
Attachment #515651 - Attachment is obsolete: true
Attachment #516270 - Flags: review+
Attachment #516270 - Flags: approval2.0?
Comment on attachment 516270 [details] [diff] [review] Allow separate NSPR logs v2: Comments addressed Jason: this patch differs in nontrivial ways from the previous version. It may not be a good idea to carry over the review+ without a new review by Honza. I suggest some changes. 1. Nit: you don't need to include "plstr.h" now. 2. If the new code needs to be thread-safe, then you also need to increment GeckoChildProcessHost::mChildCounter either with PR_AtomicIncrement or under the protection of the Mutex object. 3. I am not sure if the way you initialize the function static variables restoreOrigLogName and strdupLock is thread-safe. Please have someone who's familiar with the Mutex class to review that part of the patch. 4. Unless the Foo vs. FooImpl distinction is a well-known convention in the Mozilla code base, I'd document that in the header file (as opposed to the .cpp file). >+// >+// Wrapper function for handling GECKO_SEPARATE_NSPR_LOGS >+// > bool > GeckoChildProcessHost::PerformAsyncLaunch(std::vector<std::string> aExtraOpts, base::ProcessArchitecture arch) The comment should note that PerformAsyncLaunch does the actual work. 5. I'm not sure if you've eliminated all the copying of the aExtraOpts arguments. Please doublecheck that.
Attachment #516270 - Flags: approval2.0? → approval2.0-
My previous comment has a typo. (In reply to comment #76) > > 4. Unless the Foo vs. FooImpl distinction is a well-known convention > in the Mozilla code base, I'd document that in the header file (as > opposed to the .cpp file). > > >+// > >+// Wrapper function for handling GECKO_SEPARATE_NSPR_LOGS > >+// > > bool > > GeckoChildProcessHost::PerformAsyncLaunch(std::vector<std::string> aExtraOpts, base::ProcessArchitecture arch) > > The comment should note that PerformAsyncLaunch does the actual work. The last sentence should read: The comment should note that PerformAsyncLaunchImpl does the actual work.
> static Mutex strdupLock Isn't this a static initializer that we are trying to get rid of? We may rather use existing mMonitor of the class to protect. Entering it just ends up acquiring a normal PR_Lock, AFAIK. Are we sure this needs to be thread safe? What is the result of XRE_GetIOMessageLoop()? Is it always the same thread loop or might differ? > PerformAsyncLaunchImpl I would prefer PerformAsyncLaunchInternal.
(In reply to comment #78) > Isn't this a static initializer that we are trying to get rid of? Yeah, you can't use static XPCOM stuff.
OK, this hacky fix is not happening for FF4, and unless there's a Fennec platform that interleaves logs badly, it won't happen for Fennec either. I've opened a bug for moving to a more Chrome-like logging system. (bug 638375), which is what we should really shoot for.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
Here's a final version of the patch in case we wind up wanting a hack in a hurry at some point (might be worth having when the tree opens for Windows-based Fennec developers?) After talking with cjones, I'm confident we don't need a lock around the static variable modification and counter--they're only touched by the IO thread non-reentrantly. Contains the other suggested fixes in comment 76 and comment 78.
Attachment #516270 - Attachment is obsolete: true
Unless you are planning on fixing bug 638375 in the next 3 weeks, I'd really like this patch to land for Firefox 5, and we can take time to totally rework logging.
Oh, it's bad we won't fix this :( I then always have to push a log-fix patch of my own as I always did before. I hoped for at least some fix, landed right after FX4, even non-interleaved to make logging usable at all... But 638375 will take time before gets fixed...
Comment on attachment 516528 [details] [diff] [review] Allow separate NSPR logs v3: fixed version in case we need it r=wtc. You can reopen this bug and fix it as soon as mozilla-central reopens for Firefox 5 development. >+ // Append child-specific postfix to name >+ setChildLogName.AppendLiteral(".child-"); >+ setChildLogName.AppendInt(++mChildCounter); If you want the child log files to start from .child-0, you need to post-increment mChildCounter. Your call. >+ // Does the actual work for AsyncLaunch, on the IO thread. Typo: AsyncLaunch => PerformAsyncLaunch
Attachment #516528 - Flags: review+
(In reply to comment #72) RE: feedback on the interleaving patch, the following seems to me worth to note: > This patch turns off _PR_USE_STDIO_FOR_LOGGING for Win32 and OS/2. > I suspect this is because PR_LockFile doesn't work on stdio's FILE. That is one reason. The other was that stdio on win32 is not able to share the file for writing simply. Opening the same file by a child process caused weird corruption of the file content (null bytes etc.) appearing like a win32/stdio bug. But it might be caused by lack of locking over the file, or we just have to set some flag or call some function somewhere to make it work or it is simply a known problem.
Reopened for post FF4. Now we get to haggle over the right thing to implement. I'd suggest automatically turning on timestamps, perhaps with WTC's suggestion to create either a 'timestamp2' module (which I had called 'epoch'), or some sort of timestamp formatting. The former is easier for our purposes, so it gets my vote. We also still have the issue of whether we can get better timestamp resolution on Windows than provided by PR_Now. So far it doesn't look promising, but maybe there's some API I haven't heard of yet.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Honza: opening the log file in append mode and locking the log file may fix the weird corruption. Jason: using the logging level of the 'timestamp' module to control timestamp formatting is also worth considering. (I'm not sure if it'll work.) But sharing the log file is the most natural solution.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.2
Blocks: 660158
Sheppy, I had a brief look for where to properly document this, but wasn't sure where the best place was. The implemented solution here added a GECKO_SEPARATE_NSPR_LOGS environment variable that causes .child-1234 (1234 being the child's process ID) to be appended to NSPR_LOG_FILE for any content/plugin/jetpack process launched. However, since this isn't an NSPR solution, it doesn't really fit in the docs for NSPR logging, so I couldn't find a place that satisfied me.
Keywords: dev-doc-needed
I added docs for HTTP logging: https://developer.mozilla.org/en/HTTP_Logging#Creating_separate_logs_for_child_processes_(electrolysis-only) I suppose we can just document it in other logging pages as needed?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: