Closed Bug 936987 Opened 11 years ago Closed 8 years ago

Catch and propagate PR_Close() error in CopyToNative()

Categories

(Firefox :: File Handling, defect)

All
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED

People

(Reporter: ishikawa, Assigned: ishikawa)

References

(Blocks 2 open bugs)

Details

(Keywords: dataloss, Whiteboard: [KEEP OPEN for another month or so while the clean up in Bug 936990 goes on, especially for initial feedback collection.])

Attachments

(1 file, 1 obsolete file)

In Bug 931720, a fix was applied to catch properly write()/read() and then return meaningful error. During the fix, I noticed PR_Close() was not handled. This can result in a data-loss error since PR_Close() can fail when the underlying system is a remote file system (CIFS and possibly NFS). I verified that a file mounted on a CIFS-share can cause PR_Close( basically a wrapper for close()) to return EHOSTDOWN error when I simulated the network outage by disabling network interface of a VM (VMPlayer running on Win7 host) in which linux is running. I am uploading a patch in the next post. Before the patch, when the network is disconnected, and the local write buffer/cache is flushed to the remote server upon close(), the error is noticed. write() did not produce an error. So, without the patch to check for the error of PR_Close(), the user is led to believe that the file move/copy done by CopyToNative() is successful while it may not. TIA
This is the patch. The original problem was noted by Bug 842267 - copy or move messages from IMAP folders to Local Folders on Linux Thunderbird 17.0.2 results in bad index file, fixed by Repair But I think this patch fixes only a part of the problem in 842267. Especially if other software has no problem in moving large files on CIFS, I think there are other issues other than this failure of checking the return value of PR_Close() by thunderbird alone. But this is a valid start. It fixes a valid data-loss case. TIA PS: BTW, I almost fainted when I realized that many routines failed to check the return value of PR_Close(). There are places where these are checked. There are places where these are NOT checked, and I am afraid that there are places where errors can certainly occur (as in CIFS, NFS, etc.) and so we should check the value and probably better off dumping the error code to alert the user of the flakey network condition.)
Attachment #829997 - Flags: review?(benjamin)
Assignee: nobody → ishikawa
Blocks: 936990
Severity: normal → critical
Keywords: dataloss
Blocks: 842267
Depends on: 931720
Comment on attachment 829997 [details] [diff] [review] Fix to handle the error from PR_Close() and return them properly. I'm supposing there isn't a simple way to automatedly test this? This method is rather complicated and is getting more complicated for this edge case.
Attachment #829997 - Flags: review?(benjamin) → review?(nfroyd)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #2) > Comment on attachment 829997 [details] [diff] [review] > Fix to handle the error from PR_Close() and return them properly. > > I'm supposing there isn't a simple way to automatedly test this? This method > is rather complicated and is getting more complicated for this edge case. No ,there is not a simple automated test case unless |make mozmill| test for TB, xpishell-tests, and other test harness for mozilla software have an easy way to cause or simulate network error conditions, which would cause close() to fail (against a file on a remote file share). I noticed that sqlite3 library source code has debug code to simulate I/O errors built in. I mean there are codes that simulates I/O errors by intentionally return error codes from the low-level routines that handles read/write/seek, etc. or return bogus data(and indeed some routines intentionally seek to a wrong position (off by one) for debug/testing purposes, for example) so that the developers can test the error handling, etc. of the sqlite3 library and application in the case of misbehaving storage devices and such. If the rock-solid characteristics are desired something like the approach of sqlite3 may need be pursued. (I recall linux has a purely software-written SCSI disk simulator driver that can be plugged into the kernel and accessed: the developer can simulate reliably various infrequent error conditions of a SCSI disk that happen only and its controller so that the block device driver can be tested against SCSI bus error conditions: timeout of command response, read/write failure due to bad block, etc. This is how the SCSI subsystem was tested in parallel with real-world devices that may show out-of-specification behaviors.) TB and FF OS need to handle locally created user data, and the data is precious. Users expect it to survive system failure and if the software fails, wants to know the occurence of the error immediately To protect data from failure, the code does get ugly sometimes, but I am afraid there is no other way out. Real hardware and network errors are ugly, and lowl-level routines ought to handle them. The patch as it stands is the minimal patch to handle the close() error case: I am merely trying to make sure that the error is caught and error code is returned. Currently, the error is not caught, and the user is led to believe the file copy succeeded but indeed may have failed. (If I inject code to re-try system calls in the face of EINTR and EAGAIN, it gets a little more complex. But I suspect that network-related libraries in mozilla does this already.) It is the nature of remote network file system that lets these network errors creeps into otherwise ordinary read/write/close system calls et all that trips naive coding style. TIA.
Comment on attachment 829997 [details] [diff] [review] Fix to handle the error from PR_Close() and return them properly. Review of attachment 829997 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, just some small fixes. r=me with the changes below. ::: xpcom/io/nsLocalFileUnix.cpp @@ +865,5 @@ > totalRead, totalWritten); > #endif > > + // DONE: Errors of close can occur. Read man page of > + // close(2); Nit: please delete this trailing whitespace. @@ +877,5 @@ > + if (PR_Close(newFD) < 0) { > + saved_write_close_error = NSRESULT_FOR_ERRNO(); > +#if DEBUG > + // This error merits printing. > + fprintf(stderr,"ERROR: PR_Close(newFD) returned error. errno = %d\n", errno); Nit: space between the comma and "ERROR: ...". @@ +884,5 @@ > + > + if (PR_Close(oldFD) < 0) { > + saved_read_close_error = NSRESULT_FOR_ERRNO(); > +#if DEBUG > + fprintf(stderr,"ERROR: PR_Close(oldFD) returned error. errno = %d\n", errno); Nit: space between the comma and "ERROR: ..." @@ +888,5 @@ > + fprintf(stderr,"ERROR: PR_Close(oldFD) returned error. errno = %d\n", errno); > +#endif > + } > + > + // Report read/write error first, then close error. Please remove this comment. @@ +903,5 @@ > MOZ_ASSERT(0); > #endif > } > + > + // Report close error if any. Report write close error first if any. Please remove this comment. @@ +907,5 @@ > + // Report close error if any. Report write close error first if any. > + if (saved_write_close_error != NS_OK) > + return saved_write_close_error; > + if (saved_read_close_error != NS_OK) > + return saved_read_close_error; Nit: please delete this trailing whitespace.
Attachment #829997 - Flags: review?(nfroyd) → review+
Thank you for the review. I reflected the issues in your comments. But while doing so, I noticed a few trailing blank: in the first line of the file (!), and a few other places, and so removed them. I hope you don't mind. I will put the checkin-needed keyword. But I will keep this bug open by putting in a request for another month or so, so that we can, as a mozilla community, find a general guideline for handling this type of errors. - The error code to return: it is clear in this particular case. - The manner to show the error: we depend on the upper layer routine to handles this with GUI interaction. (But I added fprintf() inside #if DEBUG, #endif because knowing the exact error during testing with DEBUG BUILD is very handy and important). There are other cases which BUG 936990 depends, which show different requirements, and so I want to see from a dozen examples to pick up the common requirements (yes, I know the number is small, but I can go on collecting others and change courses). I need some initial guesstimate for the would-be-guideline. Thank you again for the review.
Attachment #8334544 - Flags: review+
Attachment #829997 - Attachment is obsolete: true
Attachment #8334544 - Attachment description: Fix to handle the error from PR_Close() and return them properly. (Take 2) r=nfroyd+ → Fix to handle the error from PR_Close() and return them properly. (Take 2) carrying over r=nfroyd+
Attachment #8334544 - Flags: review+
Keywords: checkin-needed
Whiteboard: [KEEP OPEN for another month or so while the clean up in Bug 936990 goes on, especially for initial feedback collection.]
Product: Core → Firefox
Can this bug be resolved? I think it got lost a few years ago.
Flags: needinfo?(ishikawa)
Sorry I missed closing this. The proposed patch went in, and there was no negative feedback while this was left open after the patch went in. I think the code is better in catching the file system error when file attachment is saved, etc. now. I have made the status resolved/fixed. TIA
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(ishikawa)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: