Closed Bug 551344 Opened 15 years ago Closed 15 years ago

HTML5 parser review

Categories

(Core :: DOM: HTML Parser, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sicking, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(9 files, 7 obsolete files)

(deleted), patch
sicking
: review+
Details | Diff | Splinter Review
(deleted), patch
sicking
: review+
Details | Diff | Splinter Review
(deleted), patch
sicking
: review+
Details | Diff | Splinter Review
(deleted), patch
sicking
: review+
Details | Diff | Splinter Review
(deleted), patch
sicking
: review+
Details | Diff | Splinter Review
(deleted), patch
sicking
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
sicking
: review+
Details | Diff | Splinter Review
(deleted), patch
sicking
: review+
Details | Diff | Splinter Review
Will put my review comments of the HTML5 parser in this bug.
nsHtml5Parser.cpp: nsHtml5Parser::ContinueInterruptedParsing() 184 if (mExecutor->IsFlushing()) { 185 // A nested event loop dequeued the continue event and there aren't 186 // scripts executing. What's currently causing the flush is running to 187 // completion or there will be a script later and the script will cause 188 // another continue event. 189 return NS_OK; 190 } Could this not also happen if for example a timeout causes flushing? nsHtml5Parser::Parse 265 // Return early if the parser has processed EOF 266 if (!mExecutor->HasStarted()) { 267 NS_ASSERTION(!mStreamParser, 268 "Had stream parser but document.write started life cycle."); 269 mExecutor->SetParser(this); 270 mTreeBuilder->setScriptingEnabled(mExecutor->IsScriptEnabled()); 271 mTokenizer->start(); 272 mExecutor->Start(); 273 /* 274 * If you move the following line, be very careful not to cause 275 * WillBuildModel to be called before the document has had its 276 * script global object set. 277 */ 278 mExecutor->WillBuildModel(eDTDMode_unknown); 279 } The initial comment doesn't seem to be related to what happens inside the if-statment. 294 NS_PRECONDITION(IsInsertionPointDefined(), 295 "Document.write called when insertion point not defined."); 296 297 NS_PRECONDITION(!(mStreamParser && !aKey), "Got a null key in a non-script-created parser"); Don't use NS_PRECONDITION for tests that happen in the middle of the function. NS_ASSERTION is more appropriate here. Also, shouldn't the first assertion say that nsHtml5Parser::Parse was called. document.write being called is not assertion-worthy since web pages can cause that. There's quite a few lines longer than 80 chars in this file. 303 nsRefPtr<nsHtml5UTF16Buffer> buffer = new nsHtml5UTF16Buffer(aSourceBuffer.Length()); 304 memcpy(buffer->getBuffer(), aSourceBuffer.BeginReading(), aSourceBuffer.Length() * sizeof(PRUnichar)); 305 buffer->setEnd(aSourceBuffer.Length()); It's very unfortunate to allocate a new buffer and copy the memory every time document.write is called. We should really only do that when needed, which is only when a <script> is written, right? Since the old parser does that maybe leave this as is for now, but please do file a bug on it. 317 if (aKey) { // after document.open, the first level of document.write has null key 318 while (searchBuf != mLastBuffer) { 319 if (searchBuf->key == aKey) { 320 // found a key holder 321 // now insert the new buffer between the previous buffer 322 // and the key holder. 323 buffer->next = searchBuf; 324 if (prevSearchBuf) { 325 prevSearchBuf->next = buffer; 326 } else { 327 mFirstBuffer = buffer; 328 } 329 break; 330 } 331 prevSearchBuf = searchBuf; 332 searchBuf = searchBuf->next; 333 } 334 if (searchBuf == mLastBuffer) { 335 // key was not found 336 nsHtml5UTF16Buffer* keyHolder = new nsHtml5UTF16Buffer(aKey); 337 keyHolder->next = mFirstBuffer; 338 buffer->next = keyHolder; 339 mFirstBuffer = buffer; 340 } Ugh, the fact that this list is a mix of a magic mLastBuffer buffer, special key-holding buffers, and normal text-holding buffers is pretty ugly. It took me a while to figure out what's going on. Why not make the list just hold normal buffers that all point to a key, and make the list terminate with nsnull? Also, is there a reason these buffers are refcounted? It seems like they are always owned by the list. 356 if (!mBlocked) { 357 // mExecutor->WillResume(); 358 while (buffer->hasMore()) { 359 buffer->adjust(mLastWasCR); 360 mLastWasCR = PR_FALSE; 361 if (buffer->hasMore()) { You can rewrite the loop to |while (!mBlocked && buffer->hasMore()) {| right? Can mLastWasCR ever be true other than before the first iteration? If not just move the adjust call to outside the loop. And please add a comment stating why WillResume/WillInterrupt calls aren't needed, and remove the commented out calls. 368 // we aren't the root context, so save the line number on the 369 // *stack* so that we can restore it. 370 lineNumberSave = mTokenizer->getLineNumber(); Don't we need to put the tokenizer in a mode where it doesn't increase the line number at all? Otherwise you'll end up with a weird line number for document.written stuff? 395 if (!mBlocked) { // buffer was tokenized to completion 396 NS_ASSERTION(!buffer->hasMore(), "Buffer wasn't tokenized to completion?"); 397 // Scripting semantics require a forced tree builder flush here 398 mTreeBuilder->flushCharacters(); // Flush trailing characters 399 mTreeBuilder->Flush(); // Move ops to the executor Please make Flush() call flushCharacters(). Even if might result in a couple of extra calls to flushCharacters it's unlikely to matter perf wise. And it'll reduce the risk of people forgetting to call flushCharacters causing bugs. 416 nsHtml5Parser::Terminate(void) Please remove the 'void'. We don't use that convention anywhere else. nsHtml5Parser::ParseFragment: 469 mExecutor->SetNodeInfoManager(target->GetOwnerDoc()->NodeInfoManager()); Just use doc->NodeInfoManager() 461 nsCOMPtr<nsISupports> container = doc->GetContainer(); 462 // Can be null if owner document is synthetic I have a feeling that we want to just use a null container for fragment parsing. 476 mTreeBuilder->setScriptingEnabled(mExecutor->IsScriptEnabled()); 477 mTokenizer->start(); 478 mExecutor->Start(); // Don't call WillBuildModel in fragment cas Please consistently use FirstCharUpperCase convention when naming functions. 479 if (!aSourceBuffer.IsEmpty()) { 480 PRBool lastWasCR = PR_FALSE; 481 nsHtml5UTF16Buffer buffer(aSourceBuffer.Length()); 482 memcpy(buffer.getBuffer(), aSourceBuffer.BeginReading(), aSourceBuffer.Length() * sizeof(PRUnichar)); 483 buffer.setEnd(aSourceBuffer.Length()); 484 while (buffer.hasMore()) { 485 buffer.adjust(lastWasCR); 486 lastWasCR = PR_FALSE; 487 if (buffer.hasMore()) { 488 lastWasCR = mTokenizer->tokenizeBuffer(&buffer); 489 } 490 } 491 } When can we ever return from the tokenizer without consuming the whole buffer? Scripts getting parsed shouldn't stop the parser here, no? 535 nsHtml5Parser::CanInterrupt() 536 { 537 return !mExecutor->IsFragmentMode(); 538 } Shouldn't this also return false during document.write? nsHtml5Parser::ParseUntilBlocked 580 if (mBlocked) { 581 return; 582 } 583 584 if (mExecutor->IsComplete()) { 585 return; 586 } Combine these. 597 if (mDocumentClosed) { 598 NS_ASSERTION(!mStreamParser, 599 "This should only happen with script-created parser."); 600 mTokenizer->eof(); 601 mTreeBuilder->StreamEnded(); 602 mTreeBuilder->Flush(); 603 mExecutor->Flush(PR_TRUE); 604 mTokenizer->end(); 605 return; 606 } else { else-after-return 607 // never release the last buffer. 608 NS_ASSERTION(!mLastBuffer->getStart(), 609 "Sentinel buffer had its indeces changed."); 610 NS_ASSERTION(!mLastBuffer->getEnd(), 611 "Sentinel buffer had its indeces changed."); Combine these 621 } 622 return; // no more data for now but expecting more 623 } 624 } else { Once you fix the above else-after-return, then this will be an else-after-return too. 634 // now we have a non-empty buffer 635 mFirstBuffer->adjust(mLastWasCR); 636 mLastWasCR = PR_FALSE; 637 if (mFirstBuffer->hasMore()) { 638 PRBool inRootContext = (!mStreamParser && (mFirstBuffer->key == mRootContextKey)); 639 if (inRootContext) { 640 mTokenizer->setLineNumber(mRootContextLineNumber); 641 } Don't you need the 'else' part for the line number handling that we have in nsHtml5Parser::Parse?
(line numbers based on revision 6f8f9de6efb7, i think)
(In reply to comment #1) > nsHtml5Parser.cpp: > > > nsHtml5Parser::ContinueInterruptedParsing() > > 184 if (mExecutor->IsFlushing()) { > 185 // A nested event loop dequeued the continue event and there aren't > 186 // scripts executing. What's currently causing the flush is running to > 187 // completion or there will be a script later and the script will cause > 188 // another continue event. > 189 return NS_OK; > 190 } > > Could this not also happen if for example a timeout causes flushing? As far as I can tell, a timeout can't cause flushing of the HTML5 parser, because document.write() from a timeout implies document.open() and there are no content-initiated flushes in the HTML5 parser. (Note that this code is getting rearranged in bug 543458.) > nsHtml5Parser::Parse > > 265 // Return early if the parser has processed EOF > 266 if (!mExecutor->HasStarted()) { > 267 NS_ASSERTION(!mStreamParser, > 268 "Had stream parser but document.write started life > cycle."); > 269 mExecutor->SetParser(this); > 270 mTreeBuilder->setScriptingEnabled(mExecutor->IsScriptEnabled()); > 271 mTokenizer->start(); > 272 mExecutor->Start(); > 273 /* > 274 * If you move the following line, be very careful not to cause > 275 * WillBuildModel to be called before the document has had its > 276 * script global object set. > 277 */ > 278 mExecutor->WillBuildModel(eDTDMode_unknown); > 279 } > The initial comment doesn't seem to be related to what happens inside the > if-statment. Moved the comment to the right place. > 294 NS_PRECONDITION(IsInsertionPointDefined(), > 295 "Document.write called when insertion point not > defined."); > 296 > 297 NS_PRECONDITION(!(mStreamParser && !aKey), "Got a null key in a > non-script-created parser"); > > Don't use NS_PRECONDITION for tests that happen in the middle of the function. > NS_ASSERTION is more appropriate here. Changed to NS_ASSERTION. > Also, shouldn't the first assertion say > that nsHtml5Parser::Parse was called. document.write being called is not > assertion-worthy since web pages can cause that. The point is that it's assertion-worthy if document.write() reaches all the way to the backing Parse() method in that case. Changed the wording. > There's quite a few lines longer than 80 chars in this file. Fixed in this file. Also filed bug 551939. > 303 nsRefPtr<nsHtml5UTF16Buffer> buffer = new > nsHtml5UTF16Buffer(aSourceBuffer.Length()); > 304 memcpy(buffer->getBuffer(), aSourceBuffer.BeginReading(), > aSourceBuffer.Length() * sizeof(PRUnichar)); > 305 buffer->setEnd(aSourceBuffer.Length()); > > It's very unfortunate to allocate a new buffer and copy the memory every time > document.write is called. We should really only do that when needed, which is > only when a <script> is written, right? Only when an external script is written, but yeah. > Since the old parser does that maybe > leave this as is for now, but please do file a bug on it. Filed bug 551942. > 317 if (aKey) { // after document.open, the first level of document.write has > null key > 318 while (searchBuf != mLastBuffer) { > 319 if (searchBuf->key == aKey) { > 320 // found a key holder > 321 // now insert the new buffer between the previous buffer > 322 // and the key holder. > 323 buffer->next = searchBuf; > 324 if (prevSearchBuf) { > 325 prevSearchBuf->next = buffer; > 326 } else { > 327 mFirstBuffer = buffer; > 328 } > 329 break; > 330 } > 331 prevSearchBuf = searchBuf; > 332 searchBuf = searchBuf->next; > 333 } > 334 if (searchBuf == mLastBuffer) { > 335 // key was not found > 336 nsHtml5UTF16Buffer* keyHolder = new nsHtml5UTF16Buffer(aKey); > 337 keyHolder->next = mFirstBuffer; > 338 buffer->next = keyHolder; > 339 mFirstBuffer = buffer; > 340 } > > Ugh, the fact that this list is a mix of a magic mLastBuffer buffer, special > key-holding buffers, and normal text-holding buffers is pretty ugly. It took me > a while to figure out what's going on. Yeah, document.write() order management isn't pretty. :-( > Why not make the list just hold normal > buffers that all point to a key, My recollection is that I had that design first. Then to fix a problem with multiple document.write() calls and nested document.write()s, I decided I needed keyholders to mark points between the actual buffers. Unfortunately, I can't recall what problem exactly I was solving, so I can't say if a simpler solution exists. I used the same buffer class for the key holders, because the code that iterates forward in the list conveniently eats up the key holders when they appear as empty buffers. > and make the list terminate with nsnull? With the sentinel buffer, there's a place for the root context key should it ever be non-null. Having a sentinel object also keeps the queue management code more similar to the queue management code in nsHtml5StreamParser, which puts data in the last buffer but recycles the buffer without releasing it. > Also, is there a reason these buffers are refcounted? It seems like they are > always owned by the list. They used to be non-refcounted. The same buffer class is used by nsStreamParser and there they do need to be refcounted to keep data around as long as there's a pending speculation that points to the buffer list in case the speculation fails and the stream has to be rewound. > 356 if (!mBlocked) { > 357 // mExecutor->WillResume(); > 358 while (buffer->hasMore()) { > 359 buffer->adjust(mLastWasCR); > 360 mLastWasCR = PR_FALSE; > 361 if (buffer->hasMore()) { > > You can rewrite the loop to |while (!mBlocked && buffer->hasMore()) {| right? Yes. Fixed. > Can mLastWasCR ever be true other than before the first iteration? If not just > move the adjust call to outside the loop. Yes, mLastWasCR changing during the loop is a crucial aspect of the loop. The line mLastWasCR = mTokenizer->tokenizeBuffer(buffer); sets it inside the loop when the tokenizer finds a CR and needs the outer code to skip over an LF immediately after. The LF could be in different buffer, so it makes sense to do the LF skipping outside the tokenizer. (CRLF is such a pain.) > And please add a comment stating why WillResume/WillInterrupt calls aren't > needed, and remove the commented out calls. These are going away in bug 543458. > 368 // we aren't the root context, so save the line number on the > 369 // *stack* so that we can restore it. > 370 lineNumberSave = mTokenizer->getLineNumber(); > > Don't we need to put the tokenizer in a mode where it doesn't increase the line > number at all? Otherwise you'll end up with a weird line number for > document.written stuff? The old parser ends up with weird line numbers, too, so this isn't any worse than the status quo. The correct fix would be making the JS engine pass the line number of the document.write() call to the parser and make the parser associate document.written scripts and styles with that line number (and file). I haven't introduced a new tokenizer mode, because the correct fix wouldn't need it. > 395 if (!mBlocked) { // buffer was tokenized to completion > 396 NS_ASSERTION(!buffer->hasMore(), "Buffer wasn't tokenized to > completion?"); > 397 // Scripting semantics require a forced tree builder flush here > 398 mTreeBuilder->flushCharacters(); // Flush trailing characters > 399 mTreeBuilder->Flush(); // Move ops to the executor > > Please make Flush() call flushCharacters(). Even if might result in a couple of > extra calls to flushCharacters it's unlikely to matter perf wise. And it'll > reduce the risk of people forgetting to call flushCharacters causing bugs. Fixed. > 416 nsHtml5Parser::Terminate(void) > > Please remove the 'void'. We don't use that convention anywhere else. Fixed. (Copied from nsParser, FWIW.) > nsHtml5Parser::ParseFragment: > > 469 mExecutor->SetNodeInfoManager(target->GetOwnerDoc()->NodeInfoManager()); > > Just use doc->NodeInfoManager() Fixed. > 461 nsCOMPtr<nsISupports> container = doc->GetContainer(); > 462 // Can be null if owner document is synthetic > > I have a feeling that we want to just use a null container for fragment > parsing. Changed. > 476 mTreeBuilder->setScriptingEnabled(mExecutor->IsScriptEnabled()); > 477 mTokenizer->start(); > 478 mExecutor->Start(); // Don't call WillBuildModel in fragment cas > > Please consistently use FirstCharUpperCase convention when naming functions. We decided long ago it wasn't worthwhile to make the translator munge Java naming conventions to Mozilla C++ conventions. start() on the tokenizer is translated from Java. > 479 if (!aSourceBuffer.IsEmpty()) { > 480 PRBool lastWasCR = PR_FALSE; > 481 nsHtml5UTF16Buffer buffer(aSourceBuffer.Length()); > 482 memcpy(buffer.getBuffer(), aSourceBuffer.BeginReading(), > aSourceBuffer.Length() * sizeof(PRUnichar)); > 483 buffer.setEnd(aSourceBuffer.Length()); > 484 while (buffer.hasMore()) { > 485 buffer.adjust(lastWasCR); > 486 lastWasCR = PR_FALSE; > 487 if (buffer.hasMore()) { > 488 lastWasCR = mTokenizer->tokenizeBuffer(&buffer); > 489 } > 490 } > 491 } > > When can we ever return from the tokenizer without consuming the whole buffer? > Scripts getting parsed shouldn't stop the parser here, no? The tokenizer returns early if it saw a CR or if the tree builder saw a </script>. Returning from the tokenizer on CR and making it the responsibility of the outer code to call adjust() on the buffer makes CRLF coalescing work right across buffer boundaries including crazy document.write boundaries and makes the tokenizer internals much nicer that they'd be if the tokenizer had to coalesce CRLF on its own. For consistency and simplicity, the tree builder makes the tokenizer return on </script> even in the fragment case. This is harmless, since the while loop just calls the tokenizer again. > 535 nsHtml5Parser::CanInterrupt() > 536 { > 537 return !mExecutor->IsFragmentMode(); > 538 } > > Shouldn't this also return false during document.write? In the revision you read, it doesn't matter what this returns. With the patch for bug 543458, this could always return PR_TRUE, since the value affects the behavior of nsContentSink::DidProcessATokenImpl() which isn't called by the HTML5 parser in the uninterruptible cases. Changed to return PR_TRUE always. > nsHtml5Parser::ParseUntilBlocked > > 580 if (mBlocked) { > 581 return; > 582 } > 583 > 584 if (mExecutor->IsComplete()) { > 585 return; > 586 } > > Combine these. Can't combine these any longer after the patch for bug 543458, because the "IsComplete" case now does more stuff than the blocked case. In bug 543458, there are three return conditions that could be combined, but keeping them separate makes it nicer to have distinct comments for each condition. (I'm expecting the compiler to produce the same opt code either way.) > 597 if (mDocumentClosed) { > 598 NS_ASSERTION(!mStreamParser, > 599 "This should only happen with script-created > parser."); > 600 mTokenizer->eof(); > 601 mTreeBuilder->StreamEnded(); > 602 mTreeBuilder->Flush(); > 603 mExecutor->Flush(PR_TRUE); > 604 mTokenizer->end(); > 605 return; > 606 } else { > > else-after-return Fixed. > 607 // never release the last buffer. > 608 NS_ASSERTION(!mLastBuffer->getStart(), > 609 "Sentinel buffer had its indeces changed."); > 610 NS_ASSERTION(!mLastBuffer->getEnd(), > 611 "Sentinel buffer had its indeces changed."); > > Combine these Fixed. > 621 } > 622 return; // no more data for now but expecting more > 623 } > 624 } else { > > Once you fix the above else-after-return, then this will be an > else-after-return too. Fixed. > 634 // now we have a non-empty buffer > 635 mFirstBuffer->adjust(mLastWasCR); > 636 mLastWasCR = PR_FALSE; > 637 if (mFirstBuffer->hasMore()) { > 638 PRBool inRootContext = (!mStreamParser && (mFirstBuffer->key == > mRootContextKey)); > 639 if (inRootContext) { > 640 mTokenizer->setLineNumber(mRootContextLineNumber); > 641 } > > Don't you need the 'else' part for the line number handling that we have in > nsHtml5Parser::Parse? The line numbers are so bogus in that case anyway that it's not worthwhile to make them bogus but in a different way.
Depends on: 539887, 543458
Comment on attachment 432131 [details] [diff] [review] Address review comments in nsHtml5Parser.cpp Oops. I need to tweak the patch a bit.
Attachment #432131 - Attachment is obsolete: true
Attachment #432131 - Flags: review?(jonas)
nsHtml5TreeOpExecutor.cpp Again there are lots of lines over 80 chars wide. I'd say for now lets leave that as it's harder to review patches that contain whitespace cleanup together with actual changes. But we should at some point go through and fix up this. And please make sure to not use over long lines in new code. nsHtml5TreeOpExecutor::InitializeStatics 101 if (sTreeOpQueueMinLength <= 0) { 102 sTreeOpQueueMinLength = 200; 103 } 104 if (sTreeOpQueueLengthLimit < sTreeOpQueueMinLength) { 105 sTreeOpQueueLengthLimit = sTreeOpQueueMinLength; 106 } 107 if (sTreeOpQueueMaxLength < sTreeOpQueueMinLength) { 108 sTreeOpQueueMaxLength = sTreeOpQueueMinLength; 109 } 110 if (sTreeOpQueueMaxTime <= 0) { 111 sTreeOpQueueMaxTime = 200; 112 } These adjustments won't help very much given that if the pref is changed they won't be rerun, right? Maybe you should install a observer for "html5.opqueue" and always update the prefs and run these adjustments if anything under that changes. nsHtml5TreeOpExecutor::DidBuildModel 139 if (!aTerminated) { 140 // Break out of update batch if we are in one 141 // and aren't forcibly terminating 142 EndDocUpdate(); For my own understanding, why should we not exit the current patch if we are forcably terminated? When can we get terminated during a bach, aren't we preventing scripts from running during a batch? 153 // This is comes from nsXMLContentSink Remove the "is". 180 nsHtml5TreeOpExecutor::WillInterrupt() 181 { 182 return WillInterruptImpl(); 183 } So it sounds like all the WillInterrupt stuff is going away from html5 parsing, is this correct? nsHtml5TreeOpExecutor::SetDocumentCharset 207 if (mDocShell) { 208 // the following logic to get muCV is copied from 209 // nsHTMLDocument::StartDocumentLoad 210 // We need to call muCV->SetPrevDocCharacterSet here in case 211 // the charset is detected by parser DetectMetaTag 212 nsCOMPtr<nsIMarkupDocumentViewer> muCV; Please call this variable 'mucv' or some such. As it's named now it looks confusingly similar to an mFoo member variable. nsHtml5TreeOpExecutor::UpdateStyleSheet This function should probably be renamed to HandleLinkElement or some such. 277 // Break out of the doc update created by Flush() to zap a runnable 278 // waiting to call UpdateStyleSheet without the right observer This comment doesn't make sense to me. Sounds like someone is intentionally waiting to "call UpdateStyleSheet without the right observer". nsHtml5TreeOpExecutor::Flush 380 if (aForceWholeQueue) { 381 if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) { 382 flushStart = PR_IntervalNow(); // compute averages only if enough ops 383 } 384 } else { 385 if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) { 386 flushStart = PR_IntervalNow(); // compute averages only if enough ops 387 if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueLengthLimit) { 388 numberOfOpsToFlush = (PRUint32)sTreeOpQueueLengthLimit; 389 reflushNeeded = PR_TRUE; 390 } 391 } 392 } You can simplify this to: if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) { flushStart = PR_IntervalNow(); // compute averages only if enough ops if (!aForceWholeQueue && numberOfOpsToFlush > (PRUint32)sTreeOpQueueLengthLimit) { numberOfOpsToFlush = (PRUint32)sTreeOpQueueLengthLimit; reflushNeeded = PR_TRUE; } } Though see below. Editorial: There is whitespace at the end of line 383 here. 398 for (nsHtml5TreeOperation* iter = (nsHtml5TreeOperation*)start; iter < end; ++iter) { 399 if (NS_UNLIKELY(!mParser)) { 400 // The previous tree op caused a call to nsIParser::Terminate(); 401 break; 402 } How could we end up terminating here? Isn't the idea that we won't be running script while inside the doc-update? Is this due to charset switching or something else? 413 if (flushStart) { 414 PRUint32 delta = PR_IntervalToMilliseconds(PR_IntervalNow() - flushStart); 415 sTreeOpQueueLengthLimit = delta ? 416 (PRUint32)(((PRUint64)sTreeOpQueueMaxTime * (PRUint64)numberOfOpsToFlush) 417 / delta) : 418 sTreeOpQueueMaxLength; // if the delta is less than one ms, use max 419 if (sTreeOpQueueLengthLimit < sTreeOpQueueMinLength) { 420 // both are signed and sTreeOpQueueMinLength is always positive, so this 421 // also takes care of the theoretical overflow of sTreeOpQueueLengthLimit 422 sTreeOpQueueLengthLimit = sTreeOpQueueMinLength; 423 } 424 if (sTreeOpQueueLengthLimit > sTreeOpQueueMaxLength) { 425 sTreeOpQueueLengthLimit = sTreeOpQueueMaxLength; 426 } 427 #ifdef DEBUG_NS_HTML5_TREE_OP_EXECUTOR_FLUSH 428 printf("FLUSH DURATION (millis): %d\n", delta); 429 printf("QUEUE NEW MAX LENGTH: %d\n", sTreeOpQueueLengthLimit); 430 #endif 431 } I'm definitely nervous about changing the way we back off to the main event loop. Specifically things like checking if there are pending user events seems like a good idea but are no longer happening. While we can't duplicate exactly what the old parser does, since we can't have callbacks for each parsed token, I don't see why we couldn't keep approximately the same strategy. I.e. start measuring time when entering nsHtml5TreeOpExecutor::Flush or some such, then treat iteration through the "perform" loop do what we currently do in nsContentSink::DidProcessATokenImpl. I need to review flushing more in detail to see how well this matches. I definitely agree that we likely will want to change things there, but I'd rather do that orthogonally to changing the parser so that we don't risk improvements in one patch masking regressions in another. We also don't have a good way to measure responsiveness automatically, so messing around with this stuff means that we need to put in a lot more manual QAing. 441 if (scriptElement) { 442 NS_ASSERTION(!reflushNeeded, "Got scriptElement when queue not fully flushed."); 443 RunScript(scriptElement); // must be tail call when mFlushState is eNotFlushing What's preventing that assertion from firing? I.e. what is preventing finding a script to execute after we've set reflushNeeded to true. nsHtml5TreeOpExecutor::ProcessBASETag(nsIContent* aContent) 459 if (mHasProcessedBase) { 460 return NS_OK; 461 } 462 mHasProcessedBase = PR_TRUE; Hmm.. won't this mean that: <base target="..."> <base href="..."> won't work? I.e. we'll ignore the second base element here, despite that being the first one with a href. Steps 3 and 4 of http://www.whatwg.org/specs/web-apps/current-work/#document-base-url seems to indicate otherwise. Same if the elements are reversed but for the target attribute. nsHtml5TreeOpExecutor::DocumentMode Maybe call this SetDocumentMode or UpdateDocumentMode or some such.
(In reply to comment #7) > nsHtml5TreeOpExecutor.cpp > > Again there are lots of lines over 80 chars wide. I'd say for now lets leave > that as it's harder to review patches that contain whitespace cleanup together > with actual changes. But we should at some point go through and fix up this. > And please make sure to not use over long lines in new code. OK. (FWIW, there's also bug 523334.) > nsHtml5TreeOpExecutor::InitializeStatics > > 101 if (sTreeOpQueueMinLength <= 0) { > 102 sTreeOpQueueMinLength = 200; > 103 } > 104 if (sTreeOpQueueLengthLimit < sTreeOpQueueMinLength) { > 105 sTreeOpQueueLengthLimit = sTreeOpQueueMinLength; > 106 } > 107 if (sTreeOpQueueMaxLength < sTreeOpQueueMinLength) { > 108 sTreeOpQueueMaxLength = sTreeOpQueueMinLength; > 109 } > 110 if (sTreeOpQueueMaxTime <= 0) { > 111 sTreeOpQueueMaxTime = 200; > 112 } > > These adjustments won't help very much given that if the pref is changed they > won't be rerun, right? Maybe you should install a observer for "html5.opqueue" > and always update the prefs and run these adjustments if anything under that > changes. This code goes away in bug 543458. > nsHtml5TreeOpExecutor::DidBuildModel > > 139 if (!aTerminated) { > 140 // Break out of update batch if we are in one > 141 // and aren't forcibly terminating > 142 EndDocUpdate(); > > For my own understanding, why should we not exit the current patch if we are > forcably terminated? This is needed to avoid unblocking loads too many times on one hand and on the other hand to avoid destroying the frame constructor from within an update batch. See bug 537683. Rewrote the comment to say so. > When can we get terminated during a bach, aren't we > preventing scripts from running during a batch? In the normal EOF case, we have aTerminate set to false and we are in a batch here. <rant>As for aTerminate set to true, the immediate termination semantics of nsIParser::Terminate() have been a huge pain. At least the HTML5 app cache stuff can decide to terminate the parser mid-flight (I don't remember in what situation exactly). Bug 531106 and bug 537683 have interensting re-entrancy scenarios.</rant> > 153 // This is comes from nsXMLContentSink > > Remove the "is". Fixed. > 180 nsHtml5TreeOpExecutor::WillInterrupt() > 181 { > 182 return WillInterruptImpl(); > 183 } > > So it sounds like all the WillInterrupt stuff is going away from html5 parsing, > is this correct? Yes. This has become NS_IMETHODIMP nsHtml5TreeOpExecutor::WillInterrupt() { NS_NOTREACHED("Don't call. For interface compat only."); return NS_ERROR_NOT_IMPLEMENTED; } in bug 543458. > nsHtml5TreeOpExecutor::SetDocumentCharset > > 207 if (mDocShell) { > 208 // the following logic to get muCV is copied from > 209 // nsHTMLDocument::StartDocumentLoad > 210 // We need to call muCV->SetPrevDocCharacterSet here in case > 211 // the charset is detected by parser DetectMetaTag > 212 nsCOMPtr<nsIMarkupDocumentViewer> muCV; > > Please call this variable 'mucv' or some such. As it's named now it looks > confusingly similar to an mFoo member variable. This comes from nsHTMLDocument::StartDocumentLoad, but OK. Changed. > nsHtml5TreeOpExecutor::UpdateStyleSheet > > This function should probably be renamed to HandleLinkElement or some such. UpdateStyleSheet is consistent with the old sinks and the XSLT code. > 277 // Break out of the doc update created by Flush() to zap a runnable > 278 // waiting to call UpdateStyleSheet without the right observer > > This comment doesn't make sense to me. Sounds like someone is intentionally > waiting to "call UpdateStyleSheet without the right observer". Indeed, code on the content tree side has questionable behavior that relies on the parser extinguishing a useless runnable here. I'd like to fix the content tree side eventually but with the content tree side the way it is, this hack is needed to avoid parsing <style> contents twice. > nsHtml5TreeOpExecutor::Flush > > 380 if (aForceWholeQueue) { > 381 if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) { > 382 flushStart = PR_IntervalNow(); // compute averages only if enough ops > 383 } > 384 } else { > 385 if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) { > 386 flushStart = PR_IntervalNow(); // compute averages only if enough ops > 387 if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueLengthLimit) { > 388 numberOfOpsToFlush = (PRUint32)sTreeOpQueueLengthLimit; > 389 reflushNeeded = PR_TRUE; > 390 } > 391 } > 392 } > > You can simplify this to: > > if (numberOfOpsToFlush > (PRUint32)sTreeOpQueueMinLength) { > flushStart = PR_IntervalNow(); // compute averages only if enough ops > if (!aForceWholeQueue && > numberOfOpsToFlush > (PRUint32)sTreeOpQueueLengthLimit) { > numberOfOpsToFlush = (PRUint32)sTreeOpQueueLengthLimit; > reflushNeeded = PR_TRUE; > } > } > > Though see below. > > Editorial: There is whitespace at the end of line 383 here. This code went away in bug 543458. > 398 for (nsHtml5TreeOperation* iter = (nsHtml5TreeOperation*)start; iter < > end; ++iter) { > 399 if (NS_UNLIKELY(!mParser)) { > 400 // The previous tree op caused a call to nsIParser::Terminate(); > 401 break; > 402 } > > How could we end up terminating here? Isn't the idea that we won't be running > script while inside the doc-update? Is this due to charset switching or > something else? At least the HTML5 app cache can terminate. That said, nsIParser::Terminate() has been a source of bugs that felt endless, so I can't remember what else could call nsIParser::Terminate() here. (I failed to record all the cases I had seen.) > 413 if (flushStart) { > 414 PRUint32 delta = PR_IntervalToMilliseconds(PR_IntervalNow() - > flushStart); > 415 sTreeOpQueueLengthLimit = delta ? > 416 (PRUint32)(((PRUint64)sTreeOpQueueMaxTime * > (PRUint64)numberOfOpsToFlush) > 417 / delta) : > 418 sTreeOpQueueMaxLength; // if the delta is less than one ms, use max > 419 if (sTreeOpQueueLengthLimit < sTreeOpQueueMinLength) { > 420 // both are signed and sTreeOpQueueMinLength is always positive, so > this > 421 // also takes care of the theoretical overflow of > sTreeOpQueueLengthLimit > 422 sTreeOpQueueLengthLimit = sTreeOpQueueMinLength; > 423 } > 424 if (sTreeOpQueueLengthLimit > sTreeOpQueueMaxLength) { > 425 sTreeOpQueueLengthLimit = sTreeOpQueueMaxLength; > 426 } > 427 #ifdef DEBUG_NS_HTML5_TREE_OP_EXECUTOR_FLUSH > 428 printf("FLUSH DURATION (millis): %d\n", delta); > 429 printf("QUEUE NEW MAX LENGTH: %d\n", sTreeOpQueueLengthLimit); > 430 #endif > 431 } > > I'm definitely nervous about changing the way we back off to the main event > loop. Specifically things like checking if there are pending user events seems > like a good idea but are no longer happening. > > While we can't duplicate exactly what the old parser does, since we can't have > callbacks for each parsed token, I don't see why we couldn't keep approximately > the same strategy. I.e. start measuring time when entering > nsHtml5TreeOpExecutor::Flush or some such, then treat iteration through the > "perform" loop do what we currently do in nsContentSink::DidProcessATokenImpl. > I need to review flushing more in detail to see how well this matches. > > I definitely agree that we likely will want to change things there, but I'd > rather do that orthogonally to changing the parser so that we don't risk > improvements in one patch masking regressions in another. We also don't have a > good way to measure responsiveness automatically, so messing around with this > stuff means that we need to put in a lot more manual QAing. Bug 543458 makes all the above code go away and uses nsContentSink::DidProcessATokenImpl instead. > 441 if (scriptElement) { > 442 NS_ASSERTION(!reflushNeeded, "Got scriptElement when queue not fully > flushed."); > 443 RunScript(scriptElement); // must be tail call when mFlushState is > eNotFlushing > > What's preventing that assertion from firing? I.e. what is preventing finding a > script to execute after we've set reflushNeeded to true. reflushNeeded goes away in bug 543458. Sorry about not advertising bug 543458 to you in advance. I thought you'd review the unreviewed Java-based parser core first instead of re-reviewing the hand-written C++ code. As for reviewing the Java-based code, I'd like to highlight bug 489820 and bug 483209 as having a pending patch (that I'm provisionally treating as rs=sicking). Other than that, P3, P4 and P5 "[HTML5]" bugs document a lot of known things that could benefit from improvement eventually. > nsHtml5TreeOpExecutor::ProcessBASETag(nsIContent* aContent) > > 459 if (mHasProcessedBase) { > 460 return NS_OK; > 461 } > 462 mHasProcessedBase = PR_TRUE; > > Hmm.. won't this mean that: > > <base target="..."> > <base href="..."> > > won't work? I.e. we'll ignore the second base element here, despite that being > the first one with a href. Steps 3 and 4 of > http://www.whatwg.org/specs/web-apps/current-work/#document-base-url seems to > indicate otherwise. Same if the elements are reversed but for the target > attribute. In the Java code, I have attempted to conform to HTML5. In the hand-written Gecko integration C++ code, I have been trying to retain compatibility with existing Gecko behavior until the HTML5 parser is on by default. In this case, I have copied the behavior of the XML content sink without trying to change it to be HTML5-compliant. > nsHtml5TreeOpExecutor::DocumentMode > > Maybe call this SetDocumentMode or UpdateDocumentMode or some such. Changed to SetDocumentMode.
Attachment #432525 - Flags: review?(jonas)
I realized that it would be nice to put all this code into a mozilla::html5 parser namespace. That should should reduce concerns about long identifiers a bit. Definitely not a requirement though. Why do we need nsHtml5DocumentMode? nsCompatibility.h has an equivalent enum. nsHtml5ElementName and nsHtml5AttributeName are very hard to read given the big tables they contain. We really should have those broken out into #include files similarly to how we do atom lists. nsHtml5MetaScanner.cpp Have you looked at if this code shows up in profiles at all? In general there are a fair number of optimizations that you could do here. Such as not feeding the scanner asynchronously (which there really is no reason to do as we don't display anything until we're done scanning, right?), and instead implementing this in a much less "switchy" way. Another thing is to get rid of the the reconsume stuff, it should be faster to rewind the reading iterator one char. This various members need descriptions. I couldn't find any in the java source either.
681 nsHtml5MetaScanner::tryCharset() 682 { 683 if (metaState != NS_HTML5META_SCANNER_A || !(contentIndex == 6 || charsetIndex == 6)) { 684 return PR_FALSE; 685 } 686 nsString* attVal = nsHtml5Portability::newStringFromBuffer(strBuf, 0, strBufLen); Ugh, this is really ugly, and scary from a leaks point of view. We need to find a solution that allows us to generate nice C++ code here. I.e. with a normal string object living on the stack. Don't expect there is anything we can do about it in this bug (probably a bigger project), but wanted to note it so that we don't loose track of it.
(In reply to comment #9) > I realized that it would be nice to put all this code into a mozilla::html5 > parser namespace. That should should reduce concerns about long identifiers a > bit. Definitely not a requirement though. Yeah, the mozilla::* stuff started being common and wanted after the nsHtml5* naming was already in place. This is bug 497820. > Why do we need nsHtml5DocumentMode? nsCompatibility.h has an equivalent enum. The Java code has a DocumentMode enum independently of Gecko. I could add more special casing to the translator and make the DocumentMode enum translate directly to nsCompatibility.h if adding complexity to the translator is preferred. (In the big picture, though, this is nothing new: the old parser also had its own enum distinct from the layout enum for this purpose.) > nsHtml5ElementName and nsHtml5AttributeName are very hard to read given the big > tables they contain. We really should have those broken out into #include files > similarly to how we do atom lists. The main difference is that those big tables are translated from the Java code (and Java doesn't have include files) while the atom list is generated by the C++ translator. I could special-case nsHtml5ElementName and nsHtml5AttributeName to be split across multiple files. Do you consider this a blocker for turning the parser on by default? > nsHtml5MetaScanner.cpp > > Have you looked at if this code shows up in profiles at all? I've noticed meta prescan-related stuff on a profile, but everything that happens on the parser thread on tp4 is below 1% of the total time, so any optimizations to the parser core would be squeezing below 1% time to even more below 1%. > In general there > are a fair number of optimizations that you could do here. Such as not feeding > the scanner asynchronously (which there really is no reason to do as we don't > display anything until we're done scanning, right?), and instead implementing > this in a much less "switchy" way. > > Another thing is to get rid of the the reconsume stuff, it should be faster to > rewind the reading iterator one char. The meta scanner uses an old approach to implementing the states. I want to change the meta scanner to use the same patterns as the main tokenizer. However, I've prioritized mochitest failures and user-visible bugs. Once the meta scanner uses the same patterns as the tokenizer, fixing bug 482920 would speed up both. I filed this as bug 552900. > This various members need descriptions. I couldn't find any in the java source > either. OK. (In reply to comment #10) > 681 nsHtml5MetaScanner::tryCharset() > 682 { > 683 if (metaState != NS_HTML5META_SCANNER_A || !(contentIndex == 6 || > charsetIndex == 6)) { > 684 return PR_FALSE; > 685 } > 686 nsString* attVal = nsHtml5Portability::newStringFromBuffer(strBuf, 0, > strBufLen); > > Ugh, this is really ugly, and scary from a leaks point of view. We need to find > a solution that allows us to generate nice C++ code here. I.e. with a normal > string object living on the stack. > > Don't expect there is anything we can do about it in this bug (probably a > bigger project), but wanted to note it so that we don't loose track of it. The cases where the HTML5 parser could benefit from on-stack strings are so few that it's probably not worth it. Most strings are attribute values and those need to travel across threads on the heap anyway. However, it's kinda pointless to have nsString* when nsStringBuffer* would have the desired semantics of java.lang.String. But nsStringBuffer* is supposed to be a hidden impl. detail. :-(
So having looked at the entity code also, I think we shoud switch to having specialized code in the translator that output "macro files", and then hand write cpp files that use those to populate whatever data structure we think is appropriate. This is especially nice for the entity map as we can get rid of the high-lo stuff. Not necessarily a requirement for turning on by default, I'd say most things in this bug aren't, but something that needs to be done for sure.
(In reply to comment #11) > (In reply to comment #9) > > This various members need descriptions. I couldn't find any in the java source > > either. > > OK. I landed some more JavaDocs, though in most cases I couldn't think of much to say. (In reply to comment #9) > nsHtml5ElementName and nsHtml5AttributeName are very hard to read given the big > tables they contain. We really should have those broken out into #include files > similarly to how we do atom lists. I realized we're repeating bug 487949 comment 42 and bug 487949 comment 49. (In reply to comment #12) > So having looked at the entity code also, I think we shoud switch to having > specialized code in the translator that output "macro files", and then hand > write cpp files that use those to populate whatever data structure we think is > appropriate. I don't really see the benefit of having the generator generate macros and includes that at compile time generate the desired structure. It seems simpler to let the generator generate the desired structure. (Except for the enum trick from bug 501082 that makes compile times faster.) > This is especially nice for the entity map as we can get rid of > the high-lo stuff. I filed bug 555941 about splitting the HILO_ACCEL table into a separate class, but I don't see positive value in having the C preprocessor compute the contents of HILO_ACCEL. Seems like an unnecessary complication to me.
> (In reply to comment #12) > > So having looked at the entity code also, I think we shoud switch to having > > specialized code in the translator that output "macro files", and then hand > > write cpp files that use those to populate whatever data structure we think > > is appropriate. > > I don't really see the benefit of having the generator generate macros and > includes that at compile time generate the desired structure. It seems simpler > to let the generator generate the desired structure. (Except for the enum > trick from bug 501082 that makes compile times faster.) The idea is that we can hand write C++ code to use and implement the datastructures that work the best for our code. So for example in C++ it's much more efficient and clean to get rid of the high-lo stuff entirely and simply use an nsTHashtable mapping the full entity string to an entity value. Or, if we want more efficiency, to build a tree structure which allows us to walk one level down the tree for each char consumed, and which at the end gives the entity value that maps to. I completely agree that simply using preprocessor macros to generate the exact same code we have now doesn't buy us much. The goal here is simpler and more maintainable C++ code.
(In reply to comment #14) > The idea is that we can hand write C++ code to use and implement the > datastructures that work the best for our code. So for example in C++ it's much > more efficient and clean to get rid of the high-lo stuff entirely and simply > use an nsTHashtable mapping the full entity string to an entity value. A hashtable alone wouldn't do the right thing, because named characters need a longest prefix match. > Or, if > we want more efficiency, to build a tree structure which allows us to walk one > level down the tree for each char consumed, and which at the end gives the > entity value that maps to. The hilo table encodes the first two levels of trie-equivalent lookup more efficiently than a tree structure would. Using linear search for the tail of the names makes the data structure have a smaller overall footprint than a prefix tree would and makes it easier to arrange the entire data structure ahead of run time. The only part of the hilo solution that is less nice in C++ because of Java's restrictions is that two 16-bit integers are manually shifted into a 32-bit space instead of using a struct of two shorts as a syntactically nicer way to use the same data layout in C++. I didn't choose the hilo solution only to work around cross-language limitations even though the solution has the nice property of working in both Java and C++ without language-specific classes. I chose it to balance the the performance characteristics of a trie with the compactness and data segment-friendliness of arrays.
I still think we need to change how we're doing element/attribute/entity names, however I'll take that to a separate bug. More review comments: What is the purpose of nsHtml5RefPtr? Why not just use nsRefPtr? We need to get rid of nsHtml5AtomTable.cpp and the "fake" nsIAtoms. It's very confusing that we in part of the code have atom like things that aren't really atoms. If you don't need these things to be unique (which my understanding is that you don't?), then simply using pointers to nsStrings or even nsStringBuffers seems like a cleaner solution. We could use a class similar to nsAttrName that wraps either an nsStringBuffer or an nsIAtom depending on if the low-bit is set. Or some such. This can be done after turning on by default though. nsHtml5StateSnapshot: Might be cleaner to just make all those members public and avoid all the getter methods. Would be nice if getListLength had something about formatting elements in its name (unless you remove it entirely, see previous point). 157 if (!!listOfActiveFormattingElements[i]) { No need for the !! nsHtml5StreamParser: 197 nsHtml5StreamParser::~nsHtml5StreamParser() 198 { 199 NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); 200 mTokenizer->end(); 201 mRequest = nsnull; 202 mObserver = nsnull; 203 mUnicodeDecoder = nsnull; 204 mSniffingBuffer = nsnull; 205 mMetaScanner = nsnull; 206 mFirstBuffer = nsnull; 207 mExecutor = nsnull; 208 mTreeBuilder = nsnull; 209 mTokenizer = nsnull; 210 mOwner = nsnull; 211 if (mFlushTimer) { 212 mFlushTimer->Cancel(); 213 mFlushTimer = nsnull; 214 } 215 } Is there a reason to null all these things out? 238 nsHtml5StreamParser::SetupDecodingAndWriteSniffingBufferAndCurrentSegment(const PRUint8* aFromSegment, // can be null 239 PRUint32 aCount, 240 PRUint32* aWriteCount) 241 { 242 NS_ASSERTION(IsParserThread(), "Wrong thread!"); 243 nsresult rv = NS_OK; 244 nsCOMPtr<nsICharsetConverterManager> convManager = do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv); 245 NS_ENSURE_SUCCESS(rv, rv); 246 rv = convManager->GetUnicodeDecoder(mCharset.get(), getter_AddRefs(mUnicodeDecoder)); I don't think that the charset converter manager is threadsafe yet. Nor are some of the decoders I looked at. I suspect the simplest fix is to make them so, it shouldn't be too hard. This we need to fix before turning it on by default as to avoid random crashes. 248 mCharset.Assign("windows-1252"); // lower case is the raw form Use AssignLiteral (This might apply elsewhere too) 979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch() 980 { 981 NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); 982 mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex); 983 Uninterrupt(); 984 nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this); 985 if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) { 986 NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation"); 987 } 988 } Is there a reason to uninterrupt from the main thread? Why not do it from the parser thread as to avoid blocking the main thread. nsHtml5StreamParser::ContinueAfterScripts 864 if (mSpeculations.IsEmpty()) { 865 // Not quite sure how exactly this happens... 866 // Maybe an artifact of defer scripts? 867 return; 868 } Sounds like an assertion would be good. 902 mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex); 903 #ifdef DEBUG 904 nsCOMPtr<nsIThread> mainThread; 905 NS_GetMainThread(getter_AddRefs(mainThread)); 906 mAtomTable.SetPermittedLookupThread(mainThread); 907 #endif Please don't indent unless there is an actual block. Might actually be good to add a block to avoid mainThread existing only in debug builds.
(In reply to comment #16) > What is the purpose of nsHtml5RefPtr? Why not just use nsRefPtr? nsHtml5RefPtr calls Release on the main thread. The only way to achieve the same thing without modifying nsRefPtr would be to make sure the refcounted type overrides Release and does the main-thread dispatching itself, which is inconvenient. What really needs to happen here is to modify nsRefPtr to accept a template parameter that controls the Release behavior. I have a patch to do just that, and also get rid of the duplicated code in parser/html/nsHtml5RefPtr.h.
You're also using the charset detector off the main thread, which I suspect isn't safe. In fact, since you are creating it on the main thread, and then calling it from the parser thread, I'm surprised this doesn't assert as it doesn't seem to be implemented using threadsafe refcounting macros. This is also something we need to fix before we can turn on by default. Do you want me to file a separate bug on this charset threadsafety stuff?
nsHtml5StreamParser::SniffStreamBytes 340 for (PRUint32 i = 0; i < aCount; i++) { 341 switch (mBomState) { ... 390 case SEEN_UTF_8_SECOND_BYTE: 391 if (aFromSegment[i] == 0xBF) { 392 rv = SetupDecodingFromBom("UTF-8", "UTF-8"); // upper case is the raw form 393 NS_ENSURE_SUCCESS(rv, rv); 394 PRUint32 count = aCount - (i + 1); 395 rv = WriteStreamBytes(aFromSegment + (i + 1), count, &writeCount); 396 NS_ENSURE_SUCCESS(rv, rv); 397 *aWriteCount = writeCount + (i + 1); 398 return rv; 399 } 400 mBomState = BOM_SNIFFING_OVER; 401 break; 402 default: 403 goto bom_loop_end; 404 } 405 } 406 // if we get here, there either was no BOM or the BOM sniffing isn't complete yet 407 bom_loop_end: Change the loop-condition to be |i < aCount && mBomState != BOM_SNIFFING_OVER| and change the default condition to set |mBomState = BOM_SNIFFING_OVER| and get rid of the goto? http://xkcd.com/292/ :) 418 if (mUnicodeDecoder) { 419 mUnicodeDecoder->SetInputErrorBehavior(nsIUnicodeDecoder::kOnError_Recover); It seems like the responsibility for this call varies with where we instantiate mUnicodeDecoder. Would be nice to make it consistent and always make it the responsibility of whoever created the decoder. Or do we need to set it to different ™ nsHtml5StreamParser::WriteStreamBytes 475 if (NS_FAILED(convResult)) { Could you add a comment describing what we do on failure. Took a while for me to figure out. 476 if (totalByteCount < aCount) { // mimicking nsScanner even though this seems wrong 477 ++totalByteCount; 478 ++aFromSegment; Could you assert here? I agree this seems wrong. 479 } 480 mLastBuffer->getBuffer()[end] = 0xFFFD; 481 ++end; 482 mLastBuffer->setEnd(end); 483 if (end == NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE) { 484 mLastBuffer = (mLastBuffer->next = new nsHtml5UTF16Buffer(NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE)); No need for the parenthesis. a = b = c; is ok. 485 } 486 mUnicodeDecoder->Reset(); 487 if (totalByteCount == aCount) { 488 *aWriteCount = totalByteCount; 489 return NS_OK; 490 } 491 } else if (convResult == NS_PARTIAL_MORE_OUTPUT) { 492 mLastBuffer = (mLastBuffer->next = new nsHtml5UTF16Buffer(NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE)); 493 NS_ASSERTION(totalByteCount < aCount, "The Unicode decoder has consumed too many bytes."); Same here nsHtml5StreamParser::internalEncodingDeclaration The name of this function seems very strange to me. Also, it's not documented anywhere. Is this called any time we find a meta-charset? If so, would calling it something like MetaCharsetFound make sense? 700 NS_ASSERTION(IsParserThread(), "Wrong thread!"); 701 if (mCharsetSource >= kCharsetFromMetaTag) { // this threshold corresponds to "confident" in the HTML5 spec 702 return; 703 } 704 705 // The encodings are different. This comment threw me off for a while. It doesn't appear to be correct, we haven't done any equality checks at that point, have we. Shouldn't it be moved until after the calias->Equals(newEncoding, mCharset, &eq); call? 739 // we still want to reparse 740 mTreeBuilder->NeedsCharsetSwitchTo(preferred); 741 mTreeBuilder->Flush(); The 'still' here threw me off a little bit too. Isn't this the first point where we've determined that we want to reparse? nsHtml5StreamParser::ParseAvailableData 762 if (!mFirstBuffer->hasMore()) { 763 if (mFirstBuffer == mLastBuffer) { 764 switch (mStreamState) { 765 case STREAM_BEING_READ: 766 // never release the last buffer. 767 if (!mSpeculating) { 768 // reuse buffer space if not speculating 769 mFirstBuffer->setStart(0); 770 mFirstBuffer->setEnd(0); 771 } 772 mTreeBuilder->FlushLoads(); 773 // Dispatch this runnable unconditionally, because the loads 774 // that need flushing may have been flushed earlier even if the 775 // flush right above here did nothing. 776 if (NS_FAILED(NS_DispatchToMainThread(mLoadFlusher))) { 777 NS_WARNING("failed to dispatch load flush event"); 778 } 779 return; // no more data for now but expecting more 780 case STREAM_ENDED: 781 if (mAtEOF) { 782 return; 783 } 784 mAtEOF = PR_TRUE; 785 mTokenizer->eof(); 786 mTreeBuilder->StreamEnded(); 787 mTreeBuilder->Flush(); 788 if (NS_FAILED(NS_DispatchToMainThread(mExecutorFlusher))) { 789 NS_WARNING("failed to dispatch executor flush event"); 790 } 791 return; // no more data and not expecting more 792 default: 793 NS_NOTREACHED("It should be impossible to reach this."); 794 return; 795 } 796 } else { 797 mFirstBuffer = mFirstBuffer->next; 798 continue; 799 } 800 } The else on line 796 is actually an else-after-return. Would probably be clearer to change those 'return's into 'break's and put a return after the switch statement and remove the else. The optimizer should turn it into the same flow anyway. The timer stuff is very strange. It seems like it keeps refiring on the main event loop, weather we've parsed anything or not, until we happen to be speculating at the time when the timer fires, or we finish parsing. (This means that the "just in case" comments at the various points when we cancel and restart the timer isn't really true, as we have no idea if the timer is currently pending or not.) It seems to me much better to have the timer running on the parser thread first of all. Also, to save batteries, only start the timer if we have pending things that need flushing. I.e. when we reach the end of a network buffer without getting blocked by a script (and maybe when we get unblocked with a successful speculation, or do we always flush what we have parsed in that case?) I think I'd like to have this timer situation solved before turning the parser on by default.
(In reply to comment #9) > Another thing is to get rid of the the reconsume stuff, it should be faster to > rewind the reading iterator one char. I refreshed my memory. The reason why the reconsume pattern exists in the tokenizer is that in the error reporting mode (that the validator uses but Gecko doesn't), reading a character isn't idompotent, because reading a character can have the side effect of emitting an error. In the case of the meta scanner, the data source abstraction currently in use doesn't provide a way to rewind. In both cases, the reconsume pattern will go away in C++ on bug 482920 is fixed. I hope this isn't considered a blocker to turning the parser on by default. (In reply to comment #16) > I still think we need to change how we're doing element/attribute/entity names, > however I'll take that to a separate bug. FWIW, I wanted to put element and attribute names in the data segment of the shared lib but couldn't because static atoms aren't in the data segment and, therefore, don't have memory addresses at link time. > More review comments: > > What is the purpose of nsHtml5RefPtr? Why not just use nsRefPtr? I see that bnewman answered this question already. > We need to get rid of nsHtml5AtomTable.cpp and the "fake" nsIAtoms. Do you mean that we just need to make the fake nsIAtoms not implement nsIAtom or do you want to get rid of the nsHtml5AtomTable mechanism more generally? (I'm OK with the former but want to avoid the latter.) > It's very > confusing that we in part of the code have atom like things that aren't really > atoms. If you don't need these things to be unique (which my understanding is > that you don't?), Even dynamically allocated local names need to be "unique" as far as an nsHtml5Tokenizer instance or an nsHtml5TreeBuilder instance can tell: to deal with duplicate attribute names and to make end tag names match start tag names. > then simply using pointers to nsStrings or even > nsStringBuffers seems like a cleaner solution. We could use a class similar to > nsAttrName that wraps either an nsStringBuffer or an nsIAtom depending on if > the low-bit is set. Or some such. This can be done after turning on by default > though. This is bug 539433. > nsHtml5StateSnapshot: > Might be cleaner to just make all those members public and avoid all the getter > methods. The getters are there so that the getters can be part of a Java interface / nsAHtml5TreeBuilderState that both the state snapshot and the tree builder implement. > Would be nice if getListLength had something about formatting elements in its > name (unless you remove it entirely, see previous point). I'll address this in a follow-up patch. > 157 if (!!listOfActiveFormattingElements[i]) { > > No need for the !! That's generated code. The translator makes the C++ code avoid !== and == expressions where one of the operands is zero. However, it doesn't check if the resulting expression is used in a context where the potentially resulting !! is useless. The translator uses the visitor pattern, so the visits for the !== and == expressions don't know about the context of the expression. In order to avoid spending excessive time on making the translator more elegant than it needs to be to work, I haven't tried to get rid of useless but harmless !!. (There's even |if (!!this)| in one generated file...) > nsHtml5StreamParser: > 197 nsHtml5StreamParser::~nsHtml5StreamParser() > 198 { > 199 NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); > 200 mTokenizer->end(); > 201 mRequest = nsnull; > 202 mObserver = nsnull; > 203 mUnicodeDecoder = nsnull; > 204 mSniffingBuffer = nsnull; > 205 mMetaScanner = nsnull; > 206 mFirstBuffer = nsnull; > 207 mExecutor = nsnull; > 208 mTreeBuilder = nsnull; > 209 mTokenizer = nsnull; > 210 mOwner = nsnull; > 211 if (mFlushTimer) { > 212 mFlushTimer->Cancel(); > 213 mFlushTimer = nsnull; > 214 } > 215 } > > Is there a reason to null all these things out? Mainly to distinguish an already-deleted object more easily in a debugger. Shall I make these #ifdef DEBUG? > 238 > nsHtml5StreamParser::SetupDecodingAndWriteSniffingBufferAndCurrentSegment(const > PRUint8* aFromSegment, // can be null > 239 > PRUint32 aCount, > 240 > PRUint32* aWriteCount) > 241 { > 242 NS_ASSERTION(IsParserThread(), "Wrong thread!"); > 243 nsresult rv = NS_OK; > 244 nsCOMPtr<nsICharsetConverterManager> convManager = > do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv); > 245 NS_ENSURE_SUCCESS(rv, rv); > 246 rv = convManager->GetUnicodeDecoder(mCharset.get(), > getter_AddRefs(mUnicodeDecoder)); > > I don't think that the charset converter manager is threadsafe yet. I believe I made the part of the charset converter manager that the HTML5 parser uses thread-safe in the off-the-main-thread parser landing. https://hg.mozilla.org/mozilla-central/rev/7cda86954b4c > Nor are > some of the decoders I looked at. I suspect the simplest fix is to make them > so, it shouldn't be too hard. In the changeset referenced above, I made the instantiation of nsOneByteDecoderSupport objects protected by a mutex. Once a given one-byte converter has loaded its tables, it has no mutable state and is, therefore, thread-safe. When I looked at the multibyte decoders, it seemed to me that each instance was safe for using from a single thread and there was no issue with loading data tables because the data tables were baked into code ahead of compile time. > This we need to fix before turning it on by default as to avoid random crashes. Are there particular decoders that I have missed? > 248 mCharset.Assign("windows-1252"); // lower case is the raw form > > Use AssignLiteral (This might apply elsewhere too) Fixed. (Both occurrences.) > 979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch() > 980 { > 981 NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); > 982 mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex); > 983 Uninterrupt(); > 984 nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this); > 985 if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) { > 986 NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation"); > 987 } > 988 } > > Is there a reason to uninterrupt from the main thread? Why not do it from the > parser thread as to avoid blocking the main thread. When calling Uninterrupt(), the main thread already owns mTokenizerMutex and Uninterrupt() doesn't acquire an additional mutex, so no additional blocking occurs. (See the comment in the implementation of Uninterrupt().) > nsHtml5StreamParser::ContinueAfterScripts > > 864 if (mSpeculations.IsEmpty()) { > 865 // Not quite sure how exactly this happens... > 866 // Maybe an artifact of defer scripts? > 867 return; > 868 } > > Sounds like an assertion would be good. I put an NS_WARNING here. > 902 mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex); > 903 #ifdef DEBUG > 904 nsCOMPtr<nsIThread> mainThread; > 905 NS_GetMainThread(getter_AddRefs(mainThread)); > 906 mAtomTable.SetPermittedLookupThread(mainThread); > 907 #endif > > Please don't indent unless there is an actual block. Might actually be good to > add a block to avoid mainThread existing only in debug builds. Added a block. (In reply to comment #18) > You're also using the charset detector off the main thread, which I suspect > isn't safe. In fact, since you are creating it on the main thread, and then > calling it from the parser thread, I'm surprised this doesn't assert as it > doesn't seem to be implemented using threadsafe refcounting macros. > > This is also something we need to fix before we can turn on by default. Do you > want me to file a separate bug on this charset threadsafety stuff? Yes, a separate bug would be good to have if there really is a thread-safety problem left. As far as I can tell, there is no problem: After bug , the detector is refcounted only from the main thread. The sniffing is run only from the parser thread. The sniffing seems to affect only the internal state of the object. The data tables shared by the instances are baked into code ahead of compile time. (In reply to comment #19) > nsHtml5StreamParser::SniffStreamBytes > > 340 for (PRUint32 i = 0; i < aCount; i++) { > 341 switch (mBomState) { > ... > 390 case SEEN_UTF_8_SECOND_BYTE: > 391 if (aFromSegment[i] == 0xBF) { > 392 rv = SetupDecodingFromBom("UTF-8", "UTF-8"); // upper case is the > raw form > 393 NS_ENSURE_SUCCESS(rv, rv); > 394 PRUint32 count = aCount - (i + 1); > 395 rv = WriteStreamBytes(aFromSegment + (i + 1), count, > &writeCount); > 396 NS_ENSURE_SUCCESS(rv, rv); > 397 *aWriteCount = writeCount + (i + 1); > 398 return rv; > 399 } > 400 mBomState = BOM_SNIFFING_OVER; > 401 break; > 402 default: > 403 goto bom_loop_end; > 404 } > 405 } > 406 // if we get here, there either was no BOM or the BOM sniffing isn't > complete yet > 407 bom_loop_end: > > Change the loop-condition to be |i < aCount && mBomState != BOM_SNIFFING_OVER| > and change the default condition to set |mBomState = BOM_SNIFFING_OVER| and get > rid of the goto? http://xkcd.com/292/ :) Changed. I wish C++ had break with label. > 418 if (mUnicodeDecoder) { > 419 > mUnicodeDecoder->SetInputErrorBehavior(nsIUnicodeDecoder::kOnError_Recover); > > It seems like the responsibility for this call varies with where we instantiate > mUnicodeDecoder. Would be nice to make it consistent and always make it the > responsibility of whoever created the decoder. Or do we need to set it to > different ™ Good catch. Fixed. > nsHtml5StreamParser::WriteStreamBytes > > 475 if (NS_FAILED(convResult)) { > > Could you add a comment describing what we do on failure. Took a while for me > to figure out. Added comment. > 476 if (totalByteCount < aCount) { // mimicking nsScanner even though > this seems wrong > 477 ++totalByteCount; > 478 ++aFromSegment; > > Could you assert here? I agree this seems wrong. I added an NS_NOTREACHED in an else branch instead of merely asserting to avoid a buffer overrun if a decoder misbehaves. FWIW, the docs at http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/intl/uconv/public/nsIUnicodeDecoder.h#123 are not good enough. > 479 } > 480 mLastBuffer->getBuffer()[end] = 0xFFFD; > 481 ++end; > 482 mLastBuffer->setEnd(end); > 483 if (end == NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE) { > 484 mLastBuffer = (mLastBuffer->next = new > nsHtml5UTF16Buffer(NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE)); > > No need for the parenthesis. a = b = c; is ok. Changed. > 485 } > 486 mUnicodeDecoder->Reset(); > 487 if (totalByteCount == aCount) { > 488 *aWriteCount = totalByteCount; > 489 return NS_OK; > 490 } > 491 } else if (convResult == NS_PARTIAL_MORE_OUTPUT) { > 492 mLastBuffer = (mLastBuffer->next = new > nsHtml5UTF16Buffer(NS_HTML5_STREAM_PARSER_READ_BUFFER_SIZE)); > 493 NS_ASSERTION(totalByteCount < aCount, "The Unicode decoder has > consumed too many bytes."); > > Same here Changed. > nsHtml5StreamParser::internalEncodingDeclaration > > The name of this function seems very strange to me. Also, it's not documented > anywhere. The documentation is over here: http://hg.mozilla.org/projects/htmlparser/file/7df2f945c5b7/src/nu/validator/htmlparser/common/EncodingDeclarationHandler.java#l36 > Is this called any time we find a meta-charset? Yes, it is. > If so, would calling it something like MetaCharsetFound make sense? "Meta charset" is not HTML5 spec terminology. "[character] encoding declaration" is spec terminology. That's why the method is named the way it is named. In general, I have tried to stick to spec terminology when naming things even when the spec terminology deviates from colloquial usage. (Consider "entities" vs. "named character refernces".) > 700 NS_ASSERTION(IsParserThread(), "Wrong thread!"); > 701 if (mCharsetSource >= kCharsetFromMetaTag) { // this threshold > corresponds to "confident" in the HTML5 spec > 702 return; > 703 } > 704 > 705 // The encodings are different. > > This comment threw me off for a while. It doesn't appear to be correct, we > haven't done any equality checks at that point, have we. Shouldn't it be moved > until after the calias->Equals(newEncoding, mCharset, &eq); call? Removed the comment. > 739 // we still want to reparse > 740 mTreeBuilder->NeedsCharsetSwitchTo(preferred); > 741 mTreeBuilder->Flush(); > > The 'still' here threw me off a little bit too. Isn't this the first point > where we've determined that we want to reparse? My thinking here was that we presumptively want to reparse but decide not to if the new encoding is an alias for the current encoding. Removed the comment. > nsHtml5StreamParser::ParseAvailableData > > 762 if (!mFirstBuffer->hasMore()) { > 763 if (mFirstBuffer == mLastBuffer) { > 764 switch (mStreamState) { > 765 case STREAM_BEING_READ: > 766 // never release the last buffer. > 767 if (!mSpeculating) { > 768 // reuse buffer space if not speculating > 769 mFirstBuffer->setStart(0); > 770 mFirstBuffer->setEnd(0); > 771 } > 772 mTreeBuilder->FlushLoads(); > 773 // Dispatch this runnable unconditionally, because the loads > 774 // that need flushing may have been flushed earlier even if the > 775 // flush right above here did nothing. > 776 if (NS_FAILED(NS_DispatchToMainThread(mLoadFlusher))) { > 777 NS_WARNING("failed to dispatch load flush event"); > 778 } > 779 return; // no more data for now but expecting more > 780 case STREAM_ENDED: > 781 if (mAtEOF) { > 782 return; > 783 } > 784 mAtEOF = PR_TRUE; > 785 mTokenizer->eof(); > 786 mTreeBuilder->StreamEnded(); > 787 mTreeBuilder->Flush(); > 788 if (NS_FAILED(NS_DispatchToMainThread(mExecutorFlusher))) { > 789 NS_WARNING("failed to dispatch executor flush event"); > 790 } > 791 return; // no more data and not expecting more > 792 default: > 793 NS_NOTREACHED("It should be impossible to reach this."); > 794 return; > 795 } > 796 } else { > 797 mFirstBuffer = mFirstBuffer->next; > 798 continue; > 799 } > 800 } > > The else on line 796 is actually an else-after-return. Would probably be > clearer to change those 'return's into 'break's and put a return after the > switch statement and remove the else. The optimizer should turn it into the > same flow anyway. Removed the else. I tried changing the returns into breaks, but when I looked at the result, it looked less obvious than just returning instead of breaking first. > The timer stuff is very strange. It seems like it keeps refiring on the main > event loop, weather we've parsed anything or not, until we happen to be > speculating at the time when the timer fires, or we finish parsing. The timer is supposed to keep refiring whenever the parser is not speculating. When the parser is not speculating, it's useful to move data to the main thread every now and then even when there's no </script> in sight. > (This means that the "just in case" comments at the various points when we > cancel and restart the timer isn't really true, as we have no idea if the timer > is currently pending or not.) Removed the comments. > It seems to me much better to have the timer running on the parser thread first > of all. I put the timer on the main thread, because I couldn't find a timer implementation that'd fire on a non-main thread. Also, initially I wanted also to check if the document is in the frontmost tab when the timer fires, but I gave up on that idea. > Also, to save batteries, only start the timer if we have pending things > that need flushing. I.e. when we reach the end of a network buffer without > getting blocked by a script For practical purposes, this is approximated by sTimerStartDelay being longer than the subsequent delays. > (and maybe when we get unblocked with a successful > speculation, or do we always flush what we have parsed in that case?) The parser always flushes when a speculation succeeds. The timer is strictly about non-speculative parsing. > I think I'd like to have this timer situation solved before turning the parser > on by default. That seems problematic if we don't already have off-the-main-thread timer infrastructure in place.
Attachment #437849 - Flags: review?(jonas)
> After bug , the I forgot to paste the bug number. I meant bug 537557.
Depends on: 558775
(In reply to comment #20) > Created an attachment (id=437849) [details] > Address review comments in nsHtml5StreamParser.cpp > > (In reply to comment #9) > > Another thing is to get rid of the the reconsume stuff, it should be faster to > > rewind the reading iterator one char. > > I refreshed my memory. The reason why the reconsume pattern exists in the > tokenizer is that in the error reporting mode (that the validator uses but > Gecko doesn't), reading a character isn't idompotent, because reading a > character can have the side effect of emitting an error. In the case of the > meta scanner, the data source abstraction currently in use doesn't provide a > way to rewind. > > In both cases, the reconsume pattern will go away in C++ on bug 482920 is > fixed. I hope this isn't considered a blocker to turning the parser on by > default. Definitely not a blocker for turning on. > (In reply to comment #16) > > I still think we need to change how we're doing element/attribute/entity names, > > however I'll take that to a separate bug. > > FWIW, I wanted to put element and attribute names in the data segment of the > shared lib but couldn't because static atoms aren't in the data segment and, > therefore, don't have memory addresses at link time. What we've done in the past is to create static pointers to pointers to atoms. I.e. things like &nsGkAtoms::foo. Dunno if that will work here. In any case, I'll file a separate bug on this stuff. > > We need to get rid of nsHtml5AtomTable.cpp and the "fake" nsIAtoms. > > Do you mean that we just need to make the fake nsIAtoms not implement nsIAtom > or do you want to get rid of the nsHtml5AtomTable mechanism more generally? > (I'm OK with the former but want to avoid the latter.) The former. > > It's very > > confusing that we in part of the code have atom like things that aren't really > > atoms. If you don't need these things to be unique (which my understanding is > > that you don't?), > > Even dynamically allocated local names need to be "unique" as far as an > nsHtml5Tokenizer instance or an nsHtml5TreeBuilder instance can tell: to deal > with duplicate attribute names and to make end tag names match start tag names. Ah, didn't realize this. Good to know. > > then simply using pointers to nsStrings or even > > nsStringBuffers seems like a cleaner solution. We could use a class similar to > > nsAttrName that wraps either an nsStringBuffer or an nsIAtom depending on if > > the low-bit is set. Or some such. This can be done after turning on by default > > though. > > This is bug 539433. Awesome, thanks. > > nsHtml5StreamParser: > > 197 nsHtml5StreamParser::~nsHtml5StreamParser() > > 198 { > > 199 NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); > > 200 mTokenizer->end(); > > 201 mRequest = nsnull; > > 202 mObserver = nsnull; > > 203 mUnicodeDecoder = nsnull; > > 204 mSniffingBuffer = nsnull; > > 205 mMetaScanner = nsnull; > > 206 mFirstBuffer = nsnull; > > 207 mExecutor = nsnull; > > 208 mTreeBuilder = nsnull; > > 209 mTokenizer = nsnull; > > 210 mOwner = nsnull; > > 211 if (mFlushTimer) { > > 212 mFlushTimer->Cancel(); > > 213 mFlushTimer = nsnull; > > 214 } > > 215 } > > > > Is there a reason to null all these things out? > > Mainly to distinguish an already-deleted object more easily in a debugger. > Shall I make these #ifdef DEBUG? Please do. > > 238 > > nsHtml5StreamParser::SetupDecodingAndWriteSniffingBufferAndCurrentSegment(const > > PRUint8* aFromSegment, // can be null > > 239 > > PRUint32 aCount, > > 240 > > PRUint32* aWriteCount) > > 241 { > > 242 NS_ASSERTION(IsParserThread(), "Wrong thread!"); > > 243 nsresult rv = NS_OK; > > 244 nsCOMPtr<nsICharsetConverterManager> convManager = > > do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv); > > 245 NS_ENSURE_SUCCESS(rv, rv); > > 246 rv = convManager->GetUnicodeDecoder(mCharset.get(), > > getter_AddRefs(mUnicodeDecoder)); > > > > I don't think that the charset converter manager is threadsafe yet. > > I believe I made the part of the charset converter manager that the HTML5 > parser uses thread-safe in the off-the-main-thread parser landing. > https://hg.mozilla.org/mozilla-central/rev/7cda86954b4c Yay!! Awesome. I would have made the one-byte table just initialize in the ctor to avoid the mutex, but that's a separate issue. Didn't know you had done this, so things look good on the charset converter side! > > 979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch() > > 980 { > > 981 NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); > > 982 mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex); > > 983 Uninterrupt(); > > 984 nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this); > > 985 if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) { > > 986 NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation"); > > 987 } > > 988 } > > > > Is there a reason to uninterrupt from the main thread? Why not do it from the > > parser thread as to avoid blocking the main thread. > > When calling Uninterrupt(), the main thread already owns mTokenizerMutex and > Uninterrupt() doesn't acquire an additional mutex, so no additional blocking > occurs. This doesn't appear to be the case. You're here explicitly grabbing the lock in order to be able to safely call Uninterrupt. If you instead do that while on the parser thread, there is no need for the main thread to block. > > nsHtml5StreamParser::ContinueAfterScripts > > > > 864 if (mSpeculations.IsEmpty()) { > > 865 // Not quite sure how exactly this happens... > > 866 // Maybe an artifact of defer scripts? > > 867 return; > > 868 } > > > > Sounds like an assertion would be good. > > I put an NS_WARNING here. Why not NS_ASSERTION? > (In reply to comment #18) > > You're also using the charset detector off the main thread, which I suspect > > isn't safe. In fact, since you are creating it on the main thread, and then > > calling it from the parser thread, I'm surprised this doesn't assert as it > > doesn't seem to be implemented using threadsafe refcounting macros. > > > > This is also something we need to fix before we can turn on by default. Do you > > want me to file a separate bug on this charset threadsafety stuff? > > Yes, a separate bug would be good to have if there really is a thread-safety > problem left. As far as I can tell, there is no problem: After bug , the > detector is refcounted only from the main thread. The sniffing is run only from > the parser thread. The sniffing seems to affect only the internal state of the > object. The data tables shared by the instances are baked into code ahead of > compile time. I'll file a separate bug. If we're lucky (which it sounds like based on your description) all we need to do is to move the creation of the detector to the thread where its used. > > 476 if (totalByteCount < aCount) { // mimicking nsScanner even though > > this seems wrong > > 477 ++totalByteCount; > > 478 ++aFromSegment; > > > > Could you assert here? I agree this seems wrong. > > I added an NS_NOTREACHED in an else branch instead of merely asserting to avoid > a buffer overrun if a decoder misbehaves. > > FWIW, the docs at > http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/intl/uconv/public/nsIUnicodeDecoder.h#123 > are not good enough. Would be great if you could fix them. > > If so, would calling it something like MetaCharsetFound make sense? > > "Meta charset" is not HTML5 spec terminology. "[character] encoding > declaration" is spec terminology. That's why the method is named the way it is > named. In general, I have tried to stick to spec terminology when naming things > even when the spec terminology deviates from colloquial usage. (Consider > "entities" vs. "named character refernces".) How about EncodingDeclarationFound? Function names that are simple nouns sound like getters to me. > > It seems to me much better to have the timer running on the parser thread first > > of all. > > I put the timer on the main thread, because I couldn't find a timer > implementation that'd fire on a non-main thread. Also, initially I wanted also > to check if the document is in the frontmost tab when the timer fires, but I > gave up on that idea. Timers by default fire on the thread that created them. So if you simply use the parser thread to create the timer, then the timer will fire using the parser thread. So you use the same API on any thread, as you are on the main thread. Worst case you can simply set the 'target' property on the nsITimer if for some reason you really can't create the timer using the parser thread. > > Also, to save batteries, only start the timer if we have pending things > > that need flushing. I.e. when we reach the end of a network buffer without > > getting blocked by a script > > For practical purposes, this is approximated by sTimerStartDelay being longer > than the subsequent delays. I don't see how this approximates what we want. The point is that it's unnecessary to have the timer repeatedly fire if no network data is coming in. Once the timer has fired and we flush, there is no reason to restart the timer until we again have pending unflushed tree ops.
Tokenizer comments nsHtml5Tokenizer::setContentModelFlag 111 nsHtml5Tokenizer::setContentModelFlag(PRInt32 contentModelFlag, nsIAtom* contentModelElement) Please use an enum for the 'contentModelFlag' argument, that way you can also scope it to the tokenizer and given it shorter names. Also, is "tokenizerState" or "stateFlag" a better name? I take it this what the spec refers to as the 'state' of the tokenizer? nsHtml5Tokenizer::contentModelElementToArray 175 default: { 176 177 return; 178 } Shouldn't you clear contentModelElementNameAsArray here? Or if it should never reach the default, please assert. 193 194 void 195 nsHtml5Tokenizer::clearStrBufAndAppendCurrentC(PRUnichar c) 196 { 197 strBuf[0] = c; 198 strBufLen = 1; 199 } 200 201 void 202 nsHtml5Tokenizer::clearStrBufAndAppendForceWrite(PRUnichar c) 203 { 204 strBuf[0] = c; 205 strBufLen = 1; 206 } Why two functions tht do the same thing? 208 void 209 nsHtml5Tokenizer::clearStrBufForNextState() 210 { 211 strBufLen = 0; 212 } Why the 'ForNextState' part? nsHtml5Tokenizer::emitStrBuf 241 if (strBufLen > 0) { 242 tokenHandler->characters(strBuf, 0, strBufLen); 243 } Is the tokenHandler ever anything other than the treebuilder? If not, we should rename the member to 'treeBuilder'. 246 void 247 nsHtml5Tokenizer::clearLongStrBufForNextState() 248 { 249 longStrBufLen = 0; 250 } 251 252 void 253 nsHtml5Tokenizer::clearLongStrBuf() 254 { 255 longStrBufLen = 0; 256 } 257 258 void 259 nsHtml5Tokenizer::clearLongStrBufAndAppendCurrentC(PRUnichar c) 260 { 261 longStrBuf[0] = c; 262 longStrBufLen = 1; 263 } 264 265 void 266 nsHtml5Tokenizer::clearLongStrBufAndAppendToComment(PRUnichar c) 267 { 268 longStrBuf[0] = c; 269 longStrBufLen = 1; 270 } Lots of duplication here, is there really a need for that? strBuf and longStrBuf is somewhat poor naming. It's unclear when to use which. Maybe identifierBuf and charBuf is better? Though why do we need both? Is the idea that we'll pass out the long buffer instead of copying it out? That doesn't appear to be happening currently. 285 nsHtml5Tokenizer::appendSecondHyphenToBogusComment() 286 { 287 appendLongStrBuf('-'); 288 } 289 290 void 291 nsHtml5Tokenizer::adjustDoubleHyphenAndAppendToLongStrBufAndErr(PRUnichar c) 292 { 293 294 appendLongStrBuf(c); 295 } These functions seem excessive. nsHtml5Tokenizer::flushChars 342 cstart = 0x7fffffff; PR_INT32_MAX? nsHtml5Tokenizer::emitCurrentTagToken 363 nsHtml5HtmlAttributes* attrs = (!attributes ? nsHtml5HtmlAttributes::EMPTY_ATTRIBUTES : attributes); 364 if (endTag) { 365 366 tokenHandler->endTag(tagName); 367 delete attributes; 368 } else { 369 tokenHandler->startTag(tagName, attrs, selfClosing); 370 } Move 'attrs' into the 'else'. You could possibly even get rid of the temporary. 362 stateSave = NS_HTML5TOKENIZER_DATA; ... 374 return stateSave; It seems like this function always returns the same value. If so, just get rid of the return value and use the value inline where needed. If not, add a comment explaining what happens. 391 void 392 nsHtml5Tokenizer::addAttributeWithoutValue() 393 { 394 395 if (!!attributeName) { 396 attributes->addAttribute(attributeName, nsHtml5Portability::newEmptyString()); 397 attributeName = nsnull; 398 } 399 } 400 401 void 402 nsHtml5Tokenizer::addAttributeWithValue() 403 { 404 if (!!attributeName) { 405 nsString* val = longStrBufToString(); 406 attributes->addAttribute(attributeName, val); 407 attributeName = nsnull; 408 } 409 } No need for the '!!'. Also, can these functions really be called without an attribute name? If that shouldn't happen, just assert and assume it's there. 411 void 412 nsHtml5Tokenizer::startErrorReporting() 413 { 414 } Remove this function? nsHtml5Tokenizer::tokenizeBuffer 460 if (pos == buffer->getEnd()) { 461 buffer->setStart(pos); 462 } else { 463 buffer->setStart(pos + 1); 464 } Why is this needed? Shouldn't we treat buffers as empty if the start == end? 3427 void 3428 nsHtml5Tokenizer::rememberAmpersandLocation(PRUnichar add) 3429 { 3430 additional = add; 3431 } This doesn't seem to remember a location at all, but rather a character. I guess this makes more sense in the validator code :( 3447 void 3448 nsHtml5Tokenizer::emitOrAppendStrBuf(PRInt32 returnState) 3449 { 3450 if ((returnState & (~1))) { 3451 appendStrBufToLongStrBuf(); 3452 } else { 3453 emitStrBuf(); 3454 } 3455 } Please create a #define for the '~1' and put it together with the definitions for the values that can be passed in here. That should reduce the risk that someone changes the flag values without updating this value. 3867 void 3868 nsHtml5Tokenizer::requestSuspension() 3869 { 3870 shouldSuspend = PR_TRUE; 3871 } 3872 3873 void 3874 nsHtml5Tokenizer::becomeConfident() 3875 { 3876 confident = PR_TRUE; 3877 } 3878 3879 PRBool 3880 nsHtml5Tokenizer::isNextCharOnNewLine() 3881 { 3882 return PR_FALSE; 3883 } 3884 3885 PRBool 3886 nsHtml5Tokenizer::isPrevCR() 3887 { 3888 return lastCR; 3889 } 3890 3891 PRInt32 3892 nsHtml5Tokenizer::getLine() 3893 { 3894 return -1; 3895 } 3896 3897 PRInt32 3898 nsHtml5Tokenizer::getCol() 3899 { 3900 return -1; 3901 } These all seem worthy of removal. More effects of validator code I guess :(
nsHtml5TreeBuilder.cpp The big thing here is that I'm worried that all the switch statements with loads of fallthroughs and gotos are going to make it incredibly hard to work with this code in the future. Is there any evidence that the fallthroughs are saving any cycles? If not I think I'd prefer to keep the code sane so that we can modify it in the future when needed. The gotos are harder due to the deep nesting of switch statements and loops making 'break' largely useless. I think there is a general over use of switch statements though. In many cases switch statements could be converted to normal if-else statements. In some cases the 'else' part isn't even needed. I've commented on a few specific instances below, but it's a more general problem. I do like the function names in this class though, in general much easier to understand what's going on without reading all the callers and/or function implementations. nsHtml5TreeBuilder::doctype 124 needToDropLF = PR_FALSE; 125 for (; ; ) { 126 switch(foreignFlag) { 127 case NS_HTML5TREE_BUILDER_IN_FOREIGN: { 128 goto doctypeloop_end; 129 } 130 default: { 131 switch(mode) { 132 case NS_HTML5TREE_BUILDER_INITIAL: { 133 nsString* emptyString = nsHtml5Portability::newEmptyString(); 134 appendDoctypeToDocument(!name ? nsHtml5Atoms::emptystring : name, !publicIdentifier ? emptyString : publicIdentifier, !systemIdentifier ? emptyString : systemIdentifier); 135 nsHtml5Portability::releaseString(emptyString); 136 if (isQuirky(name, publicIdentifier, systemIdentifier, forceQuirks)) { 137 138 documentModeInternal(QUIRKS_MODE, publicIdentifier, systemIdentifier, PR_FALSE); 139 } else if (isAlmostStandards(publicIdentifier, systemIdentifier)) { 140 141 documentModeInternal(ALMOST_STANDARDS_MODE, publicIdentifier, systemIdentifier, PR_FALSE); 142 } else { 143 documentModeInternal(STANDARDS_MODE, publicIdentifier, systemIdentifier, PR_FALSE); 144 } 145 mode = NS_HTML5TREE_BUILDER_BEFORE_HTML; 146 return; 147 } 148 default: { 149 goto doctypeloop_end; 150 } 151 } 152 } 153 } 154 } 155 doctypeloop_end: ; 156 157 return; What is the purpose of the loop here? It doesn't seem like we actually ever loop. Also, why the 'goto's rather than simply returning? nsHtml5TreeBuilder::comment 164 for (; ; ) { 165 switch(foreignFlag) { 166 case NS_HTML5TREE_BUILDER_IN_FOREIGN: { 167 goto commentloop_end; 168 } 169 default: { 170 switch(mode) { 171 case NS_HTML5TREE_BUILDER_INITIAL: 172 case NS_HTML5TREE_BUILDER_BEFORE_HTML: 173 case NS_HTML5TREE_BUILDER_AFTER_AFTER_BODY: 174 case NS_HTML5TREE_BUILDER_AFTER_AFTER_FRAMESET: { 175 appendCommentToDocument(buf, start, length); 176 return; 177 } 178 case NS_HTML5TREE_BUILDER_AFTER_BODY: { 179 flushCharacters(); 180 appendComment(stack[0]->node, buf, start, length); 181 return; 182 } 183 default: { 184 goto commentloop_end; 185 } 186 } 187 } 188 } 189 } 190 commentloop_end: ; Same here. It seems much cleaner written as if (foreignFlag != NS_HTML5TREE_BUILDER_IN_FOREIGN) { switch(mode) { case NS_HTML5TREE_BUILDER_INITIAL: case NS_HTML5TREE_BUILDER_BEFORE_HTML: case NS_HTML5TREE_BUILDER_AFTER_AFTER_BODY: case NS_HTML5TREE_BUILDER_AFTER_AFTER_FRAMESET: { appendCommentToDocument(buf, start, length); return; } case NS_HTML5TREE_BUILDER_AFTER_BODY: { flushCharacters(); appendComment(stack[0]->node, buf, start, length); return; } default: { } } } nsHtml5TreeBuilder::eof 436 switch(foreignFlag) { 437 case NS_HTML5TREE_BUILDER_IN_FOREIGN: { 438 439 while (stack[currentPtr]->ns != kNameSpaceID_XHTML) { 440 popOnEof(); 441 } 442 foreignFlag = NS_HTML5TREE_BUILDER_NOT_IN_FOREIGN; 443 } 444 default: 445 ; // fall through 446 } Doesn't seem to be any point in using a switch here. An if-statement would be much more readable. 464 case NS_HTML5TREE_BUILDER_IN_HEAD: { 465 if (currentPtr > 1) { 466 467 } 468 while (currentPtr > 0) { 469 popOnEof(); 470 } Remove the if-statement. nsHtml5TreeBuilder::startTag 764 if (!isCurrent(nsHtml5Atoms::table)) { 765 766 } Same here 831 if (currentPtr != eltPos) { 832 833 } There seems to be many more of these, I haven't commented on any more of them. This finishes the review of the bulk of the code. There are a few specific items I've got left to review, such as the flushing policy and notification batches. There's also a bunch more places that need more comments, even in the java code. In general every class function and member needs a description IMHO, though in some cases groups of functions/members can be described together. This is especially true in the tokenizer and tree builder which by nature implements very complicated algorithms, which makes it very hard to figure out what a variable is used for by looking at the code. We also need the comments in the java code to be carried over to the Cpp code, though I think there's a separate bug on that. I haven't found any real blockers for turning the parser on by default though, other than the ones I've explicitly noted already.
(In reply to comment #20) > > Would be nice if getListLength had something about formatting elements in its > > name (unless you remove it entirely, see previous point). > > I'll address this in a follow-up patch. Addressed by this patch.
Attachment #438967 - Flags: review?(jonas)
(In reply to comment #22) > What we've done in the past is to create static pointers to pointers to atoms. > I.e. things like &nsGkAtoms::foo. Dunno if that will work here. In any case, > I'll file a separate bug on this stuff. In that case, the type of the field would be nsIAtom** instead of nsIAtom*, which would mess up types in the code produced by the translator. Worse, simple pointer compares would be lost and be turned into a pointer dereference plus compare. > > > nsHtml5StreamParser: > > > 197 nsHtml5StreamParser::~nsHtml5StreamParser() > > > 198 { > > > 199 NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); > > > 200 mTokenizer->end(); > > > 201 mRequest = nsnull; > > > 202 mObserver = nsnull; > > > 203 mUnicodeDecoder = nsnull; > > > 204 mSniffingBuffer = nsnull; > > > 205 mMetaScanner = nsnull; > > > 206 mFirstBuffer = nsnull; > > > 207 mExecutor = nsnull; > > > 208 mTreeBuilder = nsnull; > > > 209 mTokenizer = nsnull; > > > 210 mOwner = nsnull; > > > 211 if (mFlushTimer) { > > > 212 mFlushTimer->Cancel(); > > > 213 mFlushTimer = nsnull; > > > 214 } > > > 215 } > > > > > > Is there a reason to null all these things out? > > > > Mainly to distinguish an already-deleted object more easily in a debugger. > > Shall I make these #ifdef DEBUG? > > Please do. Done. > > > 979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch() > > > 980 { > > > 981 NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); > > > 982 mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex); > > > 983 Uninterrupt(); > > > 984 nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this); > > > 985 if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) { > > > 986 NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation"); > > > 987 } > > > 988 } > > > > > > Is there a reason to uninterrupt from the main thread? Why not do it from the > > > parser thread as to avoid blocking the main thread. > > > > When calling Uninterrupt(), the main thread already owns mTokenizerMutex and > > Uninterrupt() doesn't acquire an additional mutex, so no additional blocking > > occurs. > > This doesn't appear to be the case. You're here explicitly grabbing the lock in > order to be able to safely call Uninterrupt. If you instead do that while on > the parser thread, there is no need for the main thread to block. Oh, right. In this case, that's true. I changed it the way you suggested. However, I'm nervous that this optimization for a rare case (the case where the shell is unable to fulfill a request to switch charsets fails) comes back to bite us later, because previously Interrupt() and Uninterrupt() were both called from the same thread so Uninterrupt() was guaranteed to cancel one Interrupt(). Now, Interrupt() is, as far as I can tell, called only in places that can't start rerunning until the parser thread has uninterrupted itself and posted more data to the main thread. However, if we later change how Interrupt() is called, it would be easy to introduce a situation where the main thread first posts a runnable to the parser thread, then interrupts and then the runnable cancels that new interrupt when it gets a chance to run on the parser thread. > > > nsHtml5StreamParser::ContinueAfterScripts > > > > > > 864 if (mSpeculations.IsEmpty()) { > > > 865 // Not quite sure how exactly this happens... > > > 866 // Maybe an artifact of defer scripts? > > > 867 return; > > > 868 } > > > > > > Sounds like an assertion would be good. > > > > I put an NS_WARNING here. > > Why not NS_ASSERTION? Because I'm not sure if there's a legitimate situation where that condition holds true. (I realize that not being sure about that isn't exactly good.) > > (In reply to comment #18) > > > You're also using the charset detector off the main thread, which I suspect > > > isn't safe. In fact, since you are creating it on the main thread, and then > > > calling it from the parser thread, I'm surprised this doesn't assert as it > > > doesn't seem to be implemented using threadsafe refcounting macros. > > > > > > This is also something we need to fix before we can turn on by default. Do you > > > want me to file a separate bug on this charset threadsafety stuff? > > > > Yes, a separate bug would be good to have if there really is a thread-safety > > problem left. As far as I can tell, there is no problem: After bug , the > > detector is refcounted only from the main thread. The sniffing is run only from > > the parser thread. The sniffing seems to affect only the internal state of the > > object. The data tables shared by the instances are baked into code ahead of > > compile time. > > I'll file a separate bug. If we're lucky (which it sounds like based on your > description) all we need to do is to move the creation of the detector to the > thread where its used. What would be the benefit except being able to avoid the detector creation in some cases and being able to delete it earlier in others? It seems to me that it's not a thread/ problem that it's allocated on one thread, used on another and destroyed on the creator thread but only after the other thread isn't going to use the object anyway. Moving the destruction of the detector off-the-main-thread would be annoying in the nsIParser::Terminate and cycle collector cases. Wouldn't those then require creating a runnable solely to clean up the detector? > > > 476 if (totalByteCount < aCount) { // mimicking nsScanner even though > > > this seems wrong > > > 477 ++totalByteCount; > > > 478 ++aFromSegment; > > > > > > Could you assert here? I agree this seems wrong. > > > > I added an NS_NOTREACHED in an else branch instead of merely asserting to avoid > > a buffer overrun if a decoder misbehaves. > > > > FWIW, the docs at > > http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/intl/uconv/public/nsIUnicodeDecoder.h#123 > > are not good enough. > > Would be great if you could fix them. I tried to fix the docs. I hope I wrote the right things! > > > If so, would calling it something like MetaCharsetFound make sense? > > > > "Meta charset" is not HTML5 spec terminology. "[character] encoding > > declaration" is spec terminology. That's why the method is named the way it is > > named. In general, I have tried to stick to spec terminology when naming things > > even when the spec terminology deviates from colloquial usage. (Consider > > "entities" vs. "named character refernces".) > > How about EncodingDeclarationFound? Function names that are simple nouns sound > like getters to me. The other callback names, for better or worse, use SAX-style naming and SAX uses spec-wise correct nouns, so having this one not be a noun would be inconsistent with the naming of the rest of the callbacks. > > I put the timer on the main thread, because I couldn't find a timer > > implementation that'd fire on a non-main thread. Also, initially I wanted also > > to check if the document is in the frontmost tab when the timer fires, but I > > gave up on that idea. > > Timers by default fire on the thread that created them. So if you simply use > the parser thread to create the timer, then the timer will fire using the > parser thread. So you use the same API on any thread, as you are on the main > thread. > > Worst case you can simply set the 'target' property on the nsITimer if for some > reason you really can't create the timer using the parser thread. I'm amazed that I've managed to overlook the target property. :-( I moved the timer to the parser thread. I have a question about the thread-safety of nsITimer::Cancel on multicores, though: nsTimerImpl::Cancel assigns to mCanceled without acquiring a mutex and nsTimerImpl::Fire() reads mCanceled without acquiring a mutex. However, before nsTimerImpl::Cancel returns, it calls TimerThread::RemoveTimer which acquires a mutex. Is that mutex acquisition enough to invalidate caches on other cores so that when nsTimerImpl::Cancel returns on one thread, another thread sees mCanceled as true in nsTimerImpl::Fire()? Or is nsTimerImpl even supposed to be thread-safe in the sense that you are allowed to Cancel from non-target thread? If Cancel isn't actually thread-safe, I need to do something more complicated with a runnable for canceling the timer when the parser is released early due to nsIParser::Terminate and the timer is armed. (In reply to comment #23) > Tokenizer comments > > nsHtml5Tokenizer::setContentModelFlag > > 111 nsHtml5Tokenizer::setContentModelFlag(PRInt32 contentModelFlag, nsIAtom* > contentModelElement) > > Please use an enum for the 'contentModelFlag' argument, that way you can also > scope it to the tokenizer and given it shorter names. I realize that it would be better code style to use enums both in Java and in C++. I actually used to have an enum there before the C++ porting effort even started. However, the switching on the tokenizer state is very performance-critical to the parser and the performance characteristics of enums in Java are surprisingly bad (well, at least of the PowerPC port of the HotSpot client VM that I benchmarked on back then). http://krijnhoetmer.nl/irc-logs/whatwg/20080427#l-256 So for perf reasons, I don't want to go back to enums on the Java side. As for the C++ side, I've viewed the generated code as a way of talking to the C++ compiler--not as much a way of expressing things to human readers--and when talking to the compiler, it shouldn't matter either way. Therefore, in the interest of not pouring effort excessively into the translator, I haven't tried to make the translator turn groups of ints into enums in the translation. > Also, is "tokenizerState" > or "stateFlag" a better name? I take it this what the spec refers to as the > 'state' of the tokenizer? When I wrote that piece of code, the spec had a concept of "content model flag". The concept has since been removed from the spec. Renamed various things in this area. > nsHtml5Tokenizer::contentModelElementToArray > > 175 default: { > 176 > 177 return; > 178 } > > Shouldn't you clear contentModelElementNameAsArray here? Or if it should never > reach the default, please assert. The Java code already asserts there. (That's why you see the empty line there.) Carrying the assertions over to C++ is bug 503190. I added a message to the assertion, though. > 193 > 194 void > 195 nsHtml5Tokenizer::clearStrBufAndAppendCurrentC(PRUnichar c) > 196 { > 197 strBuf[0] = c; > 198 strBufLen = 1; > 199 } > 200 > 201 void > 202 nsHtml5Tokenizer::clearStrBufAndAppendForceWrite(PRUnichar c) > 203 { > 204 strBuf[0] = c; > 205 strBufLen = 1; > 206 } > > Why two functions tht do the same thing? These are artifacts of an earlier attempt to boost performance by using the input buffer as the backing buffer for things that now go into strBuf and copying data into strBuf lazily if an input buffer boundary fell inside a name. It turned out that checking if it's necessary to copy data was slower than just unconditionally copying it, so the overall design didn't stay. However, I didn't clean up the redundant methods immediately in order to make it easier to experiment more later. I guess the experimentation isn't ever going to happen. Flattened these into one. > 208 void > 209 nsHtml5Tokenizer::clearStrBufForNextState() > 210 { > 211 strBufLen = 0; > 212 } > > Why the 'ForNextState' part? If the buffer were used in the current state, the variant that puts a single PRUnichar into the buffer would be appropriate. Renamed anyway. Also inlined methods like this. > nsHtml5Tokenizer::emitStrBuf > 241 if (strBufLen > 0) { > 242 tokenHandler->characters(strBuf, 0, strBufLen); > 243 } > > Is the tokenHandler ever anything other than the treebuilder? If not, we should > rename the member to 'treeBuilder'. In the Java test harness for the tokenizer, it's not the tree builder. I expect it not to be always be the tree builder even in C++ once I've fixed bug 482909. When the Java version has interfaces but the C++ doesn't, I've made the translator flatten the types but not rename the members. > 246 void > 247 nsHtml5Tokenizer::clearLongStrBufForNextState() > 248 { > 249 longStrBufLen = 0; > 250 } > 251 > 252 void > 253 nsHtml5Tokenizer::clearLongStrBuf() > 254 { > 255 longStrBufLen = 0; > 256 } > 257 > 258 void > 259 nsHtml5Tokenizer::clearLongStrBufAndAppendCurrentC(PRUnichar c) > 260 { > 261 longStrBuf[0] = c; > 262 longStrBufLen = 1; > 263 } > 264 > 265 void > 266 nsHtml5Tokenizer::clearLongStrBufAndAppendToComment(PRUnichar c) > 267 { > 268 longStrBuf[0] = c; > 269 longStrBufLen = 1; > 270 } > > Lots of duplication here, is there really a need for that? Not any longer as mentioned above. Removed redundancy and marked as inline. > strBuf and longStrBuf is somewhat poor naming. It's unclear when to use which. > Maybe identifierBuf and charBuf is better? Though why do we need both? Is the > idea that we'll pass out the long buffer instead of copying it out? That > doesn't appear to be happening currently. We don't need two buffers that are capable of growing without bound. (A second buffer whose length is capped at the length of the longest named name (plus 1?) would be practical to have around, though, now that the named character name tables no longer contain the full names.) I thought I had filed a bug about this long ago, but now I can't find it, so I filed bug 559303. > 285 nsHtml5Tokenizer::appendSecondHyphenToBogusComment() > 286 { > 287 appendLongStrBuf('-'); > 288 } > 289 > 290 void > 291 nsHtml5Tokenizer::adjustDoubleHyphenAndAppendToLongStrBufAndErr(PRUnichar > c) > 292 { > 293 > 294 appendLongStrBuf(c); > 295 } > > These functions seem excessive. These methods are part of the XML Infoset coercion feature of the Java version of the parser. The latter is also part of error reporting. I marked them inline in C++ to make their invocation overhead compile away for sure. > nsHtml5Tokenizer::flushChars > > 342 cstart = 0x7fffffff; > > PR_INT32_MAX? Curious. I though I was already using that there. I had translator support and everything already in place. Fixed. > nsHtml5Tokenizer::emitCurrentTagToken > > 363 nsHtml5HtmlAttributes* attrs = (!attributes ? > nsHtml5HtmlAttributes::EMPTY_ATTRIBUTES : attributes); > 364 if (endTag) { > 365 > 366 tokenHandler->endTag(tagName); > 367 delete attributes; > 368 } else { > 369 tokenHandler->startTag(tagName, attrs, selfClosing); > 370 } > > Move 'attrs' into the 'else'. You could possibly even get rid of the temporary. 'attrs' is used on the blank line of the 'if' branch but only when error reporting is enabled. > 362 stateSave = NS_HTML5TOKENIZER_DATA; > ... > 374 return stateSave; > > It seems like this function always returns the same value. If so, just get rid > of the return value and use the value inline where needed. If not, add a > comment explaining what happens. The method calls into the token handler, and the token handler can call back into the tokenizer and change stateSave. Added comment (though it doesn't show up in C++ because comments aren't copied over yet). > 391 void > 392 nsHtml5Tokenizer::addAttributeWithoutValue() > 393 { > 394 > 395 if (!!attributeName) { > 396 attributes->addAttribute(attributeName, > nsHtml5Portability::newEmptyString()); > 397 attributeName = nsnull; > 398 } > 399 } > 400 > 401 void > 402 nsHtml5Tokenizer::addAttributeWithValue() > 403 { > 404 if (!!attributeName) { > 405 nsString* val = longStrBufToString(); > 406 attributes->addAttribute(attributeName, val); > 407 attributeName = nsnull; > 408 } > 409 } > > No need for the '!!'. See comment 20. > Also, can these functions really be called without an > attribute name? If that shouldn't happen, just assert and assume it's there. They can in the end tag case. > 411 void > 412 nsHtml5Tokenizer::startErrorReporting() > 413 { > 414 } > > Remove this function? Removed. > nsHtml5Tokenizer::tokenizeBuffer > > 460 if (pos == buffer->getEnd()) { > 461 buffer->setStart(pos); > 462 } else { > 463 buffer->setStart(pos + 1); > 464 } > > Why is this needed? Shouldn't we treat buffers as empty if the start == end? The first is about getting ready to return because the tokenizer loop hit the end of the buffer. The else branch is about two things: Either the parser is suspending after a </script> or a charset <meta> and pos is the index of '>' or the tokenizer is burping at a CR and pos is the index of the CR. In either of those cases, when the buffer is examined the next time, the position should point to the character after the '>' or the CR (which may actually mean that the buffer gets marked as empty if pos pointed to the last character). > 3427 void > 3428 nsHtml5Tokenizer::rememberAmpersandLocation(PRUnichar add) > 3429 { > 3430 additional = add; > 3431 } > > This doesn't seem to remember a location at all, but rather a character. I > guess this makes more sense in the validator code :( I remembers the ampersand location in the Java code. The argument 'add' was introduced to align with a later spec change and I failed to make the name of the method more elaborate at that point. Renamed. > 3447 void > 3448 nsHtml5Tokenizer::emitOrAppendStrBuf(PRInt32 returnState) > 3449 { > 3450 if ((returnState & (~1))) { > 3451 appendStrBufToLongStrBuf(); > 3452 } else { > 3453 emitStrBuf(); > 3454 } > 3455 } > > Please create a #define for the '~1' and put it together with the definitions > for the values that can be passed in here. That should reduce the risk that > someone changes the flag values without updating this value. Fixed. > 3867 void > 3868 nsHtml5Tokenizer::requestSuspension() > 3869 { > 3870 shouldSuspend = PR_TRUE; > 3871 } > 3872 > 3873 void > 3874 nsHtml5Tokenizer::becomeConfident() > 3875 { > 3876 confident = PR_TRUE; > 3877 } > 3878 > 3879 PRBool > 3880 nsHtml5Tokenizer::isNextCharOnNewLine() > 3881 { > 3882 return PR_FALSE; > 3883 } > 3884 > 3885 PRBool > 3886 nsHtml5Tokenizer::isPrevCR() > 3887 { > 3888 return lastCR; > 3889 } > 3890 > 3891 PRInt32 > 3892 nsHtml5Tokenizer::getLine() > 3893 { > 3894 return -1; > 3895 } > 3896 > 3897 PRInt32 > 3898 nsHtml5Tokenizer::getCol() > 3899 { > 3900 return -1; > 3901 } > > These all seem worthy of removal. More effects of validator code I guess :( requestSuspension() is used in the </script> and charset <meta> cases in Gecko. The others are Java-only, so I removed them from C++. (In reply to comment #24) > nsHtml5TreeBuilder.cpp > > The big thing here is that I'm worried that all the switch statements with > loads of fallthroughs and gotos are going to make it incredibly hard to work > with this code in the future. Is there any evidence that the fallthroughs are > saving any cycles? In the tokenizer the fallthroughs are an attempt at saving cycles in the Java case (saving cycles in the C++ case is bug 482920). In the tree builder the fallthroughs are there mainly to avoid code duplication while keeping the structure of the spec recognizable in the code structure. The spec has a lot of fall throughs to other modes, so they are implemented as fall throughs in the code. Given the structure of the spec, I think the fallthroughs actually make things maintainable in the first place if switch is used at all. Whether it's a good idea to use switch instead of having a class per mode inheriting from a pure-virtual class is another question, but I'd be very, very reluctant to change the overall structure of the tree builder that radically at this point in time. I do realize that fixing bug 552908 in the current architecture is going to lead to ugly code, but that spec feature didn't exist when I designed the architecture. Finding out how a bunch of virtual calls would compare to huge switches from the perf POV would involve refactoring the whole thing to use the other way and then benchmarking. As for the gotos, they are all breaks and continues in the Java code. They aren't used in random ways--they are used in 3 conceptually simple ways, so they should be fine for maintenance. Unfortunately, C++ doesn't have the feature of break and continue by label, so I had to make the translator emulate those with goto. The three concepts are: 1) Done processing this token: break starttagloop; 2) Reprocess the token: continue starttagloop; 3) Process the token according to the rules of another mode (later in the order of modes): break statenameloop; > If not I think I'd prefer to keep the code sane so that we > can modify it in the future when needed. I believe the code is quite modifiable by modifying the Java source and rerunning the translator. > The gotos are harder due to the deep nesting of switch statements and loops > making 'break' largely useless. I think the inability to break the a from a switch inside the loop is a defect in C++. Anyway, the place to make modifications is the TreeBuilder.java and nsHtml5TreeBuilder.cpp is just an intermediate file for feeding it into the C++ compiler. > I think there is a general over use of switch statements though. In many cases > switch statements could be converted to normal if-else statements. In some > cases the 'else' part isn't even needed. I've commented on a few specific > instances below, but it's a more general problem. For consistency, when certain variables are switched on in the startTag method, I've also switched on them when handling other tokens. Maybe that wasn't a great idea in retrospect. It does make the different token handler methods more consistent among themselves, though. > nsHtml5TreeBuilder::doctype > > 124 needToDropLF = PR_FALSE; > 125 for (; ; ) { > 126 switch(foreignFlag) { > 127 case NS_HTML5TREE_BUILDER_IN_FOREIGN: { > 128 goto doctypeloop_end; > 129 } > 130 default: { > 131 switch(mode) { > 132 case NS_HTML5TREE_BUILDER_INITIAL: { > 133 nsString* emptyString = nsHtml5Portability::newEmptyString(); > 134 appendDoctypeToDocument(!name ? nsHtml5Atoms::emptystring : > name, !publicIdentifier ? emptyString : publicIdentifier, !systemIdentifier ? > emptyString : systemIdentifier); > 135 nsHtml5Portability::releaseString(emptyString); > 136 if (isQuirky(name, publicIdentifier, systemIdentifier, > forceQuirks)) { > 137 > 138 documentModeInternal(QUIRKS_MODE, publicIdentifier, > systemIdentifier, PR_FALSE); > 139 } else if (isAlmostStandards(publicIdentifier, > systemIdentifier)) { > 140 > 141 documentModeInternal(ALMOST_STANDARDS_MODE, publicIdentifier, > systemIdentifier, PR_FALSE); > 142 } else { > 143 documentModeInternal(STANDARDS_MODE, publicIdentifier, > systemIdentifier, PR_FALSE); > 144 } > 145 mode = NS_HTML5TREE_BUILDER_BEFORE_HTML; > 146 return; > 147 } > 148 default: { > 149 goto doctypeloop_end; > 150 } > 151 } > 152 } > 153 } > 154 } > 155 doctypeloop_end: ; > 156 > 157 return; > > What is the purpose of the loop here? It doesn't seem like we actually ever > loop. The purpose is to be able to be break of it to jump forward. > Also, why the 'goto's rather than simply returning? In the error reporting cases, there's the line err("Stray doctype."); before the return where you see a blank line in C++. To use return directly in all the cases that now break out of the loop, the error emission line would have to be duplicated everywhere. > nsHtml5TreeBuilder::comment > > 164 for (; ; ) { > 165 switch(foreignFlag) { > 166 case NS_HTML5TREE_BUILDER_IN_FOREIGN: { > 167 goto commentloop_end; > 168 } > 169 default: { > 170 switch(mode) { > 171 case NS_HTML5TREE_BUILDER_INITIAL: > 172 case NS_HTML5TREE_BUILDER_BEFORE_HTML: > 173 case NS_HTML5TREE_BUILDER_AFTER_AFTER_BODY: > 174 case NS_HTML5TREE_BUILDER_AFTER_AFTER_FRAMESET: { > 175 appendCommentToDocument(buf, start, length); > 176 return; > 177 } > 178 case NS_HTML5TREE_BUILDER_AFTER_BODY: { > 179 flushCharacters(); > 180 appendComment(stack[0]->node, buf, start, length); > 181 return; > 182 } > 183 default: { > 184 goto commentloop_end; > 185 } > 186 } > 187 } > 188 } > 189 } > 190 commentloop_end: ; > > Same here. It seems much cleaner written as > > if (foreignFlag != NS_HTML5TREE_BUILDER_IN_FOREIGN) { > switch(mode) { > case NS_HTML5TREE_BUILDER_INITIAL: > case NS_HTML5TREE_BUILDER_BEFORE_HTML: > case NS_HTML5TREE_BUILDER_AFTER_AFTER_BODY: > case NS_HTML5TREE_BUILDER_AFTER_AFTER_FRAMESET: { > appendCommentToDocument(buf, start, length); > return; > } > case NS_HTML5TREE_BUILDER_AFTER_BODY: { > flushCharacters(); > appendComment(stack[0]->node, buf, start, length); > return; > } > default: { > } > } > } Yeah, that switch on foreignFlag is mainly about being pointlessly consistent with startTag. Fixed. > nsHtml5TreeBuilder::eof > > 436 switch(foreignFlag) { > 437 case NS_HTML5TREE_BUILDER_IN_FOREIGN: { > 438 > 439 while (stack[currentPtr]->ns != kNameSpaceID_XHTML) { > 440 popOnEof(); > 441 } > 442 foreignFlag = NS_HTML5TREE_BUILDER_NOT_IN_FOREIGN; > 443 } > 444 default: > 445 ; // fall through > 446 } > > Doesn't seem to be any point in using a switch here. An if-statement would be > much more readable. Same here. Fixed. > 464 case NS_HTML5TREE_BUILDER_IN_HEAD: { > 465 if (currentPtr > 1) { > 466 > 467 } > 468 while (currentPtr > 0) { > 469 popOnEof(); > 470 } > > Remove the if-statement. > > nsHtml5TreeBuilder::startTag > 764 if (!isCurrent(nsHtml5Atoms::table)) { > 765 > 766 } > > Same here > > 831 if (currentPtr != eltPos) { > 832 > 833 } > > There seems to be many more of these, I haven't commented on any more of them. In all these cases, there's a method call for reporting an error inside the if. With the exception of the one that calls isCurrent(), it should be safe to assume that the C++ compiler optimizes these away. Apart from relying on the C++ compiler's optimizer, I haven't tried to put effort into making the translator produce more elegant output here, because I have been assumming we might actually want to report tree builder errors to the error console eventually (bug 512229). > This finishes the review of the bulk of the code. Thank you! I'd appreciate it if you could r+ the patches on this bug, too, to get them landed in order to avoid merge conflicts. > There's also a bunch more places that need more comments, even in the java > code. In general every class function and member needs a description IMHO, OK. I'll write JavaDocs for every member. > We also need the comments in the java code to be carried over to the Cpp code, > though I think there's a separate bug on that. There was only bug 539213. I filed bug 559547. > I haven't found any real blockers for turning the parser on by default though, > other than the ones I've explicitly noted already. That is, the flush timer and chardet creation?
Attachment #439222 - Flags: review?(jonas)
> > > > 979 nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch() > > > > 980 { > > > > 981 NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); > > > > 982 mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex); > > > > 983 Uninterrupt(); > > > > 984 nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this); > > > > 985 if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) { > > > > 986 NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation"); > > > > 987 } > > > > 988 } > > > > > > > > Is there a reason to uninterrupt from the main thread? Why not do it from the > > > > parser thread as to avoid blocking the main thread. > > > > > > When calling Uninterrupt(), the main thread already owns mTokenizerMutex and > > > Uninterrupt() doesn't acquire an additional mutex, so no additional blocking > > > occurs. > > > > This doesn't appear to be the case. You're here explicitly grabbing the lock in > > order to be able to safely call Uninterrupt. If you instead do that while on > > the parser thread, there is no need for the main thread to block. > > Oh, right. In this case, that's true. > > I changed it the way you suggested. However, I'm nervous that this optimization > for a rare case (the case where the shell is unable to fulfill a request to > switch charsets fails) comes back to bite us later, because previously > Interrupt() and Uninterrupt() were both called from the same thread so > Uninterrupt() was guaranteed to cancel one Interrupt(). Now, Interrupt() is, as > far as I can tell, called only in places that can't start rerunning until the > parser thread has uninterrupted itself and posted more data to the main thread. > However, if we later change how Interrupt() is called, it would be easy to > introduce a situation where the main thread first posts a runnable to the > parser thread, then interrupts and then the runnable cancels that new interrupt > when it gets a chance to run on the parser thread. I think i'm following you, but not completely sure :) I agree that if things get too complicated and we start getting concerned that a call to Uninterrupt might start Uninterrupting an different Interript than it was scheduled for, then we might need to roll back this change. Ideally we won't get there though, but if we do, lets cross that bridge then. > > > > nsHtml5StreamParser::ContinueAfterScripts > > > > > > > > 864 if (mSpeculations.IsEmpty()) { > > > > 865 // Not quite sure how exactly this happens... > > > > 866 // Maybe an artifact of defer scripts? > > > > 867 return; > > > > 868 } > > > > > > > > Sounds like an assertion would be good. > > > > > > I put an NS_WARNING here. > > > > Why not NS_ASSERTION? > > Because I'm not sure if there's a legitimate situation where that condition > holds true. (I realize that not being sure about that isn't exactly good.) That's fine. If the assertion starts firing in legitimate situations we'll see what that situation is and can replace the current comment with a description of why this is needed. Assertions aren't in release builds so no risk that millions of users will be affected. > > > (In reply to comment #18) > > > > You're also using the charset detector off the main thread, which I suspect > > > > isn't safe. In fact, since you are creating it on the main thread, and then > > > > calling it from the parser thread, I'm surprised this doesn't assert as it > > > > doesn't seem to be implemented using threadsafe refcounting macros. > > > > > > > > This is also something we need to fix before we can turn on by default. Do you > > > > want me to file a separate bug on this charset threadsafety stuff? > > > > > > Yes, a separate bug would be good to have if there really is a thread-safety > > > problem left. As far as I can tell, there is no problem: After bug , the > > > detector is refcounted only from the main thread. The sniffing is run only from > > > the parser thread. The sniffing seems to affect only the internal state of the > > > object. The data tables shared by the instances are baked into code ahead of > > > compile time. > > > > I'll file a separate bug. If we're lucky (which it sounds like based on your > > description) all we need to do is to move the creation of the detector to the > > thread where its used. > > What would be the benefit except being able to avoid the detector creation in > some cases and being able to delete it earlier in others? It seems to me that > it's not a thread/ problem that it's allocated on one thread, used on another > and destroyed on the creator thread but only after the other thread isn't going > to use the object anyway. The problem with the current situation is that if someone changes the implementation of the detector to do an extra addref/release inside one of the functions that you are calling on the separate thread, then that person will start hitting a bunch of threadsafty warnings. Despite that person not doing anything wrong, or there really being a problem anywhere. However that person won't be able to check in his/her changes until they've figured out how the HTML5 parser works and how to change it to not fire the assertions. > Moving the destruction of the detector > off-the-main-thread would be annoying in the nsIParser::Terminate and cycle > collector cases. Wouldn't those then require creating a runnable solely to > clean up the detector? One solution would be to only do this in debug builds, as that is the only place where we'll hit the refcounting assertions. > > > > If so, would calling it something like MetaCharsetFound make sense? > > > > > > "Meta charset" is not HTML5 spec terminology. "[character] encoding > > > declaration" is spec terminology. That's why the method is named the way it is > > > named. In general, I have tried to stick to spec terminology when naming things > > > even when the spec terminology deviates from colloquial usage. (Consider > > > "entities" vs. "named character refernces".) > > > > How about EncodingDeclarationFound? Function names that are simple nouns sound > > like getters to me. > > The other callback names, for better or worse, use SAX-style naming and SAX > uses spec-wise correct nouns, so having this one not be a noun would be > inconsistent with the naming of the rest of the callbacks. My impression is that right now it's sort of a mix, which is what makes things hard to understand. Yes, some "interfaces" use SAX-style naming, but they are implemented on classes which implemented both sax-style naming as well as other interfaces, thus creating the mix. The main problem was that it was very hard to find docs. One solution is to group functions together based on which interface they implement and then create clear headers showing which interface is being implemented. That makes it easier to find your way to the interface where the function should be documented. For example like: http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsGenericElement.h#351 > I have a question about the thread-safety of nsITimer::Cancel on multicores, > though: > nsTimerImpl::Cancel assigns to mCanceled without acquiring a mutex and > nsTimerImpl::Fire() reads mCanceled without acquiring a mutex. However, before > nsTimerImpl::Cancel returns, it calls TimerThread::RemoveTimer which acquires a > mutex. Is that mutex acquisition enough to invalidate caches on other cores so > that when nsTimerImpl::Cancel returns on one thread, another thread sees > mCanceled as true in nsTimerImpl::Fire()? > > Or is nsTimerImpl even supposed to be thread-safe in the sense that you are > allowed to Cancel from non-target thread? If Cancel isn't actually thread-safe, > I need to do something more complicated with a runnable for canceling the timer > when the parser is released early due to nsIParser::Terminate and the timer is > armed. I'm not sure if aquiring a mutex will flush all cores or just the current one. However in any case, when calling Cancel() you're always running the risk that the other thread is already passed the are-we-cancelled check and thus will fire. In fact, it might already be in the midst of firing already. So if you just call cancel, you need to be aware that the parser might already be currently flushing. Not sure if that's a problem yet. Haven't yet looked at the patch. > So for perf reasons, I don't want to go back to enums on the Java side. As for > the C++ side, I've viewed the generated code as a way of talking to the C++ > compiler--not as much a way of expressing things to human readers--and when > talking to the compiler, it shouldn't matter either way. Therefore, in the > interest of not pouring effort excessively into the translator, I haven't tried > to make the translator turn groups of ints into enums in the translation. I think this is where we differ in view :) I think the C++ code needs to be readable to humans too, for a few reasons. The most obvious reason is that during profiling and debugging, it's the C++ code gecko developers will be looking at. Also, I wouldn't be surprised if we'll eventually end up forking the java and C++ code. Though by no means I want that to happen anytime soon. If that happens we'll be working with the C++ code directly. I am however fine with using #defines until we deem forking necessary, if that ever happens. > > nsHtml5Tokenizer::tokenizeBuffer > > > > 460 if (pos == buffer->getEnd()) { > > 461 buffer->setStart(pos); > > 462 } else { > > 463 buffer->setStart(pos + 1); > > 464 } > > > > Why is this needed? Shouldn't we treat buffers as empty if the start == end? > > The first is about getting ready to return because the tokenizer loop hit the > end of the buffer. The else branch is about two things: Either the parser is > suspending after a </script> or a charset <meta> and pos is the index of '>' or > the tokenizer is burping at a CR and pos is the index of the CR. In either of > those cases, when the buffer is examined the next time, the position should > point to the character after the '>' or the CR (which may actually mean that > the buffer gets marked as empty if pos pointed to the last character). Ah, so pos always points to the last character consumed? > (In reply to comment #24) > > nsHtml5TreeBuilder.cpp > > > > The big thing here is that I'm worried that all the switch statements with > > loads of fallthroughs and gotos are going to make it incredibly hard to work > > with this code in the future. Is there any evidence that the fallthroughs are > > saving any cycles? > > In the tokenizer the fallthroughs are an attempt at saving cycles in the Java > case (saving cycles in the C++ case is bug 482920). > > In the tree builder the fallthroughs are there mainly to avoid code duplication > while keeping the structure of the spec recognizable in the code structure. The > spec has a lot of fall throughs to other modes, so they are implemented as fall > throughs in the code. Given the structure of the spec, I think the fallthroughs > actually make things maintainable in the first place if switch is used at all. Ok, if the fallthroughs follow the spec then that makes things better I agree. > Whether it's a good idea to use switch instead of having a class per mode > inheriting from a pure-virtual class is another question, but I'd be very, very > reluctant to change the overall structure of the tree builder that radically at > this point in time. I do realize that fixing bug 552908 in the current > architecture is going to lead to ugly code, but that spec feature didn't exist > when I designed the architecture. Finding out how a bunch of virtual calls > would compare to huge switches from the perf POV would involve refactoring the > whole thing to use the other way and then benchmarking. 100% agreed. We definitely shouldn't do any such major refactorings right now. > As for the gotos, they are all breaks and continues in the Java code. They > aren't used in random ways--they are used in 3 conceptually simple ways, so > they should be fine for maintenance. Unfortunately, C++ doesn't have the > feature of break and continue by label, so I had to make the translator emulate > those with goto. > > The three concepts are: > 1) Done processing this token: break starttagloop; > 2) Reprocess the token: continue starttagloop; > 3) Process the token according to the rules of another mode (later in the > order of modes): break statenameloop; I agree, the lack of labeled breaks in C++ sucks. But the limitation is there and we have to live with it. Like I said above, I think the C++ code needs to be readable to a human. I don't have a suggestion for how to improve the current situation, other than trying to use switch-statements less and if-statements more. There are quite a few switch statements with only one or two cases other than the default. In these situations if-statements would likely be cleaner and might make it possible to use plain breaks in a few more places. I think this is something that we can deal with separately though. > > nsHtml5TreeBuilder::doctype > > > > 124 needToDropLF = PR_FALSE; > > 125 for (; ; ) { > > 126 switch(foreignFlag) { > > 127 case NS_HTML5TREE_BUILDER_IN_FOREIGN: { > > 128 goto doctypeloop_end; > > 129 } > > 130 default: { > > 131 switch(mode) { > > 132 case NS_HTML5TREE_BUILDER_INITIAL: { > > 133 nsString* emptyString = nsHtml5Portability::newEmptyString(); > > 134 appendDoctypeToDocument(!name ? nsHtml5Atoms::emptystring : > > name, !publicIdentifier ? emptyString : publicIdentifier, !systemIdentifier ? > > emptyString : systemIdentifier); > > 135 nsHtml5Portability::releaseString(emptyString); > > 136 if (isQuirky(name, publicIdentifier, systemIdentifier, > > forceQuirks)) { > > 137 > > 138 documentModeInternal(QUIRKS_MODE, publicIdentifier, > > systemIdentifier, PR_FALSE); > > 139 } else if (isAlmostStandards(publicIdentifier, > > systemIdentifier)) { > > 140 > > 141 documentModeInternal(ALMOST_STANDARDS_MODE, publicIdentifier, > > systemIdentifier, PR_FALSE); > > 142 } else { > > 143 documentModeInternal(STANDARDS_MODE, publicIdentifier, > > systemIdentifier, PR_FALSE); > > 144 } > > 145 mode = NS_HTML5TREE_BUILDER_BEFORE_HTML; > > 146 return; > > 147 } > > 148 default: { > > 149 goto doctypeloop_end; > > 150 } > > 151 } > > 152 } > > 153 } > > 154 } > > 155 doctypeloop_end: ; > > 156 > > 157 return; > > > > What is the purpose of the loop here? It doesn't seem like we actually ever > > loop. > > The purpose is to be able to be break of it to jump forward. But the switch statement is right inside it, so any break will break out of that. Also, there aren't actually any break statements in the loop/switch. > Yeah, that switch on foreignFlag is mainly about being pointlessly consistent > with startTag. > > Fixed. For what it's worth, the switch in startTag seems unneeded too and better written as an if-statement. > > 464 case NS_HTML5TREE_BUILDER_IN_HEAD: { > > 465 if (currentPtr > 1) { > > 466 > > 467 } > > 468 while (currentPtr > 0) { > > 469 popOnEof(); > > 470 } > > > > Remove the if-statement. > > > > nsHtml5TreeBuilder::startTag > > 764 if (!isCurrent(nsHtml5Atoms::table)) { > > 765 > > 766 } > > > > Same here > > > > 831 if (currentPtr != eltPos) { > > 832 > > 833 } > > > > There seems to be many more of these, I haven't commented on any more of them. > > In all these cases, there's a method call for reporting an error inside the if. > With the exception of the one that calls isCurrent(), it should be safe to > assume that the C++ compiler optimizes these away. Apart from relying on the > C++ compiler's optimizer, I haven't tried to put effort into making the > translator produce more elegant output here, because I have been assumming we > might actually want to report tree builder errors to the error console > eventually (bug 512229). I really don't think we'll want to do that anytime soon. Netscape tried this back in the days and the result was a ton of added complexity to the parser which we're still paying for a decade later and yet have never taken advantage of. I wouldn't be surprised if some of the quirky parsing behaviors in gecko which have restricted HTML5 was also a result of this. Code is changeable, if we'll ever want to do error reporting for parse errors we can modify the code to deal with that then. > > I haven't found any real blockers for turning the parser on by default though, > > other than the ones I've explicitly noted already. > > That is, the flush timer and chardet creation? Yeah. Reviewing the patches here now.
Attachment #432525 - Flags: review?(jonas) → review+
Comment on attachment 437849 [details] [diff] [review] Address review comments in nsHtml5StreamParser.cpp >- if (totalByteCount < aCount) { // mimicking nsScanner even though this seems wrong >+ // There's an illegal byte in the input. It's now the responsibility >+ // of this calling code to output a U+FFFD REPLACEMENT CHARACTER and >+ // reset the decoder. >+ >+ if (totalByteCount < aCount) { >+ // advance over the bad byte > ++totalByteCount; > ++aFromSegment; > } >+#ifdef DEBUG >+ else { >+ NS_NOTREACHED("The decoder signaled an error but consumed all input."); >+ } >+#endif Simply putting NS_ASSERTION(totalByteCount < aCount, "...") before the if statement would avoid the #ifdef r=me either way.
Attachment #437849 - Flags: review?(jonas) → review+
Comment on attachment 439220 [details] [diff] [review] Address more review comments in C++ only parts > However in any case, when calling Cancel() you're always running the risk that > the other thread is already passed the are-we-cancelled check and thus will > fire. In fact, it might already be in the midst of firing already. OK. Not just me thinking Cancel/Fire don't look right to be thread-safe, then. (I heard nsITimer was thread-safe, tried to verify it, it didn't look right, so I started doubting myself and thinking it was thread-safe in some non-obvious way.)
Attachment #439220 - Flags: review?(jonas)
(In reply to comment #29) > Simply putting NS_ASSERTION(totalByteCount < aCount, "...") before the if > statement would avoid the #ifdef > > r=me either way. Thanks. I made that change. http://hg.mozilla.org/mozilla-central/rev/47327da4d134 http://hg.mozilla.org/mozilla-central/rev/a8327985f6e4
Attached patch Address more review comments in C++ only parts (obsolete) (deleted) — Splinter Review
This patch works around the thread-unsafety of nsITimer::Cancel.
Attachment #439220 - Attachment is obsolete: true
Attachment #439510 - Flags: review?(jonas)
Comment on attachment 432403 [details] [diff] [review] Address review comments in nsHtml5Parser.cpp Looks good. Though please include a diff created with -w as well as the full patch when you're doing indentation changes.
Attachment #432403 - Flags: review?(jonas) → review+
Attachment #438967 - Flags: review?(jonas) → review+
Attachment #439222 - Flags: review?(jonas) → review+
Comment on attachment 439222 [details] [diff] [review] Address more review comments in Java parts >@@ -793,39 +793,33 @@ >+ if (foreignFlag != IN_FOREIGN) { >+ switch (mode) { >+ case INITIAL: >+ case BEFORE_HTML: >+ case AFTER_AFTER_BODY: >+ case AFTER_AFTER_FRAMESET: Indentation here is a bit strange, elsewhere you've indented the cases. Also, there's tabs here, which isn't used elsewhere. >+ /* >+ * A comment token Append a Comment node to the Document object >+ * with the data attribute set to the data given in the comment >+ * token. >+ */ >+ appendCommentToDocument(buf, start, length); >+ return; >+ case AFTER_BODY: >+ /* >+ * A comment token Append a Comment node to the first element in >+ * the stack of open elements (the html element), with the data >+ * attribute set to the data given in the comment token. >+ */ >+ flushCharacters(); >+ appendComment(stack[0].node, buf, start, length); >+ return; >+ default: >+ break; This used to break out differently. Was that change intentional or am I reading it wrong? Looks good otherwise.
Comment on attachment 439510 [details] [diff] [review] Address more review comments in C++ only parts >+class nsHtml5TimerKungFu : public nsRunnable >+{ >+private: >+ nsHtml5RefPtr<nsHtml5StreamParser> mStreamParser; >+public: >+ nsHtml5TimerKungFu(nsHtml5StreamParser* aStreamParser) >+ : mStreamParser(aStreamParser) >+ {} >+ NS_IMETHODIMP Run() >+ { >+ if (mStreamParser->mFlushTimer) { >+ mStreamParser->mFlushTimer->Cancel(); >+ mStreamParser->mFlushTimer = nsnull; >+ } >+ return NS_OK; >+ } >+}; >+ >+void >+nsHtml5StreamParser::DropTimer() >+{ >+ if (mFlushTimer) { >+ nsCOMPtr<nsIRunnable> event = new nsHtml5TimerKungFu(this); >+ if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) { >+ NS_WARNING("Failed to dispatch TimerKungFu event"); >+ } >+ } >+} I'm not sure I understand the purpose of canceling the timer on the parser thread. The problem that was brought up earlier was that calling cancel from the main thread might not prevent the timer from just having started firing. However this solution has the same problem. The timer might fire between the call to nsHtml5StreamParser::DropTimer and the call to nsHtml5TimerKungFu::Run. For that matter, the problem exists in the code that is in the tree already. By the time we cancel the main-thread-only timer, nsHtml5StreamParserTimerFlusher::Run might have just started running on the parser thread. Also, it would be great if you could diff patches with -P. Not a huge deal though.
(In reply to comment #35) > I'm not sure I understand the purpose of canceling the timer on the parser > thread. The problem that was brought up earlier was that calling cancel from > the main thread might not prevent the timer from just having started firing. > > However this solution has the same problem. The timer might fire between the > call to nsHtml5StreamParser::DropTimer and the call to nsHtml5TimerKungFu::Run. The problem being addressed is the timer firing after the nsHtml5StreamParser instance it is associated with has been deleted. The timer doesn't hold a strong reference back to the nsHtml5StreamParser for two reasons: 1) chardet already takes nsHtml5StreamParser::Notify and 2) nsHtml5StreamParser is a CC participant and can't be refcounted off the main thread. nsHtml5TimerKungFu addrefs nsHtml5StreamParser (hence kungFu) so it stays alive until ::Run cancels the timer on the parser thread. This way, the timer may fire after the stream parser has been terminated but not after the stream parser has been deleted.
(In reply to comment #34) > (From update of attachment 439222 [details] [diff] [review]) > >@@ -793,39 +793,33 @@ > >+ if (foreignFlag != IN_FOREIGN) { > >+ switch (mode) { > >+ case INITIAL: > >+ case BEFORE_HTML: > >+ case AFTER_AFTER_BODY: > >+ case AFTER_AFTER_FRAMESET: > > Indentation here is a bit strange, elsewhere you've indented the cases. Also, > there's tabs here, which isn't used elsewhere. Fixed. I had failed to import my old Eclipse code formatter settings when I migrated from Mac to Ubuntu. > >+ /* > >+ * A comment token Append a Comment node to the Document object > >+ * with the data attribute set to the data given in the comment > >+ * token. > >+ */ > >+ appendCommentToDocument(buf, start, length); > >+ return; > >+ case AFTER_BODY: > >+ /* > >+ * A comment token Append a Comment node to the first element in > >+ * the stack of open elements (the html element), with the data > >+ * attribute set to the data given in the comment token. > >+ */ > >+ flushCharacters(); > >+ appendComment(stack[0].node, buf, start, length); > >+ return; > >+ default: > >+ break; > > This used to break out differently. Was that change intentional or am I reading > it wrong? There's no intentional change. After re-reading this several times, I fail to see the problem. The non-default cases return in both old and new version, so no change there. The default case in the old case jumps out of the pointless loop. In the new version, there's no loop, so the default case has a no-op break just to silence compiler warnings. > Looks good otherwise. Thanks. http://hg.mozilla.org/mozilla-central/rev/2ba94ee301bf http://hg.mozilla.org/mozilla-central/rev/715e63b786d4 http://hg.mozilla.org/mozilla-central/rev/46be92d24873
(In reply to comment #35) > Also, it would be great if you could diff patches with -P. Not a huge deal > though. Did you mean -p? Here's the same patch with -p from hg diff. The previous patch was from hg export. (It would be so nice if hg export formatted the patches according to the hg diff config in .hgrc...)
Attachment #439510 - Attachment is obsolete: true
Attachment #439894 - Flags: review?(jonas)
Attachment #439510 - Flags: review?(jonas)
Attached patch Address even more review comments (vol6) (obsolete) (deleted) — Splinter Review
This patch is diffed with -p. Unfortunately, startTag() became mostly unreviewable due to formatting changes. I tried diffing with various whitespace-related flags, but the result is just as bad. I'll attach the changed .java file, too. (In reply to comment #28) > > Moving the destruction of the detector > > off-the-main-thread would be annoying in the nsIParser::Terminate and cycle > > collector cases. Wouldn't those then require creating a runnable solely to > > clean up the detector? > > One solution would be to only do this in debug builds, as that is the only > place where we'll hit the refcounting assertions. Do you actually want to hit refcounting assertions, though? What about making chardet implement addref/release using the thread-safe variants? > > > > I put an NS_WARNING here. > > > > > > Why not NS_ASSERTION? > > > > Because I'm not sure if there's a legitimate situation where that condition > > holds true. (I realize that not being sure about that isn't exactly good.) > > That's fine. If the assertion starts firing in legitimate situations we'll see > what that situation is and can replace the current comment with a description > of why this is needed. > > Assertions aren't in release builds so no risk that millions of users will be > affected. Changed to NS_NOTREACHED. > > > > > If so, would calling it something like MetaCharsetFound make sense? > > > > > > > > "Meta charset" is not HTML5 spec terminology. "[character] encoding > > > > declaration" is spec terminology. That's why the method is named the way it is > > > > named. In general, I have tried to stick to spec terminology when naming things > > > > even when the spec terminology deviates from colloquial usage. (Consider > > > > "entities" vs. "named character refernces".) > > > > > > How about EncodingDeclarationFound? Function names that are simple nouns sound > > > like getters to me. > > > > The other callback names, for better or worse, use SAX-style naming and SAX > > uses spec-wise correct nouns, so having this one not be a noun would be > > inconsistent with the naming of the rest of the callbacks. > > My impression is that right now it's sort of a mix, which is what makes things > hard to understand. Yes, some "interfaces" use SAX-style naming, but they are > implemented on classes which implemented both sax-style naming as well as other > interfaces, thus creating the mix. > > The main problem was that it was very hard to find docs. One solution is to > group functions together based on which interface they implement and then > create clear headers showing which interface is being implemented. That makes > it easier to find your way to the interface where the function should be > documented. For example like: > > http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsGenericElement.h#351 I've actually used this pattern in this particular case, but in this case, it didn't help locate the documentation in the .java files. Anyway, I'd rather add more comments pointing to the docs than rename stuff inconsistently. > > So for perf reasons, I don't want to go back to enums on the Java side. As for > > the C++ side, I've viewed the generated code as a way of talking to the C++ > > compiler--not as much a way of expressing things to human readers--and when > > talking to the compiler, it shouldn't matter either way. Therefore, in the > > interest of not pouring effort excessively into the translator, I haven't tried > > to make the translator turn groups of ints into enums in the translation. > > I think this is where we differ in view :) > > I think the C++ code needs to be readable to humans too, for a few reasons. The > most obvious reason is that during profiling and debugging, it's the C++ code > gecko developers will be looking at. > > Also, I wouldn't be surprised if we'll eventually end up forking the java and > C++ code. Though by no means I want that to happen anytime soon. If that > happens we'll be working with the C++ code directly. > > I am however fine with using #defines until we deem forking necessary, if that > ever happens. Added #defines. > > > nsHtml5Tokenizer::tokenizeBuffer > > > > > > 460 if (pos == buffer->getEnd()) { > > > 461 buffer->setStart(pos); > > > 462 } else { > > > 463 buffer->setStart(pos + 1); > > > 464 } > > > > > > Why is this needed? Shouldn't we treat buffers as empty if the start == end? > > > > The first is about getting ready to return because the tokenizer loop hit the > > end of the buffer. The else branch is about two things: Either the parser is > > suspending after a </script> or a charset <meta> and pos is the index of '>' or > > the tokenizer is burping at a CR and pos is the index of the CR. In either of > > those cases, when the buffer is examined the next time, the position should > > point to the character after the '>' or the CR (which may actually mean that > > the buffer gets marked as empty if pos pointed to the last character). > > Ah, so pos always points to the last character consumed? Yes. > > > nsHtml5TreeBuilder::doctype ... > > > What is the purpose of the loop here? It doesn't seem like we actually ever > > > loop. > > > > The purpose is to be able to be break of it to jump forward. > > But the switch statement is right inside it, so any break will break out of > that. Also, there aren't actually any break statements in the loop/switch. Simplified. > > Yeah, that switch on foreignFlag is mainly about being pointlessly consistent > > with startTag. > > > > Fixed. > > For what it's worth, the switch in startTag seems unneeded too and better > written as an if-statement. Whoa. I wonder if this code looked different when I put the switch in there. Simplified. > > In all these cases, there's a method call for reporting an error inside the if. > > With the exception of the one that calls isCurrent(), it should be safe to > > assume that the C++ compiler optimizes these away. Apart from relying on the > > C++ compiler's optimizer, I haven't tried to put effort into making the > > translator produce more elegant output here, because I have been assumming we > > might actually want to report tree builder errors to the error console > > eventually (bug 512229). > > I really don't think we'll want to do that anytime soon. Netscape tried this > back in the days and the result was a ton of added complexity to the parser > which we're still paying for a decade later and yet have never taken advantage > of. I wouldn't be surprised if some of the quirky parsing behaviors in gecko > which have restricted HTML5 was also a result of this. > > Code is changeable, if we'll ever want to do error reporting for parse errors > we can modify the code to deal with that then. I've changed the translator to output nicer code in this case. TANSTAAFL, though: The translator is now uglier.
Attachment #440199 - Flags: review?(jonas)
(In reply to comment #28) > > > I'll file a separate bug. If we're lucky (which it sounds like based on your > > > description) all we need to do is to move the creation of the detector to the > > > thread where its used. > > > > What would be the benefit except being able to avoid the detector creation in > > some cases and being able to delete it earlier in others? It seems to me that > > it's not a thread/ problem that it's allocated on one thread, used on another > > and destroyed on the creator thread but only after the other thread isn't going > > to use the object anyway. > > The problem with the current situation is that if someone changes the > implementation of the detector to do an extra addref/release inside one of the > functions that you are calling on the separate thread, then that person will > start hitting a bunch of threadsafty warnings. Despite that person not doing > anything wrong, or there really being a problem anywhere. However that person > won't be able to check in his/her changes until they've figured out how the > HTML5 parser works and how to change it to not fire the assertions. I recalled why I didn't just make the chardet addref/release thread-safe: The detector addrefs/releases its observer. The observer is the stream parser, which is a cycle collection participant and, therefore, can't be addrefed or released off the main thread. Since your problem scenario is someone changing the detector, since there isn't a current thread-safety problem and since fixing the anticipated problem ahead of time is complex due to cycle collection, can we consider this a bridge we cross when someone actually wants to change the detector? The detector is about as stable as code in Gecko gets. > > In all these cases, there's a method call for reporting an error inside the if. > > With the exception of the one that calls isCurrent(), it should be safe to > > assume that the C++ compiler optimizes these away. Apart from relying on the > > C++ compiler's optimizer, I haven't tried to put effort into making the > > translator produce more elegant output here, because I have been assumming we > > might actually want to report tree builder errors to the error console > > eventually (bug 512229). > > I really don't think we'll want to do that anytime soon. Netscape tried this > back in the days and the result was a ton of added complexity to the parser > which we're still paying for a decade later and yet have never taken advantage > of. I wouldn't be surprised if some of the quirky parsing behaviors in gecko > which have restricted HTML5 was also a result of this. Adding error reporting to the parser wouldn't add a ton of complexity. Since the Java version already supports error reporting, the required complexity is already known. I'd expect the perf cost in the tree builder to be tiny. In the tokenizer, I'd expect the perf cost to be notable but smaller than the cost of supporting view source highlighting. My current plan is to generate a second copy of the main tokenizer loop for view source. One option would be putting tokenizer error reporting only in that copy and reporting errors only if the user views source (possibly by baking it into the source highlighting and not reporting to the console at all).
Comment on attachment 440481 [details] [diff] [review] Address more review comments in C++ only parts, with an added null check >+nsHtml5StreamParser::FlushTreeOpsAndDisarmTimer() >+{ Please add an assertion to check that this is called on the right (main?) thread. Those assertions were awesome to have when reviewing and trying to grokk the code. > nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch() > { > NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); > mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex); >- Uninterrupt(); > nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this); > if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) { > NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation"); > } > } Is grabbing the mutex still needed? >+class nsHtml5TimerKungFu : public nsRunnable >+{ >+private: >+ nsHtml5RefPtr<nsHtml5StreamParser> mStreamParser; >+public: >+ nsHtml5TimerKungFu(nsHtml5StreamParser* aStreamParser) >+ : mStreamParser(aStreamParser) >+ {} >+ NS_IMETHODIMP Run() >+ { >+ if (mStreamParser->mFlushTimer) { >+ mStreamParser->mFlushTimer->Cancel(); >+ mStreamParser->mFlushTimer = nsnull; >+ } >+ return NS_OK; >+ } >+}; Ok, now I see what is happening. The important part here isn't that the cancel call is happening on any particular thread. But that you addref the stream parser on the main thread, and then post a message to the parser thread, and then post a message back again to the main thread and only then release the stream parser. This ensures that the stream parser is being kept alive until after we're done processing the timer, or cancelling it. And the nsHtml5RefPtr is what's making it all happen. Is this correct? And it seems like the same setup is being used to keep the stream parser alive when sending necko data to it? Please document this thoroughly. Also document the fact that while we're calling various functions on the stream parser on the parser thread, it's critical that it's never addreffed or released there. r=me with that.
Attachment #440481 - Flags: review?(jonas) → review+
Comment on attachment 440199 [details] [diff] [review] Address even more review comments (vol6) Ugh, yeah, this patch unreadable :( I sort of skimmed the resulting java file, but if there was anything in particular you'd like me to look at let me know. I still see a few overuses of switch statements, for example when looking for whitespace and in generateImpliedEndTagsExceptFor and generateImpliedEndTags. But we can always deal with that over time.
Attachment #440199 - Flags: review?(jonas) → review+
(In reply to comment #45) > (From update of attachment 440481 [details] [diff] [review]) > >+nsHtml5StreamParser::FlushTreeOpsAndDisarmTimer() > >+{ > > Please add an assertion to check that this is called on the right (main?) > thread. Those assertions were awesome to have when reviewing and trying to > grokk the code. Added NS_ASSERTION(IsParserThread(), "Wrong thread!"); > > nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch() > > { > > NS_ASSERTION(NS_IsMainThread(), "Wrong thread!"); > > mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex); > >- Uninterrupt(); > > nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this); > > if (NS_FAILED(mThread->Dispatch(event, nsIThread::DISPATCH_NORMAL))) { > > NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation"); > > } > > } > > Is grabbing the mutex still needed? No. Removed. > >+class nsHtml5TimerKungFu : public nsRunnable > >+{ > >+private: > >+ nsHtml5RefPtr<nsHtml5StreamParser> mStreamParser; > >+public: > >+ nsHtml5TimerKungFu(nsHtml5StreamParser* aStreamParser) > >+ : mStreamParser(aStreamParser) > >+ {} > >+ NS_IMETHODIMP Run() > >+ { > >+ if (mStreamParser->mFlushTimer) { > >+ mStreamParser->mFlushTimer->Cancel(); > >+ mStreamParser->mFlushTimer = nsnull; > >+ } > >+ return NS_OK; > >+ } > >+}; > > Ok, now I see what is happening. The important part here isn't that the cancel > call is happening on any particular thread. But that you addref the stream > parser on the main thread, and then post a message to the parser thread, and > then post a message back again to the main thread and only then release the > stream parser. > > This ensures that the stream parser is being kept alive until after we're done > processing the timer, or cancelling it. And the nsHtml5RefPtr is what's making > it all happen. > > Is this correct? Correct. > And it seems like the same setup is being used to keep the stream parser alive > when sending necko data to it? Yes. > Please document this thoroughly. Also document the fact that while we're > calling various functions on the stream parser on the parser thread, it's > critical that it's never addreffed or released there. > > r=me with that. Thanks. I'll land the patch with these changes tomorrow so that I can watch the tinderbox.
(In reply to comment #46) > (From update of attachment 440199 [details] [diff] [review]) > Ugh, yeah, this patch unreadable :( > > I sort of skimmed the resulting java file, but if there was anything in > particular you'd like me to look at let me know. There was a bug in the unreviewable part. I've uploaded a corrected patch here, but it's unreviewable, too. And the interdiff is unreviewable, so I just say what changed since the previous patch: The big switch in startTag was in the |else| branch of |if (inForeign)|. In this patch, there's no |else| branch and the big switch is just after the |if|. Sorry about bothering you a second time with this patch. This didn't get caught earlier, because we don't have full html5lib tree builder test mochitest coverage (bug 559023) and because when I ran the full test suite with the Java version of the parser, I failed to notice that the test harness got stuck in an infinite loop instead of stopping running.
Attachment #440199 - Attachment is obsolete: true
Attachment #440200 - Attachment is obsolete: true
Attachment #440201 - Attachment is obsolete: true
Attachment #440769 - Flags: review?(jonas)
This patch (on top of attachment 440481 [details] [diff] [review]) should make the case where the cycle collector unlinks nsHtml5Parser before nsHtml5StreamParser safer.
Attachment #440978 - Flags: review?(jonas)
Attachment #440769 - Flags: review?(jonas) → review+
Attachment #440978 - Flags: review?(jonas) → review+
Thanks for the reviews. Pushed two patches: http://hg.mozilla.org/mozilla-central/rev/9732d2b16662 http://hg.mozilla.org/mozilla-central/rev/a1c53e4881ca The "vol 6" is the only one unlanded patch on this bug. I didn't land it in the hopes that I might get r+ on bug 548232 soon, which would avoid merge conflicts from reordering the "vol 6" patch and the patch from bug 548232.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: