Closed Bug 1338217 Opened 7 years ago Closed 7 years ago

Wasm: Allow instances to share code

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: lth, Assigned: lth)

References

Details

Attachments

(7 files, 14 obsolete files)

(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
luke
: review+
Details | Diff | Splinter Review
(deleted), patch
bbouvier
: review+
Details | Diff | Splinter Review
(deleted), patch
lth
: review+
Details | Diff | Splinter Review
(deleted), patch
luke
: review+
Details | Diff | Splinter Review
(deleted), patch
luke
: review+
Details | Diff | Splinter Review
Currently each instance gets a copy of the code, in order to allow the code to be patched.  We should keep the code invariant after Module creation.  This means some changes throughout, among other things x86 code cannot embed the heap pointer and heap length.  Things will generally move into the Tls.

We'll want this to simplify the implementation of tiering.

Luke mentions that he has some patches pending (bug 1334504) that will also be necessary because right now we assume that we can get to an instance from the PC, but when we share code that won't be possible.
A quick check suggests this will cost 8% performance loss for the zlib benchmark just for the heap pointer, this is assuming other performance issues are addressed otherwise I guess it would be a smaller percentage and get lost in the noise a little. Such a performance loss would seem fine if it were an option, but I just don't see provision for it being optional and it would be disappointing to see more performance sacrificed in general.
As usual I'm interested in knowing how you arrived at that number.  (I'm willing to believe 8%, I just want to know why 8%.)  From preliminary investigation:

- The inputs for a load are a base pointer, an offset, and the tls pointer.  The base pointer may be
  a constant or some unknown value.  The offset is known.  The tls pointer is unknown.  Assume that if
  the base pointer is unknown it is in a register, and that if it is a constant the offset is zero.

The interesting case is anyway when the base pointer is unknown and the offset is smallish, so assume that in the following.  Straightforward code:

- The Tls pointer has to be rematerialized if it is not already in a register, this may require spilling
  a register and will require loading the Tls value from the stack frame.

- If range checking is necessary we should now be able to check the base against the (constant)
  memoryLimit using a compare-with-memory via the tls + known offset; no extra register pressure.  We
  can ignore our offset parameter here.  (Measurements from last summer suggest about 75% of range
  checks - dynamic measurements - can be removed by Ion.)

- The memoryBase will have to be loaded into a register, this could be the Tls register, if the Tls
  value will not be needed shortly thereafter

- We then need to add the base to the memoryBase to compute the effective base for the deref, if the
  offset is not zero.  (If it is zero we can use reg+reg addressing and skip this step.  I don't have
  data on the distribution here, in reasonable code.)

- And then load the value into a register, this can be the same register we used to compute the
  effective base address, at least for int32 loads

In sum, for a load, there is extra register pressure if the Tls occupies a register that we can't reuse for the computation.  For a store there is extra register pressure for sure.  There's at least one memory access extra per heap access (load memoryBase), sometimes two (range check); in principle the former can be cached in a register if there are registers to spare, but this is x86 so probably not.

Another trick is to cache the memoryBase in the stack frame if we know it's not changing, ie, load to a register early if we know we'll need it and rely on spilling.  An unchanging memoryBase is always true for shared memory, always true between calls, and may be always true for certain implementations that never move the heap array.

In loops that stay in registers / the stack frame the Tls will be spilled (as it is now) and will not incur an extra cost.
The general form of the addressing modes on the x86 is base+index*scale+memory_base+offset where the memory_base can be folded into the offset.

The 8% is obtained taking reasonable performance compiled code in which the memory_base is a compile time constant and is folded into the offset, and comparing this to code in which the memory_base is a runtime constant and leaving the register allocator to do what it can.

The x86 is register starved so it makes some sense that using another register for the memory_base is going to cost some performance.

I don't have answers for the bounds checking, my focus is on pointer masking, but that tls pointer might cost more too - I suspect that the tls pointer is likely going to be still live after a memory access so it will cost to materialize it again.
I think it's worth taking a small regression on x86 given all the other benefits of code sharing.
lth: for pointer masking, moving the mask to thread local storage, in addition to the memory base, is giving an 11% increase in run times for the zlib benchmark. The tls values are referenced by a segment but cached in registers within the function so most references are using registers.

luke: I don't dispute the benefits of code sharing, and other similar use cases that cause performance regressions such as lots-of-small-modules, or quick compilers, but there does not appear to be any give or accommodation to the performance use cases, and these decisions seem to be cast as absolute decisions and the larger design then bakes in these absolute decisions, and I believe that is unfortunate.

I can't see how we can all get along with such absolute positions. It does not seem possible to optimise for these use cases while letting other people optimise for the performance uses case with such absolute positions, it seems a wedge decision. I don't think this need be the case, that a more accommodating position is likely possible.

Users wanting to play a big monolithic game will want that extra performance or battery life, and don't care if they can only play one instance at a time as that might be all that the resources will handle anyway. Devices using x86 are often constrained tablets etc that are already starved for performance and battery life.

Users with a few instances of an app that can fit in the device memory easily and that are long running may well want the extra performance and battery life even if each instance needs to be compiled specifically for that context.

Multi-threaded apps using the same linear memory can share code without the extra indirections in this issue.

Much faster compilers appear possible, it might be a much smaller burden to re-compile in future making it more compelling to just re-compile to get good performance. The web browser could cache meta information such as the validation state of code, and optimization hints that were expensive to derive, allowing for even faster re-compilation.
There are a lot of competing constraints and fortunately this is an implementation detail that can change over time based on experience and trends of x86 usage.  At present, the duplication would significantly complicate baseline tiering so I think we should start here assuming perf regression isn't too substantial.
My concerns are entirely how this fits in with larger compilation models and languages, not that sharing code at some significant performance lose might be something else to explore. I raised concerns in issues such as https://github.com/WebAssembly/design/issues/972 at higher levels with no answer, and if this issue is an answer then taking the performance loss to fit seems very unfortunate. If accommodating performance use cases does not fit the tiering model then that seems a problem with the tiering model because giving up on the performance use cases would be very unfortunate.

lth: Have a look where this is going. Wasm could be an ssa encoding, you could get ssa out of decoding wasm (faster and cleaner than now), and perhaps also get the last use of definitions along a path (or be able to derive these per block quickly), and then it would appear to just be a matter of BCE and register allocation and an assembler. The quality of BCE and register allocation might be matters that could be varied to implement the different tiers, but otherwise perhaps the compiler could be largely shared. (Some work is also needed on loops). Could you take the experience gained with the baseline wasm compiler, refocus on ssa input (avoiding a lot of complexity and poor code related to local variables), and perhaps team up with the cretonne work. That would appear to have potential to make a significant contribution to the community and perhaps you are near the lead in this area and could get there first and the feedback from such experience might help settle design issues with wasm.
Let me try to collect the threads a little bit.  This bug is about the work to remove patching on x86 and ARM (and probably MIPS, from the looks of it) in SpiderMonkey for standard WebAssembly.  I don't know where the pointer-masking use case comes into this; SpiderMonkey will necessarily support standard WebAssembly, which means bounds checking.  We'll have some performance numbers for this patch before we land, and if the regressions are bad I expect we'll not land.  (Yes, our selection of benchmarks is not great.)  But x86 is slowly becoming a marginal platform for us, and x64 will not be affected by this work.

Luke's point about this being an implementation matter strikes me as particularly important.  We want to get rid of patching now (if we can afford it) because it simplifies other things that we must do (tiering, threading).  If the cost is unpleasantly high we'll need to look for ways to undo the damage again.  I don't see why this should be controversial.  I agree that supporting the single-instance case with patching could be attractive.

It'll be useful to look at some actual numbers here.  I have a patch queue that removes the patching of the bounds check (only) for x86 and ARM by instead comparing to a value lifted from the Tls, consequently this requires Ion to keep Tls available, possibly slightly increasing register pressure, but it probably will not increase register pressure as much as keeping the memoryBase in the Tls will and loading it when we need it.

With embenchen's wasm benchmarks on x86 (2013 MacBook Pro, not completely rigorous conditions but the system was more or less idle):

The benchmark results for Baseline are as expected (the baseline compiler does not remove a significant number of bounds checks on most programs): the trend is negative across the board with representative performance loss in the 5-10% range.  For example, 5% loss on box2d, 8% on bullet, 10% on linpack, 8% on zlib.  Some of the microbenchmarks drop by more than that.  But baseline is not all that important so we can live with this if we have to, and in particular, baseline does not attempt to keep Tls in a register now and we can improve that.

For Ion, there's a 3% loss on box2d, but 11% gain (no clue why) on linpack.  zlib is holding steady - no change.  fannkuch falls by 15% but is a microbenchmark known to be a little noisy.  Most of the rest are so far a couple points up or down - looks like noise until proven otherwise.

To be continued.
lth: It is not clear that this is just an implementation matter, that is my main concern. If the wasm design bakes in an assumption that code can be shared then implementations will practically no longer have a choice to both have slower shared code and faster code specialized to a context.

Lets say that wasm bakes in a design decision to depend on shared code, then how will it be an implementation choice to specialize the compiled code for the context? If copying the code were ok then this issue would not even need to be addressed would it, and there would be no need to take the performance hit? If it were an implementation choice then this implementation could choose to copy the code and this issue is not needed? See, it's not just an implementation choice, it's more complex than that. There are compelling use cases for sharing code even at some performance loss, and compelling use cases for higher performance code specialized for the context, and the wasm design does not appear to be taking that on board.

On the BCE it was my hope that wasm would open some new opportunities as it traps on a bounds check failure. I was aware that simply hoisting the bounds check over a series of base+offset access gave limited performance gain in practice which was surprising at first as it eliminates a lot of the bound checks, but anyway it reduces code size. As noted more work is need on loops. sunfish noted long ago using the loop stride and a guard page, and I have since discovered other approaches too, but these practically need the loop stride made explicit in the encoding so that the compiler can avoid having to do loop analysis.

Anyway it shall be interested to see the results and take a look at the compiled code and compare them to what I am seeing.
I'm fairly sure that at present the only way to discover that code is being shared is by looking at memory footprint (creating 10,000 instances of a large program will not cause OOM), it is not exposed anywhere, nor is there any kind of guarantee in wasm semantics, to my knowledge, that we will share code.  I'm doing this work not because it's semantically necessary but because it came up as a work item while implementing tiering: the way we're currently handling specialization is in the way of making tiering work, and will be in the way of the simple threading use case.

(I happen to think that sharing is important because it encourages the building of reusable libraries without a lot of duplicated machine code and extra compile time, but as you say there are other considerations.)

For what will probably be the initial attempt at threading, using the asm.js model with workers and a SharedArrayBuffer, sharing code will be important to control the memory footprint.   But nothing should prevent us from specializing a piece of code to a particular shared memory array and then sharing that code among workers, if that's what we want to do.  And then, for the eventual threads-in-wasm use case, our needs may change again, in ways that are hard to see now.

It may be that, just as we're now talking about ways in which a web server can signal that a page should expect to create a large allocation for a wasm heap, there needs to be ways to talk about what a piece of code is and how it will be used, in order to select a compilation / specialization strategy.  Nothing I've seen prevents that from happening.
The quick performance impact experiment did suggests that it would also make a visible performance difference!

There is a wasm design proposed for 'sharing' code: compiling the code in one context, storing it in an indexdb and posting it to other contexts. It seems reasonable that the intent is to avoid repeated compilation and avoid duplicated code. If the x86 code needs to be copied then that blocks one of these intended uses, and if it needs to be re-compiled then that block the other.

A 'server' in one context can not know the use in another context, and not before it compiles the code. There appears to be no provision in wasm for this, which is my concern.

The initial threading use case mentioned can be addresses by making one copy of the code and sharing that with all the threads from the start because they all share the same array buffer, assuming that has been designed for, so why not create a higher performance solution by design.

There is more to the larger design, such as accommodating a custom translation stage that has inputs such as the linear memory size, and that decompresses the code, and the current wasm design fails here too as it is not just a matter of 'specializing' because it is not just a matter of keeping the wasm code around to re-compile (and it might be very large decompressed too).

So how do you land this without regressing performance and if it does regress performance then what is the plan to recover that performance?
A rollup patch of my queue that removes all memory patching from x86 (bounds checking and access) and ARM (bounds checking).  This does not remove the actual duplication of JIT code upon instantiation; that's for a subsequent patch but should be fairly easy now.

Passes testing on x86, ARM, and x64.  (MIPS is hopelessly broken by this patch, but it was already pretty broken; it needs a full-time maintainer given our current pace of modifications.)

On x86, bounds checking is performed by comparing the index ("base") to the memoryLength field in the TLS, thus requiring the TLS pointer.  The memoryLength in the TLS is updated whenever the memory is resized.  On ARM, bounds checking loads the memoryLength from the TLS into a scratch register and then performs the compare; we can use the standard scratch.

On x86, access is performed by adding the memoryBase from the TLS into the index before going ahead with the access as before; this destroys the index register, so if that were to be used subsequently the regalloc will make a copy of it.  Constant-address accesses will require the address to be loaded into a register where previously it was not.

Performance measurements to follow.
lth: quick work, thanks. Here's what I see for the zlib benchmark, x86 32-bit:

gcc native: 9.6 sec
vip compiler using pointer masking: 11.2 sec
ff wasm: 18.4 sec
ff wasm (patched): 22.1 sec

So this patch increases the run time by almost 20% on the already slow ff wasm implementation. If other performance issues were addressed then this change may well make an even more significant impact, and the 11.2 sec result I currently see is offered as potentially achievable and there is still more to do there.

As noted, I do not dispute the compelling use case for this change, but the performance use case is also compelling, and I would like to see a design accommodate both. I also like the idea of having a description of how to build the code that might include hints to bias the compilation, and would like to see the memory requirements included too.

One thought on the BCE: I wonder if the code could be instrumented to note the origin of indexes, to see if they are an index used multiple times (where a BC could be hoisted), or the index is based on a loop induction variable (where the stride and barrier style strategies might be useful), or the index has a single use for which BCE can not be eliminated, and to see the frequency of each of these cases. I might try to get such data as it might help understand how much potential there is for the loop optimizations, or if code just has a lot of BCs that can not be eliminated (e.g. walking pointer lists).
Completely idle 2013 MacBook Pro.  Embenchen benchmarks.  Lower is better (linpack and scimark figures are inverted and scaled).  Median of 11 runs.  "Before" is without the patch, "After" with the patch.

               Before  After    Before/After

box2d           1114    1340     .831
bullet          1386    1781     .778
conditionals     381     379    1.005
copy             589     584    1.008
corrections      956     968     .987
fannkuch        2516    4003     .628
fasta           1052    1166     .902
ifs              207     244     .848
linpack         9693    9980     .971
lua_binarytrees 4274    5086     .840
lua_scimark     4424    5743     .770
memops           889     1189    .747
primes          1085    1122     .967
skinning        1226    1765     .694
zlib            1530    1868     .819

Ignoring the microbenchmarks we see slowdowns in the 15%-20% range.  These results are fairly robust; I see similar slowdowns on an AMD FX4100 system (quite a bit older), though there bullet and zlib slow down even more; and I see similar slowdowns with fewer runs.  Baseline follows the same trend, roughly.
From IRC, there are other promising routes that should hopefully have much lower overhead by hoisting loads from TLS.  Agreed 15-20% overhead is too much.
Second attempt.  This patch (uploaded subsequently) instead keeps the heap bounds and the heap limit in MIR nodes and lets GVN common them.  Here the "Before" program is again mozilla-inbound, ie, with patching.  Empirically noise adds +/- 1%.  Same system as above; observe that the "Before" times are virtually identical to those given above.

Ignoring microbenchmarks we see significant slowdowns for bullet, scimark, and zlib, and a significant speedup for linpack.  I also tested my ray tracer, which shows a ratio of .887, "--".  There could be some noise here from loops becoming aligned or misaligned, but the trend is clearly still negative.  Microbenchmarky loops such as memops and fannkuch suffer significantly.

On the positive side, *all* programs do significantly better than with the previous patch.  :nbp says more can be done to improve GVN (he notes that it does not currently hoist out of "if"s, for example).

A final note, as implemented in the current patch GVN does not invalidate the heap bounds and base at a call site nor at grow_memory.  This is realistic since base and bound will not change if a heap maximum is provided, which it frequently will be.  But the eventual patch would have to take that into account.

              Before  After   Before/After

# mode=IonVsIon, runs=11, problem size=
box2d           1175    1184     .992
bullet          1374    1436     .956  (-)
conditionals     381     288    1.322  (++)
copy             588     584    1.006
corrections      957     943    1.014
fannkuch        2518    3422     .735  (--)
fasta           1056    1082     .975  (-)
ifs              209     196    1.066  (+)
linpack         9670    9163    1.055  (+)
lua_binarytrees 4286    4707     .910  (-)
lua_scimark     4424    5181     .853  (--)
memops           890    1122     .793  (--)
primes          1086    1037    1.047  (+)
skinning        1227    1250     .981  (-)
zlib            1539    1646     .934  (-)

Baseline performance is not a focus but here are the full data, for the record.  Here we load the TLS from the frame and then load the memory limit and memory base from the tls, for every access.  The TLS is loaded from the frame only once per heap access even if it is needed twice.  We can do better than this should we want to.

# mode=BaselineVsBaseline, runs=5, problem size=
box2d            1717    2073    .828
bullet           2049    2800    .731
conditionals      743     782    .950
copy             1197    1181   1.013
corrections      1445    1444   1.000
fannkuch         4774    5653    .844
fasta            2124    2294    .925
ifs               225     300    .750
linpack         11816   14997    .787
lua_binarytrees  6117    7571    .807
lua_scimark      7304    8936    .817
memops           3504    3351   1.045
primes           1132    1047   1.081
skinning         2380    3392    .701
zlib             1993    2607    .764
Attached patch bug1338217-wasm-heap-v2-wip.patch (obsolete) (deleted) — Splinter Review
Second attempt.  This is not yet clean: It is x86 only; there's a hack in the lowering for Int64 to work around a problem caused by useRegisterAtStart; GVN does not invalidate the memory bounds and memory limit at grow_memory and calls, which it needs to do if there's not a heap maximum; and arguably there's no need to introduce two new MIR node types here, one ought to be enough (with a constant offset field, which would play into value numbering).
Second attempt, adapted to ARM (bounds checking only since there is a dedicated HeapReg here).  ARM has more available registers as well.  By and large the trend is negative but slowdowns are slight except for zlib.  Microbenchmarks do not suffer notably, except binarytrees.  (3 runs only but this system is quite stable.)

              Original  New   New/Original

# mode=IonVsIon, runs=3, problem size=
box2d            3808    3827    .995
bullet           5600    5685    .985 (-)
conditionals     1914    1902   1.006
copy             3509    3581    .979 (-)
corrections      1634    1646    .992
fannkuch         6935    6967    .995
fasta            3563    3587    .993
ifs               476     462   1.030 (+)
linpack         16035   16096    .996
lua_binarytrees 12806   13274    .964 (-)
lua_scimark     15290   15625    .978 (-)
memops           3059    3134    .976 (-)
primes           1853    1853   1.000
skinning         3884    3915    .992
zlib             3674    3835    .958 (--)

Baseline slowdowns are again problematic, even if less dramatic than x86; but likely we can do much by caching the tls in a register - not difficult - or caching the memory limit in the frame between calls.

# mode=BaselineVsBaseline, runs=1, problem size=
box2d            6345    6738    .941 (-)
bullet           8533   10543    .809 (--)
conditionals     1969    1974    .997
copy             2291    2291   1.000
corrections      2432    2430   1.000
fannkuch        14675   15479    .948 (-)
fasta            5221    5825    .896 (--)
ifs               625     624   1.001
linpack         34753   42124    .825 (--)
lua_binarytrees 19718   21905    .900 (--)
lua_scimark     24752   25380    .975 (-)
memops           8110    8640    .938 (-)
primes           1860    1861    .999
skinning         7793    8915    .874 (--)
zlib             5826    6423    .907 (--)
Oops, the headings are wrong in comment 18, they should be as in comment 16.
As comment 17, and decidedly not any cleaner, but with support for ARM and with a change to take into account the more subtle memory growth observer mechanics in current m-i.
Attachment #8839075 - Attachment is obsolete: true
Just to round out that thread, on the ray tracer the shell that keeps the heap limit in the tls is fractionally faster than the shell that uses patching, on ARM.  (Actually it's in the noise but there's a discernible trend in the data in favor of the new code.)  This is slightly surprising since the patched heap limit on ARM requires only a compare-shifted-immediate; branch sequence, whereas the loaded heap limit will sometimes require a load ; compare ; branch sequence.  Still, we won't need the tls pointer for anything else and it may be that the heap limit is simply hoisted into a register in code regions where it's hot.

(The ARM system is my standard Jetson TK1 quad-core Tegra.)
Depends on: 1329676
This is a large patch but most of it is pretty mechanical.  Currently this sits on top of my patch for bug 1329676, in case the diffs confuse you (WasmBCE, Lowering).

Baldr:

We introduce a single new MIR node, MWasmHeapMeta, which is used to load the memoryBase and boundsCheckLimit from the TlsData.  We generate one of these per bounds check, and one per heap load on x86.  Lowering and code generation for this new node is platform-independent.

We introduce a new alias class for the Wasm heap metadata, to allow GVN to common the nodes but to kill live values across calls (if there's not a heap maximum).  I've verified manually that this works properly, and we do have some test cases that fail if the alias analysis is not done correctly.


Rabaldr:

For now we load the tls once before each heap access if it will be needed for bounds checking or address computation; the memoryBase is just added into the pointer value.


Performance:

For the most part, with all the bits in place and also the constant pointer BCE from bug 1329676 also applied, the performance is as in comment 14, except that a couple of microbenchmarks have improved markedly.  Here are the x86 data:

# mode=IonVsIon, runs=11, problem size=

              Original  New   New/Original

box2d           1168    1176     .993 *
bullet          1365    1428     .955 * (-)
conditionals     389     293    1.327   (+++++++)
copy             584     584    1.000
corrections      967     946    1.022   (+)
fannkuch        2917    3544     .823   (----)
fasta           1055    1091     .967   (-)
ifs              226     190    1.189   (++++)
linpack         9663    9104    1.061 * (++)
lua_binarytrees 4262    4666     .913 * (--)
lua_scimark     4440    5173     .858 * (---)
memops           890     895     .994
primes          1125    1037    1.084   (++)
skinning        1235    1238     .997
zlib            1535    1666     .921 * (--)

Each * represents a "significant" benchmark, a pretty subjective
measure.

Each + or - represents up to a .05 change in the ratio outside the
margin of error, with 1% considered a normal margin.

My ray tracer also slows down quite a lot, this is median of three runs:

Raybench        7375    8269     .892 * (--)


Testing:

This passes everything I've tried, including AngryBots in a 32-bit Linux build.
Attachment #8843332 - Flags: review?(luke)
Comment on attachment 8843332 [details] [diff] [review]
bug1338217-no-more-patching.patch

Review of attachment 8843332 [details] [diff] [review]:
-----------------------------------------------------------------

Great, that's pretty simple.  Glad to see the AliasSet matches what we need here.  Just a few nits:

::: js/src/jit/MIR.h
@@ +13623,5 @@
> +  : public MUnaryInstruction,
> +    public NoTypePolicy::Data
> +{
> +    uint32_t offset_;
> +    bool hasMaxMemory_;

I'd like to use this for table base/length before long.  Could we generalize MWasmHeapMeta to MWasmLoadTls and hasMaxMemory_ to be an AliasSet passed to the ctor?

@@ +13631,5 @@
> +        offset_(offset),
> +        hasMaxMemory_(hasMaxMemory)
> +    {
> +        setMovable();
> +        setResultType(MIRType::Pointer);

For boundsCheckLimit, the type here should MIRType::Int32.  Now this currently doesn't happen b/c 64-bit won't ever load the bounds-check, but it will be for table lengths.  So I think the constructor needs a MIRType argument.

@@ +13654,5 @@
> +
> +    AliasSet getAliasSet() const override {
> +        if (hasMaxMemory_)
> +            return AliasSet::None();
> +        return AliasSet::Load(AliasSet::WasmHeapMeta);

So to check my understanding: grow_memory/current_memory turn into MWasmCalls and MWasmCall aliases everything, so this too?

::: js/src/wasm/WasmIonCompile.cpp
@@ +728,5 @@
>      {
>          if (inDeadCode())
>              return nullptr;
>  
>          MInstruction* load = nullptr;

nit: could you move this decl of 'load' to be right before the 'if' below, with a \n after the 'curBlock_->add(memoryBase)'?

@@ +729,5 @@
>          if (inDeadCode())
>              return nullptr;
>  
>          MInstruction* load = nullptr;
> +        bool hasMaxMemory = bool(env_.maxMemoryLength);

nit: could you use env_.maxMemoryLength.isSome() (here and below x4)

@@ +732,5 @@
>          MInstruction* load = nullptr;
> +        bool hasMaxMemory = bool(env_.maxMemoryLength);
> +        MWasmHeapMeta* memoryBase = MWasmHeapMeta::New(alloc(), tlsPointer_,
> +                                                       offsetof(wasm::TlsData, memoryBase),
> +                                                       hasMaxMemory);

What's interesting here is we're generating heap-base-load MIR nodes even on x64/ARM, but, presumably, 100% of them die b/c x64 lowering of load/store ignore them.  It'd be nice to confirm this, on some big workload with a MOZ_CRASH() in the codegen for MWasmHeapMeta.  (Although soon hopefully we can kill un-pin HeapReg and have x64 take memoryBase as well.)

@@ +755,5 @@
>      {
>          if (inDeadCode())
>              return;
>  
>          MInstruction* store = nullptr;

same nit as for 'load' above

::: js/src/wasm/WasmTypes.h
@@ +1151,5 @@
>      // Pointer to the base of the default memory (or null if there is none).
>      uint8_t* memoryBase;
>  
> +    // Bounds check limit of memory, in bytes (or zero if there is no memory).
> +    uintptr_t boundsCheckLimit;

Could you make this #ifndef WASM_HUGE_MEMORY?
Attachment #8843332 - Flags: review?(luke) → review+
(In reply to Luke Wagner [:luke] from comment #25)
> Comment on attachment 8843332 [details] [diff] [review]
> bug1338217-no-more-patching.patch
> 
> @@ +13654,5 @@
> > +
> > +    AliasSet getAliasSet() const override {
> > +        if (hasMaxMemory_)
> > +            return AliasSet::None();
> > +        return AliasSet::Load(AliasSet::WasmHeapMeta);
> 
> So to check my understanding: grow_memory/current_memory turn into
> MWasmCalls and MWasmCall aliases everything, so this too?

Yes.  In truth, I realized after posting the patch that I had forgotten to check this, but I expected it to be OK because we have test cases that fail if grow_memory does not alias everything.  And it is OK.

> @@ +732,5 @@
> >          MInstruction* load = nullptr;
> > +        bool hasMaxMemory = bool(env_.maxMemoryLength);
> > +        MWasmHeapMeta* memoryBase = MWasmHeapMeta::New(alloc(), tlsPointer_,
> > +                                                       offsetof(wasm::TlsData, memoryBase),
> > +                                                       hasMaxMemory);
> 
> What's interesting here is we're generating heap-base-load MIR nodes even on
> x64/ARM, but, presumably, 100% of them die b/c x64 lowering of load/store
> ignore them.

As it happens they don't die; thanks for spotting this.  Since they are inputs to the WasmLoad node and the WasmLoad node is not dead, they are lowered unconditionally, and code is generated for them.  I think my code here dates to when I expected to specialize most nodes by platform and expected the x64 node to discard these inputs and DCE to remove them; because I did not test before/after performance on x64, I did not see the problem.

After talking to nbp about this I'm inclined to try to handle it without specializing MIR nodes by platform, and instead make the nodes in question variadic (optional memoryBase, and, for asm.js nodes, boundsCheckLimit).  Indeed the variadicity will be reduced later if we remove HeapReg.

> ::: js/src/wasm/WasmTypes.h
> @@ +1151,5 @@
> >      // Pointer to the base of the default memory (or null if there is none).
> >      uint8_t* memoryBase;
> >  
> > +    // Bounds check limit of memory, in bytes (or zero if there is no memory).
> > +    uintptr_t boundsCheckLimit;
> 
> Could you make this #ifndef WASM_HUGE_MEMORY?

Yes and no.  Currently WasmIonCompile accesses it on all platforms.  But with the fix for the problem above that may change.
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7ebcd45634ee
Backed out changeset b7bcda1f007a for bustage on a CLOSED TREE
https://hg.mozilla.org/mozilla-central/rev/1a9059a55ce0
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
No longer depends on: 1329676
Getting rid of patching was just one aspect; the other aspect is not to duplicate the code segments, but for that we're waiting on bug 1334504.
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Attached patch bug1338217-share-code.patch (obsolete) (deleted) — Splinter Review
Probably this is sufficient to share instantiated code, once bug 1329676 lands.
Attached patch bug1338217-rm-lookupInstanceDeprecated.patch (obsolete) (deleted) — Splinter Review
I'm thinking about something like this for getting rid of lookupInstanceDeprecated.  This fails the timeout tests, which suggests that the code in RedirectJitCodeToInterruptCheck is wrong.  (On the other hand, getting the Tls via the FP is also wrong there, at least sometimes.)
Attachment #8850505 - Flags: feedback?(luke)
Comment on attachment 8850505 [details] [diff] [review]
bug1338217-rm-lookupInstanceDeprecated.patch

Review of attachment 8850505 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for starting this!  There's actually a few subtleties here; probably easier for me to write the patch than writing it via comments :)

::: js/src/jit/arm/Simulator-arm.cpp
@@ +1550,5 @@
> +static Instance*
> +LookupInstance(uintptr_t* fp)
> +{
> +    wasm::TlsData* tls = reinterpret_cast<wasm::TlsData*>(fp[0]);
> +    return tls->instance;

Probably want to reuse WasmSignalHandlers.cpp, calling it
  Instance* LookupInstanceAtFault(WasmActivation* activation, void* fp);

::: js/src/wasm/WasmSignalHandlers.cpp
@@ +777,5 @@
> +
> +static Instance*
> +LookupInstance(WasmActivation* activation)
> +{
> +    wasm::FrameIterator iter(activation);

I think the logic we need is:

  Code* code = activation->compartment()->wasm.lookupCode(pc);
  if (!code) return null;
  CodeRange* codeRange = code->lookupRange(pc);
  if (!codeRange || !codeRange->isFunction()) return null;
  void* fp = ContextToFP(context);  // mirroring how we do ContextToPC() atm
  return reinterpret_cast<wasm::Frame*>(fp)->tls->instance;

@@ +1266,5 @@
>  #else
>          uint8_t** ppc = ContextToPC(context);
>          uint8_t* pc = *ppc;
>  
> +        const Instance* instance = LookupInstance(activation);

For these two calls and the one below in IsPCInWasmCode, you can use the simpler/cheaper activation->compartment()->wasm.lookupCode().
Attachment #8850505 - Flags: feedback?(luke)
Attached patch change-frame-layout (deleted) — Splinter Review
This patch switches around the order of callerFP and tls in the Frame so that callerFP is popped to fp first.  The reason is that, with the current order, right after tls is popped, that stack memory can be corrupted by a signal handler while fp still points to the whole Frame so if you interrupt at exactly that instruction, you can view a trashed field.  I accidentally regressed this in bug 1334504 (despite the comment I wrote last time I made this bug :/ ).
Attachment #8850587 - Flags: review?(bbouvier)
Attached patch rm-lookup-instance-deprecated (obsolete) (deleted) — Splinter Review
Attachment #8850505 - Attachment is obsolete: true
Attachment #8850588 - Flags: review?(lhansen)
Attached patch rm-lookup-instance-deprecated (obsolete) (deleted) — Splinter Review
(Oops, forgot to include the hunk in InstanceComparator from Lars's original patch.)
Attachment #8850588 - Attachment is obsolete: true
Attachment #8850588 - Flags: review?(lhansen)
Attachment #8850589 - Flags: review?(lhansen)
Attached patch rm-lookup-instance-deprecated (deleted) — Splinter Review
Now with build fixes for OSX/ARM.
Attachment #8850589 - Attachment is obsolete: true
Attachment #8850589 - Flags: review?(lhansen)
Attachment #8850762 - Flags: review?(lhansen)
Comment on attachment 8850762 [details] [diff] [review]
rm-lookup-instance-deprecated

Review of attachment 8850762 [details] [diff] [review]:
-----------------------------------------------------------------

Short and sweet, thanks!

(Also passes local testing with my code sharing patch.)
Attachment #8850762 - Flags: review?(lhansen) → review+
Comment on attachment 8850587 [details] [diff] [review]
change-frame-layout

Review of attachment 8850587 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you for the explanations on irc, this is definitely tricky.

::: js/src/wasm/WasmFrameIterator.cpp
@@ +367,5 @@
>  
> +    // There is an important ordering constraint here: fp must be repointed to
> +    // the caller's frame before any field of the current frame is popped:
> +    // asynchronous signal handlers could otherwise corrupt popped-but-
> +    // accessible fields of fp.

There's a metonymy on fp and the actual frame here that confused me a bit. This is probably regular and just me not being used to it, but I'd find it makes it clearer if we replaced `fields of fp` by `fields of the frame pointed by fp`. Would it also make sense to state this is an issue because these frame fields can also be read from the signal handler?

::: js/src/wasm/WasmTypes.h
@@ +1454,5 @@
>  
>  struct Frame
>  {
> +    // The caller's Frame*.
> +    uint8_t* callerFP;

It might make sense to add a comment saying not to change the order of fields here, referring to the other new comment in WasmFrameIterator.cpp?
Attachment #8850587 - Flags: review?(bbouvier) → review+
Pushed by lwagner@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0521656c88ca
Baldr: change frame layout to pop fp first (r=bbouvier)
https://hg.mozilla.org/integration/mozilla-inbound/rev/f6b17132648c
Baldr: remove wasm::Compartment::lookupInstanceDeprecated (r=lth)
Attached patch bug1338217-share-code.patch (obsolete) (deleted) — Splinter Review
Minimally we need to do this to share code, for now: Code becomes shareable, and a CodeSegment still belongs to a specific Code.  Module caches a pointer to a Code.  A fresh Code is created when debugging is enabled, and that code is not sharable.

One can do more here.  Debugging state can be broken out of Code and linked from an Instance that needs it.  (Code remains mutable, however.)  I'm a little on the fence about that; it's not a simplification but it may be good hygiene.  Feedback desired.
Attachment #8849581 - Attachment is obsolete: true
Attachment #8851568 - Flags: feedback?(luke)
Comment on attachment 8851568 [details] [diff] [review]
bug1338217-share-code.patch

Review of attachment 8851568 [details] [diff] [review]:
-----------------------------------------------------------------

Moving in the right direction, just need changes discussed earlier.

::: js/src/wasm/WasmCode.h
@@ +510,5 @@
>  typedef UniquePtr<GeneratedSourceMap> UniqueGeneratedSourceMap;
>  typedef HashMap<uint32_t, uint32_t, DefaultHasher<uint32_t>, SystemAllocPolicy> StepModeCounters;
>  typedef HashMap<uint32_t, WasmBreakpointSite*, DefaultHasher<uint32_t>, SystemAllocPolicy> WasmBreakpointSiteMap;
>  
> +// Code objects own executable code and the metadata that describes it.

Could you add a comment explaining how Code can be shared between instances on different threads via Module.

@@ +520,5 @@
>      const SharedBytes        maybeBytecode_;
>      UniqueGeneratedSourceMap maybeSourceMap_;
>  
>      // Mutated at runtime:
>      CacheableCharsVector     profilingLabels_;

Both maybeSourceMap_ and profilingLabels_ should be ExclusiveData<> (so each protected by a lock).  Earlier I was suggesting some ref-count to release when no thread is profiling, but that's probably not necessary: we could just leave the profilingLabels_ allocated once created, so that they are simply lazily-initialized, just like the maybeSourceMap_.  It'd be good to group these two fields with a comment saying "Lazily-initialized and locked:".

::: js/src/wasm/WasmInstance.h
@@ +69,5 @@
>  class Instance
>  {
>      JSCompartment* const            compartment_;
>      ReadBarrieredWasmInstanceObject object_;
> +    MutableCode                     code_;

I think this could stay 'const' though.

::: js/src/wasm/WasmModule.cpp
@@ +896,5 @@
>          maybeBytecode = bytecode_.get();
>      }
>  
> +    MutableCode code(cachedSharedCode_);
> +    if (!code || metadata_->debugEnabled) {

Since JS::WasmModule is already shareable between threads via postMessage(), I think this lazy initialization would be racy w/o a lock.  However, if Module eagerly creates the Code as we discussed earlier today, then I think this problem goes away.
Attachment #8851568 - Flags: feedback?(luke)
(In reply to Luke Wagner [:luke] from comment #46)
> Comment on attachment 8851568 [details] [diff] [review]
> bug1338217-share-code.patch
> 
> Review of attachment 8851568 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/wasm/WasmInstance.h
> @@ +69,5 @@
> >  class Instance
> >  {
> >      JSCompartment* const            compartment_;
> >      ReadBarrieredWasmInstanceObject object_;
> > +    MutableCode                     code_;
> 
> I think this could stay 'const' though.

Only if the debug data is filtered out of the WasmCode and moved into the instance, at a minimum.  We seem to be going back and forth on that.  I'll see what I can do.
Not completely obvious how to store a Code with the Module instead of the Bytes.  Creating a Code requires a CodeSegment, which requires a JSContext, and looks like it probably needs to happen on the main thread.  This is a problem for both the module generator and the deserializer (and indeed I had previously run into the same problem with the tiered compiler).

It's plausible that some of the actions (like OOM reporting) that require cx for the CodeSegment allocation can be delayed until later, and for some of the rest I may question whether they should be there in the first place.  For example, there's a sensible fallback for failing to allocate a code segment (triggers a GC), but none for failing to allocate the Bytes buffer for pasteup in the first place.

In summary, it is somewhat likely that we can at least allocate the code blob off-thread with no GC fallback, and then reference this from the module, and the module can then lazily create a CodeSegment and Code to wrap that blob, and also handle and report allocation errors for the code.
Depends on: 1352955
Attached patch bug1338217-share-code-v2.patch (obsolete) (deleted) — Splinter Review
This is still WIP but nearly done.  It performs the allocation of the code memory off the main thread (in the compiler or in the deserializer), and then performs instantiation on the main thread.  In the common case (after the masm is deleted) there is exactly one copy of the code, and this code is shared.  In the debugging case, there is one copy of the code per debugger, so that it can be patched.  Serialization needs to make a separate copy, but following the copy the code is unlinked in-place so there is only the one extra copy.

This factors out the DebugState and wraps the profiling info in an ExclusiveData and I *think* that the Module and the Code and the CodeSegment are now immutable and sharable cross-thread, but that's one of the things I'm still working on verifying and cleaning up.
Attachment #8851568 - Attachment is obsolete: true
Meant to add: the one thing that is in a sense missing here is the last-ditch GC that used to be triggered upon allocating the code memory, if the allocation failed.  I'm currently a little vague on whether this can happen off-thread at all, but for now it looks like it can (deserialization might recompile the code, for one thing), and with tiering I'd like for it to be possible (because the tier2 compilation will be running on a separate thread).

In that eventuality we have some options:

a. ignore the problem, but this is risky since code memory is not normal memory and has limited availability
b. always do a last-ditch gc before wasm compilation, but this is draconian and a little silly
c. try to guess whether a last-ditch gc is needed before a compilation, this is somewhat plausible
d. send a signal from the allocating thread to the main thread to do the gc, and then a signal back, and then do the
   allocation, but this seems tricky
e. accept a temporary copy of the memory after all, it does not have to be a contiguous buffer and we can ditch it as
   soon as we've allocated code memory
f. react to the allocation failure on the main thread, do a last-ditch gc, and then redo the compilation, but this
   risks getting into a loop and seems generally hard
g. fix the last-ditch machinery to be invokable off-thread but this seems like a lot of work

Personally I favor (a) or (e), or perhaps (a) then (e) - land soon without a fix with the expectation that we can implement something workable.  I note that for (e), if we can abstract the representation of the copied bytes we may be able to just transmit the masm instance from the compiling thread to the main thread and not allocate anything.
Luke notes that he thinks that there is nothing major to stop (g) from happening, so maybe (a) then (g) is the way to go.
Ok, looks like my assumptions were wrong; the notification has to happen on the main thread (workers simply don't do the memory notification).  Option (e) sounds like the simplest, and it could actually be part of a three-birds-one-stone fix:

Currently, we are doing a big spurious copy in MacroAssembler.asmMergeWith() by copying function batches into one big masm that just gets copied out to executable memory.  dmajor found a significant amount of main-thread time (1/3, at some point in the past) was due to this growing/copying.  What if instead we killed asmMergeWith() and just extracted the raw code and metadata from the MacroAssembler directly and had this <code,metadata> pair be the result returned from helper threads to the generator thread in CompileTask (CompileTask::masm_ would be removed and become a plain auto variable on the helper thread stack).

The last bird killed with this stone is the fact that, since Cretonne doesn't use MacroAssembler, we need to move away from using MacroAssembler as the result of compilation, and toward some abstract <code,metadata> structure that Cretonne can produce.

So we could go faster, be more general, and address this last-ditch-GC issue.  WDYT?
That seems fine.  I may try to turn that into a separate patch, but we'll see.
Attached patch bug1338217-share-code-v3.patch (obsolete) (deleted) — Splinter Review
As before, but now the code+metadata are transmitted from helper threads (deserialization, compilation) to the main thread in an intermediate format (a "ProtoModule") where the code has the form of a "CodeBytes" object, currently this just wraps a Bytes array but that is not really exposed and we can go to work on that.  With this, Code is propely immutable and Module almost so (and I think within reach).

The thing that is missing here is that it is now not possible for JS::DeserializeWasmModule to return something that is a complete Module, because a complete Module can only be constructed on the main thread, when we have a JSContext available (to do the last-ditch GC in case executable memory allocation fails).  So what will DeserializeWasmModule return?  I'm thinking WasmModule will turn into something that is effectively a Variant that holds either a Module or a ProtoModule, certainly that is serviceable if not exactly elegant.  Still thinking.
Attachment #8853962 - Attachment is obsolete: true
(In reply to Lars T Hansen [:lth] from comment #54)
Arg, I forgot about the DeserializeWasmModule() case!  I had thought this was a three-birds-with-one-stone opportunity, but if it's still gunking up the design of these central classes, it's probably worth it to just make LargeAllocationFailureCallback work on any thread by either dispatching a runnable to the main thread and blocking on a condvar for its completion as mccr8 suggested.  This would actually remove the void* parameter (so the callback was nullary) and make the callback work for workers, so probably this is the right fix.  I can try to make a patch today.
(In reply to Luke Wagner [:luke] from comment #55)

Sure, you can try that.  That said, I did run into some similar problems when implementing tiering - ie, no JSContext* available - and while it is possible that those could be avoided some other way that I did not see at the time it may well be that it's better to try to make the decoupling work.  Looking through that patch queue I can't find the troublesome code now, so no more solid information, sorry.  I seem to remember it had to do particularly with the deserialization code having to fall back on a recompilation.
Attached patch improve-large-alloc-callback (obsolete) (deleted) — Splinter Review
Pretty simple patch, actually.  And now, as a bonus, the large-allocation-callback now actually does something on workers.
Attachment #8855946 - Flags: review?(continuation)
Attached patch improve-large-alloc-callback (obsolete) (deleted) — Splinter Review
Oops, forgot to qref to remove testing code.
Attachment #8855946 - Attachment is obsolete: true
Attachment #8855946 - Flags: review?(continuation)
Attachment #8855948 - Flags: review?(continuation)
Comment on attachment 8855948 [details] [diff] [review]
improve-large-alloc-callback

Review of attachment 8855948 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/CycleCollectedJSContext.cpp
@@ +1013,5 @@
> +  void
> +  BlockUntilDone()
> +  {
> +    MOZ_ASSERT(!NS_IsMainThread());
> +    MOZ_ASSERT(mWaiting, "Can only call once");

Is this assertion valid? I think there might be a thread interleaving where the dispatch goes off and runs immediately on the main thread, before we run BlockUntilDone. I don't know how scheduling of events on the event loops of two threads can go.

@@ +1030,2 @@
>  {
> +  if (NS_IsMainThread()) {

Rather than check this dynamically, please split this into two separate callbacks, one for WorkerJSContext and one for XPCJSContext, and then register the appropriate one in their respective initialize methods. Then LargeAllocationFailureRunnable could live in with the worker code, which I thinks makes more sense.

You should also get a review from a DOM workers peer for the LargeAllocationFailureRunnable stuff, because I'm not really sure if this is okay or not. If the worker thread is waiting on the condition variable in BlockUntilDone, and the main thread sends it a memory pressure observer message, can the worker get that message and actually GC? (Though maybe it is okay if it doesn't...)

Otherwise this patch looks fine to me.
Attachment #8855948 - Flags: review?(continuation) → review-
(In reply to Andrew McCreight [:mccr8] from comment #59)
> > +    MOZ_ASSERT(!NS_IsMainThread());
> > +    MOZ_ASSERT(mWaiting, "Can only call once");
> 
> Is this assertion valid? I think there might be a thread interleaving where
> the dispatch goes off and runs immediately on the main thread, before we run
> BlockUntilDone. I don't know how scheduling of events on the event loops of
> two threads can go.

Oh duh, you're right.  The while(mWaiting){} loop anticipates this interleaving below, but here I had just meant to assert that mWaiting is initialized to true.  I'll remove it.

> Rather than check this dynamically, please split this into two separate
> callbacks, one for WorkerJSContext and one for XPCJSContext, and then
> register the appropriate one in their respective initialize methods. Then
> LargeAllocationFailureRunnable could live in with the worker code, which I
> thinks makes more sense.

There's actually a third case that I'm primarily concerned about which is SM-internal helper threads calling this callback.  Since this can occur for both main and worker JSContexts, I think this belongs in the shared CycleCollectedRuntime path, yes?

> You should also get a review from a DOM workers peer for the
> LargeAllocationFailureRunnable stuff, because I'm not really sure if this is
> okay or not. If the worker thread is waiting on the condition variable in
> BlockUntilDone, and the main thread sends it a memory pressure observer
> message, can the worker get that message and actually GC? (Though maybe it
> is okay if it doesn't...)

Yeah, that's a good question.  In general, I was assuming that the main thread shouldn't be synchronously waiting on any non-main worker or JS helper thread.  But good to have explicit review on this; who would be the best person nowadays?
Flags: needinfo?(continuation)
(In reply to Luke Wagner [:luke] from comment #60)
> There's actually a third case that I'm primarily concerned about which is
> SM-internal helper threads calling this callback.  Since this can occur for
> both main and worker JSContexts, I think this belongs in the shared
> CycleCollectedRuntime path, yes?

Oh, right, I forgot about that. So I guess you have to leave it like it is.
 
> Yeah, that's a good question.  In general, I was assuming that the main
> thread shouldn't be synchronously waiting on any non-main worker or JS
> helper thread.  But good to have explicit review on this; who would be the
> best person nowadays?

baku probably.
Flags: needinfo?(continuation)
Attached patch improve-large-alloc-callback (obsolete) (deleted) — Splinter Review
baku: Could you review this from a worker's pov and given mccr8's question in the second half of comment 59 (and my followup at the end of comment 60)
Attachment #8855948 - Attachment is obsolete: true
Attachment #8855974 - Flags: review?(amarchesini)
Attached patch improve-large-alloc-callback (obsolete) (deleted) — Splinter Review
(Oops, forgot a 'return NS_OK'.)
Attachment #8855974 - Attachment is obsolete: true
Attachment #8855974 - Flags: review?(amarchesini)
Attachment #8855976 - Flags: review?(amarchesini)
Oh, and if you could add a comment somewhere about how the callback could be triggered from a JS worker thread as well as a DOM worker thread that would be good, as that's not something you usually thing about in Gecko code outside of SM.
Attachment #8855976 - Flags: review+
Attached patch bug1338217-share-code-v4.patch (obsolete) (deleted) — Splinter Review
Time to get some eyes on this.  Basically this makes Code immutable and Module nearly so (after proper initialization, see comments + below).  We avoid intermediate copies of executable code and share code when possible (ie when not debugging).  The DebugState type is new, factored out from Code, and owned by the Instance.  There's a simple mechanism to manage sharing wrapped up in a variable called codeAvailable_ in the Module; this will doubtless change a little once we do in fact start sharing code.

The main complication here is that when we create a Module, we can't quite finish creating it because we don't have a context and the TlsContext won't do.  So we create it, but the executable code still needs to be wrapped in a CodeSegment and a Code and accounted for.  This is accomplished by Module::completeCompilation, which is called in a couple of places to ensure that the Module can be used.

Once the last-ditch GC machinery is in place we'll call that from AllocateCodeBytes, but I've left that code out for now.
Attachment #8855301 - Attachment is obsolete: true
Attachment #8857453 - Flags: review?(luke)
Comment on attachment 8857453 [details] [diff] [review]
bug1338217-share-code-v4.patch

Review of attachment 8857453 [details] [diff] [review]:
-----------------------------------------------------------------

Nice job with DebugState and the codeAvailable_ scheme!

So from some preliminary IRC with baku, I think the largeAlloc patch is good.  But we'll still need to get the LargeAllocationFailureCallback pointer value to the background thread.  The challenge being the JS::DeserializeWasmModule() case where we have no JSRuntime and we're off the main thread.  I realized though we can simply make the callback process-wide.  I can update the patch to do that.

So given that change, I think CodeBytes, Module::codeBytes_ and Module::completeCompilation() can be completely removed and ModuleGenerator::finish() would simply CodeSegment::create()?  I commented on the remaining uses of 'cx' below in this patch.

Unfortunately, bug 1340219 just added a new 'cx' to CodeSegment::create(): there is a runtime-wide cache of thunks to C++ builtins.  Embarassingly, I didn't make the connection to this patch while reviewing.  I'll write a patch real quick to make that cache locked and process-wide; there are a small, finite number of builtins and we never need to release them so it's pretty simple in theory.

::: js/src/wasm/WasmCode.cpp
@@ +188,5 @@
>          uintptr_t end = uintptr_t(cs.base() + codeRange.end());
>          uintptr_t size = end - start;
>  
>          UTF8Bytes name;
> +        if (!metadata.getFuncName(cs.base(), cs.length(), codeRange.funcIndex(), &name))

Oh wow, this is totally a pre-existing facepalm: getFuncName() wants to take *wasm bytecode* here, not machine code.    (getFuncName() potentially pulls the utf8 names out of the names section from bytecode.)  The CodeSegment:create() 'bytecode' parameter is completely misnamed.  Could you keep the getFuncNames() signature as-is and pass the actual bytecode (no maybe b/c we always have it) as a new parameter to CodeSegment::create()?

@@ +239,3 @@
>  /* static */ UniqueCodeSegment
>  CodeSegment::create(JSContext* cx,
> +                    UniqueCodeBytes bytecode,

Probably the easiest thing is to pass the MacroAssembly& as the parameter to create() so you can executableCopy() directly into the allocated memory.

@@ +266,5 @@
>  
>      {
>          JitContext jcx(CompileRuntime::get(cx->compartment()->runtimeFromAnyThread()));
>          AutoFlushICache afc("CodeSegment::create");
> +        AutoFlushICache::setRange(uintptr_t(codeBase), codeLength);

You can avoid the 'cx' arg here by removing the AutoFlushICache/JitContext entirely and calling masm.executableCopy(dst, /* flushICache = */ false) and then calling ExecutableCopy::cacheFlush() explicitly.

::: js/src/wasm/WasmCode.h
@@ +470,5 @@
>  
> +class Code : public ShareableBase<Code>
> +{
> +    const UniqueConstCodeSegment              segment_;
> +    const SharedMetadata                      metadata_;

Could Metadata now be Uniquely owned (and cease to derive ShareableBase).  IIRC, it's only shareable now b/c all Instances have copies of the same machine code and so the same Metadata can be shared, but now they'll simply share Codes.

::: js/src/wasm/WasmInstance.h
@@ +75,1 @@
>      const UniqueGlobalSegment       globals_;

nit: can you make it `const UniqueDebugState`?

::: js/src/wasm/WasmModule.cpp
@@ +148,5 @@
> +static const ShareableBytes*
> +MaybeBytecode(JSContext* cx, const Metadata* metadata, SharedBytes bytecode)
> +{
> +    if (cx->compartment()->isDebuggee() || metadata->debugEnabled ||
> +        !metadata->funcNames.empty())

To remove this dependency on 'cx', I think we can move Code::maybeBytecode_ to be Instance::maybeBytecode_ (and/or non-maybe DebugState::bytecode_)?  Any Code method needing bytecode could be moved to DebugState/Instance or instead take a Bytecode& from the containing Instance/Module.

@@ +173,5 @@
> +    auto code = js_new<Code>(Move(codeSegment), metadata, maybeBytecode);
> +    if (!code)
> +        return nullptr;
> +
> +    AccountForCodeBytes(cx, codeLength);

So for this I'd just do the accounting directly in WasmModuleObject::create() (w/o trying to only do it only once per wasm::Module; it's just a heuristic to make sure we GC eagerly enough if Modules are churning.  Actually, I think that's the right behavior; if you imagine a pipeline of modules being created in actor and postMessage()ed to another where the last reference is dropped, you need the destination actor to know to GC hence the accounting bump in both.

::: js/src/wasm/WasmModule.h
@@ +112,5 @@
> +    // `codeAvailable_` is set to true when `code_` has an instance, and then to
> +    // false when `code_` is already being used for an instance and can't be
> +    // shared because it may be patched by the debugger.
> +
> +    mutable mozilla::Atomic<bool> codeAvailable_;

IIUC, with other changes mentioned, this would only apply to ensuring debug-enabled Modules get cloned.  Could it then be inverted and renamed to debugNeedsCodeClone_ and only set to true when debug-enabled after the first instance?
Attachment #8857453 - Flags: review?(luke)
Comment on attachment 8855976 [details] [diff] [review]
improve-large-alloc-callback

I should probably split the additional work on this patch (mentioned above) into a separate dependent bug.
Attachment #8855976 - Attachment is obsolete: true
Attachment #8855976 - Flags: review?(amarchesini)
Depends on: 1356631
Depends on: 1356680
Depends on: 1358045
Depends on: 1358396
(In reply to Luke Wagner [:luke] from comment #69)
> Comment on attachment 8857453 [details] [diff] [review]
> bug1338217-share-code-v4.patch
> 
> 
> ::: js/src/wasm/WasmCode.h
> @@ +470,5 @@
> >  
> > +class Code : public ShareableBase<Code>
> > +{
> > +    const UniqueConstCodeSegment              segment_;
> > +    const SharedMetadata                      metadata_;
> 
> Could Metadata now be Uniquely owned (and cease to derive ShareableBase). 
> IIRC, it's only shareable now b/c all Instances have copies of the same
> machine code and so the same Metadata can be shared, but now they'll simply
> share Codes.

Not quite.  It is also shared with WasmGenerator.  It's easy enough for Module and Instance to go via code, but I'm not 100% sure that we can move it from the WasmGenerator into the Code when we need to.  In any event, there's a fair amount of code churn to make this change, so I'm going to factor it out into its own work item and not do it here.
Attached patch bug1338217-immutable-data.patch (deleted) — Splinter Review
I decided to split this patch into two.  This is the first part:

Make Code and CodeSegment const, fix the facepalm in the use of getFuncName, and do a little additional cleanup.  In particular, MutableCode -> SharedCode, UniqueCodeSegment -> UniqueConstCodeSegment.

(I see a couple more local cleanup matters drifted into the following patch but I won't try to disentangle those now.)
Attachment #8857453 - Attachment is obsolete: true
Attachment #8860898 - Flags: review?(luke)
Attached patch bug1338217-share-code-v6.patch (obsolete) (deleted) — Splinter Review
And this is the second part:

Actually share code, with support for clone-on-debugging, serialization, and accounting.

What I am most uncertain about here is the unlinking step.  At the moment we unlink the copied code after a clone and after serialization, and unlinking has two steps: unpatch any debugger traps, and undo whatever StaticallyLink did.  My concerns are whether this is correct (I think it is) and whether it is sufficient (less sure about that).  I have a test case (not submitted yet) that actually triggers the code cloning, so I know that works at least superficially.

(I suppose unpatching could be conditional on debugging being enabled, right now it is unconditional.)
Attachment #8860899 - Flags: review?(luke)
Blocks: 1359027
(In reply to Lars T Hansen [:lth] from comment #71)
> (In reply to Luke Wagner [:luke] from comment #69)
> > Comment on attachment 8857453 [details] [diff] [review]
> > bug1338217-share-code-v4.patch
> > 
> > 
> > Could Metadata now be Uniquely owned (and cease to derive ShareableBase). 
> > IIRC, it's only shareable now b/c all Instances have copies of the same
> > machine code and so the same Metadata can be shared, but now they'll simply
> > share Codes.
> 
> Not quite.  It is also shared with WasmGenerator.  It's easy enough for
> Module and Instance to go via code, but I'm not 100% sure that we can move
> it from the WasmGenerator into the Code when we need to.  In any event,
> there's a fair amount of code churn to make this change, so I'm going to
> factor it out into its own work item and not do it here.

In addition to that, AsmJSMetaData subclasses Metadata and is itself shared and looks like it depends on Metadata providing the refcounting machinery.  I think that needs to be broken up a bit for bug 1359027 anyway, so it's premature to dig into it in this bug.
Comment on attachment 8860898 [details] [diff] [review]
bug1338217-immutable-data.patch

Review of attachment 8860898 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/wasm/WasmCode.h
@@ +34,5 @@
>  struct Metadata;
>  class FrameIterator;
>  
> +// ShareableBytes is a ref-counted vector of bytes which are incrementally built
> +// during compilation and then immutably shared.

pre-existing: this comment is out of date; can you change it to just say "Reference-counted Vector of bytes"?

::: js/src/wasm/WasmModule.cpp
@@ +895,5 @@
>      {
>          maybeBytecode = bytecode_.get();
>      }
>  
> +    auto codeSegment = CodeSegment::create(cx, code_, maybeBytecode, linkData_, *metadata_);

Could you pass bytecode_ instead and change create() to take a & instead of * (and remove 'maybe' from parameter name)?
Attachment #8860898 - Flags: review?(luke) → review+
Comment on attachment 8860899 [details] [diff] [review]
bug1338217-share-code-v6.patch

Review of attachment 8860899 [details] [diff] [review]:
-----------------------------------------------------------------

This is looking really close, just a few non-nits below I'd like to have one more look at.

::: js/src/wasm/WasmCode.cpp
@@ +236,5 @@
> +
> +    // Zero the padding.
> +    memset(codeBase + bytesNeeded, 0, padding);
> +
> +    return Move(CodeSegment::create(codeBase, codeLength, maybeBytecode, linkData, metadata));

nit: don't need Move() here since return value of create() is an rvalue already.

@@ +569,5 @@
>  
> +size_t
> +Code::sizeOfExcludingThis(mozilla::MallocSizeOf mallocSizeOf) const
> +{
> +    return segment().length();

For memory accounting, Code already has Code::addSizeOfMisc().  I see this is called by ShareableBase::sizeOfIncludingThisIfNotSeen().  Instead, we should nix this method and morph Code::addSizeOfMisc() into Code::addSizeOfMiscIfNotSeen() and copy over the logic from ShareableBase.

@@ +579,5 @@
> +    return sizeof(uint32_t) + segment().length();
> +}
> +
> +uint8_t*
> +Code::serialize(uint8_t* cursor, const LinkData& oldLinkData) const

nit: Is it really *old* link data?  (I don't see what the new link data is.)

@@ +582,5 @@
> +uint8_t*
> +Code::serialize(uint8_t* cursor, const LinkData& oldLinkData) const
> +{
> +    uint32_t len = segment().length();
> +    cursor = WriteBytes(cursor, &len, sizeof(len));

nit: you can use WriteScalar<<uint32_t>(cursor, segment().length()) instead

@@ +585,5 @@
> +    uint32_t len = segment().length();
> +    cursor = WriteBytes(cursor, &len, sizeof(len));
> +    uint8_t* base = cursor;
> +    cursor = WriteBytes(cursor, segment().base(), len);
> +    DebugState::UnpatchCodeCopy(base, *metadata_);

Debug-enabled code is never serialized (only the bytecode is, so it will be recompiled from scratch on deserialize, as if by build-id/cpu-id change), so you can MOZ_RELEASE_ASSERT(!metadata_->debugEnabled) at the top of the method (which we do in some other serialize methods).

::: js/src/wasm/WasmCode.h
@@ +74,5 @@
> +  public:
> +    CodeSegment(const CodeSegment&) = delete;
> +    CodeSegment(CodeSegment&&) = delete;
> +    void operator=(const CodeSegment&) = delete;
> +    void operator=(CodeSegment&&) = delete;

nit: implicit && are automatically deleted when const& are deleted so only two are needed.

@@ +383,5 @@
>      const ExclusiveData<CacheableCharsVector> profilingLabels_;
>  
>    public:
>      Code(UniqueConstCodeSegment segment,
>           const Metadata& metadata,

From IRL discussion, I think wasm::Code can uniquely own Metadata (moved from ModuleGenerator::finish()).

::: js/src/wasm/WasmInstance.h
@@ +63,5 @@
>  // direct reference to its source Module which allows a Module to be destroyed
>  // while it still has live Instances.
> +//
> +// The instance's code may be shared among multiple instances provided none of
> +// those instances are being debugged.  Instances that are being debugged own

ubernit: a general SM comment style is single space after period; could you grep this patch for `.  ` and change to `. `?

::: js/src/wasm/WasmModule.cpp
@@ +278,5 @@
>      MOZ_RELEASE_ASSERT(cursor == compiledBegin + compiledSize);
>      MOZ_RELEASE_ASSERT(!!maybeMetadata == metadata->isAsmJS());
>  
> +    onFailure.release();
> +    UniqueConstCodeSegment codeSegment = CodeSegment::create(codeBytes, codeLength, bytecode, linkData, *metadata);

To make CodeSegment follow the more usual serialization pattern, could you have it WASM_DECLARE_SERIALIZABLE(CodeSegment), have deserialize() just allocate+copy the code (not statically link), and then here call codeSegment->staticallyLink(...)?  I think this would also simplify the resource ownership situation for codeBytes since CodeSegment::deserialize() would immediately store into a member which would be freed by CodeSegment's dtor.

@@ +917,5 @@
> +
> +    if (metadata_->debugEnabled) {
> +        // Claim the code, but if it has already been claimed, make a clone instead.
> +        if (!mustCloneCodeForDebugging_.compareExchange(false, true)) {
> +            code = code_->clone(linkData_);

I hadn't realized earlier that we're racily cloning here.  While I can't think of a precise problem since in theory the patching should only be mutating the bytes which we're about to clobber with a patchCallToNop() *anyway*... it certainly makes me uncomfortable and seems future-hazardous.  Also, this is a less-tested path making we want to keep it as simple as possible.  What if we continued to store the unlinked bytes in a Bytes in Module but only in debug-mode (so renamed to debugCode_) and continued to make copies on instantiation.  I think the extra memory consumption shouldn't be too bad: the debugCode_ goes away when the Module goes away which should be soon after instantiation; also it's non-executable memory so not subject to the executable quota.

::: js/src/wasm/WasmModule.h
@@ +85,5 @@
>  //
> +// Fully linked-and-instantiated code (represented by Code and CodeSegment) can
> +// be shared between instances, provided none of those instances are being
> +// debugged; if patchable code is needed then each instance must have its own
> +// Code/CodeSegment pair.  Module eagerly creates a new Code/CodeSegment pair

nit: in this comment, I'd only mention Code (CodeSegment being a uniquely-owned sub-part of Code).

@@ +120,5 @@
>                        const ValVector& globalImports) const;
>  
>    public:
>      Module(Assumptions&& assumptions,
> +           SharedCode& code,

Can you pass `const Code&` to match below parameters?
https://hg.mozilla.org/integration/mozilla-inbound/rev/d81d2d6fc83f2d11cf4fa53689e14fcbfee0550b
Bug 1338217 - Make data structures immutable in preparation for sharing Wasm code. r=luke
(In reply to Luke Wagner [:luke] from comment #76)
> Comment on attachment 8860899 [details] [diff] [review]
> bug1338217-share-code-v6.patch
> 
> Review of attachment 8860899 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> @@ +383,5 @@
> >      const ExclusiveData<CacheableCharsVector> profilingLabels_;
> >  
> >    public:
> >      Code(UniqueConstCodeSegment segment,
> >           const Metadata& metadata,
> 
> From IRL discussion, I think wasm::Code can uniquely own Metadata (moved
> from ModuleGenerator::finish()).

As discussed, I'd like to leave this churn for a subsequent patch, but I won't forget.

> ::: js/src/wasm/WasmInstance.h
> @@ +63,5 @@
> >  // direct reference to its source Module which allows a Module to be destroyed
> >  // while it still has live Instances.
> > +//
> > +// The instance's code may be shared among multiple instances provided none of
> > +// those instances are being debugged.  Instances that are being debugged own
> 
> ubernit: a general SM comment style is single space after period; could you
> grep this patch for `.  ` and change to `. `?

That pattern is common (and not just in code I've written) and I don't think I'll address that here.

> ::: js/src/wasm/WasmModule.cpp
> @@ +278,5 @@
> >      MOZ_RELEASE_ASSERT(cursor == compiledBegin + compiledSize);
> >      MOZ_RELEASE_ASSERT(!!maybeMetadata == metadata->isAsmJS());
> >  
> > +    onFailure.release();
> > +    UniqueConstCodeSegment codeSegment = CodeSegment::create(codeBytes, codeLength, bytecode, linkData, *metadata);
> 
> To make CodeSegment follow the more usual serialization pattern, could you
> have it WASM_DECLARE_SERIALIZABLE(CodeSegment), have deserialize() just
> allocate+copy the code (not statically link), and then here call
> codeSegment->staticallyLink(...)?  I think this would also simplify the
> resource ownership situation for codeBytes since CodeSegment::deserialize()
> would immediately store into a member which would be freed by CodeSegment's
> dtor.

I don't think I can quite do that.  serialize() will need to take the linkData as an argument, for one thing, because we want to unlink after copying the code bytes but we only have the pointer into the serialization buffer inside serialize().  So the macro won't be directly applicable here.  But in general I agree that pushing the logic from Code into CodeSegment is worth trying.
Attached patch bug1338217-share-code-v7.patch (deleted) — Splinter Review
Updated patch.  This addresses all the remarks on the previous patch.  I even made Code own the Metadata here because that made some accounting cleanup easier.

A few things worth pointing out:

- Metadata is still a RefPtr type.  I could have fixed that but I did not
  because I believe that the Metadata segregation for tiering will bring back
  the RefPtr for the metadata that are common to the tiers.

- I added accounting for DebugState, but I was a little in doubt about whether
  that's desirable or not.

- I couldn't quite make serialization as clean as you suggested but I'm pretty
  happy with how this turned out anyhow.  Both serialization and deserialization
  of code really want to have access to sundry metadata during those
  processeses, it's not natural to deserialize and then apply those metadata, say.

I'd also appreciate if you would take a particularly close look at the accounting code, it looks OK to me but I had a situation where about:memory threw up a warning about inconsistent numbers with the debugger activated.  I haven't been able to repro that, so I don't know if this patch is to blame or not.
Attachment #8860899 - Attachment is obsolete: true
Attachment #8860899 - Flags: review?(luke)
Attachment #8862886 - Flags: review?(luke)
Comment on attachment 8862886 [details] [diff] [review]
bug1338217-share-code-v7.patch

Review of attachment 8862886 [details] [diff] [review]:
-----------------------------------------------------------------

Awesome, thanks!

::: js/src/wasm/WasmCode.cpp
@@ +230,5 @@
> +    uint8_t* codeBase = AllocateCodeBytes(codeLength);
> +    if (!codeBase)
> +        return nullptr;
> +
> +    // We'll flush the icache after static linking, below.

nit: s/below/in initialize()/

@@ +349,5 @@
> +    cursor = WriteScalar<uint32_t>(cursor, length_);
> +    uint8_t* base = cursor;
> +    cursor = WriteBytes(cursor, bytes_, length_);
> +    if (base)
> +        StaticallyUnlink(base, linkData);

I don't see how 'base' could ever be null

@@ +360,5 @@
> +{
> +    uint32_t length;
> +    cursor = ReadScalar<uint32_t>(cursor, &length);
> +    if (!cursor)
> +        return nullptr;

nit: can you put \n after all the failure returns in this function?

@@ +678,5 @@
> +        return nullptr;
> +
> +    segment_ = UniqueConstCodeSegment(codeSegment.release());
> +    metadata_ = metadata;
> +    maybeBytecode_ = bytecode;

Setting `maybeBytecode_` unconditionally here will mean that bytecode will be kept alive by instances always (not just when debugging is enabled).  Should only set if metadata->debugEnabled.

::: js/src/wasm/WasmCode.h
@@ +88,5 @@
> +        unalignedAccessCode_(nullptr)
> +    {}
> +
> +    static UniqueConstCodeSegment create(jit::MacroAssembler& masm,
> +                                         const SharedBytes& bytecode,

SharedBytes is the RefPtr type; I think you want ShareableBytes here (and below x3).  Incidentally one of the callers is passing &bytecode which was constructing a temporary RefPtr to bind to the const&.

@@ +100,5 @@
> +
> +    // codeBytes must be executable memory.
> +    // This assumes ownership of the codeBytes, and deletes them in the event of error.
> +    static UniqueConstCodeSegment create(uint8_t* codeBytes,
> +                                         uint32_t codeLength,

Could you make this create() private?

::: js/src/wasm/WasmDebug.cpp
@@ +77,5 @@
>  
> +size_t
> +GeneratedSourceMap::sizeOfExcludingThis(mozilla::MallocSizeOf mallocSizeOf) const
> +{
> +    size_t size = SizeOfVectorExcludingThis(exprlocs_, mallocSizeOf);

This can be exprlocs_.sizeOfExcludingThis() and then you don't need the 'return 0' sizeOfExcludingThis method in ExprLoc.

::: js/src/wasm/WasmGenerator.cpp
@@ +1192,5 @@
> +        if (!maybeDebuggingBytes)
> +            return nullptr;
> +    }
> +
> +    SharedCode code = js_new<Code>(Move(codeSegment), *metadata_, &bytecode);

Similar to above, should only pass bytecode if metadata->debugEnabled.

::: js/src/wasm/WasmModule.cpp
@@ +153,2 @@
>          *maybeCompiledSize = assumptions_.serializedSize() +
> +                             code_->serializedSize() +

Could you move code_->serializedSize() to the end to match order in (de)serialize?

::: js/src/wasm/WasmModule.h
@@ +87,5 @@
> +// CodeSegment) can be shared between instances, provided none of those
> +// instances are being debugged; if patchable code is needed then each instance
> +// must have its own Code. Module eagerly creates a new Code and gives it to
> +// the first instance; it then clones (copies and relinks) this code for
> +// subsequent instances if debugging is enabled.

Last sentence of comment needs to be updated.
Attachment #8862886 - Flags: review?(luke) → review+
(In reply to Luke Wagner [:luke] from comment #82)
> Comment on attachment 8862886 [details] [diff] [review]
> bug1338217-share-code-v7.patch
> 
> Review of attachment 8862886 [details] [diff] [review]:
> -----------------------------------------------------------------

An interesting problem.  Consider:

> @@ +678,5 @@
> > +        return nullptr;
> > +
> > +    segment_ = UniqueConstCodeSegment(codeSegment.release());
> > +    metadata_ = metadata;
> > +    maybeBytecode_ = bytecode;
> 
> Setting `maybeBytecode_` unconditionally here will mean that bytecode will
> be kept alive by instances always (not just when debugging is enabled). 
> Should only set if metadata->debugEnabled.

and

> ::: js/src/wasm/WasmGenerator.cpp
> @@ +1192,5 @@
> > +        if (!maybeDebuggingBytes)
> > +            return nullptr;
> > +    }
> > +
> > +    SharedCode code = js_new<Code>(Move(codeSegment), *metadata_, &bytecode);
> 
> Similar to above, should only pass bytecode if metadata->debugEnabled.

Changing this as suggested causes test suite failures (assertions).  The reason is in getFuncName() in WasmCode.cpp, which requires the bytecode to be present.  I can add a guard there on the presence of bytecode, in this case it returns a synthesized name, but then the tests fail because the name is not what's expected.  The test that fails is wasm/binary.js, in runStackTraceTest.

I'll land the patch without the conditional saving of the bytecode, but we should discuss what's at fault here, be it code or test suite, and come to an agreement on how to proceed.
Ah, I forgot the names case: the bytecode should be saved if debugEnabled || !funcNames.empty() (which was I think the logic before).  rs=me with that change.
(In reply to Luke Wagner [:luke] from comment #85)
> Ah, I forgot the names case: the bytecode should be saved if debugEnabled ||
> !funcNames.empty() (which was I think the logic before).  rs=me with that
> change.

Oh sure, that song and dance.  I'll push a small followup tomorrow morning for this.
Hmm, found a similar crash just now on autoland: bug 1361529
Backout by kwierso@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/a748acbebbde
Backed out changeset d3197ffef609 for failures in test_webassembly_compile.html on at least Windows VM debug a=backout a=merge
Evidence points to a mutex ordering problem: ~Code destroys the profiling labels, which are under a lock, which it must acquire.  And ~Code can be invoked by the GC, which is ultimately triggered by the shutdown code in WorkerThreadPrimaryRunnable::Run() in dom/workers/RuntimeService.cpp.  The problem is when the gc calls purgeRuntime(), which purges the promise tasks.  To do so, it does "rt->promiseTasksToDestroy.lock()->clear()".  However the promiseTasksToDestroy object has the same Mutex priority as the profiling labels, and we MOZ_CRASH as observed.
Flags: needinfo?(lhansen)
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/74a0a0207e08
Share code between Wasm instances.  r=luke
https://hg.mozilla.org/integration/mozilla-inbound/rev/feb47dd8a60b
Only preserve bytecode when it's needed, rs=luke
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/31c88a4764ca
Backed out changeset feb47dd8a60b 
https://hg.mozilla.org/integration/mozilla-inbound/rev/5813ddb39f5a
Backed out changeset 74a0a0207e08 for linux failures
Silly error in the "simple" patch, using the wrong metadata value.
Flags: needinfo?(lhansen)
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/81163555cc77
Share code between Wasm instances.  r=luke
https://hg.mozilla.org/integration/mozilla-inbound/rev/31d1d042e6fd
Only preserve bytecode when it's needed, rs=luke
Keywords: leave-open
So, :tomcat will back these patches out again because there are other failures similar to the one in comment 91.  In the new case, the problem is that the lock priority for ProcessExecutableRegion equals that of PromiseTaskPtrVector, and then we're into the same failure mode as we may synchronously deallocate executable memory from a GC at worker shutdown.  See bug 1361980 for pointers to some stacks.  Again, Windows-only, and intermittent to boot.

It's easy enough to change the priority but I don't understand why this bug has not bitten us before.  Of course the sharing patterns introduced by the patches on this bug will lead to different memory consumption patterns which will drive the GC in different ways, and so maybe it's a bug that has been there all along but is only now found.  But it could also be that there's something else that needs to be fixed.
Definitely good to understand why this didn't hit before.  I think the cause is that before, code memory would be deleted with Instance finalizer which would happen during normal GC sweeping.  In the new case found in bug 1361980, the last ref is being dropped by the *Module* (which before didn't hold any executable code) and not during normal GC sweeping but during the special purgeRuntime phase where we lock the promise task vector.  So I think it's the right fix to bump priorities and not papering over a bug.
Right, I just reached that conclusion myself after staring at a lot of code: Previously the Module held a Bytes, which it could delete just fine without this locking; now it holds a SharedCode, which requires the lock.
Pushed a fix for bug 1361980.
https://hg.mozilla.org/mozilla-central/rev/81163555cc77
https://hg.mozilla.org/mozilla-central/rev/31d1d042e6fd
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: