Closed Bug 989718 Opened 10 years ago Closed 7 years ago

ARM compiler mis-optimizes mozilla::Abs applied to the most negative value

Categories

(Core :: MFBT, defect, P2)

ARM
Android
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: xidorn, Unassigned)

References

Details

Attachments

(5 files)

Attached file test.cpp (deleted) —
Please see the test program in the attachment. If you input "-2147483648" to the program, it will show the absolute value of the number is valid for int32_t, which is completely incorrect.

It seems to be an upstream bug, but I haven't located the precise condition of it.

In my experiment, if you assign the number directly instead of reading it from outside, this bug won't be triggered. In this program, if you add an output of the result of "a <= uint32_t(int32_t(~(uint32_t(1) << 31)))" which is the final expression after all template expansion, you can find it incorrectly returns true. However, if you remove all CheckedInt-related code, this expression magically returns false.
Blocks: 966166
When compiling using NDK with GCC 4.6 and 4.8, the problem exists. However, if using clang as toolchain, either 3.2 or 3.3, the problem is solved.
I can't reproduce this with gcc 4.4, 4.5, 4.6, 4.8, or clang r198563 with a Linux 64-bit build.  This together with comment 1 make me think this is an NDK gcc problem.  :-\

Seeing as you have the compiler here and I don't, any chance you could look at what the generated code is for the CheckedInt constructor call, to figure out what's up?  And does this appear with debug builds, or only with optimized builds?  We can do toolchain-specific disabling of optimizations if needed here.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #2)
> I can't reproduce this with gcc 4.4, 4.5, 4.6, 4.8, or clang r198563 with a
> Linux 64-bit build.  This together with comment 1 make me think this is an
> NDK gcc problem.  :-\

Sorry for not clarifying the problem. This happens only on ARM platforms, including Android and B2G.

> Seeing as you have the compiler here and I don't, any chance you could look
> at what the generated code is for the CheckedInt constructor call, to figure
> out what's up?  And does this appear with debug builds, or only with
> optimized builds?  We can do toolchain-specific disabling of optimizations
> if needed here.

I checked the asm code, and found that the whole condition is optimized out to constant true. Lowering the optimization level to -O1 could solve this problem, but I don't know what flag affects it exactly.
Summary: CheckedInt returns wrong result → CheckedInt returns wrong result on ARM platforms
In addition, you can check this try server result: https://tbpl.mozilla.org/?tree=Try&rev=84426ba00c88

All reftests failed in it is due to this bug.
I got the exact flag which causes this bug: -fstrict-overflow
Add -fno-strict-overflow to CPP_FLAGS could solve this problem. Please add this flag when platform is ARM and compiler is GCC.
Attached patch patch (deleted) — Splinter Review
This patch seems to be able to fix the problem: https://tbpl.mozilla.org/?tree=Try&rev=4cd5309e520a
Attachment #8402245 - Flags: review?(jwalden+bmo)
Comment on attachment 8402245 [details] [diff] [review]
patch

No, this can't be correct.  The flag -fno-strict-overflow makes gcc have better-defined behavior *only* in cases where undefined signed integer overflow semantics are relied upon.  CheckedInt's explicit goal is to not invoke any such case.  So this is definitely not what is wanted.

What happens if, with the compile options that trigger the unwanted behavior, you add -Wstrict-overflow when compiling?  That's supposed to trigger warnings when the compiler does something that depends on signed overflow being undefined.  Upload a copy of the compiler spew, and with that we can hopefully spot exactly what ARM gcc doesn't like here.
Attachment #8402245 - Flags: review?(jwalden+bmo) → review-
It points to the last line (which is only the closing curly bracket), and states:
warning: assuming signed overflow does not occur when assuming abs (x) >= 0 is true [-Wstrict-overflow]

My current testing code is like this:
  int32_t ordinal;
  std::cin >> ordinal;
  uint32_t a = Abs(ordinal);
  CheckedInt<int32_t> absolute(a);
1 OUTPUT(absolute.isValid());
  OUTPUT(a);
  OUTPUT(uint32_t(int32_t(~(uint32_t(1) << 31))));
2 OUTPUT(a <= uint32_t(int32_t(~(uint32_t(1) << 31))));

When the bug occurs, both the line marked 1 and 2 outputs true. However, if I comment out the line with 1, the 2 correctly outputs false.
Attached file test program (deleted) —
I have made this very brief test which you can test directly with android-ndk installed. Just unpack it and run `ndk-build` in the directory, and push it to an ARM-based Android device or emulator.
(In reply to Xidorn Quan from comment #8)
> It points to the last line (which is only the closing curly bracket), and
> states:
> warning: assuming signed overflow does not occur when assuming abs (x) >= 0
> is true [-Wstrict-overflow]

Strange.  There shouldn't be any call to abs() in that code at all.  (Unless this is just the compiler's pseudo-code to say what it's relying on.)  Abs() uses a custom bitwise algorithm precisely because abs(i) isn't defined for all possible |i|.

Do you get any finer-grained warnings if you generate the preprocessed source (gcc's -E flag will do this), remove all the "# foo" lines in it, and compile that?  (This might be tricky to do, to be sure -- it's just what I'd do if I were encountering this on desktop.  I don't have an NDK build system set up, myself, so testing that myself would require considerable time investment, for -- frankly -- not enough gain.)
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #10)
> Do you get any finer-grained warnings if you generate the preprocessed
> source (gcc's -E flag will do this), remove all the "# foo" lines in it, and
> compile that?  (This might be tricky to do, to be sure -- it's just what I'd
> do if I were encountering this on desktop.  I don't have an NDK build system
> set up, myself, so testing that myself would require considerable time
> investment, for -- frankly -- not enough gain.)

Nothing more. Even for the preprocessed source, it still points to the last line and output the same warning message when -Wstrict-overflow is set.

Maybe we should ask some build system peer to have a look at this problem?
It's not a build system peer sort of issue.  Definitely, changing overflow checking for the entire project is not the right solution.

Someone familiar with gcc internals and/or its diagnostic modes, perhaps able to debug this a little, is maybe the most reasonable person to investigate.  I don't know if we have such a person around.

Alternatively, someone familiar with C++ undefined behavior, and integer overflow semantics, might be good.  Hmm.  I believe I'd satisfy the latter, if I had a build system set up.  Maybe it's time to do that.  Next time I'm at the Mountain View office I'll look into that, via a VM on my Windows box there.  (I run Fedora normally, which I understand doesn't play well with NDK stuff expecting a Debian-like system.  Sigh.)  I think I'll be in SF tomorrow, so that'll be Tuesday.

In the meantime, if you have patches blocked here, depending on the complexity of their dependence on CheckedInt, it might be worth manually coding CheckedInt-like bits for now, with a followup bug (depending on this one) to fix back to CheckedInt at the right time.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #12)
> It's not a build system peer sort of issue.  Definitely, changing overflow
> checking for the entire project is not the right solution.
> 
> Someone familiar with gcc internals and/or its diagnostic modes, perhaps
> able to debug this a little, is maybe the most reasonable person to
> investigate.  I don't know if we have such a person around.
> 
> Alternatively, someone familiar with C++ undefined behavior, and integer
> overflow semantics, might be good.  Hmm.  I believe I'd satisfy the latter,
> if I had a build system set up.  Maybe it's time to do that.  Next time I'm
> at the Mountain View office I'll look into that, via a VM on my Windows box
> there.  (I run Fedora normally, which I understand doesn't play well with
> NDK stuff expecting a Debian-like system.  Sigh.)  I think I'll be in SF
> tomorrow, so that'll be Tuesday.

I think it is definitely a GCC bug, and I'd like to have a look into its source code when I have time (possibly next week). Maybe you can test whether other GCC-based ARM cross-compiling toolchain has the same problem. Fedora should have such packages for that.

> In the meantime, if you have patches blocked here, depending on the
> complexity of their dependence on CheckedInt, it might be worth manually
> coding CheckedInt-like bits for now, with a followup bug (depending on this
> one) to fix back to CheckedInt at the right time.

Don't mind the patches. They are still far from landing.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #12)
> Someone familiar with gcc internals and/or its diagnostic modes, perhaps
> able to debug this a little, is maybe the most reasonable person to
> investigate.  I don't know if we have such a person around.

I have entirely too much familiarity with GCC internals.

Xidorn, if you are able to compile the test case with the additional options -fdump-tree-all -da -dP, the compilation will produce numerous ${FILENAME}.cpp.${STUFF} files.  Please zip all those files up, and I will take a look to see if anything obvious is going wrong.

If you can produce separate sets of those files for different GCC versions, that is also helpful.

Reporting such a bug on the GCC bugtracker (http://gcc.gnu.org/bugzilla/) with preprocessed source might also be helpful.
Flags: needinfo?(quanxunzhen)
Attached file dump tree (deleted) —
This is the dump tree info.
mytest.bad is the version with problem, while mytest.good is the version compiled with -fno-strict-overflow set which does not have this problem.
Flags: needinfo?(quanxunzhen)
(In reply to Xidorn Quan from comment #15)
> This is the dump tree info.
> mytest.bad is the version with problem, while mytest.good is the version
> compiled with -fno-strict-overflow set which does not have this problem.

Excellent, thank you!

It looks like this problem creeps in very early.  GCC transforms mozilla::Abs into the "natural" implementation that doesn't deal correctly with the minimum possible value.  You can see this in the mytest.cpp.004t.gimple dump if you like.

Surprisingly, this doesn't really cause problems until mytest.cpp.196r.combine, where we have something like:

  r120 = abs(r110)
  r133 = ~r120
  r126 = r133 >> 31 (logical shift, not arithmetic)

The -fno-strict-overflow keeps this code, whereas the -fstrict-overflow version naturally transforms this into:

  r126 = 1

and we then have problems.

The reduced testcase, then, would be something like:

uint32_t
Abs(int32_t x)
{
  return x >=0 ? x : ~uint32_t(x) + 1;
}

This shouldn't get transformed into a negation.
(In reply to Nathan Froyd (:froydnj) from comment #16)
> (In reply to Xidorn Quan from comment #15)
> > This is the dump tree info.
> > mytest.bad is the version with problem, while mytest.good is the version
> > compiled with -fno-strict-overflow set which does not have this problem.
> 
> Excellent, thank you!
> 
> It looks like this problem creeps in very early.  GCC transforms
> mozilla::Abs into the "natural" implementation that doesn't deal correctly
> with the minimum possible value.  You can see this in the
> mytest.cpp.004t.gimple dump if you like.

I find it even earlier. You can see in mytest.cpp.003t.original, mozilla::Abs has been converted into the unsafe implementation, which use "-U(t)" instead of "~U(t) + 1". However, it is strange that, when I compile it with my native gcc toolchain whose result is correct, the tree dump shows my native gcc do the same tranformation as well. So the problem might not be here (or this itself is a bug, but my native gcc has another bug which make it not successfully optimize the code).
Attached file dump tree 2 (deleted) —
In this pack, mytest.question is generated by the same gcc as the two dump trees before, but with the modification to code mentioned in comment 8. mytest.native is generated by my native gcc 4.8 compiler. Both of them have the expected result.
Summary: CheckedInt returns wrong result on ARM platforms → ARM compiler mis-optimizes mozilla::Abs applied to the most negative value
Blocks: 1135426
It looks like this is going to be fixed by bug 1163171.
Depends on: 1163171
All relevant test annotations have been removed in https://hg.mozilla.org/mozilla-central/rev/6afa42ff81ef
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: