The current implementation relies on setting the rounding mode for
different calculations (first to FE_TONEAREST and then to FE_TOWARDZERO)
to obtain correctly rounded results. For most CPUs, this adds a significant
performance overhead since it requires executing a typically slow
instruction (to get/set the floating-point status), it necessitates
flushing the pipeline, and breaks some compiler assumptions/optimizations.
This patch introduces a new implementation originally written by Szabolcs
for musl, which utilizes mostly integer arithmetic. Floating-point
arithmetic is used to raise the expected exceptions, without the need for
fenv.h operations.
I added some changes compared to the original code:
* Fixed some signaling NaN issues when the 3-argument is NaN.
* Use math_uint128.h for the 64-bit multiplication operation. It allows
the compiler to use 128-bit types where available, which enables some
optimizations on certain targets (for instance, MIPS64).
* Fixed an arm32 issue where the libgcc routine might not respect the
rounding mode [1]. This can also be used on other targets to optimize
the conversion from int64_t to double.
* Use -fexcess-precision=standard on i686.
I tested this implementation on various targets (x86_64, i686, arm, aarch64,
powerpc), including some by manually disabling the compiler instructions.