vect: use wider precision type for generating early break scalar IV [PR123089]
In the PR we see that the new scalar IV tricks other passes to think there's an
overflow to the use of a signed counter:
The loop is known to iterate 8191 times and we have a VF of 8 and it starts
at 2.
The codegen out of the vectorizer is the same as before, except we now have a
scalar variable counting the scalar iteration count vs a vector one.
i.e. we have
_45 = _39 + 8;
vs
_46 = _45 + { 16, 16, 16, 16, ... }
we pick a lower VF now since costing allows it to but that's not important.
When we get to cunroll since the value is now scalar, it sees that 8 * 8191
would overflow a signed short and so it changes the loop bounds to the largest
possible signed value and then uses this to elide the ivtmp_50 < 8191 as always
true and so you get an infinite loop:
Analyzing # of iterations of loop 1
exit condition [1, + , 1](no_overflow) < 8191
bounds on difference of bases: 8190 ... 8190
result:
# of iterations 8190, bounded by 8190
Statement (exit)if (ivtmp_50 < 8191)
is executed at most 8190 (bounded by 8190) + 1 times in loop 1.
Induction variable (signed short) 8 + 8 * iteration does not wrap in statement
_45 = _39 + 8;
in loop 1.
Statement _45 = _39 + 8;
is executed at most 4094 (bounded by 4094) + 1 times in loop 1.
The signed type was originally chosen because of the negative offset we use when
adjusting for peeling for alignments with masks. However this then introduces
issues as we see here with signed overflow. This patch instead determines the
smallest possible unsigned type for use by the scalar IV where the overflow
won't happen when we include the extra bit for the sign. i.e. if the scalar IV
is an unsigned 8 bit value we pick a signed 16-bit type. But if a signed 8-bit
value we pick a unsigned 8 bit type.
We use the initial niters value to determine the smallest size possible, to
prevent certain cases like when the IV in code is a 64-bit to need a TImode
counter. I also only require the additional bit when I know we'll be generating
the SMAX. I've now moved this to vectorizable_early_exit such that if we do
end up needing something like TImode that we don't vectorize if the target
doesn't support it.
I've also added some testcases for masking around the boundary values. I've
only added them for char to reduce the runtime of the tests.
gcc/ChangeLog:
PR tree-optimization/123089
* tree-vect-loop.cc (vect_update_ivs_after_vectorizer_for_early_breaks):
Add conversion if required, Note that if we did truncate the original
scalar loop had an overflow here anyway.
(vect_get_max_nscalars_per_iter): Expose.
* tree-vect-stmts.cc (vect_compute_type_for_early_break_scalar_iv): New.
(vectorizable_early_exit): Find smallest type where we won't have UB in
the signed IV and store it.
* tree-vectorizer.h (LOOP_VINFO_EARLY_BRK_IV_TYPE): New.
(class _loop_vec_info): Add early_break_iv_type.
(vect_min_prec_for_max_niters): New.
* tree-vect-loop-manip.cc (vect_do_peeling): Use it.
gcc/testsuite/ChangeLog:
PR tree-optimization/123089
* gcc.dg/vect/vect-early-break_141-pr123089.c: New test.
* gcc.target/aarch64/sve/peel_ind_14.c: New test.
* gcc.target/aarch64/sve/peel_ind_14_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_15.c: New test.
* gcc.target/aarch64/sve/peel_ind_15_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_16.c: New test.
* gcc.target/aarch64/sve/peel_ind_16_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_17.c: New test.
* gcc.target/aarch64/sve/peel_ind_17_run.c: New test.