-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Description
I happened to look at more closely at the generated assembly for some code that uses align_offset
, and noticed... align_offset
does not compile as efficiently as one might hope for the case of "align pointer to size_of::<usize>()
"
For example (ignore that I omit handling of align_offset's error return value):
pub unsafe fn align_offset(p: *const u8) -> *const u8 {
p.add(p.align_offset(core::mem::size_of::<usize>()))
}
compiles to
example::align_offset:
movl %edi, %ecx
andl $7, %ecx
movl $8, %eax
subq %rcx, %rax
testq %rcx, %rcx
cmoveq %rcx, %rax
addq %rdi, %rax
retq
Whereas performing the same alignment manually (forgive my convoluted way of doing this, my usual pattern is very slightly different and I don't have a memorized idiom for this without going through usize, since, well, I figured I just wanted to use align_offset):
pub unsafe fn manual_align_offset(p: *const u8) -> *const u8 {
// IIRC just doing `arithmetic(p as usize) as *const u8` makes LLVM sad
let aligned = ((p as usize + USIZE_SIZE - 1) & !(USIZE_SIZE - 1)) as isize;
p.offset(aligned - (p as isize))
}
compiles to
example::manual_align_offset:
leaq 7(%rdi), %rax
andq $-8, %rax
retq
Which is substantially better along a variety of metrics, including but not limited to actual runtime.
Taking a look at the source for align_offset
reveals that it uh, well it does some stuff.
rust/library/core/src/ptr/mod.rs
Lines 1166 to 1271 in ac48e62
/// Any questions go to @nagisa. | |
#[lang = "align_offset"] | |
pub(crate) unsafe fn align_offset<T: Sized>(p: *const T, a: usize) -> usize { | |
/// Calculate multiplicative modular inverse of `x` modulo `m`. | |
/// | |
/// This implementation is tailored for align_offset and has following preconditions: | |
/// | |
/// * `m` is a power-of-two; | |
/// * `x < m`; (if `x ≥ m`, pass in `x % m` instead) | |
/// | |
/// Implementation of this function shall not panic. Ever. | |
#[inline] | |
fn mod_inv(x: usize, m: usize) -> usize { | |
/// Multiplicative modular inverse table modulo 2⁴ = 16. | |
/// | |
/// Note, that this table does not contain values where inverse does not exist (i.e., for | |
/// `0⁻¹ mod 16`, `2⁻¹ mod 16`, etc.) | |
const INV_TABLE_MOD_16: [u8; 8] = [1, 11, 13, 7, 9, 3, 5, 15]; | |
/// Modulo for which the `INV_TABLE_MOD_16` is intended. | |
const INV_TABLE_MOD: usize = 16; | |
/// INV_TABLE_MOD² | |
const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD; | |
let table_inverse = INV_TABLE_MOD_16[(x & (INV_TABLE_MOD - 1)) >> 1] as usize; | |
if m <= INV_TABLE_MOD { | |
table_inverse & (m - 1) | |
} else { | |
// We iterate "up" using the following formula: | |
// | |
// $$ xy ≡ 1 (mod 2ⁿ) → xy (2 - xy) ≡ 1 (mod 2²ⁿ) $$ | |
// | |
// until 2²ⁿ ≥ m. Then we can reduce to our desired `m` by taking the result `mod m`. | |
let mut inverse = table_inverse; | |
let mut going_mod = INV_TABLE_MOD_SQUARED; | |
loop { | |
// y = y * (2 - xy) mod n | |
// | |
// Note, that we use wrapping operations here intentionally – the original formula | |
// uses e.g., subtraction `mod n`. It is entirely fine to do them `mod | |
// usize::MAX` instead, because we take the result `mod n` at the end | |
// anyway. | |
inverse = inverse.wrapping_mul(2usize.wrapping_sub(x.wrapping_mul(inverse))); | |
if going_mod >= m { | |
return inverse & (m - 1); | |
} | |
going_mod = going_mod.wrapping_mul(going_mod); | |
} | |
} | |
} | |
let stride = mem::size_of::<T>(); | |
let a_minus_one = a.wrapping_sub(1); | |
let pmoda = p as usize & a_minus_one; | |
if pmoda == 0 { | |
// Already aligned. Yay! | |
return 0; | |
} | |
if stride <= 1 { | |
return if stride == 0 { | |
// If the pointer is not aligned, and the element is zero-sized, then no amount of | |
// elements will ever align the pointer. | |
!0 | |
} else { | |
a.wrapping_sub(pmoda) | |
}; | |
} | |
let smoda = stride & a_minus_one; | |
// SAFETY: a is power-of-two so cannot be 0. stride = 0 is handled above. | |
let gcdpow = unsafe { intrinsics::cttz_nonzero(stride).min(intrinsics::cttz_nonzero(a)) }; | |
let gcd = 1usize << gcdpow; | |
if p as usize & (gcd.wrapping_sub(1)) == 0 { | |
// This branch solves for the following linear congruence equation: | |
// | |
// ` p + so = 0 mod a ` | |
// | |
// `p` here is the pointer value, `s` - stride of `T`, `o` offset in `T`s, and `a` - the | |
// requested alignment. | |
// | |
// With `g = gcd(a, s)`, and the above asserting that `p` is also divisible by `g`, we can | |
// denote `a' = a/g`, `s' = s/g`, `p' = p/g`, then this becomes equivalent to: | |
// | |
// ` p' + s'o = 0 mod a' ` | |
// ` o = (a' - (p' mod a')) * (s'^-1 mod a') ` | |
// | |
// The first term is "the relative alignment of `p` to `a`" (divided by the `g`), the second | |
// term is "how does incrementing `p` by `s` bytes change the relative alignment of `p`" (again | |
// divided by `g`). | |
// Division by `g` is necessary to make the inverse well formed if `a` and `s` are not | |
// co-prime. | |
// | |
// Furthermore, the result produced by this solution is not "minimal", so it is necessary | |
// to take the result `o mod lcm(s, a)`. We can replace `lcm(s, a)` with just a `a'`. | |
let a2 = a >> gcdpow; | |
let a2minus1 = a2.wrapping_sub(1); | |
let s2 = smoda >> gcdpow; | |
let minusp2 = a2.wrapping_sub(pmoda >> gcdpow); | |
return (minusp2.wrapping_mul(mod_inv(s2, a2))) & a2minus1; | |
} | |
// Cannot be aligned at all. | |
usize::MAX | |
} |
p + USIZE - 1
wraps around the address space but the aligned value wouldn't).
Anyway IIUC align_offset is really considered the way forward for all pointer aligning, as miri will throw your code straight into the trash if it catches it dereferencing a pointer that you manually aligned... (I have Opinions on this but I'll spare you from a rant).
So, for that and a lot of reasons, I think we'd like the codegen for align_offset to look a lot closer to what I provided at opt-level=3, even if it means special-casing when size_of::<T> == 1
... (I mean, I'd also love for it not to generate the whole code mountain in debug builds, but one thing at a time I guess).
Anyway, the function's documentation comment tells me that @nagisa has taken up the sacred burden of "keeper of align_offset
's secrets"... I have questions: is this fixable? And if not, is there an interface that we can provide that lets us produce good code here? Am I missing something?
P.S. Godbolt link: https://godbolt.org/z/388Enf