-
-
Notifications
You must be signed in to change notification settings - Fork 14.3k
add core::hint::prefetch_{read, write}_{data, instruction}
#146948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
| /// Passing a dangling or invalid pointer is permitted: the memory will not | ||
| /// actually be dereferenced, and no faults are raised. | ||
| #[unstable(feature = "hint_prefetch", issue = "146941")] | ||
| pub const fn prefetch_read_instruction<T>(ptr: *const T, locality: Locality) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be ptr: unsafe fn() or something since some platforms have different data and instruction pointer sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On some platforms a function pointer doesn't point directly to the instruction bytes, but rather to a function descriptor, which consists of a pointer to the first instruction and some value that needs to be loaded into a register. On these platforms using unsafe fn() would be incorrect. Itanium is an example, but I know there are more architectures that do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but that doesn't mean *const T is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ultimately all you need is an address, so *const T seemed the simplest way of achieving that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but *const T may be too small e.g. on 16-bit x86 in the medium model a data pointer is 16 bits but an instruction pointer is 32 bits.
there are some AVR cpus (not currently supported by rust?) which need >16 bits for instruction addresses but not for data, so they might have the same issue https://en.wikipedia.org/wiki/Atmel_AVR_instruction_set#:~:text=Rare)%20models%20with,zero%2Dextended%20Z.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that ACP actually use the LLVM address spaces? It's not really clear from the design. Also it looks like it was never actually nominated for T-lang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM address space usage is dictated by the target, that ACP doesn't use non-default address-spaces because for all existing targets a NonNull<Code> is sufficient for function addresses (AVR just uses 16-bit pointers for both code and data and AFAIK LLVM doesn't currently support >16-bit pointers), however the plan is to add a type BikeshedFnAddr and switch to using that whenever we add a target where that's insufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AVR does use ptr addrspace(1) for function pointers: https://rust.godbolt.org/z/3hGPfKvfG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@programmerjake do you see that ACP moving forward? Maybe I should remove the instruction prefetching for now here and add it when there is progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you need just Code, you can probably get away with just adding that extern type for now under the tracking issue I just created #148768 for that ACP and let whoever implements the rest of that ACP just use Code. You can add them all now and wait on that tracking issue for stabilization. If it takes too long, this feature can be partially stabilized and leave the code prefetch stabilization for later.
|
After thinking about this for a bit,
So I would rework the API to something like this: #[non_exhaustive]
pub enum Locality {
L1,
L2,
L3,
}
#[non_exhaustive]
pub enum Retention {
Normal,
NonTemporal,
}
pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality, retention: Retention);Even though not all of these map to the underlying LLVM intrinsic today, they may do so in the future. |
library/core/src/hint.rs
Outdated
| /// Passing a dangling or invalid pointer is permitted: the memory will not | ||
| /// actually be dereferenced, and no faults are raised. | ||
| #[unstable(feature = "hint_prefetch", issue = "146941")] | ||
| pub const fn prefetch_write_data<T>(ptr: *mut T, locality: Locality) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make Locality a const generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enums cannot be const-generic parameters at the moment (on stable, anyway). We model the API here after atomic operations where the ordering parameter behaves similarly.
Maybe my understanding of It can be implemented (we'd just ignore weird/invalid combinations, I guess) but from an API perspective it seems weird. |
|
No, non-temporal is a hint that the data is likely only going to be accessed once. Essentially if you have data that you're only reading once then you'll want to prefetch it all the way to L1, but then mark that cache line as the first that should be evicted if needed since you know it won't be needed in the future. See https://stackoverflow.com/questions/53270421/difference-between-prefetch-and-prefetchnta-instructions for details of how this works on x86 CPUs. |
|
So that means something like this? #[inline(always)]
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality, retention: Retention) {
match retention
Retention::NonTemporal => {
return intrinsics::prefetch_read_data::<T, { Retention::NonTemporal as i32 }>(ptr);
}
Retention::Normal => { /* fall through */ }
}
match locality {
Locality::L3 => intrinsics::prefetch_read_data::<T, { Locality::L3 as i32 }>(ptr),
Locality::L2 => intrinsics::prefetch_read_data::<T, { Locality::L2 as i32 }>(ptr),
Locality::L1 => intrinsics::prefetch_read_data::<T, { Locality::L1 as i32 }>(ptr),
}
}This is really tricky to document: users basically have to look at the implementation to see what happens exactly. Also, every call getting the additional retention parameter is kind of unfortunate. |
|
My main concern is that the cache level to prefetch into should not be mixed with the retention hint. It should be a separate parameter or a separate function altogether. |
|
In that case I think, given current hardware support at least, that a separate function would be better #[inline(always)]
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality) {
match locality {
Locality::L3 => intrinsics::prefetch_read_data::<T, { Locality::L3 as i32 }>(ptr),
Locality::L2 => intrinsics::prefetch_read_data::<T, { Locality::L2 as i32 }>(ptr),
Locality::L1 => intrinsics::prefetch_read_data::<T, { Locality::L1 as i32 }>(ptr),
}
}
#[inline(always)]
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_data_nontemporal<T>(ptr: *const T) {
return intrinsics::prefetch_read_data::<T, { Retention::NonTemporal as i32 }>(ptr);
}that does potentially close some doors for weird future hardware designs, but as a user I think separate functions are simpler. |
c608e26 to
7f2cd50
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
7f2cd50 to
b18aacd
Compare
|
The reason I argued for a separate argument is that it's possible LLVM will add support for specifying a cache level for non-temporal prefetches in the future. It also makes the API more symmetrical. Alternatively, we could also decide to only expose prefetch hints with no extra arguments and point people to platform-specific hints in |
|
How heavily should we weigh potential future LLVM additions? Apparently no current architecture provides the fine-grained control of picking the cache level for non-temporal reads. So we're trading additional complexity for everyone versus a hypothetical future CPU capability. Also, we've gotten this far without prefetching at all. I suspect that in practice the vast majority of uses will just be "load into L1", perhaps with some "load into L2". The heavily specialized stuff can probably just be left to The current implementation of this PR is to have pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality);
pub const fn prefetch_read_data_nontemporal<T>(ptr: *const T);I've left out the non-temporal variants for |
|
for streaming writes where you're unlikely to access the written data again in the near future, |
|
also, for naming, imo we should leave out |
AArch64 has this capability, see https://developer.arm.com/documentation/ddi0596/2021-06/Base-Instructions/PRFM--immediate---Prefetch-Memory--immediate-- |
Can't you just do the non-temporal store? what benefit does a prefetch provide here?
Yeah I had been thinking that too, I'll change that.
You can encode it in the instruction, I haven't been able to figure out whether it actually does anything in practice. We can add the locality argument to the non-temporal function(s) though, I'd be OK with that given that non-temporal is even more niche than standard prefetches. |
non-temporal stores break the memory model on x86: llvm/llvm-project#64521 and #114582 |
|
And then the idea is that a non-temporal prefetch write hint plus a standard write will in effect create a well-behaved non-temporal store? |
maybe, depending on the arch? it at least won't break the memory model. |
b18aacd to
2dcd532
Compare
|
I'm not too concerned about the pointer type for I'm much more concerned about the locality and retention arguments and increasingly think we should not accept those parameters on the generic intrinsic given that proper use of those requires a lot of hardware specific knowledge. I would prefer to point people to the arch-specific intrinsics in stdarch if they need this level of precision. |
I'd agree for retention, but I think locality can be exposed. At least in the zstd case, both (what we here call) |
library/core/src/hint.rs
Outdated
| L3 = 1, | ||
| /// Data is expected to be reused in the near future. | ||
| /// | ||
| /// Typically prefetches into L2 cache. | ||
| L2 = 2, | ||
| /// Data is expected to be reused very soon. | ||
| /// | ||
| /// Typically prefetches into L1 cache. | ||
| L1 = 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the integer values from the enum? These are internal implementation details and not something that we want to publicly expose.
Are CPUs consistent enough across vendors and generations that using the same hints in the same places is useful on all of them? Naively I'd expect that with different memory latencies, pipeline depths, different cache-line sizes and prefetcher differences they'd need bespoke optimizations. Has anyone benchmarked that? |
well, we don't expose `prefetch_write_instruction`, that one doesn't really make sense in practice.
2dcd532 to
b9e3e41
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
|
@bors r+ |
|
☀️ Test successful - checks-actions |
What is this?This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.Comparing 5f1173b (parent) -> 2e667b0 (this PR) Test differencesShow 2 test diffs2 doctest diffs were found. These are ignored, as they are noisy. Test dashboardRun cargo run --manifest-path src/ci/citool/Cargo.toml -- \
test-dashboard 2e667b0c6491678642a83e3aff86626397360af5 --output-dir test-dashboardAnd then open Job duration changes
How to interpret the job duration changes?Job durations can vary a lot, based on the actual runner instance |
|
Finished benchmarking commit (2e667b0): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary 0.6%, secondary -2.2%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary 2.5%, secondary -0.2%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 471.137s -> 469.7s (-0.31%) |
tracking issue: #146941
acp: rust-lang/libs-team#638
well, we don't expose
prefetch_write_instruction, that one doesn't really make sense in practice.The implementation is straightforward, the docs can probably use some tweaks. Especially for the instruction version it's a little awkward.
r? @Amanieu