add `core::hint::prefetch_{read, write}_{data, instruction}` #146948

folkertdev · 2025-09-23T22:13:16Z

tracking issue: #146941
acp: rust-lang/libs-team#638

well, we don't expose prefetch_write_instruction, that one doesn't really make sense in practice.

The implementation is straightforward, the docs can probably use some tweaks. Especially for the instruction version it's a little awkward.

r? @Amanieu

programmerjake · 2025-09-24T00:50:00Z

library/core/src/hint.rs

+/// Passing a dangling or invalid pointer is permitted: the memory will not
+/// actually be dereferenced, and no faults are raised.
+#[unstable(feature = "hint_prefetch", issue = "146941")]
+pub const fn prefetch_read_instruction<T>(ptr: *const T, locality: Locality) {


shouldn't this be ptr: unsafe fn() or something since some platforms have different data and instruction pointer sizes?

On some platforms a function pointer doesn't point directly to the instruction bytes, but rather to a function descriptor, which consists of a pointer to the first instruction and some value that needs to be loaded into a register. On these platforms using unsafe fn() would be incorrect. Itanium is an example, but I know there are more architectures that do this.

ok, but that doesn't mean *const T is correct.

ultimately all you need is an address, so *const T seemed the simplest way of achieving that.

but *const T may be too small e.g. on 16-bit x86 in the medium model a data pointer is 16 bits but an instruction pointer is 32 bits.

there are some AVR cpus (not currently supported by rust?) which need >16 bits for instruction addresses but not for data, so they might have the same issue https://en.wikipedia.org/wiki/Atmel_AVR_instruction_set#:~:text=Rare)%20models%20with,zero%2Dextended%20Z.)

Does that ACP actually use the LLVM address spaces? It's not really clear from the design. Also it looks like it was never actually nominated for T-lang?

LLVM address space usage is dictated by the target, that ACP doesn't use non-default address-spaces because for all existing targets a NonNull<Code> is sufficient for function addresses (AVR just uses 16-bit pointers for both code and data and AFAIK LLVM doesn't currently support >16-bit pointers), however the plan is to add a type BikeshedFnAddr and switch to using that whenever we add a target where that's insufficient.

AVR does use ptr addrspace(1) for function pointers: https://rust.godbolt.org/z/3hGPfKvfG

@programmerjake do you see that ACP moving forward? Maybe I should remove the instruction prefetching for now here and add it when there is progress?

if you need just Code, you can probably get away with just adding that extern type for now under the tracking issue I just created #148768 for that ACP and let whoever implements the rest of that ACP just use Code. You can add them all now and wait on that tracking issue for stabilization. If it takes too long, this feature can be partially stabilized and leave the code prefetch stabilization for later.

Amanieu · 2025-09-24T00:52:29Z

After thinking about this for a bit, NonTemporal should be separated into a separate Retention enum since it is orthogonal to the locality at which to prefetch the data. Specifically:

"locality" refers to how soon we are going to need this data. This corresponds to the cache level into which we are prefetching.
"retention" refers to how long the data should be kept in cache. A non-temporal access includes a hint to the cache that the line should be evicted before any other cache lines. Non-temporal hints indicate memory that is accessed only once, after which it should not be kept in the cache any more.

So I would rework the API to something like this:

#[non_exhaustive]
pub enum Locality {
    L1,
    L2,
    L3,
}

#[non_exhaustive]
pub enum Retention {
    Normal,
    NonTemporal,
}

pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality, retention: Retention);

Even though not all of these map to the underlying LLVM intrinsic today, they may do so in the future.

bjorn3 · 2025-09-24T08:47:57Z

library/core/src/hint.rs

+/// Passing a dangling or invalid pointer is permitted: the memory will not
+/// actually be dereferenced, and no faults are raised.
+#[unstable(feature = "hint_prefetch", issue = "146941")]
+pub const fn prefetch_write_data<T>(ptr: *mut T, locality: Locality) {


Maybe make Locality a const generic?

Enums cannot be const-generic parameters at the moment (on stable, anyway). We model the API here after atomic operations where the ordering parameter behaves similarly.

folkertdev · 2025-09-24T10:19:02Z

Even though not all of these map to the underlying LLVM intrinsic today, they may do so in the future.

Maybe my understanding of NonTemporal is wrong, but I believe it means that the cache hierarchy should be skipped entirely. So then combining that with a Locality is completely meaningless, right?

It can be implemented (we'd just ignore weird/invalid combinations, I guess) but from an API perspective it seems weird.

Amanieu · 2025-09-24T21:58:06Z

No, non-temporal is a hint that the data is likely only going to be accessed once. Essentially if you have data that you're only reading once then you'll want to prefetch it all the way to L1, but then mark that cache line as the first that should be evicted if needed since you know it won't be needed in the future. See https://stackoverflow.com/questions/53270421/difference-between-prefetch-and-prefetchnta-instructions for details of how this works on x86 CPUs.

folkertdev · 2025-09-25T18:06:18Z

So that means something like this?

#[inline(always)]
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality, retention: Retention) {
    match retention
        Retention::NonTemporal => {
            return intrinsics::prefetch_read_data::<T, { Retention::NonTemporal as i32 }>(ptr);
        }
        Retention::Normal => { /* fall through */ }
    }

    match locality {
        Locality::L3 => intrinsics::prefetch_read_data::<T, { Locality::L3 as i32 }>(ptr),
        Locality::L2 => intrinsics::prefetch_read_data::<T, { Locality::L2 as i32 }>(ptr),
        Locality::L1 => intrinsics::prefetch_read_data::<T, { Locality::L1 as i32 }>(ptr),
    }
}

This is really tricky to document: users basically have to look at the implementation to see what happens exactly. Also, every call getting the additional retention parameter is kind of unfortunate.

Amanieu · 2025-09-26T23:46:31Z

My main concern is that the cache level to prefetch into should not be mixed with the retention hint. It should be a separate parameter or a separate function altogether.

folkertdev · 2025-09-30T10:28:42Z

In that case I think, given current hardware support at least, that a separate function would be better

#[inline(always)]
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality) {
    match locality {
        Locality::L3 => intrinsics::prefetch_read_data::<T, { Locality::L3 as i32 }>(ptr),
        Locality::L2 => intrinsics::prefetch_read_data::<T, { Locality::L2 as i32 }>(ptr),
        Locality::L1 => intrinsics::prefetch_read_data::<T, { Locality::L1 as i32 }>(ptr),
    }
}

#[inline(always)]
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_data_nontemporal<T>(ptr: *const T) {
    return intrinsics::prefetch_read_data::<T, { Retention::NonTemporal as i32 }>(ptr);
}

that does potentially close some doors for weird future hardware designs, but as a user I think separate functions are simpler.

Amanieu · 2025-10-05T08:16:16Z

The reason I argued for a separate argument is that it's possible LLVM will add support for specifying a cache level for non-temporal prefetches in the future. It also makes the API more symmetrical.

Alternatively, we could also decide to only expose prefetch hints with no extra arguments and point people to platform-specific hints in std::arch for more detailed hints.

folkertdev · 2025-10-05T10:15:23Z

How heavily should we weigh potential future LLVM additions? Apparently no current architecture provides the fine-grained control of picking the cache level for non-temporal reads. So we're trading additional complexity for everyone versus a hypothetical future CPU capability.

Also, we've gotten this far without prefetching at all. I suspect that in practice the vast majority of uses will just be "load into L1", perhaps with some "load into L2". The heavily specialized stuff can probably just be left to stdarch.

The current implementation of this PR is to have

pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality);
pub const fn prefetch_read_data_nontemporal<T>(ptr: *const T);

I've left out the non-temporal variants for write and read_instruction for now, from what I can tell those don't actually seem that useful and can probably be left to stdarch unless someone does have an actual use case.

programmerjake · 2025-10-05T10:29:47Z

for streaming writes where you're unlikely to access the written data again in the near future, prefetch_write_data_nontemporal seems useful, at least it doesn't have crazy semantics like nontemporal stores do.

programmerjake · 2025-10-05T10:31:51Z

also, for naming, imo we should leave out _data since that's likely waay more common than _instruction so makes a good default.

Amanieu · 2025-10-05T10:56:57Z

How heavily should we weigh potential future LLVM additions? Apparently no current architecture provides the fine-grained control of picking the cache level for non-temporal reads. So we're trading additional complexity for everyone versus a hypothetical future CPU capability.

AArch64 has this capability, see https://developer.arm.com/documentation/ddi0596/2021-06/Base-Instructions/PRFM--immediate---Prefetch-Memory--immediate--

folkertdev · 2025-10-05T11:16:10Z

for streaming writes where you're unlikely to access the written data again in the near future, prefetch_write_data_nontemporal seems useful, at least it doesn't have crazy semantics like nontemporal stores do.

Can't you just do the non-temporal store? what benefit does a prefetch provide here?

also, for naming, imo we should leave out _data since that's likely waay more common than _instruction so makes a good default.

Yeah I had been thinking that too, I'll change that.

AArch64 has this capability

You can encode it in the instruction, I haven't been able to figure out whether it actually does anything in practice.

We can add the locality argument to the non-temporal function(s) though, I'd be OK with that given that non-temporal is even more niche than standard prefetches.

programmerjake · 2025-10-05T11:48:38Z

Can't you just do the non-temporal store?

non-temporal stores break the memory model on x86: llvm/llvm-project#64521 and #114582

folkertdev · 2025-10-05T12:03:20Z

And then the idea is that a non-temporal prefetch write hint plus a standard write will in effect create a well-behaved non-temporal store?

programmerjake · 2025-10-05T12:27:30Z

And then the idea is that a non-temporal prefetch write hint plus a standard write will in effect create a well-behaved non-temporal store?

maybe, depending on the arch? it at least won't break the memory model.

Amanieu · 2025-11-09T11:56:12Z

I'm not too concerned about the pointer type for prefetch_read_instruction: we can document this as a pointer into the data address space and simply treat it as a nop if this is disjoint from the code address space. In any case, we can defer stabilization of this until the issues of code vs data pointer size is resolved by lang.

I'm much more concerned about the locality and retention arguments and increasingly think we should not accept those parameters on the generic intrinsic given that proper use of those requires a lot of hardware specific knowledge. I would prefer to point people to the arch-specific intrinsics in stdarch if they need this level of precision.

folkertdev · 2025-11-09T12:14:40Z

I'm much more concerned about the locality and retention arguments and increasingly think we should not accept those parameters on the generic intrinsic given that proper use of those requires a lot of hardware specific knowledge. I would prefer to point people to the arch-specific intrinsics in stdarch if they need this level of precision.

I'd agree for retention, but I think locality can be exposed. At least in the zstd case, both (what we here call) L1 and L2 are used, and so I'd like a cross-platform function that can do that. Sure you do need some low-level knowledge to use these options effectively, but I think the mental model is reasonably clear.

Amanieu · 2025-12-09T17:38:11Z

library/core/src/hint.rs

+    L3 = 1,
+    /// Data is expected to be reused in the near future.
+    ///
+    /// Typically prefetches into L2 cache.
+    L2 = 2,
+    /// Data is expected to be reused very soon.
+    ///
+    /// Typically prefetches into L1 cache.
+    L1 = 3,


Could you remove the integer values from the enum? These are internal implementation details and not something that we want to publicly expose.

the8472 · 2025-12-09T20:19:21Z

At least in the zstd case, both (what we here call) L1 and L2 are used, and so I'd like a cross-platform function that can do that

Are CPUs consistent enough across vendors and generations that using the same hints in the same places is useful on all of them? Naively I'd expect that with different memory latencies, pipeline depths, different cache-line sizes and prefetcher differences they'd need bespoke optimizations. Has anyone benchmarked that?
I peeked at the git history from zstd and the one PR I looked at (facebook/zstd#2749) did several things and has benchmarks across multiple CPUs, but it didn't seem to benchmark the prefetching in isolation.

well, we don't expose `prefetch_write_instruction`, that one doesn't really make sense in practice.

rustbot · 2025-12-09T23:18:24Z

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

Amanieu · 2025-12-09T23:19:56Z

@bors r+

bors · 2025-12-09T23:19:59Z

📌 Commit b9e3e41 has been approved by Amanieu

It is now in the queue for this repository.

bors · 2025-12-10T03:18:30Z

⌛ Testing commit b9e3e41 with merge 2e667b0...

bors · 2025-12-10T06:38:26Z

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing 2e667b0 to main...

github-actions · 2025-12-10T06:41:39Z

What is this?

This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.

Comparing 5f1173b (parent) -> 2e667b0 (this PR)

Test differences

Show 2 test diffs

2 doctest diffs were found. These are ignored, as they are noisy.

Test dashboard

Run

cargo run --manifest-path src/ci/citool/Cargo.toml -- \
    test-dashboard 2e667b0c6491678642a83e3aff86626397360af5 --output-dir test-dashboard

And then open test-dashboard/index.html in your browser to see an overview of all executed tests.

Job duration changes

dist-aarch64-apple: 6723.7s -> 7889.0s (+17.3%)
x86_64-gnu-llvm-20-3: 5726.8s -> 6693.8s (+16.9%)
x86_64-gnu-gcc: 2966.0s -> 3382.5s (+14.0%)
x86_64-rust-for-linux: 2757.2s -> 3119.7s (+13.1%)
x86_64-gnu-llvm-20: 2433.7s -> 2750.6s (+13.0%)
x86_64-gnu-tools: 3217.4s -> 3620.1s (+12.5%)
dist-i586-gnu-i586-i686-musl: 5611.9s -> 4948.1s (-11.8%)
aarch64-gnu-llvm-20-2: 2194.4s -> 2442.7s (+11.3%)
i686-gnu-2: 5511.1s -> 6133.1s (+11.3%)
aarch64-gnu-debug: 3832.5s -> 4264.3s (+11.3%)

How to interpret the job duration changes?

Job durations can vary a lot, based on the actual runner instance
that executed the job, system noise, invalidated caches, etc. The table above is provided
mostly for t-infra members, for simpler debugging of potential CI slow-downs.

rust-timer · 2025-12-10T07:19:21Z

Finished benchmarking commit (2e667b0): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.5%	[-0.5%, -0.5%]	1
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (primary 0.6%, secondary -2.2%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	0.6%	[0.6%, 0.6%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-2.2%	[-2.2%, -2.2%]	1
All ❌✅ (primary)	0.6%	[0.6%, 0.6%]	1

Cycles

Results (primary 2.5%, secondary -0.2%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	2.5%	[2.5%, 2.5%]	2
Regressions ❌ (secondary)	2.0%	[2.0%, 2.0%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-2.4%	[-2.4%, -2.4%]	1
All ❌✅ (primary)	2.5%	[2.5%, 2.5%]	2

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 471.137s -> 469.7s (-0.31%)
Artifact size: 389.02 MiB -> 388.99 MiB (-0.01%)

rustbot assigned Amanieu Sep 23, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 23, 2025

This comment has been minimized.

Sign in to view

programmerjake reviewed Sep 24, 2025

View reviewed changes

bjorn3 reviewed Sep 24, 2025

View reviewed changes

folkertdev force-pushed the hint-prefetch branch from c608e26 to 7f2cd50 Compare October 4, 2025 17:09

This comment has been minimized.

Sign in to view

folkertdev force-pushed the hint-prefetch branch from 7f2cd50 to b18aacd Compare October 4, 2025 17:58

folkertdev force-pushed the hint-prefetch branch from b18aacd to 2dcd532 Compare October 5, 2025 13:50

programmerjake mentioned this pull request Nov 13, 2025

Tracking Issue for FnStatic and FnPtr::addr/from_ptr/as_ptr and Code #148768

Open

5 tasks

Amanieu added the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label Dec 9, 2025

Amanieu reviewed Dec 9, 2025

View reviewed changes

Amanieu removed the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label Dec 9, 2025

folkertdev added 3 commits December 10, 2025 00:08

add core::hint::prefetch_{read, write}_{data, instruction}

a3b78e0

well, we don't expose `prefetch_write_instruction`, that one doesn't really make sense in practice.

Add a separate function for a non-temporal read prefetch

6407023

remove explicit discriminants

b9e3e41

folkertdev force-pushed the hint-prefetch branch from 2dcd532 to b9e3e41 Compare December 9, 2025 23:18

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 9, 2025

bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 10, 2025

bors merged commit 2e667b0 into rust-lang:main Dec 10, 2025
12 checks passed

rustbot added this to the 1.94.0 milestone Dec 10, 2025

github-actions bot mentioned this pull request Dec 12, 2025

Toolchain upgrade to nightly-2025-12-15 flux-rs/flux#1345

Draft

Uh oh!

add core::hint::prefetch_{read, write}_{data, instruction} #146948

add core::hint::prefetch_{read, write}_{data, instruction} #146948

Uh oh!

Conversation

folkertdev commented Sep 23, 2025

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

programmerjake Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

folkertdev Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Amanieu commented Sep 24, 2025

Uh oh!

bjorn3 Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

folkertdev commented Sep 24, 2025

Uh oh!

Amanieu commented Sep 24, 2025

Uh oh!

folkertdev commented Sep 25, 2025

Uh oh!

Amanieu commented Sep 26, 2025

Uh oh!

folkertdev commented Sep 30, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

Amanieu commented Oct 5, 2025

Uh oh!

folkertdev commented Oct 5, 2025

Uh oh!

programmerjake commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

programmerjake commented Oct 5, 2025

Uh oh!

Amanieu commented Oct 5, 2025

Uh oh!

folkertdev commented Oct 5, 2025

Uh oh!

programmerjake commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

folkertdev commented Oct 5, 2025

Uh oh!

programmerjake commented Oct 5, 2025

Uh oh!

Amanieu commented Nov 9, 2025

Uh oh!

folkertdev commented Nov 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

the8472 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

add `core::hint::prefetch_{read, write}_{data, instruction}` #146948

add `core::hint::prefetch_{read, write}_{data, instruction}` #146948

programmerjake Sep 24, 2025 •

edited

Loading

folkertdev Sep 25, 2025 •

edited

Loading

bjorn3 Sep 24, 2025 •

edited

Loading

programmerjake commented Oct 5, 2025 •

edited

Loading

programmerjake commented Oct 5, 2025 •

edited

Loading

the8472 commented Dec 9, 2025 •

edited

Loading