Only box the function once when creating threads. #31621

rphmeier · 2016-02-13T03:33:35Z

I felt like non-capturing thread functions really shouldn't require any heap allocations.

rust-highfive · 2016-02-13T03:33:48Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

alexcrichton · 2016-02-13T09:04:00Z

Personally I don't necessarily feel that this is worth the cost. The farther down we push the generics the more code we have to monomorphize for any vanilla call to thread::spawn, which makes "hello world" style programs compile that much longer. In the pasts I've seen thread::spawn take compile times from 10ms to 500ms, which is certainly quite noticeable!

Without performance numbers one way or another (and I would be quite skeptical that this could show up in any profile), I'd prefer to stick to concrete types instead of generics wherever possible.

rphmeier · 2016-02-13T14:58:32Z

I had anticipated that as a possible downside, and the cost of any allocation would definitely be dwarfed by spawning the thread. In that sense, this change is unsubstantiated. However, I personally like to have very fine-grained control over how memory is used in my programs, and it's always a little disheartening to look into std and see hidden allocations like this, even when they don't really mean much.

Another approach would be to violate parametricity in thread::spawn, with a different path for ZSTs. The only way to make that work would be to use "raw" in the ZST version to pass the vtable as the thread args, which feels very hacky.

arielb1 · 2016-02-13T15:03:07Z

Box::new(zst) does not actually allocate anyway.

rphmeier · 2016-02-13T15:18:18Z

@arielb1 Yes, but the old code boxes the box, which does allocate.

The core of this issue is that you need a pointer-width argument to pass to the system thread creation function. You can't get that from an *mut Trait, so the old code boxed it again to get an *mut Box<FnBox()>.

What I meant by using raw in the ZST path is to deconstruct the first Box into a TraitObject, pass only the pointer-width vtable, and rebuild it on the other side. This is definitely hacky, but would probably work.

arielb1 · 2016-02-13T17:42:44Z

Ah right. I would have new be a shim, as so:

    pub unsafe fn new<F: FnOnce()>(stack: usize, p: F) -> io::Result<Thread> {
        let p = box p;
        match Self::new_inner(stack, &*p as *const F as *const libc::c_void, thread_start::<F>) {
            Ok(o) => { mem::forget(p); Ok(o) }
            Err(e) => Err(e)
        }

        extern fn thread_start<F: FnOnce()>(main: *mut libc::c_void)
            -> *mut libc::c_void {
            unsafe {
                let main = Box::from_raw(main as *mut F);
                start_thread(main);
            }
            ptr::null_mut()
        }
    }

    unsafe fn new_inner<'a>(stack: usize, p: *const libc::c_void,
                            f: extern "C" fn(*mut libc::c_void) -> *mut libc::c_void)
                            -> io::Result<Thread> {
        // ..
    }

There's a potential use-after-free if new_inner panics after spawning the new thread, but the use-after-free is already present today: there's a

        assert_eq!(libc::pthread_attr_destroy(&mut attr), 0);

line - if it panics, we get a double-free of p. This is a place where #[noexcept] would be useful.

This way we also avoid creating a monomorphzied vtable.

rphmeier · 2016-02-13T18:21:05Z

Making new a shim is probably a good idea -- it'll reduce the volume of monomorphized code.

w.r.t. the destruction of the thread attributes, I really doubt that would ever fail if the pthreads implementation is compliant.

alexcrichton · 2016-02-14T07:33:45Z

In that sense, this change is unsubstantiated

Having concrete data showing that we pay for this change without showing how the tradeoff is worth it seems bad to me though?

rphmeier · 2016-02-14T17:27:17Z

@alexcrichton I'll try to drum up a benchmark to see how much this really hurts compile times. It's not as though we don't already do some monomorphization for thread::spawn, so I wouldn't expect there to be a huge increase.

rphmeier · 2016-02-15T04:22:43Z

I did a
time make check-stage1-std NO_REBUILD=1 NO_BENCH=1

for each of:

master (HEAD = 0c4d81f)
thread_box_once
thread_box_once_shim (a local branch)

which yielded the following times after two runs:

Branch	Run 1	Run 2
master	2m2.351s	1m59.618s
thread_box_once	2m11.934s	2m13.587s
thread_box_once_shim	2m5.279s	2m4.556s

The std tests use thread::spawn extensively in the thread, 'net', and sync modules, so I figure this was as good a test case as any. I couldn't figure out how to just measure the compile times without running (something like cargo test --no-run), so these numbers haphazardly represent the aggregate of compile-time regression and thread::spawn performance increase (which is probably ~0).

I consider this level of regression acceptable, especially considering the sheer volume of thread::spawn calls in the tests, 199 in libstd alone (git grep thread::spawn | awk '/libstd/ { print $0 }' | wc -l).

arielb1 · 2016-02-15T10:18:13Z

@rphmeier

I figure that you might want to manually inline start_thread (the stack overflow handler) into thread_start - this will avoid monomorphizing the <[closure] as FnBox> vtable and possibly speed up compile times even more.

rphmeier · 2016-02-15T16:09:07Z

@arielb1 I tried doing the inlined variety here: rphmeier@9fd687b

But it seems to have slower compile times than the previous shim, but not as bad as my first attempt.
I do wonder whether the boxing is necessary at all with this restructuring of the code. Theoretically, you could just pass a pointer to "p" on the stack frame, and then use it the same way. This requires additional synchronization via atomics such that you don't mem::forget(p) or return from Thread::new() before the data has been read on the other thread. In my local proof-of-concept, this passed the tests, but test compile + run time took slightly more than the boxing version.

arielb1 · 2016-02-15T16:15:57Z

thats... a bit odd, but go figure. What measurement method were you using?

rphmeier · 2016-02-15T17:21:43Z

Just a time make check-stage1-std NO_REBUILD=1 NO_BENCH=1

If you have any other suggestions for benchmarking, I'd be glad to hear them. I don't know very much about the Rust build system, so I'm sure there are better ways to perform this test.

arielb1 · 2016-02-15T17:25:35Z

You can cut out the middleman

$ rustc ../src/libstd/lib.rs --test -Z time-passes

in opt mode or in no-opt mode.

alexcrichton · 2016-02-16T05:49:49Z

I was curious what this did to compile times in a more extreme case, as stdtest may not give us great insight into compiling this function specifically. The test case I had was spawning 1000 threads (each returning a new integer) and then just dropping all the handles (no join). An optimized compile of this program went from 9:06.92 to 14:18.06, a 36% slowdown.

Have you measured any runtime benefit to this? I'm not super eager to make the standard library more generic without any runtime benefit.

rphmeier · 2016-02-17T01:10:48Z

I did the same test as you're proposing, but with only 200 threads as opposed to 1000, since 1000 ran on my laptop for about half an hour in LLVM with no signs of finishing. With 200 threads, I measured a 1.8% compilation slowdown with my changes (47.277s to 48.148s). Running rustc -Z time-passes indicated that for N=1000, it got hung up in LLVM, which probably means that there's a non-linear-complexity algorithm leading to the massive spike we're observing. Regardless, I don't think it's related to my changes.

Actually running the benchmark:

master:

running 1 test
test bench_run ... bench:   5,535,172 ns/iter (+/- 4,100,966)

thread_box_once:

running 1 test
test bench_run ... bench:   5,436,470 ns/iter (+/- 1,974,170)

So it seems that the runtimes are just about even. The 1.8% slowdown amounted to a 0.871s increase in compile times, or about 5ms per thread::spawn call. I'm not going to dismiss it, but I think that this is definitely a tolerable increase.

arielb1 · 2016-02-17T09:21:43Z

Maybe try a strategically placed inline(never)?

alexcrichton · 2016-02-22T18:41:04Z

I don't think it's related to my changes.

What I mean to point out is that making these functions more generic comes at a cost, and that cost is inlining more code downstream. This is minimized here, but the cost is not zero.

So it seems that the runtimes are just about even.

While I agree with the conclusion the method here doesn't look like it's producing accurate results. The variance in those measurements is massive, so we can't really gain any information from that. I suspect, however, that it's basically absolutely guaranteed that there is no difference in perf from one more allocation when spawning a thread

bors · 2016-03-23T19:35:49Z

☔ The latest upstream changes (presumably #32390) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton · 2016-05-01T18:08:07Z

Closing due to inactivity, but feel free to resubmit with a rebase!

rust-highfive assigned brson Feb 13, 2016

rphmeier force-pushed the thread_box_once branch from 893ff0c to 6e3582b Compare February 13, 2016 03:39

rphmeier force-pushed the thread_box_once branch 2 times, most recently from b4e9419 to 2543cd7 Compare February 15, 2016 04:31

rphmeier force-pushed the thread_box_once branch from 2543cd7 to edd679a Compare February 15, 2016 15:18

Only box the function once when creating threads.

ee7a85c

rphmeier force-pushed the thread_box_once branch from edd679a to ee7a85c Compare February 15, 2016 15:38

alexcrichton closed this May 1, 2016

Only box the function once when creating threads. #31621

Only box the function once when creating threads. #31621

Uh oh!

Conversation

rphmeier commented Feb 13, 2016

Uh oh!

rust-highfive commented Feb 13, 2016

Uh oh!

alexcrichton commented Feb 13, 2016

Uh oh!

rphmeier commented Feb 13, 2016

Uh oh!

arielb1 commented Feb 13, 2016

Uh oh!

rphmeier commented Feb 13, 2016

Uh oh!

arielb1 commented Feb 13, 2016

Uh oh!

rphmeier commented Feb 13, 2016

Uh oh!

alexcrichton commented Feb 14, 2016

Uh oh!

rphmeier commented Feb 14, 2016

Uh oh!

rphmeier commented Feb 15, 2016

Uh oh!

arielb1 commented Feb 15, 2016

Uh oh!

rphmeier commented Feb 15, 2016

Uh oh!

arielb1 commented Feb 15, 2016

Uh oh!

rphmeier commented Feb 15, 2016

Uh oh!

arielb1 commented Feb 15, 2016

Uh oh!

alexcrichton commented Feb 16, 2016

Uh oh!

rphmeier commented Feb 17, 2016

Uh oh!

arielb1 commented Feb 17, 2016

Uh oh!

alexcrichton commented Feb 22, 2016

Uh oh!

bors commented Mar 23, 2016

Uh oh!

alexcrichton commented May 1, 2016

Uh oh!

Uh oh!