Skip to content

String searching is slow #14107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
huonw opened this issue May 11, 2014 · 1 comment · Fixed by #14135
Closed

String searching is slow #14107

huonw opened this issue May 11, 2014 · 1 comment · Fixed by #14135
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@huonw
Copy link
Member

huonw commented May 11, 2014

The current implement of functions like StrSlice.contains are naive string searches, not using a more intelligent algorithm like Boyer-Moore (with a comment referring one to #1932).

Unfortunately that means we're trounced when performing operations like that. e.g.

// contains_string.rs
extern crate test;

fn main() {
    let haystack = std::io::stdin().read_to_str().unwrap();

    let needle = "Anything whatsoever related to the Rust programming language: an open-source systems programming language from Mozilla, emphasizing safety, concurrency, and speed.";

    println!("{} {}", haystack.len(), needle.len());
    for _ in range(0, 1000) {
        test::black_box(haystack.contains(needle))
    }
}
// contains_string.c
#include<string.h>
#include<stdio.h>
#include<unistd.h>
#include<stdlib.h>

// 16MB
#define N (1 << 24)

int main() {
    char * haystack = calloc(N, 1);

    read(0, haystack, N - 1);

    const char * needle = "Anything whatsoever related to the Rust programming language: an open-source systems programming language from Mozilla, emphasizing safety, concurrency, and speed.";

    printf("%d %d", strlen(haystack), strlen(needle));
    int count = 0;
    for (int i = 0; i < 1000; i ++) {
        haystack[N - 1] = 0;
        count += strstr(haystack, needle) == NULL;
    }

    return count;
}
// contains_string.py
import sys

haystack = sys.stdin.read()
needle = "Anything whatsoever related to the Rust programming language: an open-source systems programming language from Mozilla, emphasizing safety, concurrency, and speed."

print(len(haystack), len(needle))
for _ in range(0, 1000):
    needle in haystack
$ wget http://www.gutenberg.org/cache/epub/1342/pg1342.txt # pride and prejudice (700 kb)
[...snip...]
$ rustc -O contains_string.rs

$ gcc -O contains_string.c -o contains_string_c -std=gnu11

$ time ./contains_string < pg1342.txt 
717569 163

real    0m1.041s
user    0m1.036s
sys     0m0.000s

$ time ./contains_string_c < pg1342.txt 
717569 163
real    0m0.064s
user    0m0.056s
sys     0m0.004s

$ time python ./contains_string.py < pg1342.txt 
(717569, 163)

real    0m0.155s
user    0m0.152s
sys     0m0.000s

General notes:

bors added a commit that referenced this issue May 16, 2014
This changes the previously naive string searching algorithm to a two-way search like glibc, which should be faster on average while still maintaining worst case linear time complexity. This fixes #14107. Note that I don't think this should be merged yet, as this is the only approach to speeding up search I've tried - it's worth considering options like Boyer-Moore or adding a bad character shift table to this. However, the benchmarks look quite good so far:

    test str::bench::bench_contains_bad_naive                   ... bench:       290 ns/iter (+/- 12)     from 1309 ns/iter (+/- 36)
    test str::bench::bench_contains_equal                       ... bench:       479 ns/iter (+/- 10)     from  137 ns/iter (+/- 2)
    test str::bench::bench_contains_short_long                  ... bench:      2844 ns/iter (+/- 105)    from 5473 ns/iter (+/- 14)
    test str::bench::bench_contains_short_short                 ... bench:        55 ns/iter (+/- 4)      from   57 ns/iter (+/- 6)

Except for the case specifically designed to be optimal for the naive case (`bench_contains_equal`), this gets as good or better performance as the previous code.
@mahkoh
Copy link
Contributor

mahkoh commented Feb 1, 2015

~/dragons/strstr$ /usr/bin/time -p ./rust < pg1342.txt
real 26.85
user 26.79
sys 0.00
~/dragons/strstr$ /usr/bin/time -p ./c < pg1342.txt    
real 0.83
user 0.83
sys 0.00

This is for the case where the needle is a single character. In the case above the rust version is as fast or faster than the C version.

#include <string.h>
#include <unistd.h>

int main(void) {
    static char haystack[1024 * 1024];
    read(0, haystack, sizeof(haystack) - 1);

    const char *needle = "_";

    for (size_t i = 0; i < 10000000; i++) {
        size_t j = (size_t)strstr(haystack, needle);
        asm volatile("" : : "r"(j), "r"(&needle) : "memory");
    }
}
#![feature(core, test, io)]

extern crate test;

fn main() {
    let haystack = std::old_io::stdin().read_to_string().unwrap();

    let mut needle = "_";

    for _ in range(0, 10000000) {
        test::black_box(&mut needle);
        test::black_box(haystack.contains(needle));
    }
}

Maybe there is a bug in the case for short needles.

flip1995 pushed a commit to flip1995/rust that referenced this issue Apr 3, 2025
close rust-lang#2177

changelog: [`manual_dangling_ptr`]: new lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants