Skip to content

Remove unreachable #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

hkBst
Copy link
Member

@hkBst hkBst commented May 14, 2025

This improves the API of this crate to not use unreachable any more and is the continuation of rust-lang/rust#138163.

It also eliminates internal unreachable by inlining Mode methods into the *_common functions, and eliminates the resulting duplication by using traits instead. Using traits is much more verbose than a macro-based variant, because they are very explicit, but hopefully also a bit less unclear.

I've tried to use separate commits to explain the story, but have probably only succeeded at the beginning.

There is a companion PR to use this new API here: rust-lang/rust#140999

r? @nnethercote

@GuillaumeGomez
Copy link
Member

Gonna take a look later on.

@nnethercote
Copy link

Thanks for splitting this into multiple pieces. I think it's good that @GuillaumeGomez will look at this, it will be good to get another pair of eyes on it after rust-lang/rust#138163.

@GuillaumeGomez
Copy link
Member

Changes look good to me. Now comes the not so fun question: can you add benchmarks please?

@hkBst
Copy link
Member Author

hkBst commented May 21, 2025

Changes look good to me. Now comes the not so fun question: can you add benchmarks please?

Are you thinking about numbers (before the crate split off this was looking like this for the macro variant of this change) or code for this crate? I'm happy to come up with some benchmark code.

@GuillaumeGomez
Copy link
Member

Mostly code, I can check locally when done. Considering it'll impact performance, better check ahead of time. We just need to check the entry functions.

@hkBst
Copy link
Member Author

hkBst commented May 27, 2025

Benchmarks PR: #9

src/lib.rs Outdated
/// Takes the contents of a literal (without quotes)
/// and produces a sequence of errors,
/// which are returned by invoking `error_callback`.
pub fn unescape_for_errors(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this function? It does the conversion but doesn't actually make use of it. It only provides information if one error occurred. Where is it meant to be used?

Copy link
Member Author

@hkBst hkBst May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its only purpose is to be used in the companion PR using the new API here: https://github.com/rust-lang/rust/pull/140999/files#diff-36d0ff95049fa1b66bdd47ec2c03e1588268303571a9561d1ba664ca29034dacR1019-R1049.

It seemed like a good compromise to remove a stubborn use of unescape_{unicode,mixed} and signals intent well. I suppose it could alternatively live where it is used instead of here, except for its use of unescape_single...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, it only checks if there is an error. It doesn't need the unescaped content. Then let me make a suggestion for its documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it is like the other unescape_* functions, but it only gives you the error results and not the Oks. That's why I named it unescape_for_errors. Was the name not a good indication of this behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Always open to interpretation. I needed your comment to understand why this function was working this way. Documentation is here to clarify that.

@GuillaumeGomez
Copy link
Member

Also need to update the benchmarks.

hkBst added 5 commits May 28, 2025 09:23
The old API exposes `unreachable` in both unescape_unicode and unescape_mixed.
These are conceptually one function, but because their return types are incompatible,
they could not be unified.

The new API takes this insight further to separate unescape_unicode into separate functions,
such that byte functions can return bytes instead of chars.
@rust-cloud-vms rust-cloud-vms bot force-pushed the remove_unreachable branch from 00b6cfd to 45a5bf4 Compare May 28, 2025 09:27
src/lib.rs Outdated
Comment on lines 543 to 545
/// Takes the contents of a literal (without quotes)
/// and produces a sequence of errors,
/// which are returned by invoking `error_callback`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Takes the contents of a literal (without quotes)
/// and produces a sequence of errors,
/// which are returned by invoking `error_callback`.
/// Takes the contents of a literal (without quotes) and calls `error_callback` if any error is encountered
/// while unescaping it. Please note that the unescaped content is not provided, this function is only meant
/// to be used to confirm whether or not the literal content is (in)valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed this to check_for_errors and improved its docs. Also took the chance to polish up some of the other doc comments. Let me know what you think.

@GuillaumeGomez
Copy link
Member

GuillaumeGomez commented May 29, 2025

API changes look good to me. Let me check benches now and then I think we're ready to go. :)

@GuillaumeGomez
Copy link
Member

Here are the bench results:

test name main branch new changes diff
bench_check_raw_byte_str 25,941.68 ns/iter (+/- 57.51) 48,097.89 ns/iter (+/- 104.59) +85%
bench_check_raw_c_str_ascii 24,387.77 ns/iter (+/- 83.40) 39,709.25 ns/iter (+/- 116.80) +62%
bench_check_raw_c_str_unicode 40,485.60 ns/iter (+/- 181.36) 50,116.99 ns/iter (+/- 115.94) +23.7%
bench_check_raw_str_ascii 25,679.40 ns/iter (+/- 68.78) 39,727.86 ns/iter (+/- 137.46) +54.7%
bench_check_raw_str_unicode 42,031.95 ns/iter (+/- 199.86) 47,429.04 ns/iter (+/- 121.53) +12.8%
bench_skip_ascii_whitespace 6,321.42 ns/iter (+/- 70.54) 6,322.61 ns/iter (+/- 20.25) 0%
bench_unescape_byte_str_ascii 64,825.30 ns/iter (+/- 68.90) 57,407.75 ns/iter (+/- 272.16) -11.4%
bench_unescape_byte_str_hex 95,602.13 ns/iter (+/- 214.39) 85,577.19 ns/iter (+/- 188.78) -10.4%
bench_unescape_byte_str_trivial 34,165.35 ns/iter (+/- 114.53) 52,296.49 ns/iter (+/- 260.71) +53%
bench_unescape_c_str_ascii 54,627.06 ns/iter (+/- 76.32) 48,477.48 ns/iter (+/- 214.53) -10.7%
bench_unescape_c_str_hex_ascii 94,049.71 ns/iter (+/- 192.47) 86,601.92 ns/iter (+/- 226.26) -7.9%
bench_unescape_c_str_hex_byte 94,030.54 ns/iter (+/- 170.87) 86,285.94 ns/iter (+/- 129.85) -7.9%
bench_unescape_c_str_trivial 44,069.69 ns/iter (+/- 63.67) 37,109.07 ns/iter (+/- 89.56) -15.8%
bench_unescape_c_str_unicode 183,698.16 ns/iter (+/- 236.18) 165,312.95 ns/iter (+/- 167.56) -10%
bench_unescape_str_ascii 64,803.76 ns/iter (+/- 88.18) 66,926.66 ns/iter (+/- 228.30) +3.3%
bench_unescape_str_hex 95,642.76 ns/iter (+/- 145.34) 102,020.68 ns/iter (+/- 140.15) +6.7%
bench_unescape_str_trivial 34,071.43 ns/iter (+/- 54.39) 46,578.22 ns/iter (+/- 91.29) +36.7%
bench_unescape_str_unicode 185,521.24 ns/iter (+/- 228.33) 180,438.70 ns/iter (+/- 428.62) -2.7%

Overall, the check* functions are much slower. The unescape* ones are mixed but with some small gains and big regressions. More work is required to improve everything, we cannot merge as is.

@hkBst
Copy link
Member Author

hkBst commented May 30, 2025

Interesting! Is there an easy way to create such a nice table?

@GuillaumeGomez
Copy link
Member

Sadly no. I ran benches a lot of time in both main and in your branch and then kept the lowest +/- changes and finally computed the diff for all of them. So I recommend you do the same for main and then you can check with your branch.

@hkBst
Copy link
Member Author

hkBst commented May 30, 2025

Taking the worst offender (bench_check_raw_byte_str), if I manually inline bench_check_raw (and so get rid of &mut dyn FnMut), then a lot (but not all) of the slowdown disappears. Unfortunately just annotating with #[inline(always)] does nothing.

On the other hand, if I make the main branch use the newer more generic bench_check_raw (which includes adding + ?Sized bounds to unescape_unicode and *_common), then it becomes just as slow as the new code.

Maybe I should rewrite the benchmarks as macros, to minimize such issues...

diff --git a/benches/benches.rs b/benches/benches.rs
index a028dfd..1100832 100644
--- a/benches/benches.rs
+++ b/benches/benches.rs
@@ -3,7 +3,9 @@
 extern crate test;
 
 use rustc_literal_escaper::*;
+use std::fmt::Debug;
 use std::iter::repeat_n;
+use std::ops::Range;
 
 const LEN: usize = 10_000;
 
@@ -37,6 +39,24 @@ fn bench_skip_ascii_whitespace(b: &mut test::Bencher) {
 // Check raw
 //
 
+#[allow(clippy::type_complexity)]
+fn new_bench_check_raw<UNIT: Into<char> + PartialEq + Debug + Copy>(
+    b: &mut test::Bencher,
+    c: UNIT,
+    check_raw: fn(&str, &mut dyn FnMut(Range<usize>, Result<UNIT, EscapeError>)),
+) {
+    let input: String = test::black_box(repeat_n(c.into(), LEN).collect());
+    assert_eq!(input.len(), LEN * c.into().len_utf8());
+
+    b.iter(|| {
+        let mut output = vec![];
+
+        check_raw(&input, &mut |range, res| output.push((range, res)));
+        assert_eq!(output.len(), LEN);
+        assert_eq!(output[0], (0..c.into().len_utf8(), Ok(c)));
+    });
+}
+
 fn bench_check_raw(b: &mut test::Bencher, c: char, mode: Mode) {
     let input: String = test::black_box(repeat_n(c, LEN).collect());
     assert_eq!(input.len(), LEN * c.len_utf8());
@@ -64,7 +84,20 @@ fn bench_check_raw_str_unicode(b: &mut test::Bencher) {
 
 #[bench]
 fn bench_check_raw_byte_str(b: &mut test::Bencher) {
-    bench_check_raw(b, 'a', Mode::RawByteStr);
+    //    bench_check_raw(b, 'a', Mode::RawByteStr);
+
+    new_bench_check_raw(b, 'a', |s, cb| unescape_unicode(s, Mode::RawByteStr, cb));
+
+    // let input: String = test::black_box(repeat_n('a', LEN).collect());
+    // assert_eq!(input.len(), LEN * 'a'.len_utf8());
+
+    // b.iter(|| {
+    //     let mut output = vec![];
+
+    //     check_raw_byte_str(&input, &mut |range, res| output.push((range, res)));
+    //     assert_eq!(output.len(), LEN);
+    //     assert_eq!(output[0], (0..1, Ok(b'a')));
+    // });
 }
 
 // raw C str
diff --git a/src/lib.rs b/src/lib.rs
index d315ed2..c381032 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -87,7 +87,7 @@ impl EscapeError {
 /// the callback will be called exactly once.
 pub fn unescape_unicode<F>(src: &str, mode: Mode, callback: &mut F)
 where
-    F: FnMut(Range<usize>, Result<char, EscapeError>),
+    F: FnMut(Range<usize>, Result<char, EscapeError>) + ?Sized,
 {
     match mode {
         Char | Byte => {
@@ -357,7 +357,7 @@ fn unescape_char_or_byte(chars: &mut Chars<'_>, mode: Mode) -> Result<char, Esca
 /// sequence of escaped characters or errors.
 fn unescape_non_raw_common<F, T: From<char> + From<u8>>(src: &str, mode: Mode, callback: &mut F)
 where
-    F: FnMut(Range<usize>, Result<T, EscapeError>),
+    F: FnMut(Range<usize>, Result<T, EscapeError>) + ?Sized,
 {
     let mut chars = src.chars();
     let allow_unicode_chars = mode.allow_unicode_chars(); // get this outside the loop
@@ -424,7 +424,7 @@ where
 /// only produce errors on bare CR.
 fn check_raw_common<F>(src: &str, mode: Mode, callback: &mut F)
 where
-    F: FnMut(Range<usize>, Result<char, EscapeError>),
+    F: FnMut(Range<usize>, Result<char, EscapeError>) + ?Sized,
 {
     let mut chars = src.chars();
     let allow_unicode_chars = mode.allow_unicode_chars(); // get this outside the loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants