Improve performance of ncodeunits(::Char) (#54001)

Seelengrab · web-flow · commit d183ee1bc0f6 · 2024-04-09T17:06:49.000-04:00
This improves performance of `ncodeunits(::Char)` by simply counting the number of non-zero bytes (except for `\0`, which is encoded as all zero bytes). For a performance comparison, see [this gist]( https://gist.github.com/Seelengrab/ebb02d4b8d754700c2869de8daf88cad); there's an up to 10x improvement here for collections of `Char`, with a minor improvement for single `Char` (with much smaller spread). The version in this PR is called `nbytesencoded` in the benchmarks. Correctness has been verified with Supposition.jl, using the existing implementation as an oracle: ```julia julia> using Supposition julia> const chars = Data.Characters() julia> @check max_examples=1_000_000 function bytesenc(i=Data.Integers{UInt32}()) c = reinterpret(Char, i) ncodeunits(c) == nbytesdiv(c) end; Test Summary: | Pass Total Time bytesenc | 1 1 1.0s julia> ncodeunits('\0') == nbytesencoded('\0') true ``` Let's see if CI agrees! Notably, neither the existing nor the new implementation check whether the given `Char` is valid or not, since the only thing that matters is how many bytes are written out. --------- Co-authored-by: Sukera <Seelengrab@users.noreply.github.com>
diff --git a/base/char.jl b/base/char.jl
@@ -62,7 +62,14 @@ to an output stream, or `ncodeunits(string(c))` but computed efficiently.
     This method requires at least Julia 1.1. In Julia 1.0 consider
     using `ncodeunits(string(c))`.
 """
-ncodeunits(c::Char) = write(devnull, c) # this is surprisingly efficient
+function ncodeunits(c::Char)
+    u = reinterpret(UInt32, c)
+    # We care about how many trailing bytes are all zero
+    # subtract that from the total number of bytes
+    n_nonzero_bytes = sizeof(UInt32) - div(trailing_zeros(u), 0x8)
+    # Take care of '\0', which has an all-zero bitpattern
+    n_nonzero_bytes + iszero(u)
+end
 
 """
     codepoint(c::AbstractChar) -> Integer