Skip to content

Commit d183ee1

Browse files
authored
Improve performance of ncodeunits(::Char) (#54001)
This improves performance of `ncodeunits(::Char)` by simply counting the number of non-zero bytes (except for `\0`, which is encoded as all zero bytes). For a performance comparison, see [this gist]( https://gist.github.com/Seelengrab/ebb02d4b8d754700c2869de8daf88cad); there's an up to 10x improvement here for collections of `Char`, with a minor improvement for single `Char` (with much smaller spread). The version in this PR is called `nbytesencoded` in the benchmarks. Correctness has been verified with Supposition.jl, using the existing implementation as an oracle: ```julia julia> using Supposition julia> const chars = Data.Characters() julia> @check max_examples=1_000_000 function bytesenc(i=Data.Integers{UInt32}()) c = reinterpret(Char, i) ncodeunits(c) == nbytesdiv(c) end; Test Summary: | Pass Total Time bytesenc | 1 1 1.0s julia> ncodeunits('\0') == nbytesencoded('\0') true ``` Let's see if CI agrees! Notably, neither the existing nor the new implementation check whether the given `Char` is valid or not, since the only thing that matters is how many bytes are written out. --------- Co-authored-by: Sukera <[email protected]>
1 parent f870ea0 commit d183ee1

File tree

1 file changed

+8
-1
lines changed

1 file changed

+8
-1
lines changed

base/char.jl

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,14 @@ to an output stream, or `ncodeunits(string(c))` but computed efficiently.
6262
This method requires at least Julia 1.1. In Julia 1.0 consider
6363
using `ncodeunits(string(c))`.
6464
"""
65-
ncodeunits(c::Char) = write(devnull, c) # this is surprisingly efficient
65+
function ncodeunits(c::Char)
66+
u = reinterpret(UInt32, c)
67+
# We care about how many trailing bytes are all zero
68+
# subtract that from the total number of bytes
69+
n_nonzero_bytes = sizeof(UInt32) - div(trailing_zeros(u), 0x8)
70+
# Take care of '\0', which has an all-zero bitpattern
71+
n_nonzero_bytes + iszero(u)
72+
end
6673

6774
"""
6875
codepoint(c::AbstractChar) -> Integer

0 commit comments

Comments
 (0)