-
Notifications
You must be signed in to change notification settings - Fork 906
GODRIVER-2914 bson: improve marshal/unmarshal performance by ~58% and ~29% #1313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
97b2de0
to
280b324
Compare
These benchmarks do a good job of identifying bottlenecks in the bson encoding/decoding logic.
This commit improves BSON marshaling performance by ~58% and unmarshaling performance by ~29% by replacing the mutex based decoder/encoder caches with sync.Map, which can often avoid locking and is ideally suited for caches that only grow. The commit also adds the BenchmarkCodeMarshal and BenchmarkCodeUnmarshal benchmarks from the Go standard library's encoding/json package since they do an excellent job of stress testing parallel encoding/decoding (a common use case in a database driver) and are how the lock contention that led to this commit were discovered. ``` goos: darwin goarch: arm64 pkg: go.mongodb.org/mongo-driver/bson │ base.20.txt │ new.20.txt │ │ sec/op │ sec/op vs base │ CodeUnmarshal/BSON-10 3.192m ± 1% 2.246m ± 1% -29.64% (p=0.000 n=20) CodeUnmarshal/JSON-10 2.735m ± 1% 2.737m ± 0% ~ (p=0.640 n=20) CodeMarshal/BSON-10 2.972m ± 0% 1.221m ± 3% -58.93% (p=0.000 n=20) CodeMarshal/JSON-10 471.0µ ± 1% 464.6µ ± 0% -1.36% (p=0.000 n=20) geomean 1.870m 1.366m -26.92% │ base.20.txt │ new.20.txt │ │ B/s │ B/s vs base │ CodeUnmarshal/BSON-10 579.7Mi ± 1% 823.9Mi ± 1% +42.13% (p=0.000 n=20) CodeUnmarshal/JSON-10 676.6Mi ± 1% 676.2Mi ± 0% ~ (p=0.640 n=20) CodeMarshal/BSON-10 622.7Mi ± 0% 1516.2Mi ± 3% +143.46% (p=0.000 n=20) CodeMarshal/JSON-10 3.837Gi ± 1% 3.890Gi ± 0% +1.38% (p=0.000 n=20) geomean 989.8Mi 1.323Gi +36.84% │ base.20.txt │ new.20.txt │ │ B/op │ B/op vs base │ CodeUnmarshal/BSON-10 4.219Mi ± 0% 4.219Mi ± 0% ~ (p=0.077 n=20) CodeUnmarshal/JSON-10 2.904Mi ± 0% 2.904Mi ± 0% ~ (p=0.672 n=20) CodeMarshal/BSON-10 2.821Mi ± 1% 2.776Mi ± 2% -1.59% (p=0.023 n=20) CodeMarshal/JSON-10 1.857Mi ± 0% 1.859Mi ± 0% ~ (p=0.331 n=20) geomean 2.830Mi 2.820Mi -0.37% │ base.20.txt │ new.20.txt │ │ allocs/op │ allocs/op vs base │ CodeUnmarshal/BSON-10 230.4k ± 0% 230.4k ± 0% ~ (p=1.000 n=20) CodeUnmarshal/JSON-10 92.67k ± 0% 92.67k ± 0% ~ (p=1.000 n=20) ¹ CodeMarshal/BSON-10 94.07k ± 0% 94.07k ± 0% ~ (p=0.112 n=20) CodeMarshal/JSON-10 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=20) ¹ geomean 6.694k 6.694k +0.00% ¹ all samples are equal ```
280b324
to
7ee2919
Compare
This commit eliminates all allocations during marshalling (except for the final []byte slice allocation, which is unavoidable). This matches the behavior of encoding/json. It also prevents the vwPool from leaking writers and includes a few small optimizations to the value_writer. Other Changes: * bsoncodec: reduce use of reflect.Value.Interface() * bson: use a buffer pool instead of SliceWriter to eliminate 2 allocs * bsonrw: use references to slice indexes This work builds off of mongodb#1313 and uses the BenchmarkCode* suite added by that PR. The combined performance improvement is below: ``` goos: darwin goarch: arm64 pkg: go.mongodb.org/mongo-driver/bson │ base.20.txt │ new.20.txt │ │ sec/op │ sec/op vs base │ CodeUnmarshal/BSON-10 3.177m ± 1% 2.202m ± 1% -30.68% (p=0.000 n=20) CodeMarshal/BSON-10 2922.7µ ± 0% 757.1µ ± 2% -74.10% (p=0.000 n=20) geomean 3.047m 1.291m -57.63% │ base.20.txt │ new.20.txt │ │ B/s │ B/s vs base │ CodeUnmarshal/BSON-10 582.5Mi ± 1% 840.4Mi ± 1% +44.26% (p=0.000 n=20) CodeMarshal/BSON-10 633.2Mi ± 0% 2444.3Mi ± 2% +286.03% (p=0.000 n=20) geomean 607.3Mi 1.400Gi +135.99% │ base.20.txt │ new.20.txt │ │ B/op │ B/op vs base │ CodeUnmarshal/BSON-10 4.219Mi ± 0% 4.148Mi ± 0% -1.69% (p=0.000 n=20) CodeMarshal/BSON-10 2.818Mi ± 3% 1.630Mi ± 0% -42.16% (p=0.000 n=20) geomean 3.448Mi 2.600Mi -24.59% │ base.20.txt │ new.20.txt │ │ allocs/op │ allocs/op vs base │ CodeUnmarshal/BSON-10 230.4k ± 0% 220.7k ± 0% -4.21% (p=0.000 n=20) CodeMarshal/BSON-10 94066.000 ± 0% 1.000 ± 0% -100.00% (p=0.000 n=20) geomean 147.2k 469.7 -99.68% ```
This commit eliminates all allocations during marshalling (except for the final []byte slice allocation, which is unavoidable). This matches the behavior of encoding/json. It also prevents the vwPool from leaking writers and includes a few small optimizations to the value_writer. Other Changes: * bsoncodec: reduce use of reflect.Value.Interface() * bson: use a buffer pool instead of SliceWriter to eliminate 2 allocs * bsonrw: use references to slice indexes This work builds off of mongodb#1313 and uses the BenchmarkCode* suite added by that PR. The combined performance improvement is below: ``` goos: darwin goarch: arm64 pkg: go.mongodb.org/mongo-driver/bson │ base.20.txt │ new.20.txt │ │ sec/op │ sec/op vs base │ CodeUnmarshal/BSON-10 3.177m ± 1% 2.202m ± 1% -30.68% (p=0.000 n=20) CodeMarshal/BSON-10 2922.7µ ± 0% 757.1µ ± 2% -74.10% (p=0.000 n=20) geomean 3.047m 1.291m -57.63% │ base.20.txt │ new.20.txt │ │ B/s │ B/s vs base │ CodeUnmarshal/BSON-10 582.5Mi ± 1% 840.4Mi ± 1% +44.26% (p=0.000 n=20) CodeMarshal/BSON-10 633.2Mi ± 0% 2444.3Mi ± 2% +286.03% (p=0.000 n=20) geomean 607.3Mi 1.400Gi +135.99% │ base.20.txt │ new.20.txt │ │ B/op │ B/op vs base │ CodeUnmarshal/BSON-10 4.219Mi ± 0% 4.148Mi ± 0% -1.69% (p=0.000 n=20) CodeMarshal/BSON-10 2.818Mi ± 3% 1.630Mi ± 0% -42.16% (p=0.000 n=20) geomean 3.448Mi 2.600Mi -24.59% │ base.20.txt │ new.20.txt │ │ allocs/op │ allocs/op vs base │ CodeUnmarshal/BSON-10 230.4k ± 0% 220.7k ± 0% -4.21% (p=0.000 n=20) CodeMarshal/BSON-10 94066.000 ± 0% 1.000 ± 0% -100.00% (p=0.000 n=20) geomean 147.2k 469.7 -99.68% ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome performance improvements! A few requested changes.
pc.l.Unlock() | ||
// TODO(charlie): handle concurrent requests for the same type | ||
enc, err := ec.LookupEncoder(typ.Elem()) | ||
enc = pc.ecache.LoadOrStore(typ, enc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does keeping the local cache in a PointerCodec
actually get us anything? It seems redundant with the type encoder lookup in LookupEncoder
(other than the extra call to reflect.Type.Elem
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, I copied the original logic here to make minimize any potential for a change in behavior, but this cache is not needed at all. Will remove.
I'm actually going to look into this a bit more. It looks like this cache might be hot enough to justify skipping the call to .Elem()
.
bson/benchmark_test.go
Outdated
} | ||
} | ||
}) | ||
b.SetBytes(int64(len(codeJSON))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use codeBSON
instead of codeJSON
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used codeJSON
since that provides a common reference point between the "BSON" and "JSON" benchmarks (basically it makes the decoding rate in MB/s comparable between the two formats - otherwise they differ due to the differences in encoded size).
That said, it does not accurately represent the BSON decoding speed - so happy to change it to use len(codeBSON)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting, thanks for clarifying the motivation! I am in favor of changing it to len(codeBSON)
because I think subsequent readers will likely be confused about the intent (as I was) the "sec/op" measurement already gives us a reasonable comparison between JSON/BSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - changed it to use codeBSON
instead.
} | ||
} | ||
}) | ||
b.SetBytes(int64(len(codeJSON))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use the byte slice returned by json.Marshal
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of json.Marshal
and codeJSON
are identical (we test for this when initializing these values):
mongo-go-driver/bson/benchmark_test.go
Lines 369 to 377 in 9de8c88
if data, err = json.Marshal(&codeStruct); err != nil { | |
panic("json.Marshal code.json: " + err.Error()) | |
} | |
if codeBSON, err = Marshal(&codeStruct); err != nil { | |
panic("Marshal code.json: " + err.Error()) | |
} | |
if !bytes.Equal(data, codeJSON) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the update! Sounds good.
} | ||
|
||
type kindEncoderCache struct { | ||
entries [reflect.UnsafePointer + 1]atomic.Value // *kindEncoderCacheEntry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an array sized by reflect.UnsafePointer
risks becoming a problem if a future version of Go ever introduces a reflect
kind that is greater than UnsafePointer
. It doesn't seem to offer a significant performance improvement compared to using a sync.Map
, so we should use a sync.Map
here instead.
That suggestion goes for kindDecoderCache
as well.
Here's the benchmark comparison between using a [reflect.UnsafePointer + 1]atomic.Value
(old.txt) and a sync.Map
(new.txt):
goos: darwin
goarch: arm64
pkg: go.mongodb.org/mongo-driver/bson
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
CodeUnmarshal/BSON-10 2.492m ± 4% 2.430m ± 5% ~ (p=0.912 n=10)
CodeUnmarshal/JSON-10 3.078m ± 2% 3.038m ± 3% ~ (p=0.075 n=10)
CodeMarshal/BSON-10 1.342m ± 3% 1.322m ± 4% ~ (p=0.353 n=10)
CodeMarshal/JSON-10 497.1µ ± 1% 499.5µ ± 1% ~ (p=0.218 n=10)
geomean 1.504m 1.486m -1.20%
│ old.txt │ new.txt │
│ B/s │ B/s vs base │
CodeUnmarshal/BSON-10 742.7Mi ± 4% 761.5Mi ± 5% ~ (p=0.912 n=10)
CodeUnmarshal/JSON-10 601.2Mi ± 2% 609.2Mi ± 2% ~ (p=0.075 n=10)
CodeMarshal/BSON-10 1.346Gi ± 3% 1.367Gi ± 4% ~ (p=0.353 n=10)
CodeMarshal/JSON-10 3.635Gi ± 1% 3.618Gi ± 1% ~ (p=0.218 n=10)
geomean 1.202Gi 1.216Gi +1.22%
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
CodeUnmarshal/BSON-10 4.219Mi ± 0% 4.219Mi ± 0% ~ (p=0.644 n=10)
CodeUnmarshal/JSON-10 2.904Mi ± 0% 2.904Mi ± 0% ~ (p=0.304 n=10)
CodeMarshal/BSON-10 2.720Mi ± 2% 2.769Mi ± 2% ~ (p=0.063 n=10)
CodeMarshal/JSON-10 1.855Mi ± 1% 1.863Mi ± 0% ~ (p=0.105 n=10)
geomean 2.804Mi 2.820Mi +0.55%
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
CodeUnmarshal/BSON-10 230.4k ± 0% 230.4k ± 0% ~ (p=1.000 n=10) ¹
CodeUnmarshal/JSON-10 92.67k ± 0% 92.67k ± 0% ~ (p=1.000 n=10) ¹
CodeMarshal/BSON-10 94.07k ± 0% 94.07k ± 0% +0.00% (p=0.033 n=10)
CodeMarshal/JSON-10 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹
geomean 6.694k 6.694k +0.00%
¹ all samples are equal
Edit: Here's a gist with the changes used to get those results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go either way with this since we quickly store the result returned by the kind{Encoder,Decoder}Cache
into Registry.type{Encoders,Decoders}
(honestly, I wish we actually didn't need this and had a better way of handling lookupInterfaceEncoder
).
That said, it is a lot faster when benchmarked in isolation and we do test against a new type being added, which shouldn't be possible until Go 2.0 and the last time reflect.Kind
was modified it did not change the Kind value.
$ go test -run '^$' -bench 'BenchmarkKindEncoderCacheLoad' -v
goos: darwin
goarch: arm64
pkg: go.mongodb.org/mongo-driver/bson/bsoncodec
BenchmarkKindEncoderCacheLoad
BenchmarkKindEncoderCacheLoad/Array
BenchmarkKindEncoderCacheLoad/Array-10 541177398 2.033 ns/op
BenchmarkKindEncoderCacheLoad/SyncMap
BenchmarkKindEncoderCacheLoad/SyncMap-10 82671672 14.64 ns/op
BenchmarkKindEncoderCacheLoad/Map
BenchmarkKindEncoderCacheLoad/Map-10 207793422 5.779 ns/op
PASS
ok go.mongodb.org/mongo-driver/bson/bsoncodec 4.638s
func BenchmarkKindEncoderCacheLoad(b *testing.B) {
m := new(sync.Map)
x := make(map[reflect.Kind]ValueEncoder)
c := new(kindEncoderCache)
for k := reflect.Kind(0); k <= reflect.UnsafePointer; k++ {
c.Store(k, fakeCodec{})
m.Store(k, fakeCodec{})
x[k] = fakeCodec{}
}
b.Run("Array", func(b *testing.B) {
for i := 0; i < b.N; i++ {
kind := reflect.Kind(i) % (reflect.UnsafePointer + 1)
_, ok := c.Load(kind)
if !ok {
b.Fatal("missing:", kind)
}
}
})
b.Run("SyncMap", func(b *testing.B) {
for i := 0; i < b.N; i++ {
kind := reflect.Kind(i) % (reflect.UnsafePointer + 1)
_, ok := m.Load(kind)
if !ok {
b.Fatal("missing:", kind)
}
}
})
// Map listed here because this is potentially read-only and could be represented as a plain map.
b.Run("Map", func(b *testing.B) {
for i := 0; i < b.N; i++ {
kind := reflect.Kind(i) % (reflect.UnsafePointer + 1)
_, ok := x[kind]
if !ok {
b.Fatal("wat")
}
}
})
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, [...]sync.Atomic
definitely benchmarks faster in isolation, and it does seem unlikely that Go 1.x will add new Kind
s anytime soon. I also see the test now. I'm fine keeping it as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that a user uses an old driver against a new Go with a reflect kind greater than UnsafePointer
. A user can hardly be cautious of this issue. I would suggest using a sync.Map
as long as it performs better than the current implementation. Otherwise, we can move the capacity check into an init()
in "codec_cache.go", like:
func init() {
rt := reflect.UnsafePointer
s := rt.String()
if !strings.Contains(s, strconv.Itoa(int(rt))) {
panic("the capacity of kindEncoderCache is too small")
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed with: 924008c
df92e60
to
924008c
Compare
This check will guard against changes to the reflect.Kind constants.
924008c
to
7bb0482
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the slow follow-up. Looks good 👍
} | ||
} | ||
}) | ||
b.SetBytes(int64(len(codeJSON))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the update! Sounds good.
Thank you for merging this in! |
This commit eliminates all allocations during marshalling (except for the final []byte slice allocation, which is unavoidable). This matches the behavior of encoding/json. It also prevents the vwPool from leaking writers and includes a few small optimizations to the value_writer. Other Changes: * bsoncodec: reduce use of reflect.Value.Interface() * bson: use a buffer pool instead of SliceWriter to eliminate 2 allocs * bsonrw: use references to slice indexes This work builds off of mongodb#1313 and uses the BenchmarkCode* suite added by that PR. The combined performance improvement is below: ``` goos: darwin goarch: arm64 pkg: go.mongodb.org/mongo-driver/bson │ base.20.txt │ new.20.txt │ │ sec/op │ sec/op vs base │ CodeUnmarshal/BSON-10 3.177m ± 1% 2.202m ± 1% -30.68% (p=0.000 n=20) CodeMarshal/BSON-10 2922.7µ ± 0% 757.1µ ± 2% -74.10% (p=0.000 n=20) geomean 3.047m 1.291m -57.63% │ base.20.txt │ new.20.txt │ │ B/s │ B/s vs base │ CodeUnmarshal/BSON-10 582.5Mi ± 1% 840.4Mi ± 1% +44.26% (p=0.000 n=20) CodeMarshal/BSON-10 633.2Mi ± 0% 2444.3Mi ± 2% +286.03% (p=0.000 n=20) geomean 607.3Mi 1.400Gi +135.99% │ base.20.txt │ new.20.txt │ │ B/op │ B/op vs base │ CodeUnmarshal/BSON-10 4.219Mi ± 0% 4.148Mi ± 0% -1.69% (p=0.000 n=20) CodeMarshal/BSON-10 2.818Mi ± 3% 1.630Mi ± 0% -42.16% (p=0.000 n=20) geomean 3.448Mi 2.600Mi -24.59% │ base.20.txt │ new.20.txt │ │ allocs/op │ allocs/op vs base │ CodeUnmarshal/BSON-10 230.4k ± 0% 220.7k ± 0% -4.21% (p=0.000 n=20) CodeMarshal/BSON-10 94066.000 ± 0% 1.000 ± 0% -100.00% (p=0.000 n=20) geomean 147.2k 469.7 -99.68% ```
…coder caches with sync.Map. (#1313)
…coder/encoder caches with sync.Map. (mongodb#1313)" This reverts commit 6c7e124.
GODRIVER-2914
Summary
This commit improves BSON marshaling performance by ~58% and unmarshaling performance by ~29% by replacing the mutex based decoder/encoder caches with sync.Map, which can often avoid locking and is ideally suited for caches that only grow.
The commit also adds the
BenchmarkCodeMarshal
andBenchmarkCodeUnmarshal
benchmarks from the Go standard library'sencoding/json
package since they do an excellent job of stress testing parallel encoding/decoding (a common use case in a database driver) and are how the lock contention that led to this commit was discovered.Background & Motivation
Lock contention around the encoder/decoder caches were identified as a serious performance bottleneck and faster is always better (within reason).