PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook #1493

ilukyanchikov · 2024-01-31T13:46:59Z

Optimized performance by removing object_pairs_hook call if we need the default conversion to dict behavior jira task
In object_hook, a dictionary is already passed, and calling object_pairs_hook for casting pairs to dict is redundant.
I validated my changes with test_json_util cases. pytest -v -s test/test_json_util.py
Compare performance with TestJson*Decoding cases(changes affect performance only in these cases) pytest -v -s test/performance/perf_test.py::TestJsonFlatDecoding test/performance/perf_test.py::TestJsonDeepDecoding test/performance/perf_test.py::TestJsonFullDecoding

Test Name	master	current branch
JsonFlatDecoding	96.18636555725531	117.50816018425552
JsonDeepDecoding	58.337081713027956	77.52136671585596
JsonFullDecoding	33.81260536995342	38.88821810950474

This fix only works with JSONOptions where document_class is dict, and I haven't found cases where something other than dict was used.

ShaneHarvey

Nice work this is great!

ShaneHarvey · 2024-01-31T18:07:02Z

bson/json_util.py

+    if json_options.document_class is dict:
+        kwargs["object_hook"] = lambda pairs: object_hook(pairs, json_options)
+    else:
+        kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options)


Now that native dictionaries are always ordered (we only support Python>=3.7), I don't think we ever need to use object_pairs_hook. I wonder if JSON decoding with document_class=SON or document_class=OrderedDict is also faster using object_hook instead of object_pairs_hook?

I tested this and see a comically large improvement. Before:

$ python -m timeit -s ' from bson import SON from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON) doc={str(i): {"a": 1, "b": 2} for i in range(10000)} json_doc=dumps(doc) ' 'loads(json_doc, json_options=opts)' 1 loop, best of 5: 364 msec per loop

After:

$ python -m timeit -s ' from bson import SON from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON) doc={str(i): {"a": 1, "b": 2} for i in range(10000)} json_doc=dumps(doc) ' 'loads(json_doc, json_options=opts)' 50 loops, best of 5: 4.82 msec per loop

To be clear I'm suggesting:

json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS) kwargs["object_hook"] = lambda obj: object_hook(obj, json_options) return json.loads(s, *args, **kwargs)

To be clear I'm suggesting:

json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS) kwargs["object_hook"] = lambda obj: object_hook(obj, json_options) return json.loads(s, *args, **kwargs)

It looks like such changes will break backward compatibility and such code will throw an exception.

python -c ' from bson.son import SON1 as SON from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON) doc={"1":"2", "2":{"3":"4"}} json_sting=dumps(doc) obj=loads(json_sting, json_options=opts) obj["2"].to_dict() ' Traceback (most recent call last): File "<string>", line 8, in <module> AttributeError: 'dict' object has no attribute 'to_dict'

If this is okay, then we can make this changes.

But if we want to maintain backward compatibility, I would suggest the following changes.

Replace the SON implementation with such (It will speed up the code you mentioned in this comment PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook #1493 (comment))

# The only reason for this class is to maintain backward compatibility # so that the code son_obj.to_dict() and son_obj[key].to_dict() works correctly. class SON(Dict[_Key, _Value]): def __init__(self, *args, **kwargs): warnings.warn( "Class SON is deprecated and will be removed in version x.x.x; use the default Python dict instead", category=DeprecationWarning, stacklevel=2) super().__init__(*args, **kwargs) for k, v in self.items(): if isinstance(v, dict): self[k] = SON(v) def __repr__(self): return f"SON{super().__repr__()}" def to_dict(self) -> dict[_Key, _Value]: ...

Add similar warning if we pass/set the document_class in the JSONOptions object.

And in version x.x.x we can change loads function behavior and remove document_class, and also possibly get rid of SON.

python -m timeit -s ' from bson.son import SON_OLD as SON from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON) doc={str(i): {"a": 1, "b": 2} for i in range(10000)} json_doc=dumps(doc) ' 'loads(json_doc, json_options=opts)' 1 loop, best of 5: 427 msec per loop

python -m timeit -s ' from bson.son import SON as SON from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON) doc={str(i): {"a": 1, "b": 2} for i in range(10000)} json_doc=dumps(doc) ' 'loads(json_doc, json_options=opts)' 10 loops, best of 5: 25.7 msec per loop

Oh I see. I was under the incorrect assumption that object_hook also casted the input to document_class.

bson/json_util.py

ShaneHarvey · 2024-01-31T18:22:58Z

Also feel free to add your name to doc/contributors.rst if you like.

NoahStapp

As Shane said, great work!

NoahStapp · 2024-01-31T18:35:44Z

bson/json_util.py

@@ -497,7 +497,11 @@ def loads(s: Union[str, bytes, bytearray], *args: Any, **kwargs: Any) -> Any:
       Accepts optional parameter `json_options`. See :class:`JSONOptions`.
    """
    json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
-    kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options)
+    # Execution time optimization if json_options.document_class is dict
+    if json_options.document_class is dict:


Suggested change

if json_options.document_class is dict:

if isinstance(json_options.document_class, dict):

Using isinstance supports inheritance, is dict only works for a literal dict object.

This code will go away with my suggestion to always use object_hook above.

I suppose what was meant here is issubclass(json_options.document_class, dict) because

>>> isinstance(dict, dict) False

However, if you simply use issubclass, loads will return an object of the incorrect type, specifically always just dict.
Similar problem will occur if object_pairs_hook is completely removed without additional modifications. But I think it's worth trying to reduce the number of calls to the constructor of the dict-like class, which could optimize the execution time.

Additionally, I've noticed that the behavior I described isn't being tested.
https://github.com/mongodb/mongo-python-driver/blob/master/test/test_json_util.py#L565 This test looks as though it's testing, but in reality, it doesn't check the type, and such a test
self.assertEqual(
SON([("foo", "bar"), ("b", 1)]),
{"foo": "bar", "b": 1}
)
will run without errors.

Good catch. We need to add assertions that top-level and embedded objects are all SON with document_class=SON.

Update: I'm added this test here: #1509

ShaneHarvey · 2024-02-05T22:54:38Z

@ilukyanchikov thanks for the great work here!

ShaneHarvey · 2024-02-05T22:58:40Z

The perf benchmarks confirm a 20-30% decoding improvement:

https://spruce.mongodb.com/task/mongo_python_driver_perf_tests_perf_6.0_standalone_97b9a333c84af5093874880d1d3ae7d5e59a8b59_24_02_05_21_59_14/trend-charts?execution=0

skip object_pairs_hook in case we need default behavior

da12f19

ilukyanchikov requested a review from a team as a code owner January 31, 2024 13:47

ilukyanchikov requested review from blink1073 and removed request for a team January 31, 2024 13:47

blink1073 requested review from NoahStapp and removed request for blink1073 January 31, 2024 14:30

ShaneHarvey changed the title ~~PYTHON-1374: Optimized performance by removing object_pairs_hook call if we need the default conversion to dict behavior~~ PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook Jan 31, 2024

ShaneHarvey requested changes Jan 31, 2024

View reviewed changes

NoahStapp requested changes Jan 31, 2024

View reviewed changes

ilukyanchikov added 2 commits February 1, 2024 03:10

fix arg name

dab1331

update contributors doc

98fd4de

ilukyanchikov force-pushed the PYTHON-1374-5 branch from a08a4be to 98fd4de Compare February 1, 2024 01:10

ilukyanchikov requested review from ShaneHarvey and NoahStapp February 1, 2024 01:35

ShaneHarvey approved these changes Feb 5, 2024

View reviewed changes

ShaneHarvey merged commit 97b9a33 into mongodb:master Feb 5, 2024

ShaneHarvey mentioned this pull request Feb 5, 2024

PYTHON-4179 Verify document_class type in json_util.loads test #1509

Merged

	if json_options.document_class is dict:
	if isinstance(json_options.document_class, dict):

PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook #1493

PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook #1493

Uh oh!

Conversation

ilukyanchikov commented Jan 31, 2024

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShaneHarvey commented Jan 31, 2024

Uh oh!

NoahStapp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey commented Feb 5, 2024

Uh oh!

ShaneHarvey commented Feb 5, 2024

Uh oh!

Uh oh!