Skip to content

PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook #1493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 5, 2024

Conversation

ilukyanchikov
Copy link
Contributor

Optimized performance by removing object_pairs_hook call if we need the default conversion to dict behavior jira task
In object_hook, a dictionary is already passed, and calling object_pairs_hook for casting pairs to dict is redundant.
I validated my changes with test_json_util cases. pytest -v -s test/test_json_util.py
Compare performance with TestJson*Decoding cases(changes affect performance only in these cases) pytest -v -s test/performance/perf_test.py::TestJsonFlatDecoding test/performance/perf_test.py::TestJsonDeepDecoding test/performance/perf_test.py::TestJsonFullDecoding

Test Name master current branch
JsonFlatDecoding 96.18636555725531 117.50816018425552
JsonDeepDecoding 58.337081713027956 77.52136671585596
JsonFullDecoding 33.81260536995342 38.88821810950474

This fix only works with JSONOptions where document_class is dict, and I haven't found cases where something other than dict was used.

@ilukyanchikov ilukyanchikov requested a review from a team as a code owner January 31, 2024 13:47
@ilukyanchikov ilukyanchikov requested review from blink1073 and removed request for a team January 31, 2024 13:47
@blink1073 blink1073 requested review from NoahStapp and removed request for blink1073 January 31, 2024 14:30
@ShaneHarvey ShaneHarvey changed the title PYTHON-1374: Optimized performance by removing object_pairs_hook call if we need the default conversion to dict behavior PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook Jan 31, 2024
Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work this is great!

if json_options.document_class is dict:
kwargs["object_hook"] = lambda pairs: object_hook(pairs, json_options)
else:
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that native dictionaries are always ordered (we only support Python>=3.7), I don't think we ever need to use object_pairs_hook. I wonder if JSON decoding with document_class=SON or document_class=OrderedDict is also faster using object_hook instead of object_pairs_hook?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and see a comically large improvement. Before:

$ python -m timeit -s '
from bson import SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
1 loop, best of 5: 364 msec per loop

After:

$ python -m timeit -s '
from bson import SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
50 loops, best of 5: 4.82 msec per loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear I'm suggesting:

json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_hook"] = lambda obj: object_hook(obj, json_options)
return json.loads(s, *args, **kwargs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear I'm suggesting:

json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_hook"] = lambda obj: object_hook(obj, json_options)
return json.loads(s, *args, **kwargs)

It looks like such changes will break backward compatibility and such code will throw an exception.

python -c '
from bson.son import SON1 as SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={"1":"2", "2":{"3":"4"}}
json_sting=dumps(doc)
obj=loads(json_sting, json_options=opts)
obj["2"].to_dict()
'
Traceback (most recent call last):
  File "<string>", line 8, in <module>
AttributeError: 'dict' object has no attribute 'to_dict'

If this is okay, then we can make this changes.

But if we want to maintain backward compatibility, I would suggest the following changes.

# The only reason for this class is to maintain backward compatibility
# so that the code son_obj.to_dict() and son_obj[key].to_dict() works correctly.
class SON(Dict[_Key, _Value]):
    def __init__(self, *args, **kwargs):
        warnings.warn(
            "Class SON is deprecated and will be removed in version x.x.x; use the default Python dict instead",
            category=DeprecationWarning, stacklevel=2)
        super().__init__(*args, **kwargs)
        for k, v in self.items():
            if isinstance(v, dict):
                self[k] = SON(v)

    def __repr__(self):
        return f"SON{super().__repr__()}"

    def to_dict(self) -> dict[_Key, _Value]:
        ...
  • Add similar warning if we pass/set the document_class in the JSONOptions object.

And in version x.x.x we can change loads function behavior and remove document_class, and also possibly get rid of SON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -m timeit -s '
from bson.son import SON_OLD as SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
1 loop, best of 5: 427 msec per loop
python -m timeit -s '
from bson.son import SON as SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
10 loops, best of 5: 25.7 msec per loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. I was under the incorrect assumption that object_hook also casted the input to document_class.

@ShaneHarvey
Copy link
Member

Also feel free to add your name to doc/contributors.rst if you like.

Copy link
Contributor

@NoahStapp NoahStapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Shane said, great work!

@@ -497,7 +497,11 @@ def loads(s: Union[str, bytes, bytearray], *args: Any, **kwargs: Any) -> Any:
Accepts optional parameter `json_options`. See :class:`JSONOptions`.
"""
json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options)
# Execution time optimization if json_options.document_class is dict
if json_options.document_class is dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if json_options.document_class is dict:
if isinstance(json_options.document_class, dict):

Using isinstance supports inheritance, is dict only works for a literal dict object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will go away with my suggestion to always use object_hook above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose what was meant here is issubclass(json_options.document_class, dict) because

>>> isinstance(dict, dict)
False

However, if you simply use issubclass, loads will return an object of the incorrect type, specifically always just dict.
Similar problem will occur if object_pairs_hook is completely removed without additional modifications. But I think it's worth trying to reduce the number of calls to the constructor of the dict-like class, which could optimize the execution time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I've noticed that the behavior I described isn't being tested.
https://github.com/mongodb/mongo-python-driver/blob/master/test/test_json_util.py#L565 This test looks as though it's testing, but in reality, it doesn't check the type, and such a test
self.assertEqual(
SON([("foo", "bar"), ("b", 1)]),
{"foo": "bar", "b": 1}
)
will run without errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We need to add assertions that top-level and embedded objects are all SON with document_class=SON.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I'm added this test here: #1509

@ShaneHarvey
Copy link
Member

@ilukyanchikov thanks for the great work here!

@ShaneHarvey
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants