Skip to content

PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook #1493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion bson/json_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -497,7 +497,11 @@ def loads(s: Union[str, bytes, bytearray], *args: Any, **kwargs: Any) -> Any:
Accepts optional parameter `json_options`. See :class:`JSONOptions`.
"""
json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options)
# Execution time optimization if json_options.document_class is dict
if json_options.document_class is dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if json_options.document_class is dict:
if isinstance(json_options.document_class, dict):

Using isinstance supports inheritance, is dict only works for a literal dict object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will go away with my suggestion to always use object_hook above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose what was meant here is issubclass(json_options.document_class, dict) because

>>> isinstance(dict, dict)
False

However, if you simply use issubclass, loads will return an object of the incorrect type, specifically always just dict.
Similar problem will occur if object_pairs_hook is completely removed without additional modifications. But I think it's worth trying to reduce the number of calls to the constructor of the dict-like class, which could optimize the execution time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I've noticed that the behavior I described isn't being tested.
https://github.com/mongodb/mongo-python-driver/blob/master/test/test_json_util.py#L565 This test looks as though it's testing, but in reality, it doesn't check the type, and such a test
self.assertEqual(
SON([("foo", "bar"), ("b", 1)]),
{"foo": "bar", "b": 1}
)
will run without errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We need to add assertions that top-level and embedded objects are all SON with document_class=SON.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I'm added this test here: #1509

kwargs["object_hook"] = lambda obj: object_hook(obj, json_options)
else:
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that native dictionaries are always ordered (we only support Python>=3.7), I don't think we ever need to use object_pairs_hook. I wonder if JSON decoding with document_class=SON or document_class=OrderedDict is also faster using object_hook instead of object_pairs_hook?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and see a comically large improvement. Before:

$ python -m timeit -s '
from bson import SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
1 loop, best of 5: 364 msec per loop

After:

$ python -m timeit -s '
from bson import SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
50 loops, best of 5: 4.82 msec per loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear I'm suggesting:

json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_hook"] = lambda obj: object_hook(obj, json_options)
return json.loads(s, *args, **kwargs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear I'm suggesting:

json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_hook"] = lambda obj: object_hook(obj, json_options)
return json.loads(s, *args, **kwargs)

It looks like such changes will break backward compatibility and such code will throw an exception.

python -c '
from bson.son import SON1 as SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={"1":"2", "2":{"3":"4"}}
json_sting=dumps(doc)
obj=loads(json_sting, json_options=opts)
obj["2"].to_dict()
'
Traceback (most recent call last):
  File "<string>", line 8, in <module>
AttributeError: 'dict' object has no attribute 'to_dict'

If this is okay, then we can make this changes.

But if we want to maintain backward compatibility, I would suggest the following changes.

# The only reason for this class is to maintain backward compatibility
# so that the code son_obj.to_dict() and son_obj[key].to_dict() works correctly.
class SON(Dict[_Key, _Value]):
    def __init__(self, *args, **kwargs):
        warnings.warn(
            "Class SON is deprecated and will be removed in version x.x.x; use the default Python dict instead",
            category=DeprecationWarning, stacklevel=2)
        super().__init__(*args, **kwargs)
        for k, v in self.items():
            if isinstance(v, dict):
                self[k] = SON(v)

    def __repr__(self):
        return f"SON{super().__repr__()}"

    def to_dict(self) -> dict[_Key, _Value]:
        ...
  • Add similar warning if we pass/set the document_class in the JSONOptions object.

And in version x.x.x we can change loads function behavior and remove document_class, and also possibly get rid of SON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -m timeit -s '
from bson.son import SON_OLD as SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
1 loop, best of 5: 427 msec per loop
python -m timeit -s '
from bson.son import SON as SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
10 loops, best of 5: 25.7 msec per loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. I was under the incorrect assumption that object_hook also casted the input to document_class.

return json.loads(s, *args, **kwargs)


Expand Down
1 change: 1 addition & 0 deletions doc/contributors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,4 @@ The following is a list of people who have contributed to
- Iris Ho (sleepyStick)
- Stephan Hof (stephan-hof)
- Casey Clements (caseyclements)
- Ivan Lukyanchikov (ilukyanchikov)