-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook #1493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -497,7 +497,11 @@ def loads(s: Union[str, bytes, bytearray], *args: Any, **kwargs: Any) -> Any: | |
Accepts optional parameter `json_options`. See :class:`JSONOptions`. | ||
""" | ||
json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS) | ||
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options) | ||
# Execution time optimization if json_options.document_class is dict | ||
if json_options.document_class is dict: | ||
kwargs["object_hook"] = lambda obj: object_hook(obj, json_options) | ||
else: | ||
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now that native dictionaries are always ordered (we only support Python>=3.7), I don't think we ever need to use object_pairs_hook. I wonder if JSON decoding with document_class=SON or document_class=OrderedDict is also faster using object_hook instead of object_pairs_hook? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tested this and see a comically large improvement. Before:
After:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear I'm suggesting: json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_hook"] = lambda obj: object_hook(obj, json_options)
return json.loads(s, *args, **kwargs) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It looks like such changes will break backward compatibility and such code will throw an exception.
If this is okay, then we can make this changes. But if we want to maintain backward compatibility, I would suggest the following changes.
And in version x.x.x we can change loads function behavior and remove document_class, and also possibly get rid of SON. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh I see. I was under the incorrect assumption that object_hook also casted the input to document_class. |
||
return json.loads(s, *args, **kwargs) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using
isinstance
supports inheritance,is dict
only works for a literaldict
object.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code will go away with my suggestion to always use object_hook above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose what was meant here is issubclass(json_options.document_class, dict) because
However, if you simply use issubclass, loads will return an object of the incorrect type, specifically always just dict.
Similar problem will occur if object_pairs_hook is completely removed without additional modifications. But I think it's worth trying to reduce the number of calls to the constructor of the dict-like class, which could optimize the execution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I've noticed that the behavior I described isn't being tested.
https://github.com/mongodb/mongo-python-driver/blob/master/test/test_json_util.py#L565 This test looks as though it's testing, but in reality, it doesn't check the type, and such a test
self.assertEqual(
SON([("foo", "bar"), ("b", 1)]),
{"foo": "bar", "b": 1}
)
will run without errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. We need to add assertions that top-level and embedded objects are all SON with document_class=SON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I'm added this test here: #1509