-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook #1493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work this is great!
if json_options.document_class is dict: | ||
kwargs["object_hook"] = lambda pairs: object_hook(pairs, json_options) | ||
else: | ||
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that native dictionaries are always ordered (we only support Python>=3.7), I don't think we ever need to use object_pairs_hook. I wonder if JSON decoding with document_class=SON or document_class=OrderedDict is also faster using object_hook instead of object_pairs_hook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this and see a comically large improvement. Before:
$ python -m timeit -s '
from bson import SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
1 loop, best of 5: 364 msec per loop
After:
$ python -m timeit -s '
from bson import SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
50 loops, best of 5: 4.82 msec per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear I'm suggesting:
json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_hook"] = lambda obj: object_hook(obj, json_options)
return json.loads(s, *args, **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear I'm suggesting:
json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS) kwargs["object_hook"] = lambda obj: object_hook(obj, json_options) return json.loads(s, *args, **kwargs)
It looks like such changes will break backward compatibility and such code will throw an exception.
python -c '
from bson.son import SON1 as SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={"1":"2", "2":{"3":"4"}}
json_sting=dumps(doc)
obj=loads(json_sting, json_options=opts)
obj["2"].to_dict()
'
Traceback (most recent call last):
File "<string>", line 8, in <module>
AttributeError: 'dict' object has no attribute 'to_dict'
If this is okay, then we can make this changes.
But if we want to maintain backward compatibility, I would suggest the following changes.
- Replace the SON implementation with such (It will speed up the code you mentioned in this comment PYTHON-4179: Optimize JSON decoding performance by avoiding object_pairs_hook #1493 (comment))
# The only reason for this class is to maintain backward compatibility
# so that the code son_obj.to_dict() and son_obj[key].to_dict() works correctly.
class SON(Dict[_Key, _Value]):
def __init__(self, *args, **kwargs):
warnings.warn(
"Class SON is deprecated and will be removed in version x.x.x; use the default Python dict instead",
category=DeprecationWarning, stacklevel=2)
super().__init__(*args, **kwargs)
for k, v in self.items():
if isinstance(v, dict):
self[k] = SON(v)
def __repr__(self):
return f"SON{super().__repr__()}"
def to_dict(self) -> dict[_Key, _Value]:
...
- Add similar warning if we pass/set the document_class in the JSONOptions object.
And in version x.x.x we can change loads function behavior and remove document_class, and also possibly get rid of SON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python -m timeit -s '
from bson.son import SON_OLD as SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
1 loop, best of 5: 427 msec per loop
python -m timeit -s '
from bson.son import SON as SON
from bson.json_util import dumps,loads,DEFAULT_JSON_OPTIONS
opts=DEFAULT_JSON_OPTIONS.with_options(document_class=SON)
doc={str(i): {"a": 1, "b": 2} for i in range(10000)}
json_doc=dumps(doc)
' 'loads(json_doc, json_options=opts)'
10 loops, best of 5: 25.7 msec per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. I was under the incorrect assumption that object_hook also casted the input to document_class.
Also feel free to add your name to doc/contributors.rst if you like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Shane said, great work!
@@ -497,7 +497,11 @@ def loads(s: Union[str, bytes, bytearray], *args: Any, **kwargs: Any) -> Any: | |||
Accepts optional parameter `json_options`. See :class:`JSONOptions`. | |||
""" | |||
json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS) | |||
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(pairs, json_options) | |||
# Execution time optimization if json_options.document_class is dict | |||
if json_options.document_class is dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if json_options.document_class is dict: | |
if isinstance(json_options.document_class, dict): |
Using isinstance
supports inheritance, is dict
only works for a literal dict
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code will go away with my suggestion to always use object_hook above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose what was meant here is issubclass(json_options.document_class, dict) because
>>> isinstance(dict, dict)
False
However, if you simply use issubclass, loads will return an object of the incorrect type, specifically always just dict.
Similar problem will occur if object_pairs_hook is completely removed without additional modifications. But I think it's worth trying to reduce the number of calls to the constructor of the dict-like class, which could optimize the execution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I've noticed that the behavior I described isn't being tested.
https://github.com/mongodb/mongo-python-driver/blob/master/test/test_json_util.py#L565 This test looks as though it's testing, but in reality, it doesn't check the type, and such a test
self.assertEqual(
SON([("foo", "bar"), ("b", 1)]),
{"foo": "bar", "b": 1}
)
will run without errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. We need to add assertions that top-level and embedded objects are all SON with document_class=SON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I'm added this test here: #1509
a08a4be
to
98fd4de
Compare
@ilukyanchikov thanks for the great work here! |
The perf benchmarks confirm a 20-30% decoding improvement: |
Optimized performance by removing object_pairs_hook call if we need the default conversion to dict behavior jira task
In object_hook, a dictionary is already passed, and calling object_pairs_hook for casting pairs to dict is redundant.
I validated my changes with test_json_util cases.
pytest -v -s test/test_json_util.py
Compare performance with TestJson*Decoding cases(changes affect performance only in these cases)
pytest -v -s test/performance/perf_test.py::TestJsonFlatDecoding test/performance/perf_test.py::TestJsonDeepDecoding test/performance/perf_test.py::TestJsonFullDecoding
This fix only works with JSONOptions where document_class is dict, and I haven't found cases where something other than dict was used.