-
Notifications
You must be signed in to change notification settings - Fork 218
Redesign codecs #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign codecs #233
Conversation
85bfcb3
to
9448930
Compare
Build is still failing. need #235 to fix the build error. |
Codecov Report@@ Coverage Diff @@
## master #233 +/- ##
==========================================
- Coverage 70.2% 70.15% -0.05%
==========================================
Files 68 68
Lines 4692 4715 +23
Branches 623 632 +9
==========================================
+ Hits 3294 3308 +14
- Misses 1258 1267 +9
Partials 140 140
Continue to review full report at Codecov.
|
managed to get the build passing by temporarily merging #235. Here is the build https://travis-ci.org/scrapinghub/frontera/builds/183326040 |
2a5aeb0
to
409b49a
Compare
e8fb2e7
to
31777f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great contribution for this annoying problem, and I'm happy that it will be solved soon! Thank you.
from w3lib.util import to_unicode, to_bytes | ||
|
||
|
||
def _convert(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not entirely about conversion, but encoding of types preservation, I propose to change the name to _encode_recursively()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return 'other', obj | ||
|
||
|
||
def _reconvert(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
elif hasattr(obj, '__dict__'): | ||
return serialize(obj.__dict__) | ||
else: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function skips silently the object if it's not possible to serialize. I propose to keep this behavior, but make it throw a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this function back except with a minor change. I have excluded the encoding part for six.text_type
as this will be handled now by msgpack
|
||
def encode_add_seeds(self, seeds): | ||
return packb([b'as', [_prepare_request_message(seed) for seed in seeds]]) | ||
return packb([b'as', [_prepare_request_message(seed) for seed in seeds]], use_bin_type=True, encoding="utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to short msgpack-python documentation, it's recommended to use encoding argument for unpackb
not for packb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
super(Encoder, self).__init__(request_model, *a, **kw) | ||
|
||
def encode(self, obj): | ||
converted = _convert(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, only meta requires type information to be stored along with the content. Type information cause message size growth, and introduces cpu overhead on parsing. So, I suggest to reimplement this to make it efficient, the same is (and especially) true for msgpack. I don't care much about json codec performance, but msgpack is very important one to be efficient, because it's used in large deployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this error occur for headers, cookies or in fact all dict type objects? Right now issues are only reported for meta.
Also any suggestions to make this efficient for msgpack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I haven't fixed yet. @Preetwinder Your input here would be of great value
enc = encoder(Request, send_body=True) | ||
dec = decoder(Request, Response) | ||
req = Request(url="http://www.yandex.ru",method=b'GET', meta={b"test": b"shmest"}, headers={b'reqhdr': b'value'}) | ||
req = Request(url="http://www.yandex.ru", method=b'GET', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to explicitly check the type for string objects in meta, to prove that your update codecs code really respects the type and preserves it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
also @Preetwinder please have a look. Your opinion would much appreciated. |
@Preetwinder Since this PR overwrites some of your changes, your input here would be highly appreciated. |
@voith Thanks for your hard work! I've reviewed the code and it looks good, like the idea of using tuples for the JSON encoding/decoding to reduce the payload size as opposed to a dict (or at least that's what I think you're doing). |
a420692
to
9ac2b49
Compare
@chuckus Thanks for your feedback. My idea to use tuples was to create a generator, But unfortunately, |
rebased. @sibiryakov Can I get a second review? I have still not implemented your suggestion for optimization. I'm waiting for your reply on the comments that I have left behind in the code review |
elif hasattr(obj, '__dict__'): | ||
return serialize(obj.__dict__) | ||
else: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make it generate warning, that object of some non-serializable type was skipped.
import pytest | ||
|
||
|
||
def _compare_dicts(dict1, dict2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason to have such complex machinery for comparing dicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to be a little bit precise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it was not immediately obvious, but I think the main difference from built-in dict comparison is this line https://github.com/scrapinghub/frontera/pull/233/files#diff-252cc4202700183e8f0c782fb28dbfc2R34, so it will catch e.g. python 2 dicts with equal str and unicode values, int/float, etc.
tests/test_codecs.py
Outdated
dicts_are_equal = True | ||
for key in dict1.keys(): | ||
if type(dict1[key]) is dict: | ||
dicts_are_equal = dicts_are_equal and _compare_dicts(dict1[key], dict2[key]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you could return False earlier in case of inequality, there is no need to iterate till the end
otherwise looks good! |
LGTM. |
@Preetwinder Thanks for your review. @sibiryakov Is there anything else that needs to be done to get this merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any changes needed in the documentation? For example here - http://frontera.readthedocs.io/en/latest/topics/frontier-objects.html#frontera.core.models.Request.meta
def _response_from_object(self, obj): | ||
url = to_native_str(obj[b'url']) | ||
url = obj['url'] | ||
request = self._request_model(url=url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the keys guaranteed to be native strings for this and all the following obj
accesses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a look at this line https://github.com/scrapinghub/frontera/pull/233/files?diff=split#diff-6cd48fdd254557ffc1b50d60a0323364R154. The previous method called in the decoder was dict_to_bytes
which was changing the native_string types to bytes. But the encoder was never changed and it used native_Strings. Since the encoder takes care that it uses native strings the decoder need not worry about any native_string conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If url had type 'bytes' then after decoding it will have type bytes, and in Frontera URLs have to be native strings (when in process). so maybe we still need a preventive conversion to native strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If url had type 'bytes' then after decoding it will have type bytes
You are right about the fact that this PR will save the type of the object it was originally given to the codec and on on decoding it will return the original type of value.
and in Frontera URLs have to be native strings
If this is the case than such type checking and conversion should be done somewhere else, Not In a codec.
so maybe we still need a preventive conversion to native strings
A codec should never change the type of the object it was originally inputed with. Forcing such a mechanism in codec is a hacky way of getting around this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A codec should never change the type of the object it was originally inputed with. Forcing such a mechanism in codec is a hacky way of getting around this problem.
Agree, it's not a codec's responsibility. But then we would need to check the rest of the code and make sure we're using native strings for URLs. URLs could appear in crawling strategy (sw), backend, message bus or converters. Because such a change could break the code relying on URLs present as native strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, false alarm probably
https://github.com/scrapinghub/frontera/blob/master/frontera/core/models.py#L28
any thoughts @Preetwinder ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, URLs shouldn't be a problem, they are guaranteed to be native strings. @voith I understand. The reason for using dict_to_bytes was that the JSON encoder returns unicode strings and we wanted meta, headers, etc to be bytes. But it seems the new function takes care of that.
elif isinstance(obj, (list, tuple)): | ||
return type(obj).__name__, [_encode_recursively(item) for item in obj] | ||
return 'other', obj | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't storing the types 'other, and 'dict' redundant, since the JSON encoder will take care of these types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dict can can have nested dicts or other kinds of objects. Since we are doing some kind of conversion here, we want to save the type here for successful de-conversion. Json Encoder taking care is not sufficient. We need to successfully Decode this to the original form for which we need to save type.
Although this implementation currently only converts strings, list and dicts it can be easily extended for something like sets.
If you have any other implementation in mind, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I borrowed this code from: https://github.com/samuel/python-bert/blob/master/bert/codec.py#L93-L123. I just simplified it for this use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still it's not clear why 'dict' and 'other' is needed? Please be more specific. My concern is connected with integration with other components. Imagine if somebody else decide to implement this codec in other language. The less type-related information we store here, the easier it would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay a bit hard to explain, But I'll try my best to explain through a series of examples. lets take Preetwinders suggestion and see the potential problems with that.
Isn't storing the types 'other, and 'dict' redundant, since the JSON encoder will take care of these types
If we had to implement this suggestion, the code would look somewhat like this.
def _encode_recursively(obj):
if isinstance(obj, bytes):
return ('bytes', to_unicode(obj))
elif isinstance(obj, dict):
return {_encode_recursively(k): _encode_recursively(v) for k, v in six.iteritems(obj)}
elif isinstance(obj, (list, tuple)):
return obj.__class__(_encode_recursively(item) for item in obj)
return obj
now consider this:
In [59]: import json
In [60]: encoder = json.JSONEncoder()
...: decoder = json.JSONDecoder()
...:
In [61]: meta = {b'byte_key': 'str_value'}
In [62]: encoded_value = _encode_recursively(meta) # using the above defined code
In [63]: print(encoded_value)
{('bytes', 'byte_key'): 'str_value'}
In [64]: encoder.encode(encoded_value)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-65-abb76ad6421f> in <module>()
----> 1 encoder.encode(encoded_value)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py in encode(self, o)
196 # exceptions aren't as detailed. The list call should be roughly
197 # equivalent to the PySequence_Fast that ''.join() would do.
--> 198 chunks = self.iterencode(o, _one_shot=True)
199 if not isinstance(chunks, (list, tuple)):
200 chunks = list(chunks)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py in iterencode(self, o, _one_shot)
254 self.key_separator, self.item_separator, self.sort_keys,
255 self.skipkeys, _one_shot)
--> 256 return _iterencode(o, 0)
257
258 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
TypeError: keys must be a string
So the goal here is to use a data-structure that will serialize well with JSONEncoder
and also help store type.
After a lot of thinking and reference from this code I decided that tuple would be the best way to store this. But a thing to note here is that with this design, a dict
object is converted to tuple (for serialization purpose) and its type has to be stored. Else the information for decoding will be lost.
Even a tuple has to be converted and its type should be stored so that we can clearly differentiate the tuple used by the encoder to one actually received and converted.
So the code I came up was to convert any basic python object to the form (object_type_originally, object_value_coverted) and to convert this recursively as objects can be nested.
Now the way this works for decoding is that, the decoder knows that the first object in this encoded object
always contains type
information, and that the second value will always contain the value which again should be decoded recursively to obtain successfully the original object.
I hope this atleast explain why type of dict has to be saved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for the explanation, now it's much clearer. Ok, let's leave the dict there. But could you leave the comment in the code explaining why it is, at least in short form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, I get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sibiryakov will add a docstring.
@Preetwinder thank you for your review. I will address your concerns tonight |
sorry for keeping this PR on hold. Have been a little busy lately |
NP, I had vacations too ;) |
|
||
def _encode_recursively(obj): | ||
""" | ||
recursively encodes an object to unicode and store its type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nitpick: you can't encode to unicode because unicode is not an encoding (at least in Python), you can only decode to unicode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike Thanks for the review.
I had previously named this method convert
: 33347d9#diff-6cd48fdd254557ffc1b50d60a0323364R12.
Side Note: I think I should start contributing to scrapy. There's a lot I can learn from your reviews :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Preetwinder
I hope I've addressed your concerns.
I do agree that some doc changes are needed. But after having a look at #243, it seems like we need a lot of refactoring
elif isinstance(obj, (list, tuple)): | ||
return type(obj).__name__, [_encode_recursively(item) for item in obj] | ||
return 'other', obj | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I borrowed this code from: https://github.com/samuel/python-bert/blob/master/bert/codec.py#L93-L123. I just simplified it for this use case
def _response_from_object(self, obj): | ||
url = to_native_str(obj[b'url']) | ||
url = obj['url'] | ||
request = self._request_model(url=url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a look at this line https://github.com/scrapinghub/frontera/pull/233/files?diff=split#diff-6cd48fdd254557ffc1b50d60a0323364R154. The previous method called in the decoder was dict_to_bytes
which was changing the native_string types to bytes. But the encoder was never changed and it used native_Strings. Since the encoder takes care that it uses native strings the decoder need not worry about any native_string conversion.
I've also tested this PR manually with the Hbase backend and it works fine for me |
@sibiryakov Please don't merge this. I have to apply kmikes suggestion |
I think we're close. Let's clarify all the issues and try to merge it. |
@sibiryakov you're concerns are real. I'll try to address these issues when I get some more time. Finding it hard to manage time! But i'll get back to this as soon as possible |
@sibiryakov I have tried my best to address your comments as Best as I possibly could. Please let me know if there's still any doubt. A a bit late here now(IST), so can't apply kmikes suggestion. But I hope I have atleast addressed your concerns. |
Have you tried the crawling in distributed backends mode with Json codec @voith ? |
@sibiryakov No I haven't. I only checked that its not blowing up anything. But I have tested msg-pack in distributed mode. I will let you know when I finish testing the json codec in distributed mode |
635388d
to
29d0310
Compare
I have changed the names of the methods But I didn't really get the time to test json-codec with the distributed backend. If anyone in the community could help me test it than it would be great. In any case it'll take some more time to test this thoroughly. |
|
||
|
||
def _decode_recursively(obj): | ||
def _revert_from_saved_type(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO revert is used in the wrong context here http://dictionary.cambridge.org/dictionary/english/revert-to-sth, convert_from_saved_type would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
OK. Final call before I merge it. @voith please rename the revert_* method name. |
first element of the nested tuple as the original type information and | ||
the second value to be the converted value. It applies the original type | ||
recursively on the object to retrieve the original form of the object. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super minor, but I would maybe add a check that obj
has exactly two items, e.g. with obj_type, obj_value = obj
, just in case some list is passed instead of an object returned by _convert_and_save_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks @voith! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I didn't review changes related to msgpack as I don't know it good enough.
@sibiryakov Here's a suspicious line I found out. I haven't really tested it, but I have a feeling that this will break. I think as part of this PR some clean up is also needed. All strings related to scrapy should be native strings. Also I did try to test json codec with the distributed backend. But I was having some problems with my hbase setup. I have recently shifted to mac and |
Yeah @voith I noticed that. But that was broken before you, and I don't see how it is connected to this PR. |
@sibiryakov You're probably right. I wasn't entirely sure. But give me a day to manually test everything. I managed to setup hbase locally. I will try to verify it tonight and than will let you know |
Sure @voith |
@sibiryakov I have finished testing json codec in distributed mode. I can confirm that everything works. The only error I was getting has been fixed in #252. If i apply that patch to this branch than everything works fine. Also that patch has nothing to do with this PR. That bug existed before I made this PR. This PR should be merged after #252 is merged. |
I expect some merge conflicts in this branch if #252 gets merged first. I will resolve them later |
ok, rebase it pls. |
added test for testing encoded unicode values in meta using msgpack codec fixed message_bus_backend_test added tests for _convert, _reconvert make msgpack requirement >=0.4 made suggested changes added warning message for non serializable objects and added an early return renamed method names for encode-decode to convert-revert and added docstring to these methods renamed methdod _revert_from_saved_type to _convert_from_saved_type replaced revert with restores in a doc string added assertion to check object length in _convert_from_saved_type resolved merge conflicts
b7f5a2d
to
9c1b873
Compare
@sibiryakov I have rebased the PR and fixed merge conflicts. |
Congratulations and thank you @voith @Preetwinder and others! 🍺 🍺 🍕 |
Thanks to everyone who reviewed this PR and helped me make the code better! :) |
thank you for bearing all this! |
No problem @sibiryakov. I personally learnt a lot from this PR. |
Issue discussed here #211 (comment)
Todo List
This PR fixes #211
Other things done in this besides the todo list:
_convert
andreconvert
in json codec. These are needed as JSONEncoder accepts strings only as unicode. Methodconvert
converts objects recursively to unicode and saves their type.test_message_bus_backend
which got exposed after fixing the codecs.