-
Notifications
You must be signed in to change notification settings - Fork 6
replace org.json:json with AOSP json in MetaData #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
`org.json:json` has a non-standard license. This commit drops it from the POM and replaces it with the Apache 2 licensed clean room reimplementation done by Google as part of the Android Open Source Project. See https://android.googlesource.com/platform/libcore/+/ics-plus-aosp/json/ for original code. This doesn't build - a future commit will patch the necessary places in the CC to adapt call sites for some incompatible API changes in Google's reimplementation.
`append` appears to have become `accumulate`, e.g. `JSONTokener` no longer takes a `Reader`
|
Some comments about testing:
|
|
And before I forget - I wasn't able to validate the change in https://github.com/commoncrawl/ia-web-commons/pull/16/files#diff-4775e792517029f7a0ee1542d80904d5R31, as I'm not sure what code path exercises it. |
|
Thanks, @cldellow! I'll test it.
JSON seems somewhat indifferent regarding repeating keys:
But in any case, many JSON libs backed by a simple map would just keep one value and throw away the other. |
|
Thanks @cldellow, I've tested your improvements and
I would agree that it isn't worth to try to change the code substantially to get little more improvements. Next step was to integrate the change into the production system. However, if the lib (webarchive-commons) is used as an upstream dependency, then the included org.json replacement classes may cause troubles because they're not fully compatible to the API of the original org.json package. There are two options to fix it:
Shading would still require to maintain the classes, ie. the maintainers webarchive-commons would need to follow the change history and watch out for potential issues. That's why I've decided to try Ted Dunning's JSON package and added it as an upstream dependency. It's derived from the Android classes and still maintained. Unfortunately, it also not fully compatible to the API of org.json package, although the required adoption is minimal (see #17 as an alternative solution). However, it turned out that it's less performant:
|
|
|
re duplicate keys in a map for headers: Good point! Multiple entries would give the client the ability to extract the value via a streaming mode, even if the non-streaming mode does something like last one wins/first one wins, etc. I opened #18 to track that. I can submit a separate PR once the other JSON change has been made. re json lib: Nice find! I did a bit of looking to see if someone maintained a port but didn't find those two repos. I'll close this PR, since it seems like #17 will supersede it. |




This is a followup to #15 (comment)
I think the ideal end state would be to sever metadata representation from serialization. The existing
org.jsonapproach extendsJSONObject, and hangs some custom behaviours on theMetaData. That may not be the pragmatic end state, though. Indeed, I went down that path, but it began to look like a fairly invasive change.Instead, I figured I'd break it into three steps:
move to Google's Apache 2 licensed reimplementation of
org.json, released as part of the Android Open Source Project, to deal with the licensing concerndecouple serialization from the representation (eg still have
MetaDataextendJSONObject, but move to jsoniter or other)consider dropping the
org.jsonrepresentation (eg move to plainMap<String, Object>orList<Object>instead ofJSONObjectandJSONArray)This PR is just (1), but as it turns out, even that gets a speed boost. Running
org.archive.extract.ResourceExtractor -waton two different WARCs from the November 2018 crawl:warc1: before 282 s, after 251 s, 11% faster
warc2: before 276 s, after 249 s, 10% faster
Both WARCs were in the OS file cache prior to running, so this should mainly be measuring CPU use, not I/O.
I now wonder if the marginal value of steps (2) and (3) outweighs their risks:
Before I go further, I figured I'd share what I had and get feedback. Comments appreciated!