replace org.json:json with AOSP json in MetaData #16

cldellow · 2019-07-17T09:40:46Z

This is a followup to #15 (comment)

I think the ideal end state would be to sever metadata representation from serialization. The existing org.json approach extends JSONObject, and hangs some custom behaviours on the MetaData. That may not be the pragmatic end state, though. Indeed, I went down that path, but it began to look like a fairly invasive change.

Instead, I figured I'd break it into three steps:

move to Google's Apache 2 licensed reimplementation of org.json, released as part of the Android Open Source Project, to deal with the licensing concern
decouple serialization from the representation (eg still have MetaData extend JSONObject, but move to jsoniter or other)
consider dropping the org.json representation (eg move to plain Map<String, Object> or List<Object> instead of JSONObject and JSONArray)

This PR is just (1), but as it turns out, even that gets a speed boost. Running org.archive.extract.ResourceExtractor -wat on two different WARCs from the November 2018 crawl:

warc1: before 282 s, after 251 s, 11% faster
warc2: before 276 s, after 249 s, 10% faster

Both WARCs were in the OS file cache prior to running, so this should mainly be measuring CPU use, not I/O.

I now wonder if the marginal value of steps (2) and (3) outweighs their risks:

the non-free license has been removed
a performance improvement was achieved
this is a relatively low-risk change in that it churns few files
to achieve the best speed boost from jsoniter/jackson requires a well-defined static type that you can generate code for (vs enumerating maps at runtime)
taking a dependency on jackson (or jsoniter, which uses jackson) introduces an ops burden for future Hadoop upgrades, if the jackson versions are mismatched or not shaded correctly

Before I go further, I figured I'd share what I had and get feedback. Comments appreciated!

`org.json:json` has a non-standard license. This commit drops it from the POM and replaces it with the Apache 2 licensed clean room reimplementation done by Google as part of the Android Open Source Project. See https://android.googlesource.com/platform/libcore/+/ics-plus-aosp/json/ for original code. This doesn't build - a future commit will patch the necessary places in the CC to adapt call sites for some incompatible API changes in Google's reimplementation.

`append` appears to have become `accumulate`, e.g. `JSONTokener` no longer takes a `Reader`

cldellow · 2019-07-17T09:49:31Z

Some comments about testing:

I haven't added any tests, as I'm not sure what the desired level of testing / test philosophy is. A smoke test that uses a warc record for a single file (provided as a test resource) and verifies the output matches something?
I tested manually by comparing two runs. Except for timestamps/UUIDs, I believe the only changes of interest are that the old serializer escaped Unicode references, but the new one leaves them as literals. eg "–" vs "\u2013"
@sebastian-nagel was right to be suspicious of header handling. Set-Cookie, e.g. is a header that often occurs multiple times in an HTTP response, but we only capture the last one in due to

ia-web-commons/src/main/java/org/archive/resource/http/HTTPResponseResource.java

Line 64 in c7e79be

headers.putString(h.getName(),h.getValue());

. I think this is a problem with the spec, though, and can't be remedied without breaking spec.

cldellow · 2019-07-17T09:54:21Z

And before I forget - I wasn't able to validate the change in https://github.com/commoncrawl/ia-web-commons/pull/16/files#diff-4775e792517029f7a0ee1542d80904d5R31, as I'm not sure what code path exercises it.

sebastian-nagel · 2019-07-19T16:42:16Z

Thanks, @cldellow! I'll test it.

header handling. Set-Cookie, e.g. is a header that often occurs multiple times in an HTTP response, but we only capture the last one

JSON seems somewhat indifferent regarding repeating keys:

"The JSON syntax does not
impose any restrictions on the strings used as names, does not require that name be unique, and does not assign any significance to the ordering of name/value pairs." (ISO-IEC-27778-2017)
"The names within an object SHOULD be unique." (RFC 4627)

But in any case, many JSON libs backed by a simple map would just keep one value and throw away the other.

sebastian-nagel · 2019-07-22T14:09:20Z

Thanks @cldellow,

I've tested your improvements and

also see a performance improvement (7-8% but that may depend on the hardware and the tested WARC files)
and can confirm that despite the sorting/ordering of keys in JSON objects there are no changes to the WAT output.

I would agree that it isn't worth to try to change the code substantially to get little more improvements.

Next step was to integrate the change into the production system. However, if the lib (webarchive-commons) is used as an upstream dependency, then the included org.json replacement classes may cause troubles because they're not fully compatible to the API of the original org.json package. There are two options to fix it:

shade the included org.json package, so that it's possible to have it in parallel to some other org.json package
use another replacement.

Shading would still require to maintain the classes, ie. the maintainers webarchive-commons would need to follow the change history and watch out for potential issues.

That's why I've decided to try Ted Dunning's JSON package and added it as an upstream dependency. It's derived from the Android classes and still maintained. Unfortunately, it also not fully compatible to the API of org.json package, although the required adoption is minimal (see #17 as an alternative solution). However, it turned out that it's less performant:

using the original org.json lib
Android classes / PR replace org.json:json with AOSP json in MetaData #16
Ted Dunning's open-json library

sebastian-nagel · 2019-07-22T14:35:48Z

Ted Dunning's open-json library with a fix applied (tdunning/open-json#13)

cldellow · 2019-07-23T13:17:48Z

re duplicate keys in a map for headers:

Good point! Multiple entries would give the client the ability to extract the value via a streaming mode, even if the non-streaming mode does something like last one wins/first one wins, etc. I opened #18 to track that. I can submit a separate PR once the other JSON change has been made.

re json lib:

Nice find! I did a bit of looking to see if someone maintained a port but didn't find those two repos.

I'll close this PR, since it seems like #17 will supersede it.

sebastian-nagel · 2019-07-23T21:51:52Z

Thanks, @cldellow! It was a step into the right direction. And thanks for opening #18.

cldellow added 2 commits July 16, 2019 17:13

Adapt to AOSP JSONObject signatures

18101d5

`append` appears to have become `accumulate`, e.g. `JSONTokener` no longer takes a `Reader`

cldellow mentioned this pull request Jul 17, 2019

WAT/WET generator performance improvements #15

Closed

sebastian-nagel mentioned this pull request Jul 22, 2019

Replace the org.json dependency by openjson #17

Merged

sebastian-nagel mentioned this pull request Jul 22, 2019

Speed up serialization of String values in JSONObject tdunning/open-json#13

Closed

cldellow mentioned this pull request Jul 23, 2019

WAT extraction: handle duplicate HTTP response headers #18

Closed

cldellow closed this Jul 23, 2019

sebastian-nagel mentioned this pull request Oct 22, 2019

Replace the org.json dependency by openjson library iipc/webarchive-commons#84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

replace org.json:json with AOSP json in MetaData #16

replace org.json:json with AOSP json in MetaData #16

Uh oh!

cldellow commented Jul 17, 2019

Uh oh!

cldellow commented Jul 17, 2019

Uh oh!

cldellow commented Jul 17, 2019

Uh oh!

sebastian-nagel commented Jul 19, 2019

Uh oh!

sebastian-nagel commented Jul 22, 2019 •

edited

Loading

Uh oh!

sebastian-nagel commented Jul 22, 2019 •

edited

Loading

Uh oh!

cldellow commented Jul 23, 2019

Uh oh!

sebastian-nagel commented Jul 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

replace org.json:json with AOSP json in MetaData #16

replace org.json:json with AOSP json in MetaData #16

Uh oh!

Conversation

cldellow commented Jul 17, 2019

Uh oh!

cldellow commented Jul 17, 2019

Uh oh!

cldellow commented Jul 17, 2019

Uh oh!

sebastian-nagel commented Jul 19, 2019

Uh oh!

sebastian-nagel commented Jul 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-nagel commented Jul 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cldellow commented Jul 23, 2019

Uh oh!

sebastian-nagel commented Jul 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sebastian-nagel commented Jul 22, 2019 •

edited

Loading

sebastian-nagel commented Jul 22, 2019 •

edited

Loading