Skip to content

Integrating continuous fuzzing by way of OSS-Fuzz #771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DavidKorczynski opened this issue Jan 18, 2021 · 11 comments
Closed

Integrating continuous fuzzing by way of OSS-Fuzz #771

DavidKorczynski opened this issue Jan 18, 2021 · 11 comments

Comments

@DavidKorczynski
Copy link
Contributor

Hi,

I was thinking that it would be nice to set up continuous fuzzing of jsonschema, by way of OSS-Fuzz. In this PR: google/oss-fuzz#4996 I have done exactly that, namely created the necessary logic from an OSS-Fuzz perspective to integrate jsonschema. This includes developing initial fuzzers as well as integrating into OSS-Fuzz.

Essentially, OSS-Fuzz is a free service run by Google that performs continuous fuzzing of important open source projects. The only expectation of integrating into OSS-Fuzz is that bugs will be fixed. This is not a "hard" requirement in that no one enforces this and the main point is if bugs are not fixed then it is a waste of resources to run the fuzzers, which we would like to avoid.

If you would like to integrate, could I please have an email(s) that will get access to the data produced by OSS-Fuzz, such as bug reports, coverage reports and more stats. Notice the emails affiliated with the project will be public in the OSS-Fuzz repo, as they will be part of a configuration file.

@Julian
Copy link
Member

Julian commented Jan 20, 2021

Hi there. Thanks for the offer / sending the PR.

So -- all of jsonschema's code is pure Python at the minute, so I'd be curious whether OSS-Fuzz could say anything interesting without at least some information about how to generate JSON Schema specification -alike objects. But happy to see what pops out too.

The email address that's in the SECURITY.md file is a decent place to send these.

CC @Zac-HD in case you're interested or have opinions :)

And thanks for raising!

@Julian
Copy link
Member

Julian commented Jan 20, 2021

I didn't know this, though surely @Zac-HD will, but looks like OSS-Fuzz supports generating data via Hypothesis.

Something like that would be a big improvement over random dictionary poking I think. Zac may already be doing that himself as part of hypothesis-jsonschema? But if not, @DavidKorczynski I think that'd be the right kind of integration with OSSFuzz.

@DavidKorczynski
Copy link
Contributor Author

Am happy to set up a property-based approach with Hypothesis if you are happy to integrate with OSS-Fuzz! If I go ahead with the integration I can submit the fuzzers upstream in this repository instead of keeping them on OSS-Fuzz, then we can also get the property-based testing going - does that sound good?

@Julian
Copy link
Member

Julian commented Jan 20, 2021

Sounds good to me!

@Zac-HD
Copy link
Member

Zac-HD commented Jan 21, 2021

I do indeed have opinions! (and wrote the integration docs over the weekend 😉) The real trick here is that Hypothesis supports driving arbitrary tests using a traditional fuzzer, which naturally includes OSS-Fuzz's various backends.

In short, I'd be very surprised if you can discover interesting bugs by parsing strings into JSON and nothing else - though I can imagine this working OK if you had all the JSON tokens ({}[]",0123456789 etc) and schema keywords in a dictionary to randomly splice in; and then started from a decent seed corpus like the upstream test suite plus schemastore.org schemas.

Fortunately though, hypothesis-jsonschema exists specifically to be the better way to do this! I even have some internal test tooling to generate arbitrary schemas: https://github.com/Zac-HD/hypothesis-jsonschema/blob/master/tests/gen_schemas.py which cover basically everything except $ref, I think 😁

The main tricks will be that:

  1. Not all good property-based tests make good fuzz targets; for this you want an integration-test which can in principle exercise the whole codebase. This makes assertions in the code (vs in the test) particularly powerful.
  2. My tooling, both in hypothesis-jsonschema and in the test code above, generally assume that jsonschema mostly works. I have discovered several bugs using them, but it's more likely to come in the form of odd crashes (assert FTW) than clear test failures so a quick code audit with this in mind might be useful. That said, Hypothesis' builtin shrinking tends to make the cause of errrs quite immediately obvious!

Also happy to collaborate and split any integration reward or direct to charity (e.g. the PSF or one of https://www.givewell.org/charities/top-charities)

@Julian
Copy link
Member

Julian commented Jan 21, 2021

@Zac-HD I think I followed that (and very helpful as usual).

I can't say in my brain that I know yet what sorts of fuzzing seem useful here but as you say you've certainly found issues via it before so maybe there are more gaps to fill...

Also happy to collaborate and split any integration reward or direct to charity (e.g. the PSF or one of https://www.givewell.org/charities/top-charities)

This sounds great to me too yeah. Should take it offline maybe to discuss but you know I still have a soft spot for PyPy so throwing some dollars at them to make hypothesis+PyPy support even better is attractive :) but so is PSF.

@DavidKorczynski
Copy link
Contributor Author

My general view on refining the fuzzer to rely on more structural approaches is that it should be verified empirically. The argument is that the coverage-guided aspects of the fuzzing engine will be great at coming up with inputs that satisfy the various input structures of the target application. I think this is particularly true in a case like this where the execution speed is high, the structural complexity of json is relatively low (say in comparison to PDFs or image formats), and OSS-Fuzz will throw significant CPU power on it. The original fuzzer starts hitting into jsonschema (in seconds). Based on these my personal view is to refine only after we get empirical results, i.e. if we get results then that's great and if not then we should refine. Naturally I respect the view of the maintainers - but my personal advice would be to either not refine at first or have both.

The perspective the fuzzer takes (the original one) was simply to follow the pattern described here https://pypi.org/project/jsonschema/ i.e. the comment in the code A sample schema, like what we'd get from json.load().

@Julian
Copy link
Member

Julian commented Jan 21, 2021

Based on these my personal view is to refine only after we get empirical results,

The question though to me is what results we are expecting. For a normal fuzzing process that OSS-Fuzz is running, it seems to me often that is "the software doesn't crash", especially if it's fuzzing code in memory-unsafe languages.

If a fuzzer is to say anything useful though about JSON Schema (and this library jsonschema) I think it needs to know sets of invariants we believe to be true, and then it can go find counterexamples to them if they exist.

If I'm understanding your comment I think you're saying that we should test one invariant "valid JSON doesn't blow up jsonschema", but yeah I don't know that I expect fuzzing to produce much that's interesting there, but who knows...

I think if I follow @Zac-HD's comment:

Not all good property-based tests make good fuzz targets; for this you want an integration-test which can in principle exercise the whole codebase. This makes assertions in the code (vs in the test) particularly powerful.

That that looks more like what I'd expect, namely if the key invariant is "valid pairs of schemas and instances produce successful jsonschema output" and "invalid pairs of schema and instances produce unsuccessful jsonschema output" that there's likely to be more bang-for-the-buck there.

But I'm as I say also willing to go with what the experts say :) so you @DavidKorczynski may be more familiar with OSS-Fuzz and I know @Zac-HD is more familiar with property testing in general so I'd be willing to defer.

@DavidKorczynski
Copy link
Contributor Author

DavidKorczynski commented Jan 21, 2021

Ah right - now I understand. The bugs that I am after are unhandled exceptions. validate promises to throw ValidationError or SchemaError - that's the guarantee I am after, i.e. if we can trigger other types of exceptions assuming we give input acceptable by validate.

@Julian
Copy link
Member

Julian commented Jan 21, 2021

Probably you know this but validate only makes that "guarantee" for some subset of objects, yes? E.g. if you give it the working equivalent of:

class Foo(dict):
    def __getitem__(self, key):
        if key == "12": raise ZeroDivisionError()
        return self.__dict__[key]

you indeed may get ZeroDivisionError (i.e. it does not wrap other exceptions emitted during validation).

But for suitable subsets of objects, ones I assume you'll use as fuzzing input, then yeah.

(And fair enough, maybe we start there.)

@Zac-HD
Copy link
Member

Zac-HD commented Jan 22, 2021

Re: donation of any integration reward

I'd be very happy to direct it to PyPy for use at their discretion... and to include a note suggesting that efficient code coverage would be great for fuzzers 😉

Fuzzing with consume_unicode

I think this could be effective - it's certainly worth including both this and the Hypothesis-based harness (see e.g. Levels of Fuzzing). I just think that including a "dictionary" of tokens and a decent seed corpus would make it considerably more efficient.

More sophisticated tests

I'd love to write a harness which generates a random valid schema, then random known-valid objects, and checks that they validate as expected. I already know that via #686 this isn't always the case! Metamorphic testing also has some super-powerful ideas. For example, given a valid schema and instance, deleting a constraint from the schema should never make the instance invalid... and with time we could think of many more, I'm sure.

But this is a conversation for later 🙂

@Julian Julian closed this as completed Mar 13, 2021
V02460 added a commit to V02460/jsonschema that referenced this issue May 26, 2025
9ad349be Merge pull request python-jsonschema#773 from jdesrosiers/annotation-propname-desc
f164982c Merge pull request python-jsonschema#771 from bavulapati/add-blaze-as-consumer
d2bd2ad2 Update annotation test description for propertyNames
a7a64707 Add [Blaze](https://github.com/sourcemeta/blaze) to the README
9f256c88 Change "expected" to an object with schema locations
7f996868 Update content tests to only apply to string instances
5338ecd1 Remove tests that assert a keyword doesn't emit annotations
738653b5 Make order of assertion properties consistent
8b5de3b9 Updates based on feedback from Juan
6270e399 Updates based on feedback from Karen
341df3ec Add automation to check that annotation tests are valid
16988c67 Add annotation tests
bc919bdb Merge pull request python-jsonschema#755 from V02460/unevaluated-additional-properties
83e866b4 Merge pull request python-jsonschema#763 from michaelmior/propertynames-const
c5a9703f Merge pull request python-jsonschema#760 from OptimumCode/rfc3490-label-separator
b4c09b65 Add tests for propertyNames with const/enum
4fa572d8 Move tests for rfc3490#3.1 into a separate test case
ce9f68ca Add link to rfc and quote
ad94cacc Add test cases for other valid label separators in IDN hostnames
39002ae7 Merge pull request python-jsonschema#762 from OptimumCode/rfc-html-link
c8780535 Correct section anchor for rfc URL template
5f2ca7d6 Modify rfc url template to use html version
9c5d99b6 Merge pull request python-jsonschema#761 from OptimumCode/annotation-script-rfc-support
9563ce7b Correct rfc URL template - incorrect path pattern was used
961bfad0 Correct spec kind extraction from defined key. Continue on unkown URL kind
e524505b Merge pull request python-jsonschema#759 from sirosen/hostname-format-reject-single-dot
4a3efd18 Add negative tests for "." for hostname formats
4ba013d5 Merge pull request python-jsonschema#747 from santhosh-tekuri/duration
aa500e80 Merge pull request python-jsonschema#749 from json-schema-org/gregsdennis/json-everything-update
eb8ce976 Merge pull request python-jsonschema#757 from ajevans99/main
dcdae5c0 Merge pull request python-jsonschema#758 from sirosen/hostname-format-check-empty-string
db21d21b Merge branch 'main' into hostname-format-check-empty-string
3fd78f04 Merge pull request python-jsonschema#1 from ajevans99/swift-json-schema
3cada3a9 Update README.md
5273e0d6 Make test descriptions more specific
43828fee Simplify adjacent additionalProperties test
347d6099 unevaluatedProperties: Remove type keywords
7dfbb1e9 Add test for unevaluatedProperties
82a07749 Merge pull request python-jsonschema#753 from json-schema-org/ether/fix-draft-locations
a66d23d4  move draft-specific files to the dedicated dir for its draft
8ef15501 Merge pull request python-jsonschema#751 from big-andy-coates/format_tests_under_format
fe1b1392 All format test cases should be under the `format` directory.
b1ee90f6 json-everything moved to an org
c00a3f94 test: duration format must start with P
9fc880bf Merge pull request python-jsonschema#740 from notEthan/format-pattern-control-char
cbd48ea5 Simplify test of \a regex character to test directly against `pattern` schema
d6f1010a Merge pull request python-jsonschema#746 from json-schema-org/annotations
4aec22c1 Revert the changes to additionalProperties.json.
2dc10671 Move the workflow step title.
d9ce71ac May as well also show quotes in the annotation.
1b719a84 Pick the line after the description when attaching spec annotations.
08105151 Markdown is apparently not (yet?) supported in annotations.
81645773 Tidy up the specification annotator a bit.
38628b79 Make the spec URLs structure a bit easier for internal use.
4ebbeaf4 Merge branch 'Era-cell/main'
e4bd7554 dumbness2 corrected
d8ade402 inside run
57c7c869 changed install location
11f8e511 Added installing command in workflow
f2766616 template library, url loads changes
c2badb12 Merge pull request python-jsonschema#734 from OptimumCode/idn-hostname-arabic-indic-mixed
dd9599a5 Merge branch 'main' of github.com:json-schema-org/JSON-Schema-Test-Suite
5b393436 add pr dependencies action
3a509007 Clear existin annotations on same PR
23674123 Cases for rfc and iso written separately
0b780b2c Corected yaml format
2b1ffb74 Best practices followed with optimized code
e88a2da6 Works for all OS
7b40efe4 Base path for neighbouring file?
564e6957 Walking through all leaf files
7b84fb44 Merge branch 'main' of https://github.com/Era-cell/JSON-Schema-Test-Suite
891d0265 First workflow2
1c175195 regex correction
96f7683a Final correction2 - file names beautufied
5f050a07 Final correction1
77527b63 Stupidity corrected
eb8fd760 Branch name specified
540a269b Log2
f29d090a Wrong location sepcification
582e12be logging logs check
df3bdecc path corrected
c6b937ca Reading all jsons and spec urls added
cbdd1755 change day2
54f3784a Merge pull request python-jsonschema#731 from MeastroZI/main
79dc92f1 TOKEN
ce52852d Python file location changed
3558c2c6 Fake add to tests
eecc7b7a Merge branch 'main' of https://github.com/Era-cell/JSON-Schema-Test-Suite
810d148a First workflow2
4eac02c7 First workflow
ff29264c Merge pull request python-jsonschema#741 from harrel56/chore/tabs-to-spaces
9f39cf73 use spaces instead of tabs
2f3b5f7a Corrected replaced unevaluated with additoinalProperties
40bcb8b3 Corrected replaced unevaluated with additoinalProperties
fa9224d7 Merge pull request python-jsonschema#732 from MeastroZI/main2
83bedd5c Changing descriptions
49f73429 fixing tests
e6d6a081 adding more test cases
7e6c9be6 changing descriptions
959aca92 shifting test
605d7d78 Update propertyDependencies.json : test must be tests
deb82824 test for dependentSchema and propertyDependencies with unevaluatedProperties and additionalProperties
ea485124 Merge branch 'json-schema-org:main' into main
64a3e7b3 Merge pull request python-jsonschema#721 from json-schema-org/gregsdennis/dynamicref-skips-resources
b9f14e64 Fix $schema in new new test
3d5048e8 Merge pull request python-jsonschema#733 from Era-cell/main
4ae14268 Add valid first character to avoid Bidi rule violation
2480edba Update additionalProperties.json formatting it
6aa79c0b Update additionalProperties.json formatting it
3e0139a5 Update tests/draft-next/additionalProperties.json
616240b0 Update tests/draft-next/additionalProperties.json
c5f3e4ea Update tests/draft2020-12/propertyNames.json
964efb8e propertyNames doesn't affect additionalProperties, tests exist already for unevaluatedProps
f08b884c Cases go under additional and unevaluated Properties
99864ff6 added tests for propertyNames with additionalProperties/unevaluatedProperties, also with specification property
3b5782b6 Update ref.json : changing $Ids
546b3561 test for $ref with $recursiveAnchor
57617f25 Merge pull request python-jsonschema#726 from Era-cell/main
51fc69cd meta data and property names constraints added, additional Items: string
9b169bed specification takes array of objects having section and quote
1362a8cc Pattern for para corrected
340116ec Schema of specification in much structured
003ac021 Test-schema including sub-schema for scpecification
50a20280 adding specification enhancement for additionalProperties
604f5f99 Drop tests of `$id` and `$anchor` that just test values against meta-schema `pattern` for those properties
9cd64ec9 come on man, save all the files
f494440e use unique $id in optional tests, too
468453b0 use unique $id
9ec6d17e fix copy/paste error
b284f423 add tests for $dynamicRef skipping over resources
bf0360f4 add $recursiveAnchor to 2019-09 meta-schemas
0519d1f0 add $dynamicAnchor to meta-schemas
b41167c7 Merge pull request python-jsonschema#714 from json-schema-org/more-not
4221a55a Add tests for not: {} schemas for all values.
c499d1d2 Merge pull request python-jsonschema#713 from spacether/patch-1
24a471bd Update README.md
544f7c3d Merge pull request python-jsonschema#712 from otto-ifak/main
9dad3ebe Add tests for enum with array of bool
589a0858 Merge pull request python-jsonschema#706 from marksparkza/unevaluated-before-ref
64d5cab9 Merge pull request python-jsonschema#710 from spacether/patch-1
418cdbd6 Removes idea folder
e0a9e066 Updates all other tests to mention grapheme/graphemes
217bf81b Merge pull request python-jsonschema#701 from json-schema-org/ether/dynamicRef-boolean
7a3d06d7 I remove a test that doesn't make sense.
e8bf453d Move tests with ids in non-schemas to optional
69136952 Update minLength.json
d545be21 Fix duplidate identifiers in recently added tests
4e9640c8 test when $dynamicRef references a boolean schema
3dab98ca Merge pull request python-jsonschema#705 from json-schema-org/gregsdennis/remove-contains-objects-tests
1d3aa495 remove more maxContains
4a2c61e8 Test unevaluatedItems|Properties before $ref
ec553d76 contains no longer applies to objects
0433a2bf Merge pull request python-jsonschema#704 from big-andy-coates/clarify-format-requirements
c685195f Merge pull request python-jsonschema#703 from big-andy-coates/link-to-creek-validator-comprison-site
a46174b0 Add more detail around test runner requirements for `format` tests
bb1de8a9 The site linked to is a data-driven functional and performance benchmark of JVM based validator implementations.
d38ddd54 Merge pull request python-jsonschema#696 from jdesrosiers/unevaluated-dynamicref
5d0c05fa Fix copy/paste error
95fe6ca2 Merge pull request python-jsonschema#694 from json-schema-org/heterogeneous-additionalItems
9c88a0be Merge pull request python-jsonschema#697 from json-schema-org/gregsdennis/add-ref-into-known-nonapplicator
49222046 Add unevaluted with dynamic ref tests to draft-next
8ba1c90d Update unevaluted with dynamic ref to be more likely to catch errors
fea2cf19 add tests for 2019 and 2020
6695ca38 add optional tests for `$ref`ing into known non-applicator keywords
2834c630 Add tests for unevaluated with dynamic reference
cda4281c Merge pull request python-jsonschema#695 from json-schema-org/ether/clean-up-subSchemas
7b9f45c2 move subSchemas-defs.json to subSchemas.json
e41ec0ec remove unused definition files
349c5a82 Merge pull request #692 from json-schema-org/ether/fix-subSchemas-refs
451baca4 Merge pull request python-jsonschema#670 from marksparkza/invalid-output-test
b8da838a Add tests for heterogeneous arrays with additionalItems
6d7a44b7 fix subschema locations and their $refs
a9a1e2e3 Merge pull request python-jsonschema#690 from skryukov/add-ipv4-mask-test
ba52c48a Merge pull request python-jsonschema#689 from skryukov/add-schema-keyword-to-required-tests
69b53add Add a test case for ipv4 with netmask
d0c602a7 Add $schema keyword to required tests
20f1f52c Merge pull request python-jsonschema#688 from spacether/feat_updates_python_exp_impl
b087b3ca Updates implmentation
4ecd01f3 Merge pull request python-jsonschema#687 from swaeberle/check-single-label-idn-hostnames
732e7275 test single label IDN hostnames
202d5625 test: hostname format check fails on empty string
ea0b63c9 Remove invalid output tests

git-subtree-dir: json
git-subtree-split: 9ad349be933f1e74810cb4fd3ad19780694dc77e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants