-
Notifications
You must be signed in to change notification settings - Fork 246
DRIVERS-2917 - Standardized Performance Testing of ODMs and Integrations #1828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| @@ -0,0 +1 @@ | |||
| {"field1":"miNVpaKW","field2":"CS5VwrwN","field3":"Oq5Csk1w","field4":"ZPm57dhu","field5":"gxUpzIjg","field6":"Smo9whci","field7":"TW34kfzq","field8":55336395,"field9":41992681,"field10":72188733,"field11":46660880,"field12":3527055,"field13":74094448} No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jibola are you asking for a trailing newline here?
| {"field1":"miNVpaKW","field2":"CS5VwrwN","field3":"Oq5Csk1w","field4":"ZPm57dhu","field5":"gxUpzIjg","field6":"Smo9whci","field7":"TW34kfzq","field8":55336395,"field9":41992681,"field10":72188733,"field11":46660880,"field12":3527055,"field13":74094448} | |
| {"field1":"miNVpaKW","field2":"CS5VwrwN","field3":"Oq5Csk1w","field4":"ZPm57dhu","field5":"gxUpzIjg","field6":"Smo9whci","field7":"TW34kfzq","field8":55336395,"field9":41992681,"field10":72188733,"field11":46660880,"field12":3527055,"field13":74094448} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
|
|
||
| ### Benchmark placement and scheduling | ||
|
|
||
| The MongoDB ODM Performance Benchmark should be placed within the ODM's test directory as an independent test suite. Due |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should leave an option for folks to create their own benchmarking repo if that helps out. I'm open to others take on this one seeing as I worry about maintainers not wanting a benchmark repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't agree that they should be in the tests directory but haven't ruled out including them in the ODM. For the purposes of getting the spec done, I wonder if requiring the ODM to document the location of the test suite is enough. If not, I would definitely remove the "test directory" requirement and make it "should be placed within the ODM". I think that is enough to make it clear that the goal is to have the perf tests included in the ODM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a separate benchmark repo is a good choice here. We could reach out to existing maintainers and see if they want to weigh in, but I imagine having a separate repo for benchmarking is more trouble than it's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that ODMs might not be receptive to a large addition of performance tests to their repository. The ticket makes it sound like we (DBX) planned to run these tests ourselves, probably in a CI we maintain:
The testing criteria would be documented in a human readable form (such as either a docs page or a markdown file), and once benchmarks have been developed we would run these against each new notable release of the client library. Providing well documented benchmarks will hopefully also encourage the developer community to contribute additional tests to further improve coverage.
I don't see any mention of where these tests will live in the scope, either.
Why do we plan on contributing spec tests to ODM repos, instead of creating a pipeline similar to the ai testing pipeline? Or just integrating perf testing within drivers' existing performance testing? We already have the test runners and infrastructure to run these ourselves. And to @ajcvickers 's point, we already have dedicated performance test hosts in evergreen that are stable and isolated from other hosts in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe there was any concrete plan one way or the other at the time the ticket and scope were created.
In my view, there are a few fundamental differences between the libraries being tested here versus for AI integrations.
- Many ODMs are or are planned to be first-party libraries rather than contributions to third-party AI frameworks.
- The AI space moves extremely rapidly and broken CI/CD or testing suites are extremely common. Both factors were significant motivators in the creation of our AI testing pipeline. Those motivations don't seem to exist here.
- AI frameworks tend to have several to dozens of integrations all housed within a single repo, each with their own dependencies and tests. Third-party ODMs are more often standalone repos with far less complexity in this manner, so adding a single additional test file for performance testing is much less significant.
What would integrating perf testing within the existing drivers perf testing look like? Would all of the ODM benchmarks live in a separate repo, with each driver cloning and using the specific subdirectory that contains the ODMs they want to test?
Using the same skeleton of test runners and infrastructure for the ODM testing makes it very easy to get these tests up and running without polluting the existing drivers tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ODMs where we own the repo, it's easy to run the performance suite on our existing test infrastructure. For external ODMs, unless the maintainers are fine with us adding config to run evergreen tasks directly into the repo, we'll have to set up a separate ODM testing pipeline repo similar to the AI/ML one that already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's worth adjusting the wording here accordingly, then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Circling back here, it looks like the agreement solely for test placement is this:
First Party ODMs (libraries we maintain e.g. Django, Hibernate, EFCore ...)
- Are required to include performance tests directly to the repository. They need to be in an explicit test directory, but do not need to be run as part of the standard set of CI/CD to confirm functionality.
External ODMs (libraries we contribute to e.g. Mongoose, Spring Data, Doctrine ...)
- Make best effort attempts are including tests directly to the repository. If not possible, create a separate repository that clones the target repository and then executes the performance benchmark. These tests should still be an clear
testsdirectory.
Have I understood that correctly?
cc: @dariakp , @baileympearson , @NoahStapp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by "the standard set of CI/CD to confirm functionality"? I would expect the performance tests to be run on the same standard infrastructure we maintain for the ODM.
For external ODMs, there should be only one separate repository for all ODMs, similar to how the AI/ML testing pipeline operates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, to clarify. I meant they do not need to be run at the same time as the unit and integration tests.
| to the relatively long runtime of the benchmarks, including them as part of an automated suite that runs against every | ||
| PR is not recommended. Instead, scheduling benchmark runs on a regular cadence is the recommended method of automating | ||
| this suite of tests. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per your suggestion earlier, we should include some new information about testing mainline usecases.
| As discussed earlier in this document, ODM feature sets vary significantly across libraries. Many ODMs have features | ||
| unique to them or their niche in the wider ecosystem, which makes specifying concrete benchmark test cases for every | ||
| possible API unfeasible. Instead, ODM authors should determine what mainline use cases of their library are not covered | ||
| by the benchmarks specified above and expand this testing suite with additional benchmarks to cover those areas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is attempting to specify that ODMs should implement additional benchmark tests to cover mainline use cases that do not fall into those included in this specification. One example would be the use of Django's in filter operator: Model.objects.filter(field__in=["some_val"]).
JamesKovacs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good. The most pressing concerns are around the percentile calculation and picking a stable server version to test against.
| - Sort the array into ascending order (i.e. shortest time first) | ||
| - Let the index i for percentile p in the range [1,100] be defined as: `i = int(N * p / 100) - 1` | ||
|
|
||
| *N.B. This is the [Nearest Rank](https://en.wikipedia.org/wiki/Percentile#The_Nearest_Rank_method) algorithm, chosen for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#The_Nearest_Rank_method anchor should be #The_nearest-rank_method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This anchor needs to be corrected.
| Unless otherwise specified, the number of iterations to measure per task is variable: | ||
|
|
||
| - iterations should loop for at least 30 seconds cumulative execution time | ||
| - iterations should stop after 10 iterations or 1 minute cumulative execution time, whichever is shorter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two conditions seem to be working at cross purposes. The measurement should loop for at least 30 seconds but not more than 60, but stop after 10 iterations. This caps the number of iterations at 10, possibly fewer if each iteration takes longer than 6 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing on my part (also taken from the driver benchmarking spec).
The intent is to have a 30 second minimum execution time with a 120 second execution time cap. Once the minimum time is reached, we stop the benchmark being executed once it reaches 120 seconds of execution time or once at least 10 iterations have completed.
| The data will be stored as strict JSON with no extended types. These JSON representations must be converted into | ||
| equivalent models as part of each benchmark task. | ||
|
|
||
| Flat model benchmark tasks include:s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extraneous s at the end of the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed yet.
|
|
||
| | Phase | Description | | ||
| | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | Setup | Load the SMALL_DOC dataset into memory as an ODM-appropriate model object. Insert 10,000 instances into the database, saving the inserted `id` field for each into a list. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the _id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll update the wording to clarify since ODM naming conventions for the document _id will vary.
| Summary: This benchmark tests ODM performance creating a single large model. | ||
|
|
||
| Dataset: The dataset (LARGE_DOC) is contained within `large_doc.json` and consists of a sample document stored as strict | ||
| JSON with an encoded length of approximately 8,000 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8,000 bytes is still relatively small. Do we want to have perf tests for huge documents close to the 16MB limit? While we may not recommend large models, customers will run into these scenarios especially if their models contain large arrays of subdocuments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close to the 16MB limit seems excessive and would increase execution time significantly. Increasing the size here to be a few MB, similar to what the driver benchmarks use for their large document tests, would likely result in similar performance characteristics without as large of a latency hit. The downside to increasing the size of documents here is that we need to define the data's structure carefully to not significantly complicate the process of model creation for implementing ODMs, which is not a concern for the driver benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few MB makes sense to me for the reasons stated. If the intent is to test large payloads, we could have a binary field that contains a 1MB JPG of a leaf. (e.g. Something incompressible.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this, sorry. My main concern with a several MB document is modeling it in a way that is representative of real data manipulated using an ODM. The existing driver benchmark simply uses a document with a massive amount of top-level fields, which is not the kind of structure you would expect for data wrapped by an ODM. Something closer to a very large array of embedded documents could be a better representation, or your suggestion of a large binary field.
I'll try adding a large binary field and see how practical that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the flat large_doc.json to be 3MB using a binary image field.
| - Nested models -- reading and writing nested models of various sizes, to explore basic operation efficiency for complex | ||
| data | ||
|
|
||
| The suite is intentionally kept small for several reasons. One, ODM feature sets vary significantly across libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May prefer bulleted list here e.g.
The suite is intentionally kept small for the following reasons:
- ODM feature sets vary …
- Several popular MongoDB ODMs are maintained by third-parties …
aclark4life
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending clarification of if "in the repo" can replace "in the repo's test dir".
|
|
||
| We expect substantial performance differences between ODMs based on both their language families (e.g. static vs. | ||
| dynamic or compiled vs. virtual-machine-based) as well as their inherent design (e.g. web frameworks such as Django vs. | ||
| application-agnostic such as Mongoose). However we still expect "vertical" comparison within families of ODMs to expose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is worthwhile to compare different ODMs to each other. The performance of ODMs doing different types of things varies widely based on the approach taken by the ODM, as opposed to anything the provider/adapter for Mongo is doing.
I do think it could be valuable to compare a given ODM with Mongo to that same ODM but with a similar (e.g. Cosmos) database, and a different (e.g. PostgreSQL) database. Whether or not this will show differences in the client is dependent on many things. For example, in .NET making the PostgreSQL provider faster is measurable because the data transfer and server can keep up. On the other hand, making the SQL Server provider faster makes no difference, because the wire protocol and server blocking is already the limiting factor.
It may also be useful to test raw driver to ODM perf, especially since customers often ask about this. However, in most cases the performance overhead will come from the core ODM code, rather than anything we are doing, so I doubt that there will be much actionable to come out of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing ODMs to each other could be useful in identifying potential design or implementation issues. For example, if one ODM implements embedded document querying in an inefficient way, comparing its performance on a benchmark to a similar ODM with much better performance could unlock improvements that would be difficult to identify otherwise. Outside of that specific case, I agree that ODM comparisons are not very useful.
Comparing performance across databases is an interesting idea. Django did apples-to-apples comparisons with benchmarks against both MongoDB and Postgres and got a lot of useful data out of that. ODMs make doing so relatively easy as only the backend configuration and models (for differences like embedded documents and relational links) need to change. We'd need to be careful to isolate performance differences to the database alone as much as possible, due to all the factors you state.
Comparing raw driver to ODM perf is part of the stated goals of this project. Determining exactly which benchmarks should be directly compared is still under consideration, for both maintainability and overhead concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked to @JamesKovacs and then @NoahStapp offline about my concerns. In a nutshell, there is a lot here, and perf testing is hard and resource (human and otherwise) intensive. It would be my preference to start with one goal, such as testing for ODM perf regressions, and get that working well end-to-end. In terms of the details here (iterations, payloads, and the scenarios themselves) they look fine as starting points, and then we can modify them as we start getting data based on what we are seeing.
I'm going to say LGTM here, because this feels like a plan we can get started with and then iterate on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- let's start by using performance tests to get a first indicator of performance and prevent regressions, then focus on improving performance before starting to compare different ODMs against one another. FWIW, even though we've been testing drivers performance for a long time, we haven't really compared their performance, nor explored how language differences (e.g. compiled vs. interpreted) may favour one driver over another in the performance tests. I don't think it's good to include such a goal in the initial spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that the initial focus of this spec is to prevent regressions and track performance for individual ODMs. However, I don't think that not having compared performance across drivers in the past should prevent us from aiming to compare performance across ODMs in the future. The driver benchmarking spec includes a similar paragraph about comparing across drivers that we may not have ever fully implemented, but for the reasons stated above I feel it's a worthwhile goal.
aclark4life
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @ajcvickers ! I think this is a reasonable framework to get folks started doing something collaboratively across ODMs. We'll see how it goes and update the spec accordingly.
| The MongoDB ODM Performance Benchmark must be run against a MongoDB replica set of size 1 patch-pinned to the latest | ||
| stable database version without authentication or SSL enabled. This database version should be updated as needed to | ||
| better differentiate between ODM performance changes and server performance changes. The Benchmark should be run on the | ||
| established internal performance distro for the sake of consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "established internal performance distro" reference another document. Based on @dariakp's resolved comment above, I assume this is referring to the basic driver benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an internal document specifying infrastructure-specific config and details. I would support linking it here, but that is against our current policy for specs.
| to the relatively long runtime of the benchmarks, including them as part of an automated suite that runs against every | ||
| PR is not recommended. Instead, scheduling benchmark runs on a regular cadence is the recommended method of automating | ||
| this suite of tests. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the dedicated Evergreen hosts be discussed in this document? Alternatively, if it's already discussed in the general driver benchmarks spec we can reference that.
jmikola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one suggestion for a trailing newline, which I think is what @Jibola was asking for.
OK to have my other open comment resolved without further input on my end.
| @@ -0,0 +1 @@ | |||
| {"field1":"miNVpaKW","field2":"CS5VwrwN","field3":"Oq5Csk1w","field4":"ZPm57dhu","field5":"gxUpzIjg","field6":"Smo9whci","field7":"TW34kfzq","field8":55336395,"field9":41992681,"field10":72188733,"field11":46660880,"field12":3527055,"field13":74094448} No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jibola are you asking for a trailing newline here?
| {"field1":"miNVpaKW","field2":"CS5VwrwN","field3":"Oq5Csk1w","field4":"ZPm57dhu","field5":"gxUpzIjg","field6":"Smo9whci","field7":"TW34kfzq","field8":55336395,"field9":41992681,"field10":72188733,"field11":46660880,"field12":3527055,"field13":74094448} | |
| {"field1":"miNVpaKW","field2":"CS5VwrwN","field3":"Oq5Csk1w","field4":"ZPm57dhu","field5":"gxUpzIjg","field6":"Smo9whci","field7":"TW34kfzq","field8":55336395,"field9":41992681,"field10":72188733,"field11":46660880,"field12":3527055,"field13":74094448} | |
| to the relatively long runtime of the benchmarks, including them as part of an automated suite that runs against every | ||
| PR is not recommended. Instead, scheduling benchmark runs on a regular cadence is the recommended method of automating | ||
| this suite of tests. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NoahStapp let me know if this is also an internal document or something we can easily reference. Feel free to resolve accordingly.
I'm not aware of an existing document that lays out best practices for scheduling performance tests, this is pulled from our own experience with the Python driver. |
JamesKovacs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some pending changes that need addressing. Minor typos, an anchor tag fix, and such. Please address so I can LGTM. Thanks.
| Summary: This benchmark tests ODM performance creating a single large model. | ||
|
|
||
| Dataset: The dataset (LARGE_DOC) is contained within `large_doc.json` and consists of a sample document stored as strict | ||
| JSON with an encoded length of approximately 8,000 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few MB makes sense to me for the reasons stated. If the intent is to test large payloads, we could have a binary field that contains a 1MB JPG of a leaf. (e.g. Something incompressible.)
| - Sort the array into ascending order (i.e. shortest time first) | ||
| - Let the index i for percentile p in the range [1,100] be defined as: `i = int(N * p / 100) - 1` | ||
|
|
||
| *N.B. This is the [Nearest Rank](https://en.wikipedia.org/wiki/Percentile#The_Nearest_Rank_method) algorithm, chosen for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This anchor needs to be corrected.
| The data will be stored as strict JSON with no extended types. These JSON representations must be converted into | ||
| equivalent models as part of each benchmark task. | ||
|
|
||
| Flat model benchmark tasks include:s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed yet.
Please complete the following before merging:
clusters).
Python Django implementation: mongodb/django-mongodb-backend#366.