-
Notifications
You must be signed in to change notification settings - Fork 294
tests: database_types dual-use for benchmarks #67
Conversation
data_diff/database.py
Outdated
@@ -381,7 +381,13 @@ def to_string(self, s: str): | |||
|
|||
def _query(self, sql_code: str) -> list: | |||
"Uses the standard SQL cursor interface" | |||
return _query_conn(self._conn, sql_code) | |||
c = self._conn.cursor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is Presto-specific, it should only be used for Presto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you mean 👀 This is on the Presto driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, my bad.
@@ -378,6 +378,13 @@ def _diff_tables(self, table1, table2, level=0, segment_index=None, segment_coun | |||
f"size: {table2.max_key-table1.min_key}" | |||
) | |||
|
|||
# The entire segment wasn't below the threshold, but the next set of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This belongs in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. The reason I included it here is because this is how the 'download' is a fair benchmark comparison where it splits into multiple segments, processing those in threads
"2022-01-01 15:10:05.009900", | ||
], | ||
"float": [0.0, 0.1, 0.10, 10.0, 100.98], | ||
"int": IntFaker(N_SAMPLES), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a step backwards. How can we be sure that the "faker" creates all the edge cases that we are interested in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely don't think this is a step backwards. It allows us to generate millions of samples predictably when we set N_SAMPLES
higher for more comprehensive testing (and benchmarking), which is a far more thorough than anything we can type out manually.
It generates plenty of edge-cases naturally by creating lots of 9s, 8s, etc. in the first bit:
['2000-01-01 00:00:03.000571', '2000-01-01 00:00:06.001142', '2000-01-01 00:00:09.001713', '2000-01-01 00:00:12.002284', '2000-01-01 00:00:15.002855', '2000-01-01 00:00:18.003426', '2000-01-01 00:00:21.003997', '2000-01-01 00:00:24.004568', '2000-01-01 00:00:27.005139', '2000-01-01 00:00:30.005710', '2000-01-01 00:00:33.006281', '2000-01-01 00:00:36.006852', '2000-01-01 00:00:39.007423', '2000-01-01 00:00:42.007994', '2000-01-01 00:00:45.008565', '2000-01-01 00:00:48.009136', '2000-01-01 00:00:51.009707', '2000-01-01 00:00:54.010278', '2000-01-01 00:00:57.010849', '2000-01-01 00:01:00.011420', '2000-01-01 00:01:03.011991', '2000-01-01 00:01:06.012562', '2000-01-01 00:01:09.013133', '2000-01-01 00:01:12.013704', '2000-01-01 00:01:15.014275', '2000-01-01 00:01:18.014846', '2000-01-01 00:01:21.015417', '2000-01-01 00:01:24.015988', '2000-01-01 00:01:27.016559', '2000-01-01 00:01:30.017130', '2000-01-01 00:01:33.017701', '2000-01-01 00:01:36.018272', '2000-01-01 00:01:39.018843', '2000-01-01 00:01:42.019414', '2000-01-01 00:01:45.019985', '2000-01-01 00:01:48.020556', '2000-01-01 00:01:51.021127', '2000-01-01 00:01:54.021698', '2000-01-01 00:01:57.022269', '2000-01-01 00:02:00.022840', '2000-01-01 00:02:03.023411', '2000-01-01 00:02:06.023982', '2000-01-01 00:02:09.024553', '2000-01-01 00:02:12.025124', '2000-01-01 00:02:15.025695', '2000-01-01 00:02:18.026266', '2000-01-01 00:02:21.026837', '2000-01-01 00:02:24.027408', '2000-01-01 00:02:27.027979', '2000-01-01 00:02:30.028550']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there are edge-cases we already know are problematic. It doesn't make sense to throw them away, and hope we get them through an equation..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I can hard-code the first M elements of the faker to the existing hard-coded values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's acceptable.
tests/test_database_types.py
Outdated
for source_type in source_types: | ||
for target_type in target_type_categories[type_category]: | ||
for target_type in target_type_categories.get(type_category, []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll never know if a typo caused some the tests to not run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair! Reason I had it was to just be able to comment out something quickly, but I see now that the risk isn't worth it. Agree
tests/test_database_types.py
Outdated
dst_table = dst_conn.quote(".".join(dst_table_path)) | ||
|
||
# If you want clean house from e.g. changing fakers. | ||
src_conn.query(f"DROP TABLE IF EXISTS {src_table}", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oracle doesn't support "DROP TABLE IF EXISTS"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually going to remove/comment out these lines. The whole idea is to keep these tables around so we can re-run benchmarks without building up millions of rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about old revisions.
self.table = TableSegment(self.src_conn, src_table_path, "id", None, ("col",), quote_columns=False) | ||
self.table2 = TableSegment(self.dst_conn, dst_table_path, "id", None, ("col",), quote_columns=False) | ||
start = time.time() | ||
already_seeded = self._create_table(dst_conn, dst_table, target_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be dropping the tables when we're done, no? So this exists only in case the tests fail.. And what if it's an old table with different values? Better to drop the tables and re-create everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We keep the tables around on purpose for performance with large N_SAMPLES
. Only if you change the fakers would you need to drop them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's exactly what's going to happen if I check-out some old revision on git. And I won't understand why the tests are behaving strangely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's fair. I'm thinking then I will add a benchmarking "mode" via an environment variable where you pass the sample numbers. We need to select a small subset of the tests for this either way. I'll comment it for now to get the numbers I need.
Biggest benefit of this PR is that running benchmarks through this suite is possible at all
tests/test_database_types.py
Outdated
count2_duration = round(time.time() - start, 2) | ||
|
||
# Ensure we actually checksum even for few samples! | ||
ch_threshold = 5_000 if N_SAMPLES > DEFAULT_N_SAMPLES else 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs an explanation
tests/test_database_types.py
Outdated
diff = list(differ.diff_tables(self.table, self.table2)) | ||
expected = [] | ||
self.assertEqual(expected, diff) | ||
self.assertEqual(len(sample_values), differ.stats.get("rows_downloaded", 0)) | ||
download_duration = round(time.time() - start, 2) | ||
|
||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed tests don't print. Use the logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, for sure, just a debugging relic! Will log to debug
!
tests/test_database_types.py
Outdated
) | ||
|
||
def _create_table(self, conn, table, type) -> bool: | ||
conn.query(f"CREATE TABLE IF NOT EXISTS {table}(id int, col {type})", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all dialects support IF NOT EXISTS here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, is it Oracle? I've run this on all but Redshift + Oracle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Oracle, yes.
tests/test_database_types.py
Outdated
return already_seeded | ||
|
||
def _insert_to_table(self, conn, table, type, values): | ||
conn.query(f"TRUNCATE {table}", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you make sure TRUNCATE exists in all dialects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worked on all I've seen so far. It's fairly standard SQL... I mean until we discover it's not.
data_diff/database.py
Outdated
if sql_code.lower().startswith("select"): | ||
return c.fetchall() | ||
# Required for the query to actually run 🤯 | ||
if re.match(r"\A(insert|create)", sql_code, re.IGNORECASE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\A is not necessary. Python's match()
doesn't scan the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh TIL
03a4a06
to
86c3866
Compare
3a53501
to
99dc0ee
Compare
presto: fix test suite diff_tables: fix not double recursing tests: fix presto create indexes, etc., haven't tested all dbs more
99dc0ee
to
e973e04
Compare
Most is addressed in #125 and |
@erezsh The idea is to dual-use this test-suite for benchmarking too by upping
N_SAMPLES
. However, you'll want to run this for individual tests with a highN_SAMPLES
... so it's a bit annoying, but workable.Plan is to use this to create the graphs.