Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

tests: database_types dual-use for benchmarks #67

Closed
wants to merge 1 commit into from

Conversation

sirupsen
Copy link
Contributor

@sirupsen sirupsen commented Jun 17, 2022

@erezsh The idea is to dual-use this test-suite for benchmarking too by upping N_SAMPLES. However, you'll want to run this for individual tests with a high N_SAMPLES... so it's a bit annoying, but workable.

Plan is to use this to create the graphs.

@@ -381,7 +381,13 @@ def to_string(self, s: str):

def _query(self, sql_code: str) -> list:
"Uses the standard SQL cursor interface"
return _query_conn(self._conn, sql_code)
c = self._conn.cursor()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is Presto-specific, it should only be used for Presto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean 👀 This is on the Presto driver

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, my bad.

@@ -378,6 +378,13 @@ def _diff_tables(self, table1, table2, level=0, segment_index=None, segment_coun
f"size: {table2.max_key-table1.min_key}"
)

# The entire segment wasn't below the threshold, but the next set of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belongs in a separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. The reason I included it here is because this is how the 'download' is a fair benchmark comparison where it splits into multiple segments, processing those in threads

"2022-01-01 15:10:05.009900",
],
"float": [0.0, 0.1, 0.10, 10.0, 100.98],
"int": IntFaker(N_SAMPLES),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a step backwards. How can we be sure that the "faker" creates all the edge cases that we are interested in?

Copy link
Contributor Author

@sirupsen sirupsen Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't think this is a step backwards. It allows us to generate millions of samples predictably when we set N_SAMPLES higher for more comprehensive testing (and benchmarking), which is a far more thorough than anything we can type out manually.

It generates plenty of edge-cases naturally by creating lots of 9s, 8s, etc. in the first bit:

['2000-01-01 00:00:03.000571', '2000-01-01 00:00:06.001142', '2000-01-01 00:00:09.001713', '2000-01-01 00:00:12.002284', '2000-01-01 00:00:15.002855', '2000-01-01 00:00:18.003426', '2000-01-01 00:00:21.003997', '2000-01-01 00:00:24.004568', '2000-01-01 00:00:27.005139', '2000-01-01 00:00:30.005710', '2000-01-01 00:00:33.006281', '2000-01-01 00:00:36.006852', '2000-01-01 00:00:39.007423', '2000-01-01 00:00:42.007994', '2000-01-01 00:00:45.008565', '2000-01-01 00:00:48.009136', '2000-01-01 00:00:51.009707', '2000-01-01 00:00:54.010278', '2000-01-01 00:00:57.010849', '2000-01-01 00:01:00.011420', '2000-01-01 00:01:03.011991', '2000-01-01 00:01:06.012562', '2000-01-01 00:01:09.013133', '2000-01-01 00:01:12.013704', '2000-01-01 00:01:15.014275', '2000-01-01 00:01:18.014846', '2000-01-01 00:01:21.015417', '2000-01-01 00:01:24.015988', '2000-01-01 00:01:27.016559', '2000-01-01 00:01:30.017130', '2000-01-01 00:01:33.017701', '2000-01-01 00:01:36.018272', '2000-01-01 00:01:39.018843', '2000-01-01 00:01:42.019414', '2000-01-01 00:01:45.019985', '2000-01-01 00:01:48.020556', '2000-01-01 00:01:51.021127', '2000-01-01 00:01:54.021698', '2000-01-01 00:01:57.022269', '2000-01-01 00:02:00.022840', '2000-01-01 00:02:03.023411', '2000-01-01 00:02:06.023982', '2000-01-01 00:02:09.024553', '2000-01-01 00:02:12.025124', '2000-01-01 00:02:15.025695', '2000-01-01 00:02:18.026266', '2000-01-01 00:02:21.026837', '2000-01-01 00:02:24.027408', '2000-01-01 00:02:27.027979', '2000-01-01 00:02:30.028550']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there are edge-cases we already know are problematic. It doesn't make sense to throw them away, and hope we get them through an equation..

Copy link
Contributor Author

@sirupsen sirupsen Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I can hard-code the first M elements of the faker to the existing hard-coded values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's acceptable.

for source_type in source_types:
for target_type in target_type_categories[type_category]:
for target_type in target_type_categories.get(type_category, []):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll never know if a typo caused some the tests to not run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair! Reason I had it was to just be able to comment out something quickly, but I see now that the risk isn't worth it. Agree

dst_table = dst_conn.quote(".".join(dst_table_path))

# If you want clean house from e.g. changing fakers.
src_conn.query(f"DROP TABLE IF EXISTS {src_table}", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oracle doesn't support "DROP TABLE IF EXISTS"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually going to remove/comment out these lines. The whole idea is to keep these tables around so we can re-run benchmarks without building up millions of rows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about old revisions.

self.table = TableSegment(self.src_conn, src_table_path, "id", None, ("col",), quote_columns=False)
self.table2 = TableSegment(self.dst_conn, dst_table_path, "id", None, ("col",), quote_columns=False)
start = time.time()
already_seeded = self._create_table(dst_conn, dst_table, target_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be dropping the tables when we're done, no? So this exists only in case the tests fail.. And what if it's an old table with different values? Better to drop the tables and re-create everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep the tables around on purpose for performance with large N_SAMPLES. Only if you change the fakers would you need to drop them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's exactly what's going to happen if I check-out some old revision on git. And I won't understand why the tests are behaving strangely.

Copy link
Contributor Author

@sirupsen sirupsen Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's fair. I'm thinking then I will add a benchmarking "mode" via an environment variable where you pass the sample numbers. We need to select a small subset of the tests for this either way. I'll comment it for now to get the numbers I need.

Biggest benefit of this PR is that running benchmarks through this suite is possible at all

count2_duration = round(time.time() - start, 2)

# Ensure we actually checksum even for few samples!
ch_threshold = 5_000 if N_SAMPLES > DEFAULT_N_SAMPLES else 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an explanation

diff = list(differ.diff_tables(self.table, self.table2))
expected = []
self.assertEqual(expected, diff)
self.assertEqual(len(sample_values), differ.stats.get("rows_downloaded", 0))
download_duration = round(time.time() - start, 2)

print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed tests don't print. Use the logger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, for sure, just a debugging relic! Will log to debug!

)

def _create_table(self, conn, table, type) -> bool:
conn.query(f"CREATE TABLE IF NOT EXISTS {table}(id int, col {type})", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all dialects support IF NOT EXISTS here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, is it Oracle? I've run this on all but Redshift + Oracle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Oracle, yes.

return already_seeded

def _insert_to_table(self, conn, table, type, values):
conn.query(f"TRUNCATE {table}", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you make sure TRUNCATE exists in all dialects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worked on all I've seen so far. It's fairly standard SQL... I mean until we discover it's not.

if sql_code.lower().startswith("select"):
return c.fetchall()
# Required for the query to actually run 🤯
if re.match(r"\A(insert|create)", sql_code, re.IGNORECASE):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\A is not necessary. Python's match() doesn't scan the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh TIL

@sirupsen sirupsen force-pushed the benchmark-suite branch 4 times, most recently from 3a53501 to 99dc0ee Compare June 24, 2022 15:39
presto: fix test suite

diff_tables: fix not double recursing

tests: fix presto

create indexes, etc., haven't tested all dbs

more
@sirupsen sirupsen mentioned this pull request Jun 28, 2022
@sirupsen
Copy link
Contributor Author

Most is addressed in #125 and

@sirupsen sirupsen closed this Jun 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants