-
Notifications
You must be signed in to change notification settings - Fork 15
ingester: Port batch ingester to django #1489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
aa39103
to
9e3cd40
Compare
json_files = [ | ||
f | ||
for f in os.listdir(spool_dir) | ||
if os.path.isfile(os.path.join(spool_dir, f)) | ||
and f.endswith(".json") | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChatGPT suggested scandir, which is a bit more cleaner around is_file
, and https://stackoverflow.com/questions/59268696/why-is-os-scandir-as-slow-as-os-listdir seems to suggest it would be as fast as listdir (which apparently is due to how it's using C structs instead of Python objects).
with os.scandir(spool_dir) as it:
json_files = [
entry.name
for entry in it
if entry.is_file() and entry.name.endswith(".json")
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# Batching and backpressure controls | ||
try: | ||
INGEST_BATCH_SIZE = int(os.environ.get("INGEST_BATCH_SIZE", "1000")) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer being more specific (applies to lines below)
except Exception: | |
except (ValueError, TypeError): |
# Write debug/perf output to stdout | ||
# logger was unreliable in some environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
# Write debug/perf output to stdout | |
# logger was unreliable in some environments | |
"""Write debug/perf output to stdout.""" | |
# logger was unreliable in some environments |
def db_worker( # noqa: C901 | ||
stop_event: threading.Event, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, I'm almost sure you could've put the noqa in front.
def db_worker( # noqa: C901 | |
stop_event: threading.Event, | |
): | |
def db_worker(stop_event: threading.Event): # noqa: C901 |
But I'd argue that the c901 alert was accurate, and we should try to refactor this function in the future
Don't worry too much about the nitpicking, it's just how I walk through the patch while I understand the work. I appreciate the batching and the need to call to I think the code could be drier... but I also think the functionality was already thoroughly tested manually. So we could merge this critical piece, write unit tests and performance tests over it, and only then think about refining the code. |
In other words, I'm looking at this PR as your requirements with regards to performance, and matching the functionality we had in kcidb ingester, and we need to work on top of it. |
backend/kernelCI_app/management/commands/helpers/process_submissions.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/helpers/process_submissions.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
flush_start = time.time() | ||
try: | ||
# Single transaction for all tables in the flush | ||
with transaction.atomic(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it could work if you made a single shared function like
def bulk_create_items(table, item_buffer):
t0 = time.time()
# table should be Issues, Checkouts, etc.
table.objects.bulk_create(
item_buffer, batch_size=batch_size, ignore_conflicts=True
)
_out(
"[%s] bulk_create issues: n=%d in %.3fs"
% (_ts(), len(item_buffer), time.time() - t0)
)
and then just call it like bulk_create_items(Issues, issue_buf)
and similarly to the other tables. This way would reduce the code a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't work, just move all this insertion to a separate function to lower the complexity of db_worker
, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't work, just move all this insertion to a separate function to lower the complexity of db_worker, please
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
This is just initial stage to get things working, and to disable old ingester. |
b90b8cf
to
29c10e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why add clutter to the code when it can simply be refactored without much effort. I understand the reasoning, but this ingester is not even being currently used, we don't need to keep "legacy" code.
flush_start = time.time() | ||
try: | ||
# Single transaction for all tables in the flush | ||
with transaction.atomic(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't work, just move all this insertion to a separate function to lower the complexity of db_worker, please
backend/kernelCI_app/management/commands/helpers/process_submissions.py
Outdated
Show resolved
Hide resolved
@MarceloRobert i removed legacy code now |
3be7455
to
338ad56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could still be some improvements, but looks good enough to me.
try: | ||
# Attempt to cancel remaining futures and exit early | ||
# Note: this only cancels tasks not yet started | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there some code supposed to be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More graceful shutdown if process is interrupted. I decided it is not critical for now, as in general this ingester better to not interrupt, especially when it is in this part of code (due potential data loss)
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py
Outdated
Show resolved
Hide resolved
c97b59b
to
8750e8b
Compare
Also added some verbose logs to monitor ingestion rate, queue, etc. Signed-off-by: Denys Fedoryshchenko <[email protected]>
8750e8b
to
6801b26
Compare
Also added some verbose logs to monitor ingestion rate, queue, etc.