Skip to content

Conversation

@rtibbles
Copy link
Member

Summary

During testing of 0.8 with the KA integration script, I was still seeing cases where a download would fail, but not trigger an error from the file pipeline.
Adds a check that any output file from the pipeline must be in the ricecooker storage, as otherwise, the pipeline has failed to execute properly.

@rtibbles rtibbles force-pushed the output_path_required branch from 16a0b26 to d9656f0 Compare September 25, 2025 22:11
Copy link
Member

@marcellamaki marcellamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concerns just two small questions, then i'll do a re-read and approve.

"""A dummy handler that passes through the original path without transferring to storage.
This simulates the bug where a download handler fails to actually download/transfer
the file but returns the original URL as the path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but returns the original URL as the path

I'm not sure what this means. Previously, the bug was what the file wasn't getting transferred/downloaded actually, and the bug was that it seemed like it was, because there was a return value that was a path? but that the path was not the local path download location, but rather the original URL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that was what was actually happening for me - but because of the way that the handlers work, it is possible that this would cause an issue - basically, this just makes sure that download handlers always return a local file path in storage after they have finished, because otherwise they could cause issues for every other handler.

# Use explicit timeout to prevent hanging downloads
# (connection_timeout, read_timeout) - connection timeout for establishing connection,
# read timeout for time between receiving data chunks (prevents stuck downloads)
r = config.DOWNLOAD_SESSION.get(path, stream=True, timeout=(30, 60))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly a curiosity question -- how did you decide what timeout values to use here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't - Claude chose them, and they seemed fine to me!

Copy link
Member

@marcellamaki marcellamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions resolved! Thanks, Richard

@marcellamaki marcellamaki merged commit 75dd992 into learningequality:develop Oct 23, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants