-
Notifications
You must be signed in to change notification settings - Fork 20k
community: Add recursive sitemap support to GitbookLoader with concurrent processing #30681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: Add recursive sitemap support to GitbookLoader with concurrent processing #30681
Conversation
…ded async processing for speeding document loading
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
|
Could we speed up web loader instead of making changes in gitbook loader? |
f29f11c to
10d6ad5
Compare
You are right - any optimizations should be performed in WebBaseLoader. Nevertheless, this is a more involved task as it has lots of dependencies, so perhaps this could be another ticket. |
|
@eyurtsev , can you please review when you got a chance? Thank you! |
|
@andrasfe I'll review this tomorrow more carefully. But if there are no changes here that optimize specifically for gitbookloader, then either changes will need to be made in the parent abstraction or the implementation needs to be refactored before we can merge the code. |
|
For context, could you provide a quick explanation of what this does that was not do-able with the web based loader in terms of concurrent processing? |
All is good with the base class now. I modified the code to fully take advantage of the WebBaseLoader async processing. The main issue I addressed, as outlined in the ticket I opened, was that the original GitbookLoader implementation could not handle hierarchical sitemaps—clearly a bug. At first, I introduced the optimization without realizing that WebBaseLoader already supports asynchronous loading -- my bad, it's fixed now. While there’s room for further optimization in this class, as you pointed out, that’s not the primary focus of this ticket or fix. Thank you!! |
eyurtsev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to make 3 changes to this:
- Stop inherit from WebBaseLoader and re-use WebBaseLoader to simplify implementation?
- Make sure that this does not introduce a CVE unless there's something already built in that can filter urls that are not in allowed domains.
- remove changes to pyproject.toml and uv.lock and replace with pytest.marker.requires
libs/community/tests/unit_tests/document_loaders/test_gitbook.py
Outdated
Show resolved
Hide resolved
e758a3f to
eeae614
Compare
|
All changes implemented, as requested. thank you! |
…safely, CVE support added, reverted pyproject.toml and uv.lock to master
eeae614 to
5195dc9
Compare
|
Hi @eyurtsev , when you got a chance.... thank you :) |
eyurtsev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrasfe thank you for the updated implementation and apologies for the delay I just came back from vacation yesterday.
The implementation looks good overall, i flagged a few places that still need to be adjusted for security purposes.
We're in the process of moving community to a separate repository:
https://github.com/langchain-ai/langchain-community
Would you be able to address the issue and re-open the PR there?
libs/community/tests/unit_tests/document_loaders/test_gitbook.py
Outdated
Show resolved
Hide resolved
- Improve domain validation to ensure URLs come from allowed domains - Add explicit scheme validation to prevent protocol-based attacks - Implement safe URL checking before adding URLs to processing queue - Refactor URL handling code to apply validation consistently
|
Thank you @eyurtsev and no problem at all. I saw you had no activity during this time frame - know you were off. I completed the changes ad provided my 2 cents for some, but I'm open to further suggestions. I will open a PR against the new repo once I get confirmation from you that we are good to go. |
|
Closing for now as we've already moved the library. |
|
re-created PR in the new repo: langchain-ai/langchain-community#13 |
…cursive-sitemap community: Add recursive sitemap support to GitbookLoader with concurrent processing langchain-ai#30681
Description:
Enhanced
GitbookLoaderto support recursive sitemap structures and asynchronous processing. The loader now recursively processes sitemap index files, following links to child sitemaps, and extracts all URLs to content pages. Also added async processing.Issue:
Fixes #30629 - GitbookLoader fails to process nested sitemaps
Dependencies:
None added