Skip to content

Conversation

@andrasfe
Copy link
Contributor

@andrasfe andrasfe commented Apr 4, 2025

Description:

Enhanced GitbookLoader to support recursive sitemap structures and asynchronous processing. The loader now recursively processes sitemap index files, following links to child sitemaps, and extracts all URLs to content pages. Also added async processing.

Issue:

Fixes #30629 - GitbookLoader fails to process nested sitemaps

Dependencies:

None added

…ded async processing for speeding document loading
@vercel
Copy link

vercel bot commented Apr 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Apr 29, 2025 3:15am

@dosubot dosubot bot added size:XL labels Apr 4, 2025
@eyurtsev
Copy link
Collaborator

eyurtsev commented Apr 4, 2025

Could we speed up web loader instead of making changes in gitbook loader?

@eyurtsev eyurtsev self-assigned this Apr 4, 2025
@dosubot dosubot bot added size:L and removed size:XL labels Apr 5, 2025
@andrasfe andrasfe force-pushed the community-gitbook-recursive-sitemap branch from f29f11c to 10d6ad5 Compare April 5, 2025 13:40
@dosubot dosubot bot added size:XL and removed size:L labels Apr 5, 2025
@andrasfe
Copy link
Contributor Author

andrasfe commented Apr 5, 2025

Could we speed up web loader instead of making changes in gitbook loader?

You are right - any optimizations should be performed in WebBaseLoader. Nevertheless, this is a more involved task as it has lots of dependencies, so perhaps this could be another ticket.

@andrasfe
Copy link
Contributor Author

andrasfe commented Apr 8, 2025

@eyurtsev , can you please review when you got a chance? Thank you!

@eyurtsev
Copy link
Collaborator

eyurtsev commented Apr 9, 2025

@andrasfe I'll review this tomorrow more carefully.

But if there are no changes here that optimize specifically for gitbookloader, then either changes will need to be made in the parent abstraction or the implementation needs to be refactored before we can merge the code.

@eyurtsev
Copy link
Collaborator

eyurtsev commented Apr 9, 2025

For context, could you provide a quick explanation of what this does that was not do-able with the web based loader in terms of concurrent processing?

@andrasfe
Copy link
Contributor Author

andrasfe commented Apr 9, 2025

For context, could you provide a quick explanation of what this does that was not do-able with the web based loader in terms of concurrent processing?

All is good with the base class now. I modified the code to fully take advantage of the WebBaseLoader async processing.

The main issue I addressed, as outlined in the ticket I opened, was that the original GitbookLoader implementation could not handle hierarchical sitemaps—clearly a bug. At first, I introduced the optimization without realizing that WebBaseLoader already supports asynchronous loading -- my bad, it's fixed now. While there’s room for further optimization in this class, as you pointed out, that’s not the primary focus of this ticket or fix. Thank you!!

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make 3 changes to this:

  1. Stop inherit from WebBaseLoader and re-use WebBaseLoader to simplify implementation?
  2. Make sure that this does not introduce a CVE unless there's something already built in that can filter urls that are not in allowed domains.
  3. remove changes to pyproject.toml and uv.lock and replace with pytest.marker.requires

@andrasfe andrasfe force-pushed the community-gitbook-recursive-sitemap branch from e758a3f to eeae614 Compare April 12, 2025 18:22
@andrasfe
Copy link
Contributor Author

All changes implemented, as requested. thank you!

…safely, CVE support added, reverted pyproject.toml and uv.lock to master
@andrasfe andrasfe force-pushed the community-gitbook-recursive-sitemap branch from eeae614 to 5195dc9 Compare April 14, 2025 14:32
@andrasfe
Copy link
Contributor Author

Hi @eyurtsev , when you got a chance.... thank you :)

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrasfe thank you for the updated implementation and apologies for the delay I just came back from vacation yesterday.

The implementation looks good overall, i flagged a few places that still need to be adjusted for security purposes.

We're in the process of moving community to a separate repository:

https://github.com/langchain-ai/langchain-community

Would you be able to address the issue and re-open the PR there?

- Improve domain validation to ensure URLs come from allowed domains
- Add explicit scheme validation to prevent protocol-based attacks
- Implement safe URL checking before adding URLs to processing queue
- Refactor URL handling code to apply validation consistently
@andrasfe
Copy link
Contributor Author

Thank you @eyurtsev and no problem at all. I saw you had no activity during this time frame - know you were off. I completed the changes ad provided my 2 cents for some, but I'm open to further suggestions. I will open a PR against the new repo once I get confirmation from you that we are good to go.

@ccurme
Copy link
Collaborator

ccurme commented Apr 29, 2025

Closing for now as we've already moved the library.

@ccurme ccurme closed this Apr 29, 2025
@andrasfe
Copy link
Contributor Author

andrasfe commented May 1, 2025

re-created PR in the new repo: langchain-ai/langchain-community#13

fileames pushed a commit to rohanaggarwal7997/langchain that referenced this pull request Jun 4, 2025
…cursive-sitemap

community: Add recursive sitemap support to GitbookLoader with concurrent processing langchain-ai#30681
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: GitbookLoader fails to process nested sitemaps

3 participants