Skip to content

Conversation

@aflansburg
Copy link
Contributor

@aflansburg aflansburg commented Nov 26, 2024

TL;DR

The Playwright browser context storage_state parameter can be used to provide a path to a JSON state file that can be leveraged for providing session authentication to a scraper leveraging the ChromiumLoader class. As one of the medium-term goals I noted was for handling authentication when using Selenium or Playwright, in the short-term this pull request allows the passing of the storage_state parameter in the Playwright loader, allowing for more flexible and secure scraping operations.

Example usage:

Imagine a workflow where, in some module, you are using Playwright directly to authenticate a session. You can leverage the state of your browser at the moment after login in future Playwright calls.

async def _login(page: Page):
    print("Logging in...")
    username = app_config.foo_username
    password = app_config.foo_password
    unusual_activity_text = app_config.unusual_activity_text

    await page.goto("https://foo.bar.com/i/flow/login")

    await page.get_by_role("textbox").fill(username)

    await page.get_by_role("button", name="Next").click()

    print("Password page detected.")
    await page.get_by_role("textbox", name="password").fill(password)
    await asyncio.sleep(random.uniform(1, 3))

    await page.get_by_role("button", name="Log in").click()

    profile_links = await page.locator("a[href='/your_profile']").element_handles()
    if len(profile_links) > 0:
        print("Logged in.")
        # save the state of the page
        await page.context.storage_state(path="data/state.json")
        return True
    else:
        print("Not logged in.")
        raise Exception("Unable to login.")

You can see at the end of the _login method that we store the storage_state of the browser context to a file.

Then, when using scrapegraph-ai we can pass the path into the graph by including the storage_state parameter in the graph_config so that playwright can use it:

graph_config = {
    "llm": {
        "api_key":OPENAI_API_KEY,
        "model": "openai/gpt-4o",
    },
    "max_images": 10,
    "verbose": True,
    "headless": False,
    "storage_state": "data/state.json",
}


async def execute_scraper_graph():
    graph = OmniScraperGraph(
        app_config.prompt,
        app_config.url,
        graph_config,
    )
    result = graph.run()
    print(result)

Summary of Changes

This pull request includes several changes to improve the functionality and maintainability of the scrapegraphai package, specifically focusing on the chromium.py, abstract_graph.py, and code_generator_graph.py files. The most important changes include adding support for storage_state, improving error handling, and reformatting code for better readability.

Note some changes were introduced by running the ruff formatter, but do not impact functionality in any way. Please let me know if this is an issue (ruff is really good).

Enhancements to chromium.py:
• Added storage_state parameter to the ChromiumLoader class to support session storage.
• Updated ascrape_playwright and ascrape_with_js_support methods to use storage_state when creating a new browser context.
• Improved error message formatting for better readability.
• Reformatted conditional logic in lazy_load and alazy_load methods for clarity.

Enhancements to abstract_graph.py:
• Reformatted import statements and function definitions for better readability.
• Added storage_state configuration to the AbstractGraph class initialization.
• Improved readability of the _create_llm method by reformatting and simplifying code.

Enhancements to code_generator_graph.py:
• Reformatted import statements and function definitions for better readability.
• Added storage_state configuration to the CodeGeneratorGraph class initialization and graph creation.
• Improved readability of the _create_graph method by reformatting and simplifying code.

Copy link
Collaborator

@VinciGit00 VinciGit00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aflansburg, please add in the folder example examples/extra an example of this and if it's possible add to the other graphs the auth

@aflansburg
Copy link
Contributor Author

Hi @aflansburg, please add in the folder example examples/extra an example of this and if it's possible add to the other graphs the auth

Hi @VinciGit00 I'll add the example. I tried to add it to all graphs that might leverage the ChromiumLoader class, but I'll go back through and confirm!

@aflansburg
Copy link
Contributor Author

Hi @aflansburg, please add in the folder example examples/extra an example of this and if it's possible add to the other graphs the auth

Hi @VinciGit00 I'll add the example. I tried to add it to all graphs that might leverage the ChromiumLoader class, but I'll go back through and confirm!

@VinciGit00 I added the example and the storage_state parameter to DocumentScraperGraph and SearchGraph

@aflansburg aflansburg requested a review from VinciGit00 December 1, 2024 18:30
Copy link
Collaborator

@VinciGit00 VinciGit00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work with each login page?

@VinciGit00 VinciGit00 merged commit a86e7d6 into ScrapeGraphAI:pre/beta Dec 3, 2024
1 check passed
@github-actions
Copy link

github-actions bot commented Dec 5, 2024

🎉 This PR is included in version 1.33.0-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@github-actions
Copy link

github-actions bot commented Dec 5, 2024

🎉 This PR is included in version 1.33.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@aflansburg aflansburg deleted the pre/beta branch December 8, 2024 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants