Skip to content

After Dockerizing the Selenium Web App not opening the webpage that needs to be crawled. #2976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Orbiszeus opened this issue Jul 30, 2024 · 9 comments
Labels
invalid usage You may need to change what you're doing UC Mode / CDP Mode Undetected Chromedriver Mode / CDP Mode

Comments

@Orbiszeus
Copy link

Dear Micheal,
I have build a full web app that scrapes some site and gathers some information. Locally everything runs perfectly, however after dealing with dockerizing the application. Never occured exception has risen due to:

INFO:     127.0.0.1:44352 - "POST /crawl_menu HTTP/1.1" 200 OK
        Exception in Getir Crawler:  Message: 
 Element {button[aria-label='Tümünü Reddet']} was not present after 7 seconds!

I have never gotten that, I think that after docker the container runs the crawler headless=True and tries to get to the site in 6 seconds but cannot do it. What should I do to workaround that? I will provide my crawler.py and dockerfile.

crawler.py:

def g_crawler(url, is_area):
    menu_items = []
    if not is_area: 
        with SB(uc=True, headless=True) as sb:
            sb.driver.uc_open_with_reconnect(url, 6)
            try:
                sb.uc_gui_handle_cf()
                sb.sleep(3)
                sb.click("button[aria-label='Tümünü Reddet']")
                sb.sleep(3)
                all_items = sb.find_elements("div[class='sc-be09943-2 gagwGV']")
                for item in all_items:
                    product_name = item.find_element("css selector", "h4[class='style__Title4-sc-__sc-1nwjacj-5 jrcmhy sc-be09943-0 bpfNyi']").text
                    sb.sleep(2)
                    try:
                        product_description = item.find_element("css selector", "p[contenteditable='false']").text
                    except:
                        product_description = "No description for this product."
                    sb.sleep(2)
                    product_price = item.find_element("css selector", "span[class='style__Text-sc-__sc-1nwjacj-0 jbOUDC sc-be09943-5 kA-DgzG']").text
                    sb.sleep(2)
                    menu_item = {
                        "Menu Item": product_name,
                        "Menu Ingredients": product_description,
                        "Price": product_price
                    }
                    if product_name == "Poşet":
                        continue
                    menu_items.append(menu_item)
                menu_items_json = json.dumps(menu_items, ensure_ascii=False, indent=4)   
                menu_items_list = json.loads(menu_items_json) 
                df = pd.DataFrame(menu_items_list)
                # title = sb.get_title()
                # excel_file = f'{title}_getir_menu.xlsx'
                # df.to_excel(excel_file, index=False)  
                return df.to_json(orient='split')                  
            except Exception as e:
                print(f"Exception in Getir Crawler:  {e}")

My Dockerfile:

# Use a smaller base image
FROM python:3.10-slim

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV TZ=Europe/Istanbul
ENV LC_ALL=tr_TR.UTF-8
ENV LANG=tr_TR.UTF-8

# Set the working directory in the container
WORKDIR /app

# Install dependencies and Chrome in one layer to keep image size smaller
RUN apt-get update && apt-get install -y \
     wget \
     gnupg \
     unzip \
     curl \
     ca-certificates \
     fonts-liberation \
     libappindicator3-1 \
     libasound2 \
     libatk-bridge2.0-0 \
     libatk1.0-0 \
     libcups2 \
     libdbus-1-3 \
     libgdk-pixbuf2.0-0 \
     libnspr4 \
     libnss3 \
     libx11-xcb1 \
     libxcomposite1 \
     libxdamage1 \
     libxrandr2 \
     xdg-utils \
     locales \
     --no-install-recommends \
     && ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone \
     && wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb \
     && apt-get install ./google-chrome-stable_current_amd64.deb --yes \
     && apt-get clean \
     && rm -rf /var/lib/apt/lists/*

# Configure locale settings for Türkiye
RUN echo "LC_ALL=tr_TR.UTF-8" >> /etc/environment \
     && echo "LANG=tr_TR.UTF-8" >> /etc/environment \
     && locale-gen tr_TR.UTF-8

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application
COPY . .

# Expose the ports for FastAPI and Streamlit
EXPOSE 8000 8501

# Command to run FastAPI and Streamlit
CMD ["sh", "-c", "uvicorn menu_crawler:app --host 0.0.0.0 --port 8000 & streamlit run Hotel_Analyst.py"]

@Orbiszeus Orbiszeus changed the title After Dockerizing the Selenium Web App not responding to the sb elements. After Dockerizing the Selenium Web App not opening the webpage that needs to be crawled. Jul 30, 2024
@mdmintz mdmintz added invalid usage You may need to change what you're doing UC Mode / CDP Mode Undetected Chromedriver Mode / CDP Mode labels Jul 30, 2024
@mdmintz
Copy link
Member

mdmintz commented Jul 30, 2024

UC Mode doesn't support headless mode anymore because UC Mode uses pyautogui for lots of things, and that doesn't work with a headless browser. (You don't need headless mode on Linux anymore because of the special Xvfb virtual display.)

Also, running a browser in Docker can leave a trail that makes the automation detectable from websites (due to unique fingerprints). I'd suggest not using Docker for UC Mode unless you know how to configure your Docker container to have the appearance of regular Linux (where UC Mode works normally).

You can try increasing the default timeout values for reconnect_time, and see if that helps with slow page-loads.
You can also try saving a screenshot and inspecting it to find out why the element wasn't present after the page-load.

@mdmintz mdmintz closed this as completed Jul 30, 2024
@Orbiszeus
Copy link
Author

Thank you so much for your help, can you please tell me how to not trace in my dockerfile??

@mdmintz
Copy link
Member

mdmintz commented Jul 30, 2024

I'm not sure how to undo the fingerprint changes that Docker added.
That's why I use regular Linux for UC Mode, and not a Docker-branded Linux.

@Orbiszeus
Copy link
Author

Orbiszeus commented Jul 30, 2024

But without a docker how can I deploy my crawler downloading Chrome to a host machine like I was doing with Docker?
Because I keep getting this when deployed on Headless off -->

    self.error_handler.check_response(response)

  File "/usr/local/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response

    raise exception_class(message, screen, stacktrace)

selenium.common.exceptions.SessionNotCreatedException: Message: session not created: cannot connect to chrome at 127.0.0.1:9222

from chrome not reachable

@Orbiszeus
Copy link
Author

Please help me I have been trying to find a way to publish my crawler web app that uses your software, however as you mentioned it needs Chrome/Chrome-Stable to achieve this I had to dockerize and download to the Linux machine that run on all my function (crawlers) on it. But, I still am stuck.

@mdmintz
Copy link
Member

mdmintz commented Jul 30, 2024

For regular SeleniumBase (non UC Mode), try this Dockerfile: SeleniumBase/Dockerfile.

But if you need to avoid automation-detection, (eg. bypassing Cloudflare), then don't use Docker.

@Orbiszeus
Copy link
Author

Okay, I will be trying it. Thanks!

@Orbiszeus
Copy link
Author

Okay, hello again. I tried your Dockerfile. However, it is raising an error:

 > [16/23] COPY virtualenv_install.sh /SeleniumBase/virtualenv_install.sh:
------
Dockerfile:77
--------------------
  75 |     COPY pytest.ini /SeleniumBase/pytest.ini
  76 |     COPY setup.cfg /SeleniumBase/setup.cfg
  77 | >>> COPY virtualenv_install.sh /SeleniumBase/virtualenv_install.sh
  78 |     RUN find . -name '*.pyc' -delete
  79 |     RUN pip install --upgrade pip setuptools wheel
--------------------
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 35d2166f-7983-467f-904c-36f44fbc4b38::ycw93pzwijz37f1shofemcanl: "/virtualenv_install.sh": not found

@Orbiszeus
Copy link
Author

Orbiszeus commented Jul 31, 2024

Also, the other thing is that I need to Headless=True. I know that it is not used anymore but the server cannot run with Headless=False and also I need to specify some options like no-sandbox and disable-dev-shm-usage. I could not find them on sb_manager.py, I only could write this sb_config.no_sandbox = True I don't know if it will work like this or not..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid usage You may need to change what you're doing UC Mode / CDP Mode Undetected Chromedriver Mode / CDP Mode
Projects
None yet
Development

No branches or pull requests

2 participants