Skip to content

Getting Random SSL Errors with upload_from_file function #992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shubham07507 opened this issue Feb 6, 2023 · 11 comments
Closed

Getting Random SSL Errors with upload_from_file function #992

shubham07507 opened this issue Feb 6, 2023 · 11 comments
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. needs more info This issue needs more information from the customer to proceed. priority: p3 Desirable enhancement or fix. May not be included in next release. type: question Request for information or clarification. Not an issue.

Comments

@shubham07507
Copy link

requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/athena-samples-prod/o?uploadType=multipart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1056)')))

at .send ( /usr/local/lib/python3.7/site-packages/requests/adapters.py:563 )
at .send ( /usr/local/lib/python3.7/site-packages/requests/sessions.py:701 )
at .request ( /usr/local/lib/python3.7/site-packages/requests/sessions.py:587 )
at .request ( /usr/local/lib/python3.7/site-packages/google/auth/transport/requests.py:555 )
at .retriable_request ( /usr/local/lib/python3.7/site-packages/google/resumable_media/requests/upload.py:146 )
at .wait_and_retry ( /usr/local/lib/python3.7/site-packages/google/resumable_media/requests/_request_helpers.py:148 )
at .wait_and_retry ( /usr/local/lib/python3.7/site-packages/google/resumable_media/requests/_request_helpers.py:171 )
at .transmit ( /usr/local/lib/python3.7/site-packages/google/resumable_media/requests/upload.py:154 )
at ._do_multipart_upload ( /usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py:1890 )
at ._do_upload ( /usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py:2367 )
at .upload_from_file ( /usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py:2552 )
at .upload_from_filename ( /usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py:2696 )

@atulep
Copy link

atulep commented Feb 6, 2023

Please share your code.

@atulep atulep added the needs more info This issue needs more information from the customer to proceed. label Feb 6, 2023
@atulep atulep self-assigned this Feb 6, 2023
@parthea parthea transferred this issue from googleapis/google-cloud-python Feb 15, 2023
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Feb 15, 2023
@parthea
Copy link
Contributor

parthea commented Feb 15, 2023

Transferring to the python-storage repo for triage

@cojenco
Copy link
Contributor

cojenco commented Feb 17, 2023

Hi shubham07507@ could you please elaborate your use case along with a code snippet for investigation? Also could you share what versions of google-cloud-storage and google-resumable-media you're using?

@ddelgrosso1 ddelgrosso1 added priority: p2 Moderately-important priority. Fix may not be included in next release. priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Feb 21, 2023
@cojenco cojenco added the type: question Request for information or clarification. Not an issue. label Feb 21, 2023
@cojenco
Copy link
Contributor

cojenco commented Feb 24, 2023

Closing this issue for now. Happy to reopen if you have more information or questions.

@cojenco cojenco closed this as completed Feb 24, 2023
@cdeln
Copy link

cdeln commented May 19, 2024

I am also getting random errors. It is completely spurious, I can not reproduce the issue with determinism. I'll give you as much info as I can, please ask for more if you need. I will also keep on adding logging to give more clues to why this is happening.

Execution environment: Cloud Run Job, 8 CPUs, 16 GB memory, 10 tasks in parallel (1 / 10 fails spuriously).

System info:

  • OS: Ubuntu 22.04
  • Python version: 3.10 (shipped with Ubuntu 22.04)
  • google-cloud-storage==2.16.0
  • google-resumable-media==2.7.0

Stack traces below (Note that I have anonymized sensitive names, such as SECRET_BUCKET_NAME).

Stack trace (level 1)

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 715, in urlopen
  httplib_response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
  conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 244, in request
  super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib/python3.10/http/client.py", line 1283, in request
  self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
  self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
  self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1077, in _send_output
  self.send(chunk)
File "/usr/lib/python3.10/http/client.py", line 999, in send
  self.sock.sendall(data)
File "/usr/lib/python3.10/ssl.py", line 1266, in sendall
  v = self.send(byte_view[count:])
File "/usr/lib/python3.10/ssl.py", line 1235, in send
  return self._sslobj.write(data)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2426)

Stack trace (level 2)

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/SECRET_BUCKET_NAME/o?uploadType=multipart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))

Stack trace (level 3)

Traceback (most recent call last):
  File "/app/tasks/task.py", line 74, in <module>
    main()
  File "/app/tasks/task.py", line 71, in main
    run(workspace, args)
  File "/app/tasks/task.py", line 50, in run
    upload(output_bucket, local_output_path, output_file)
  File "/app/taskutil.py", line 63, in upload
    blob.upload_from_filename(local_path)
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/blob.py", line 2959, in upload_from_filename
    self._handle_filename_and_upload(
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/blob.py", line 2829, in _handle_filename_and_upload
    self._prep_and_do_upload(
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/blob.py", line 2637, in _prep_and_do_upload
    created_json = self._do_upload(
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/blob.py", line 2443, in _do_upload
    response = self._do_multipart_upload(
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/blob.py", line 1956, in _do_multipart_upload
    response = upload.transmit(
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/upload.py", line 153, in transmit
    return _request_helpers.wait_and_retry(
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/_request_helpers.py", line 178, in wait_and_retry
    raise error
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/_request_helpers.py", line 155, in wait_and_retry
    response = func()
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/upload.py", line 145, in retriable_request
    result = transport.request(
  File "/usr/local/lib/python3.10/dist-packages/google/auth/transport/requests.py", line 541, in request
    response = super(AuthorizedSession, self).request(
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 563, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /upload/storage/v1/b/SECRET_BUCKET_NAME/o?uploadType=multipart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))

Worth noting is that the job runs a quite heavy workload and push performance with multiprocessing.Pool. I was suspecting OOM issues first, but Metrics tab under Cloud Run Jobs shows about 20% memory usage. Not sure how granular that metric is, maybe there is a better way to get better stats. If you have any suggestions please advice.

@cojenco
Copy link
Contributor

cojenco commented Jun 6, 2024

Hi cdeln@, thanks for reporting. This seems to be caused by issues in the underlaying cpython and urllib3 packages. The urllib3 request is not fulfilled yet, so I'd suggest trying a few workarounds in the meanwhile

  • Change to use python 3.9 (cpython issue does not affect python versions < 3.10)
  • Ensure that you have retries enabled for the uploads, either (a) use if_generation_match preconditions with the upload method, as shown here, or (2) modify to use DEFAULT_RETRY, as shown below
from google.cloud.storage.retry import DEFAULT_RETRY

upload_from_file(retry=DEFAULT_RETRY)

[cpython]

  • The cpython bug affects python 3.10+. Errors such as broken pipe error and connection reset by peer that occur when the connection is interrupted during ssl communication have all been changed to SSLEOFError errors.
  • Bug is fixed in latest python version, but not backported to python 3.10 and 3.11. Features developed in Python 3.9 or under do not need action.

[urllib3]

While urllib3.exceptions.SSLError are considered retryable errors for upload operations in the python storage client, without further resolution in the underlaying urllib3 library, it's highly possible the SSL error will still exhausts retries.

@cdeln
Copy link

cdeln commented Jun 7, 2024

@cojenco Thanks for your support. I do not see the need for a retry policy as my service runs in a completely managed solution provided by Google in the cloud. I see why retries become important if you are accessing cloud storage from something like a mobile device. Two example scenarios

  1. The user goes into a tunnel with their car
  2. Leavestheir home and switch network

But my use case is done all in the cloud and there are nothing to disturb the network! I think adding a retry policy will only hide the problem, it will not solve it. Please correct me if I have misunderstood the retry policy.

With that said, I am reading up on the links to the cpython and urllib3 threads. These are both quite technical, so again, please correct me if I am wrong.

According to the cpython thread, there is a regression that changes some instances of OSError into SSLEOFError, which breaks some downstream libraries, including urllib3. See this (unresolved) issue: urllib3/urllib3#3382 . This thread does not address the underlying connection error (at least what I can see).

According to the urllib3 thread, it seems like the issue is related to the SSL handshake, which then gets convoluted into an SSLEOFError somehow. If I read the thread correctly, they indicate that the issue is related to a protocol violation from the server side (due to usage of weak cipher configuration from the client side, and server requires stricter, and abruptly closes the connection in violation to the protocol).

The weirdest part of this is still that it's random, which feels like a stability issue with GCS backend, but this is impossible for me to verify.

@cojenco
Copy link
Contributor

cojenco commented Jun 7, 2024

There are various reasons (broken connection, network congestion, etc) that can cause connection issues inside the Google network boundaries. When the connection is interrupted during ssl communication, it’s not by surprise to see transient errors such as broken pipe error, connection reset by peer or urllib3.exceptions.SSLError. The python storage client considers the above transient errors safe and useful to retry. That is why we recommend applying retry strategies to users applications, as documented in https://cloud.google.com/storage/docs/retry-strategy.

As for the Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)'))) error you're seeing, I think this is specifically impacted by the cpython and urllib3 issue. The urllib3 pacakge gracefully handles broken pipe/connection rest by peer errors but raises SSLErrors. Overall network instability will intensify the number of SSLErrors we see. @cdeln have you had the chance to observe what happens when running in python 3.9?

@cojenco
Copy link
Contributor

cojenco commented Jun 14, 2024

Hi @cdeln following up to see if you had the chance try running in python 3.9. I haven't been able to reproduce. Is there a code snippet you could share?

@cdeln
Copy link

cdeln commented Jun 14, 2024

Not yet, I haven't had time, but I am thinking about a debugging strategy. It's unfortunate that I discovered this so late into my development process, I don't want to use my complex workflow as a basis for debugging.
In order to properly debug this, we can split it up into several sub-tasks

  1. Reproduce: Even if this is random, we should be able to reproduce this statistically. I suggest we set up a minimal cloud run job with a Dockerfile + Python script that uploads a blob to storage. Then we run the job many times, and see how many times it fails.

  2. After reproducing the bug, fix it, either by downgrading Python or other by some other mean

There are some variables that needs to be considered, such as compute and storage region. I am using europe-north1 atm. Any other variables to consider?

This is just a proposal debug strategy, please improve with any ideas you may have.

Here is a dockerfile + python script (main.py) and utility bash scripts for starters

FROM ubuntu:22.04

WORKDIR /app/

ENV DEBIAN_FRONTEND noninteractive

ENV PYTHONUNBUFFERED True

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 \
    python3-pip \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir \
    google-cloud-storage==2.16.0

COPY main.py .

ENTRYPOINT ["python3", "main.py"]
#!/usr/bin/env python3
import argparse
import os
import uuid

import google.cloud.storage

parser = argparse.ArgumentParser()
parser.add_argument('bucket')
parser.add_argument('--filename', default=str(uuid.uuid4()))
parser.add_argument('--filesize', type=int, default=1_000_000_000)
args = parser.parse_args()

storage = google.cloud.storage.Client()
bucket = storage.bucket(args.bucket)
blob = bucket.blob(args.filename)
payload = b'A' * args.filesize
print(f'Uploading {len(payload)} bytes to gs://{args.bucket}/{args.filename}')
blob.upload_from_string(payload)

I build the docker with this script

#!/usr/bin/env bash

docker build -t google-cloud-storage-ssl-error:latest .

and run locally with this script (no need to mount credentials when deploying the job ofc)

#!/usr/bin/env bash

docker run \
       -v ~/.config/gcloud/application_default_credentials.json:/app/credentials.json \
       -e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials.json \
       -e GCLOUD_PROJECT=$(gcloud config get project) \
       google-cloud-storage-ssl-error:latest $@

The script is parameterized with bucket name, bucket object filename (defaults to unique ID on every invocation) and payload size. Can we set sensible defaults for filename and filesize? If we can, then we can remove more variables from the problem. A constant filename (say testfile.dat) would be ideal to not create a lot of trash on storage, and ideally filesize should be as low as possible, but if setting too low we might not be able to trigger the bug.

I set up a repo with the code as well: https://github.com/cdeln/google-cloud-storage-ssl-bug

@cdeln
Copy link

cdeln commented Jun 18, 2024

Hi, I've added a workflow also, and tweaked the existing scripts a bit (see repo).

main:
  params: []
  steps:
    - init:
        assign:
          - project_id: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
          - job_location: europe-north1
          - job_namespace: ${"namespaces/" + project_id + "/jobs/"}
          - bucket: ${sys.get_env("BUCKET")}
    - loop:
        for:
          value: i
          range: ${[0, 100]}
          steps:
            - upload:
                call: googleapis.run.v1.namespaces.jobs.run
                args:
                  name: ${job_namespace + "google-cloud-storage-ssl-error"}
                  location: ${job_location}
                  body:
                    overrides:
                      containerOverrides:
                        args:
                          - ${bucket}
                          - --folder
                          - testfolder

I configure the Run Job with 1 vCPU and 4GB mem, with 40 tasks in parallel.
I've managed to run the workflow for 20 iterations without any issues.
Current state for me is that I have my "real" code that fails spuriously, and I have the "repro" code which works fine.
I'll have to carefully dismantle the "real" workflow until it starts to work again. Hopefully I can figure out, given enough time, what the triggering issue is. I'll keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. needs more info This issue needs more information from the customer to proceed. priority: p3 Desirable enhancement or fix. May not be included in next release. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

6 participants