Skip to content

tests.system.test_gbq.TestToGBQIntegration: test_upload_data_with_valid_user_schema failed #511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flaky-bot bot opened this issue Apr 1, 2022 · 1 comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. flakybot: flaky Tells the Flaky Bot not to close or comment on this issue. flakybot: issue An issue filed by the Flaky Bot. Should not be added manually. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@flaky-bot
Copy link

flaky-bot bot commented Apr 1, 2022

This test failed!

To configure my behavior, see the Flaky Bot documentation.

If I'm commenting on this issue too often, add the flakybot: quiet label and
I will stop commenting.


commit: 82d38ea
buildURL: Build Status, Sponge
status: failed

Test output
self = 
file_obj = <_io.BufferedReader name='/tmp/tmpblhmw9v2_job_8006a745.parquet'>
destination = TableReference(DatasetReference('precise-truck-742', 'python_bigquery_pandas_tests_system_20220401031324_b7a6ab'), 'new_test18')
rewind = True, size = 1649, num_retries = 6
job_id = '8006a745-3fc5-4e12-9ebe-2520d654d042', job_id_prefix = None
location = None, project = 'precise-truck-742'
job_config = 
timeout = None
def load_table_from_file(
    self,
    file_obj: IO[bytes],
    destination: Union[Table, TableReference, TableListItem, str],
    rewind: bool = False,
    size: int = None,
    num_retries: int = _DEFAULT_NUM_RETRIES,
    job_id: str = None,
    job_id_prefix: str = None,
    location: str = None,
    project: str = None,
    job_config: LoadJobConfig = None,
    timeout: ResumableTimeoutType = DEFAULT_TIMEOUT,
) -> job.LoadJob:
    """Upload the contents of this table from a file-like object.

    Similar to :meth:`load_table_from_uri`, this method creates, starts and
    returns a :class:`~google.cloud.bigquery.job.LoadJob`.

    Args:
        file_obj:
            A file handle opened in binary mode for reading.
        destination:
            Table into which data is to be loaded. If a string is passed
            in, this method attempts to create a table reference from a
            string using
            :func:`google.cloud.bigquery.table.TableReference.from_string`.

    Keyword Arguments:
        rewind:
            If True, seek to the beginning of the file handle before
            reading the file.
        size:
            The number of bytes to read from the file handle. If size is
            ``None`` or large, resumable upload will be used. Otherwise,
            multipart upload will be used.
        num_retries: Number of upload retries. Defaults to 6.
        job_id: Name of the job.
        job_id_prefix:
            The user-provided prefix for a randomly generated job ID.
            This parameter will be ignored if a ``job_id`` is also given.
        location:
            Location where to run the job. Must match the location of the
            destination table.
        project:
            Project ID of the project of where to run the job. Defaults
            to the client's project.
        job_config:
            Extra configuration options for the job.
        timeout:
            The number of seconds to wait for the underlying HTTP transport
            before using ``retry``. Depending on the retry strategy, a request
            may be repeated several times using the same timeout each time.

            Can also be passed as a tuple (connect_timeout, read_timeout).
            See :meth:`requests.Session.request` documentation for details.

    Returns:
        google.cloud.bigquery.job.LoadJob: A new load job.

    Raises:
        ValueError:
            If ``size`` is not passed in and can not be determined, or if
            the ``file_obj`` can be detected to be a file opened in text
            mode.

        TypeError:
            If ``job_config`` is not an instance of :class:`~google.cloud.bigquery.job.LoadJobConfig`
            class.
    """
    job_id = _make_job_id(job_id, job_id_prefix)

    if project is None:
        project = self.project

    if location is None:
        location = self.location

    destination = _table_arg_to_table_ref(destination, default_project=self.project)
    job_ref = job._JobReference(job_id, project=project, location=location)
    if job_config:
        job_config = copy.deepcopy(job_config)
        _verify_job_config_type(job_config, google.cloud.bigquery.job.LoadJobConfig)
    load_job = job.LoadJob(job_ref, None, destination, self, job_config)
    job_resource = load_job.to_api_repr()

    if rewind:
        file_obj.seek(0, os.SEEK_SET)

    _check_mode(file_obj)

    try:
        if size is None or size >= _MAX_MULTIPART_SIZE:
            response = self._do_resumable_upload(
                file_obj, job_resource, num_retries, timeout, project=project
            )
        else:
          response = self._do_multipart_upload(
                file_obj, job_resource, size, num_retries, timeout, project=project
            )

.nox/system-3-9/lib/python3.9/site-packages/google/cloud/bigquery/client.py:2423:


self = <google.cloud.bigquery.client.Client object at 0x7fd1ec068ac0>
stream = <_io.BufferedReader name='/tmp/tmpblhmw9v2_job_8006a745.parquet'>
metadata = {'configuration': {'load': {'destinationTable': {'datasetId': 'python_bigquery_pandas_tests_system_20220401031324_b7a6... 'PARQUET', ...}}, 'jobReference': {'jobId': '8006a745-3fc5-4e12-9ebe-2520d654d042', 'projectId': 'precise-truck-742'}}
size = 1649, num_retries = 6, timeout = None, project = 'precise-truck-742'

def _do_multipart_upload(
    self,
    stream: IO[bytes],
    metadata: Mapping[str, str],
    size: int,
    num_retries: int,
    timeout: Optional[ResumableTimeoutType],
    project: Optional[str] = None,
):
    """Perform a multipart upload.

    Args:
        stream: A bytes IO object open for reading.

        metadata: The metadata associated with the upload.

        size:
            The number of bytes to be uploaded (which will be read
            from ``stream``). If not provided, the upload will be
            concluded once ``stream`` is exhausted (or :data:`None`).

        num_retries:
            Number of upload retries. (Deprecated: This
            argument will be removed in a future release.)

        timeout:
            The number of seconds to wait for the underlying HTTP transport
            before using ``retry``. Depending on the retry strategy, a request may
            be repeated several times using the same timeout each time.

            Can also be passed as a tuple (connect_timeout, read_timeout).
            See :meth:`requests.Session.request` documentation for details.

        project:
            Project ID of the project of where to run the upload. Defaults
            to the client's project.

    Returns:
        requests.Response:
            The "200 OK" response object returned after the multipart
            upload request.

    Raises:
        ValueError:
            if the ``stream`` has fewer than ``size``
            bytes remaining.
    """
    data = stream.read(size)
    if len(data) < size:
        msg = _READ_LESS_THAN_SIZE.format(size, len(data))
        raise ValueError(msg)

    headers = _get_upload_headers(self._connection.user_agent)

    if project is None:
        project = self.project

    # TODO: Increase the minimum version of google-cloud-core to 1.6.0
    # and remove this logic. See:
    # https://github.com/googleapis/python-bigquery/issues/509
    hostname = (
        self._connection.API_BASE_URL
        if not hasattr(self._connection, "get_api_base_url_for_mtls")
        else self._connection.get_api_base_url_for_mtls()
    )
    upload_url = _MULTIPART_URL_TEMPLATE.format(host=hostname, project=project)
    upload = MultipartUpload(upload_url, headers=headers)

    if num_retries is not None:
        upload._retry_strategy = resumable_media.RetryStrategy(
            max_retries=num_retries
        )
  response = upload.transmit(
        self._http, data, metadata, _GENERIC_CONTENT_TYPE, timeout=timeout
    )

.nox/system-3-9/lib/python3.9/site-packages/google/cloud/bigquery/client.py:2976:


self = <google.resumable_media.requests.upload.MultipartUpload object at 0x7fd1ec078700>
transport = <google.auth.transport.requests.AuthorizedSession object at 0x7fd1ec068a90>
data = b'PAR1\x15\x04\x15P\x154L\x15\n\x15\x04\x12\x00\x00(\x00\x002\x01\x00\x04\xf0?\r\x0f\x00@\t\x08$\x08@\x00\x00\x00\x00...=\x00\x18\x1fparquet-cpp-arrow version 7.0.0\x19L\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x00[\x03\x00\x00PAR1'
metadata = {'configuration': {'load': {'destinationTable': {'datasetId': 'python_bigquery_pandas_tests_system_20220401031324_b7a6... 'PARQUET', ...}}, 'jobReference': {'jobId': '8006a745-3fc5-4e12-9ebe-2520d654d042', 'projectId': 'precise-truck-742'}}
content_type = '/', timeout = None

def transmit(
    self,
    transport,
    data,
    metadata,
    content_type,
    timeout=(
        _request_helpers._DEFAULT_CONNECT_TIMEOUT,
        _request_helpers._DEFAULT_READ_TIMEOUT,
    ),
):
    """Transmit the resource to be uploaded.

    Args:
        transport (~requests.Session): A ``requests`` object which can
            make authenticated requests.
        data (bytes): The resource content to be uploaded.
        metadata (Mapping[str, str]): The resource metadata, such as an
            ACL list.
        content_type (str): The content type of the resource, e.g. a JPEG
            image has content type ``image/jpeg``.
        timeout (Optional[Union[float, Tuple[float, float]]]):
            The number of seconds to wait for the server response.
            Depending on the retry strategy, a request may be repeated
            several times using the same timeout each time.

            Can also be passed as a tuple (connect_timeout, read_timeout).
            See :meth:`requests.Session.request` documentation for details.

    Returns:
        ~requests.Response: The HTTP response returned by ``transport``.
    """
    method, url, payload, headers = self._prepare_request(
        data, metadata, content_type
    )

    # Wrap the request business logic in a function to be retried.
    def retriable_request():
        result = transport.request(
            method, url, data=payload, headers=headers, timeout=timeout
        )

        self._process_response(result)

        return result
  return _request_helpers.wait_and_retry(
        retriable_request, self._get_status_code, self._retry_strategy
    )

.nox/system-3-9/lib/python3.9/site-packages/google/resumable_media/requests/upload.py:153:


func = <function MultipartUpload.transmit..retriable_request at 0x7fd1ec067040>
get_status_code = <function RequestsMixin._get_status_code at 0x7fd205000310>
retry_strategy = <google.resumable_media.common.RetryStrategy object at 0x7fd1ec072d90>

def wait_and_retry(func, get_status_code, retry_strategy):
    """Attempts to retry a call to ``func`` until success.

    Expects ``func`` to return an HTTP response and uses ``get_status_code``
    to check if the response is retry-able.

    ``func`` is expected to raise a failure status code as a
    common.InvalidResponse, at which point this method will check the code
    against the common.RETRIABLE list of retriable status codes.

    Will retry until :meth:`~.RetryStrategy.retry_allowed` (on the current
    ``retry_strategy``) returns :data:`False`. Uses
    :func:`_helpers.calculate_retry_wait` to double the wait time (with jitter)
    after each attempt.

    Args:
        func (Callable): A callable that takes no arguments and produces
            an HTTP response which will be checked as retry-able.
        get_status_code (Callable[Any, int]): Helper to get a status code
            from a response.
        retry_strategy (~google.resumable_media.common.RetryStrategy): The
            strategy to use if the request fails and must be retried.

    Returns:
        object: The return value of ``func``.
    """
    total_sleep = 0.0
    num_retries = 0
    # base_wait will be multiplied by the multiplier on the first retry.
    base_wait = float(retry_strategy.initial_delay) / retry_strategy.multiplier

    # Set the retriable_exception_type if possible. We expect requests to be
    # present here and the transport to be using requests.exceptions errors,
    # but due to loose coupling with the transport layer we can't guarantee it.

    while True:  # return on success or when retries exhausted.
        error = None
        try:
          response = func()

.nox/system-3-9/lib/python3.9/site-packages/google/resumable_media/requests/_request_helpers.py:147:


def retriable_request():
    result = transport.request(
        method, url, data=payload, headers=headers, timeout=timeout
    )
  self._process_response(result)

.nox/system-3-9/lib/python3.9/site-packages/google/resumable_media/requests/upload.py:149:


self = <google.resumable_media.requests.upload.MultipartUpload object at 0x7fd1ec078700>
response = <Response [403]>

def _process_response(self, response):
    """Process the response from an HTTP request.

    This is everything that must be done after a request that doesn't
    require network I/O (or other I/O). This is based on the `sans-I/O`_
    philosophy.

    Args:
        response (object): The HTTP response object.

    Raises:
        ~google.resumable_media.common.InvalidResponse: If the status
            code is not 200.

    .. _sans-I/O: https://sans-io.readthedocs.io/
    """
    # Tombstone the current upload so it cannot be used again (in either
    # failure or success).
    self._finished = True
  _helpers.require_status_code(response, (http.client.OK,), self._get_status_code)

.nox/system-3-9/lib/python3.9/site-packages/google/resumable_media/_upload.py:114:


response = <Response [403]>, status_codes = (<HTTPStatus.OK: 200>,)
get_status_code = <function RequestsMixin._get_status_code at 0x7fd205000310>
callback = <function do_nothing at 0x7fd204ff4e50>

def require_status_code(response, status_codes, get_status_code, callback=do_nothing):
    """Require a response has a status code among a list.

    Args:
        response (object): The HTTP response object.
        status_codes (tuple): The acceptable status codes.
        get_status_code (Callable[Any, int]): Helper to get a status code
            from a response.
        callback (Optional[Callable]): A callback that takes no arguments,
            to be executed when an exception is being raised.

    Returns:
        int: The status code.

    Raises:
        ~google.resumable_media.common.InvalidResponse: If the status code
            is not one of the values in ``status_codes``.
    """
    status_code = get_status_code(response)
    if status_code not in status_codes:
        if status_code not in common.RETRYABLE:
            callback()
      raise common.InvalidResponse(
            response,
            "Request failed with status code",
            status_code,
            "Expected one of",
            *status_codes
        )

E google.resumable_media.common.InvalidResponse: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>)

.nox/system-3-9/lib/python3.9/site-packages/google/resumable_media/_helpers.py:105: InvalidResponse

During handling of the above exception, another exception occurred:

self = <pandas_gbq.gbq.GbqConnector object at 0x7fd1ec0688b0>
dataframe = A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07
destination_table_ref = TableReference(DatasetReference('precise-truck-742', 'python_bigquery_pandas_tests_system_20220401031324_b7a6ab'), 'new_test18')
chunksize = None
schema = {'fields': [{'name': 'A', 'type': 'FLOAT'}, {'name': 'B', 'type': 'FLOAT'}, {'name': 'C', 'type': 'STRING'}, {'name': 'D', 'type': 'TIMESTAMP'}]}
progress_bar = True, api_method = 'load_parquet'
billing_project = 'precise-truck-742'

def load_data(
    self,
    dataframe,
    destination_table_ref,
    chunksize=None,
    schema=None,
    progress_bar=True,
    api_method: str = "load_parquet",
    billing_project: Optional[str] = None,
):
    from pandas_gbq import load

    total_rows = len(dataframe)

    try:
      chunks = load.load_chunks(
            self.client,
            dataframe,
            destination_table_ref,
            chunksize=chunksize,
            schema=schema,
            location=self.location,
            api_method=api_method,
            billing_project=billing_project,
        )

pandas_gbq/gbq.py:591:


client = <google.cloud.bigquery.client.Client object at 0x7fd1ec068ac0>
dataframe = A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07
destination_table_ref = TableReference(DatasetReference('precise-truck-742', 'python_bigquery_pandas_tests_system_20220401031324_b7a6ab'), 'new_test18')
chunksize = None
schema = {'fields': [{'name': 'A', 'type': 'FLOAT'}, {'name': 'B', 'type': 'FLOAT'}, {'name': 'C', 'type': 'STRING'}, {'name': 'D', 'type': 'TIMESTAMP'}]}
location = None, api_method = 'load_parquet'
billing_project = 'precise-truck-742'

def load_chunks(
    client,
    dataframe,
    destination_table_ref,
    chunksize=None,
    schema=None,
    location=None,
    api_method="load_parquet",
    billing_project: Optional[str] = None,
):
    if api_method == "load_parquet":
      load_parquet(
            client,
            dataframe,
            destination_table_ref,
            location,
            schema,
            billing_project=billing_project,
        )

pandas_gbq/load.py:238:


client = <google.cloud.bigquery.client.Client object at 0x7fd1ec068ac0>
dataframe = A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07
destination_table_ref = TableReference(DatasetReference('precise-truck-742', 'python_bigquery_pandas_tests_system_20220401031324_b7a6ab'), 'new_test18')
location = None
schema = {'fields': [{'name': 'A', 'type': 'FLOAT'}, {'name': 'B', 'type': 'FLOAT'}, {'name': 'C', 'type': 'STRING'}, {'name': 'D', 'type': 'TIMESTAMP'}]}
billing_project = 'precise-truck-742'

def load_parquet(
    client: bigquery.Client,
    dataframe: pandas.DataFrame,
    destination_table_ref: bigquery.TableReference,
    location: Optional[str],
    schema: Optional[Dict[str, Any]],
    billing_project: Optional[str] = None,
):
    job_config = bigquery.LoadJobConfig()
    job_config.write_disposition = "WRITE_APPEND"
    job_config.source_format = "PARQUET"

    if schema is not None:
        schema = pandas_gbq.schema.remove_policy_tags(schema)
        job_config.schema = pandas_gbq.schema.to_google_cloud_bigquery(schema)
        dataframe = cast_dataframe_for_parquet(dataframe, schema)

    try:
      client.load_table_from_dataframe(
            dataframe,
            destination_table_ref,
            job_config=job_config,
            location=location,
            project=billing_project,
        ).result()

pandas_gbq/load.py:130:


self = <google.cloud.bigquery.client.Client object at 0x7fd1ec068ac0>
dataframe = A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07
destination = TableReference(DatasetReference('precise-truck-742', 'python_bigquery_pandas_tests_system_20220401031324_b7a6ab'), 'new_test18')
num_retries = 6, job_id = '8006a745-3fc5-4e12-9ebe-2520d654d042'
job_id_prefix = None, location = None, project = 'precise-truck-742'
job_config = <google.cloud.bigquery.job.load.LoadJobConfig object at 0x7fd1ec0d7190>
parquet_compression = 'SNAPPY', timeout = None

def load_table_from_dataframe(
    self,
    dataframe: "pandas.DataFrame",
    destination: Union[Table, TableReference, str],
    num_retries: int = _DEFAULT_NUM_RETRIES,
    job_id: str = None,
    job_id_prefix: str = None,
    location: str = None,
    project: str = None,
    job_config: LoadJobConfig = None,
    parquet_compression: str = "snappy",
    timeout: ResumableTimeoutType = DEFAULT_TIMEOUT,
) -> job.LoadJob:
    """Upload the contents of a table from a pandas DataFrame.

    Similar to :meth:`load_table_from_uri`, this method creates, starts and
    returns a :class:`~google.cloud.bigquery.job.LoadJob`.

    .. note::

        REPEATED fields are NOT supported when using the CSV source format.
        They are supported when using the PARQUET source format, but
        due to the way they are encoded in the ``parquet`` file,
        a mismatch with the existing table schema can occur, so
        REPEATED fields are not properly supported when using ``pyarrow<4.0.0``
        using the parquet format.

        https://github.com/googleapis/python-bigquery/issues/19

    Args:
        dataframe:
            A :class:`~pandas.DataFrame` containing the data to load.
        destination:
            The destination table to use for loading the data. If it is an
            existing table, the schema of the :class:`~pandas.DataFrame`
            must match the schema of the destination table. If the table
            does not yet exist, the schema is inferred from the
            :class:`~pandas.DataFrame`.

            If a string is passed in, this method attempts to create a
            table reference from a string using
            :func:`google.cloud.bigquery.table.TableReference.from_string`.

    Keyword Arguments:
        num_retries: Number of upload retries.
        job_id: Name of the job.
        job_id_prefix:
            The user-provided prefix for a randomly generated
            job ID. This parameter will be ignored if a ``job_id`` is
            also given.
        location:
            Location where to run the job. Must match the location of the
            destination table.
        project:
            Project ID of the project of where to run the job. Defaults
            to the client's project.
        job_config:
            Extra configuration options for the job.

            To override the default pandas data type conversions, supply
            a value for
            :attr:`~google.cloud.bigquery.job.LoadJobConfig.schema` with
            column names matching those of the dataframe. The BigQuery
            schema is used to determine the correct data type conversion.
            Indexes are not loaded.

            By default, this method uses the parquet source format. To
            override this, supply a value for
            :attr:`~google.cloud.bigquery.job.LoadJobConfig.source_format`
            with the format name. Currently only
            :attr:`~google.cloud.bigquery.job.SourceFormat.CSV` and
            :attr:`~google.cloud.bigquery.job.SourceFormat.PARQUET` are
            supported.
        parquet_compression:
            [Beta] The compression method to use if intermittently
            serializing ``dataframe`` to a parquet file.

            The argument is directly passed as the ``compression``
            argument to the underlying ``pyarrow.parquet.write_table()``
            method (the default value "snappy" gets converted to uppercase).
            https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table

            If the job config schema is missing, the argument is directly
            passed as the ``compression`` argument to the underlying
            ``DataFrame.to_parquet()`` method.
            https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet
        timeout:
            The number of seconds to wait for the underlying HTTP transport
            before using ``retry``. Depending on the retry strategy, a request may
            be repeated several times using the same timeout each time.

            Can also be passed as a tuple (connect_timeout, read_timeout).
            See :meth:`requests.Session.request` documentation for details.

    Returns:
        google.cloud.bigquery.job.LoadJob: A new load job.

    Raises:
        TypeError:
            If ``job_config`` is not an instance of :class:`~google.cloud.bigquery.job.LoadJobConfig`
            class.
    """
    job_id = _make_job_id(job_id, job_id_prefix)

    if job_config:
        _verify_job_config_type(job_config, google.cloud.bigquery.job.LoadJobConfig)
        # Make a copy so that the job config isn't modified in-place.
        job_config_properties = copy.deepcopy(job_config._properties)
        job_config = job.LoadJobConfig()
        job_config._properties = job_config_properties

    else:
        job_config = job.LoadJobConfig()

    supported_formats = {job.SourceFormat.CSV, job.SourceFormat.PARQUET}
    if job_config.source_format is None:
        # default value
        job_config.source_format = job.SourceFormat.PARQUET

    if (
        job_config.source_format == job.SourceFormat.PARQUET
        and job_config.parquet_options is None
    ):
        parquet_options = ParquetOptions()
        # default value
        parquet_options.enable_list_inference = True
        job_config.parquet_options = parquet_options

    if job_config.source_format not in supported_formats:
        raise ValueError(
            "Got unexpected source_format: '{}'. Currently, only PARQUET and CSV are supported".format(
                job_config.source_format
            )
        )

    if location is None:
        location = self.location

    # If table schema is not provided, we try to fetch the existing table
    # schema, and check if dataframe schema is compatible with it - except
    # for WRITE_TRUNCATE jobs, the existing schema does not matter then.
    if (
        not job_config.schema
        and job_config.write_disposition != job.WriteDisposition.WRITE_TRUNCATE
    ):
        try:
            table = self.get_table(destination)
        except core_exceptions.NotFound:
            pass
        else:
            columns_and_indexes = frozenset(
                name
                for name, _ in _pandas_helpers.list_columns_and_indexes(dataframe)
            )
            job_config.schema = [
                # Field description and policy tags are not needed to
                # serialize a data frame.
                SchemaField(
                    field.name,
                    field.field_type,
                    mode=field.mode,
                    fields=field.fields,
                )
                # schema fields not present in the dataframe are not needed
                for field in table.schema
                if field.name in columns_and_indexes
            ]

    job_config.schema = _pandas_helpers.dataframe_to_bq_schema(
        dataframe, job_config.schema
    )

    if not job_config.schema:
        # the schema could not be fully detected
        warnings.warn(
            "Schema could not be detected for all columns. Loading from a "
            "dataframe without a schema will be deprecated in the future, "
            "please provide a schema.",
            PendingDeprecationWarning,
            stacklevel=2,
        )

    tmpfd, tmppath = tempfile.mkstemp(
        suffix="_job_{}.{}".format(job_id[:8], job_config.source_format.lower())
    )
    os.close(tmpfd)

    try:

        if job_config.source_format == job.SourceFormat.PARQUET:
            if job_config.schema:
                if parquet_compression == "snappy":  # adjust the default value
                    parquet_compression = parquet_compression.upper()

                _pandas_helpers.dataframe_to_parquet(
                    dataframe,
                    job_config.schema,
                    tmppath,
                    parquet_compression=parquet_compression,
                    parquet_use_compliant_nested_type=True,
                )
            else:
                dataframe.to_parquet(
                    tmppath,
                    engine="pyarrow",
                    compression=parquet_compression,
                    **(
                        {"use_compliant_nested_type": True}
                        if _helpers.PYARROW_VERSIONS.use_compliant_nested_type
                        else {}
                    ),
                )

        else:

            dataframe.to_csv(
                tmppath,
                index=False,
                header=False,
                encoding="utf-8",
                float_format="%.17g",
                date_format="%Y-%m-%d %H:%M:%S.%f",
            )

        with open(tmppath, "rb") as tmpfile:
            file_size = os.path.getsize(tmppath)
          return self.load_table_from_file(
                tmpfile,
                destination,
                num_retries=num_retries,
                rewind=True,
                size=file_size,
                job_id=job_id,
                job_id_prefix=job_id_prefix,
                location=location,
                project=project,
                job_config=job_config,
                timeout=timeout,
            )

.nox/system-3-9/lib/python3.9/site-packages/google/cloud/bigquery/client.py:2657:


self = <google.cloud.bigquery.client.Client object at 0x7fd1ec068ac0>
file_obj = <_io.BufferedReader name='/tmp/tmpblhmw9v2_job_8006a745.parquet'>
destination = TableReference(DatasetReference('precise-truck-742', 'python_bigquery_pandas_tests_system_20220401031324_b7a6ab'), 'new_test18')
rewind = True, size = 1649, num_retries = 6
job_id = '8006a745-3fc5-4e12-9ebe-2520d654d042', job_id_prefix = None
location = None, project = 'precise-truck-742'
job_config = <google.cloud.bigquery.job.load.LoadJobConfig object at 0x7fd1ec078220>
timeout = None

def load_table_from_file(
    self,
    file_obj: IO[bytes],
    destination: Union[Table, TableReference, TableListItem, str],
    rewind: bool = False,
    size: int = None,
    num_retries: int = _DEFAULT_NUM_RETRIES,
    job_id: str = None,
    job_id_prefix: str = None,
    location: str = None,
    project: str = None,
    job_config: LoadJobConfig = None,
    timeout: ResumableTimeoutType = DEFAULT_TIMEOUT,
) -> job.LoadJob:
    """Upload the contents of this table from a file-like object.

    Similar to :meth:`load_table_from_uri`, this method creates, starts and
    returns a :class:`~google.cloud.bigquery.job.LoadJob`.

    Args:
        file_obj:
            A file handle opened in binary mode for reading.
        destination:
            Table into which data is to be loaded. If a string is passed
            in, this method attempts to create a table reference from a
            string using
            :func:`google.cloud.bigquery.table.TableReference.from_string`.

    Keyword Arguments:
        rewind:
            If True, seek to the beginning of the file handle before
            reading the file.
        size:
            The number of bytes to read from the file handle. If size is
            ``None`` or large, resumable upload will be used. Otherwise,
            multipart upload will be used.
        num_retries: Number of upload retries. Defaults to 6.
        job_id: Name of the job.
        job_id_prefix:
            The user-provided prefix for a randomly generated job ID.
            This parameter will be ignored if a ``job_id`` is also given.
        location:
            Location where to run the job. Must match the location of the
            destination table.
        project:
            Project ID of the project of where to run the job. Defaults
            to the client's project.
        job_config:
            Extra configuration options for the job.
        timeout:
            The number of seconds to wait for the underlying HTTP transport
            before using ``retry``. Depending on the retry strategy, a request
            may be repeated several times using the same timeout each time.

            Can also be passed as a tuple (connect_timeout, read_timeout).
            See :meth:`requests.Session.request` documentation for details.

    Returns:
        google.cloud.bigquery.job.LoadJob: A new load job.

    Raises:
        ValueError:
            If ``size`` is not passed in and can not be determined, or if
            the ``file_obj`` can be detected to be a file opened in text
            mode.

        TypeError:
            If ``job_config`` is not an instance of :class:`~google.cloud.bigquery.job.LoadJobConfig`
            class.
    """
    job_id = _make_job_id(job_id, job_id_prefix)

    if project is None:
        project = self.project

    if location is None:
        location = self.location

    destination = _table_arg_to_table_ref(destination, default_project=self.project)
    job_ref = job._JobReference(job_id, project=project, location=location)
    if job_config:
        job_config = copy.deepcopy(job_config)
        _verify_job_config_type(job_config, google.cloud.bigquery.job.LoadJobConfig)
    load_job = job.LoadJob(job_ref, None, destination, self, job_config)
    job_resource = load_job.to_api_repr()

    if rewind:
        file_obj.seek(0, os.SEEK_SET)

    _check_mode(file_obj)

    try:
        if size is None or size >= _MAX_MULTIPART_SIZE:
            response = self._do_resumable_upload(
                file_obj, job_resource, num_retries, timeout, project=project
            )
        else:
            response = self._do_multipart_upload(
                file_obj, job_resource, size, num_retries, timeout, project=project
            )
    except resumable_media.InvalidResponse as exc:
      raise exceptions.from_http_response(exc.response)

E google.api_core.exceptions.Forbidden: 403 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/precise-truck-742/jobs?uploadType=multipart: Access Denied: Table precise-truck-742:python_bigquery_pandas_tests_system_20220401031324_b7a6ab.new_test18: Permission bigquery.tables.updateData denied on table precise-truck-742:python_bigquery_pandas_tests_system_20220401031324_b7a6ab.new_test18 (or it may not exist).

.nox/system-3-9/lib/python3.9/site-packages/google/cloud/bigquery/client.py:2427: Forbidden

During handling of the above exception, another exception occurred:

self = <system.test_gbq.TestToGBQIntegration object at 0x7fd203d95b20>
project_id = 'precise-truck-742'

def test_upload_data_with_valid_user_schema(self, project_id):
    # Issue #46; tests test scenarios with user-provided
    # schemas
    df = make_mixed_dataframe_v1()
    test_id = "18"
    test_schema = [
        {"name": "A", "type": "FLOAT"},
        {"name": "B", "type": "FLOAT"},
        {"name": "C", "type": "STRING"},
        {"name": "D", "type": "TIMESTAMP"},
    ]
    destination_table = self.destination_table + test_id
  gbq.to_gbq(
        df,
        destination_table,
        project_id,
        credentials=self.credentials,
        table_schema=test_schema,
    )

tests/system/test_gbq.py:943:


pandas_gbq/gbq.py:1198: in to_gbq
connector.load_data(
pandas_gbq/gbq.py:610: in load_data
self.process_http_error(ex)


ex = Forbidden('POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/precise-truck-742/jobs?uploadType=multipar...n table precise-truck-742:python_bigquery_pandas_tests_system_20220401031324_b7a6ab.new_test18 (or it may not exist).')

@staticmethod
def process_http_error(ex):
    # See `BigQuery Troubleshooting Errors
    # <https://cloud.google.com/bigquery/troubleshooting-errors>`__

    if "cancelled" in ex.message:
        raise QueryTimeout("Reason: {0}".format(ex))
  raise GenericGBQException("Reason: {0}".format(ex))

E pandas_gbq.exceptions.GenericGBQException: Reason: 403 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/precise-truck-742/jobs?uploadType=multipart: Access Denied: Table precise-truck-742:python_bigquery_pandas_tests_system_20220401031324_b7a6ab.new_test18: Permission bigquery.tables.updateData denied on table precise-truck-742:python_bigquery_pandas_tests_system_20220401031324_b7a6ab.new_test18 (or it may not exist).

pandas_gbq/gbq.py:386: GenericGBQException

@flaky-bot flaky-bot bot added flakybot: issue An issue filed by the Flaky Bot. Should not be added manually. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Apr 1, 2022
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Apr 1, 2022
@flaky-bot flaky-bot bot added the flakybot: flaky Tells the Flaky Bot not to close or comment on this issue. label Apr 1, 2022
@flaky-bot
Copy link
Author

flaky-bot bot commented Apr 1, 2022

Looks like this issue is flaky. 😟

I'm going to leave this open and stop commenting.

A human should fix and close this.


When run at the same commit (82d38ea), this test passed in one build (Build Status, Sponge) and failed in another build (Build Status, Sponge).

@meredithslota meredithslota added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. flakybot: flaky Tells the Flaky Bot not to close or comment on this issue. flakybot: issue An issue filed by the Flaky Bot. Should not be added manually. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

1 participant