Skip to content

scikit_bring_your_own.ipynb train model pandas error #219

@professoroakz

Description

@professoroakz

Hello!

I am following the scikit_bring_your_own tutorial and I am trying to set up BYO bring your own model for production use, but I am encountering the following issue when trying to train the model on AWS Sagemaker.


AlgorithmError: Exception during training: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'. Traceback (most recent call last): File "/opt/program/train", line 48, in train raw_data = [ pd.read_csv(file, header=None) for file in input_files ] File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 449, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 818, in __init__ self._make_engine(self.engine) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1049, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1695, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parser

I uploaded the data to s3 using:

    def upload_data(self):
        self.logger.info(
            'Uploading locally available data to s3 in path: %s, using bucket: %s using s3 directory prefix: %s'
            % (
                self.config.data_directory_path,
                self.config.data_upload_bucket,
                self.config.s3_data_directory_prefix,
            )
        )

        self.train_data_location = self.session.upload_data(
            path=self.config.data_directory_path,
            bucket=self.config.data_upload_bucket,
            key_prefix=self.config.s3_data_directory_prefix
        )

        self.logger.info('Uploaded local data to s3 path: %s' % (self.train_data_location))

I ran the build_and_push.sh script.

Then I tried to train the model using:

    def estimator(self):
        self.logger.info(
            'Creating estimator for %s model %s using image %s' % (
                'BYO',
                self.config.model_name,
                self.image,
            )
        )

        return Estimator(
            image_name=self.image,
            role=self.config.role,
            train_instance_count=self.config.train_instance_count,
            train_instance_type=self.config.train_instance_type,
            output_path=self.config.output_path,
            base_job_name=self.config.base_job_name,
            sagemaker_session=self.session,
        )

(I'm using the same code as in the notebook, just rewritten for using it as a class)

Am I missing something or doing something wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions