Skip to content

Conversation

@sanjaychelliah
Copy link
Contributor

@sanjaychelliah sanjaychelliah commented May 9, 2024

Overview

The primary goal is to make Users be able to use clarifai-datautils library for data ETL process with ease. This library should be used along with the Python SDK to easily load text files(pdf, doc, etc..) , transform, chunk and upload to the Clarifai Platform. The requirement was to give users pipelines to define and use it to ingest data chunks into the Platform. For this implementation, unstructured library will be used internally.

Usage

from clarifai_datautils.text import Pipeline, PDFPartition
from clarifai_datautils.text.pipeline.extractors import ExtractTextAfter


# Define the pipeline
pipeline = Pipeline(
    name='Pdf-Splitter',
    transformations=[
        PDFPartition(chunking_strategy = "by_title",max_characters = 1024),
        ExtractTextAfter(key = 'text_after',string = 'demon to survive')
    ]
)

# Using SDK to upload
from clarifai.client import Dataset
dataset = Dataset(dataset_url)
dataset.upload_dataset(pipeline.run(files = file_name, loader = True))

Added

TODO:

  • OCR Pipelines
  • Schema for Pipeline init
  • num_workers support
  • Other supported formats

Copy link

@sainivedh sainivedh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments

Copy link

@phatvo9 phatvo9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @sanjaychelliah . I left some comments.

@sanjaychelliah sanjaychelliah requested a review from sainivedh May 17, 2024 08:20
Copy link

@sainivedh sainivedh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Tremendous effort!

@sanjaychelliah sanjaychelliah merged commit adc9e24 into main May 27, 2024
@sanjaychelliah sanjaychelliah mentioned this pull request May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants