Skip to content

Conversation

CodesLikeIcarus
Copy link

No description provided.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

cursor[bot]

This comment was marked as outdated.

Copy link
Collaborator

@MKhalusova MKhalusova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the notebook! This is a really cool example for our collection, however I do have some feedback we should address before we publish it.
Mainly, we need to add some narrative, more context, examples, and guide the reader through the notebook.

Detailed feedback (not necessarily in order):

  • In the intro, you briefly mention why one would use it - “perfect for web apps, knowledge bases, or downstream AI workflows.” 
Start with the “why”. Got lots of unstructured data in various formats that you’d like to visually standardize for XYZ…. You can do it with Unstructured. 2 sentences about what Unstructured does, and what the standard output of the VLM parser contains. Give a bit more context for the reader, who’s new to Unstructured. Give a TLDR of what the reader will learn to do. Then move on to the steps.
  • List prerequisites: Unstructured API key + uploading some PDF to your google colab environment. Check an example of prerequisites here: https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Getting_Started_with_Unstructured_API_and_Redis.ipynb
  • Add a sentence or two on environment variables management.
  • Maybe explain why only certain file types are eligible for VLM partitioner? Also, you’re not using the VLM_ELIGIBLE_FILE_TYPES. Maybe replace it with markdown narrative about supported file types?
  • Please explain what type of partitioners Unstructured offers, what we recommend the VLM partitioner for, and why in this case it is the only option.
  • Mention that in this example for illustration purposes you’re partitioning just one file, but if they want to do this at scale for many files, they would need to use the Workflows Endpoint with connectors. Link to the docs for the partition endpoint: https://docs.unstructured.io/api-reference/partition/overview, and for the workflow endpoint: https://docs.unstructured.io/api-reference/workflow/overview
  • After you partition the file, print out an example of an element. This helps to make it more interactive. Reader can understand what we’re working with, and also when they try it for themselves, this gives them a checkpoint.
  • Explain what happens in every step, guide the learner. E.g. Step 2 is a lot of code, explain how you leverage the metadata. Link to the docs where we explain what the our output JSON looks like. Ideally, show an example of an element. https://docs.unstructured.io/api-reference/partition/document-elements
  • Step 3: Stylize the outputs. Say a sentence or two about this Step.
  • Show a screenshot of the result in a markdown cell to give the reader an idea of what the output looks like. Add the wow factor.
  • Write a conclusion with what could be next steps (e.g. building a production workflow with connectors, and then applying the style to batches of processed documents).
  • Add a CTA - encourage the reader to try it with their documents. Lead them to sign up for the platform, mention the free trial
  • Minor nitpicks: “# Platform Partition URL” -> “Unstructured Partition Endpoint URL”

@CodesLikeIcarus
Copy link
Author

Great feedback. I'll address each of these and push an update.

},
"outputs": [],
"source": [
"file_as_html = VlmJsonToHtmlConverter(title).convert(json_data)\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Undefined Variables Cause Conversion Error

The notebook attempts to convert JSON to HTML using VlmJsonToHtmlConverter(title).convert(json_data), but both title and json_data are undefined. The title variable is never declared, and json_data is likely a typo for file_as_json, which was defined earlier. This results in a NameError when the cell executes.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants