-
Notifications
You must be signed in to change notification settings - Fork 0
New VLM JSON to stylized HTML notebook draft #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the notebook! This is a really cool example for our collection, however I do have some feedback we should address before we publish it.
Mainly, we need to add some narrative, more context, examples, and guide the reader through the notebook.
Detailed feedback (not necessarily in order):
- In the intro, you briefly mention why one would use it - “perfect for web apps, knowledge bases, or downstream AI workflows.” Start with the “why”. Got lots of unstructured data in various formats that you’d like to visually standardize for XYZ…. You can do it with Unstructured. 2 sentences about what Unstructured does, and what the standard output of the VLM parser contains. Give a bit more context for the reader, who’s new to Unstructured. Give a TLDR of what the reader will learn to do. Then move on to the steps.
- List prerequisites: Unstructured API key + uploading some PDF to your google colab environment. Check an example of prerequisites here: https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Getting_Started_with_Unstructured_API_and_Redis.ipynb
- Add a sentence or two on environment variables management.
- Maybe explain why only certain file types are eligible for VLM partitioner? Also, you’re not using the
VLM_ELIGIBLE_FILE_TYPES
. Maybe replace it with markdown narrative about supported file types? - Please explain what type of partitioners Unstructured offers, what we recommend the VLM partitioner for, and why in this case it is the only option.
- Mention that in this example for illustration purposes you’re partitioning just one file, but if they want to do this at scale for many files, they would need to use the Workflows Endpoint with connectors. Link to the docs for the partition endpoint: https://docs.unstructured.io/api-reference/partition/overview, and for the workflow endpoint: https://docs.unstructured.io/api-reference/workflow/overview
- After you partition the file, print out an example of an element. This helps to make it more interactive. Reader can understand what we’re working with, and also when they try it for themselves, this gives them a checkpoint.
- Explain what happens in every step, guide the learner. E.g. Step 2 is a lot of code, explain how you leverage the metadata. Link to the docs where we explain what the our output JSON looks like. Ideally, show an example of an element. https://docs.unstructured.io/api-reference/partition/document-elements
- Step 3: Stylize the outputs. Say a sentence or two about this Step.
- Show a screenshot of the result in a markdown cell to give the reader an idea of what the output looks like. Add the wow factor.
- Write a conclusion with what could be next steps (e.g. building a production workflow with connectors, and then applying the style to batches of processed documents).
- Add a CTA - encourage the reader to try it with their documents. Lead them to sign up for the platform, mention the free trial
- Minor nitpicks: “# Platform Partition URL” -> “Unstructured Partition Endpoint URL”
Great feedback. I'll address each of these and push an update. |
}, | ||
"outputs": [], | ||
"source": [ | ||
"file_as_html = VlmJsonToHtmlConverter(title).convert(json_data)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Undefined Variables Cause Conversion Error
The notebook attempts to convert JSON to HTML using VlmJsonToHtmlConverter(title).convert(json_data)
, but both title
and json_data
are undefined. The title
variable is never declared, and json_data
is likely a typo for file_as_json
, which was defined earlier. This results in a NameError
when the cell executes.
No description provided.