-
-
Notifications
You must be signed in to change notification settings - Fork 146
Add summary backfill #1948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add summary backfill #1948
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
llm/normalize_summaries.py
Outdated
| strip_summary = re.sub(r"^Summary:", "", summary) | ||
| lines = strip_summary.splitlines() | ||
| handle_list_items = [re.sub(r"^- ", "", x) for x in lines] | ||
| handle_remaining_whitespace = [x.strip() for x in handle_list_items if x.strip() != ""] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strip all extraneous whitespace and filter any empty lines
| databases/ | ||
| .secret.local | ||
| .secret.local | ||
| summaries-and-topics.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generated by running llm/backfill_summaries.py and I assume we don't want to accidentally commit that.
3e0c451 to
2652ed9
Compare
jicruz96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
nesanders
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Just a few comments, all minor stuff.
llm/backfill_summaries.py
Outdated
| @@ -0,0 +1,82 @@ | |||
| # This script fills any missing 'summary' or 'topics' fields on the data model. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor: recommend changing to multiline comment for readability, i.e.
"""This script...
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend adding to comment the explanation that it queries, by default, for session 194 bills and writes output to a local CSV and also automatically edits it in the firebase, IIUC.
| import csv | ||
| from normalize_summaries import normalize_summary | ||
|
|
||
| # Application Default credentials are automatically created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have docs on how to connect to the MAPLE prod firebase, assuming that's what you are doing? If so, can we link that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, as far as I know yes. In ## Contributing Backend Features to Dev/Prod: in the main README.md file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Can we link here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By link, do you mean just note that it exists? Or, do you want a link to github directly? Or expect that we'll use sphinx or something in the future and add a relative link in sphinx doc?
llm/backfill_summaries.py
Outdated
| return [f"{bill_id}", f"{status}", f"{summary}", f"{topics}"] | ||
|
|
||
|
|
||
| bills_ref = db.collection("generalCourts/194/bills") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend moving global constants to top of script and using ALL_CAPS naming convention, per PEP8.
llm/backfill_summaries.py
Outdated
|
|
||
| bills_ref = db.collection("generalCourts/194/bills") | ||
| bills = bills_ref.get() | ||
| with open("./summaries-and-topics.csv", "w") as csvfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend making this filename a global constant surfaced at top of script.
| continue | ||
| # Note: `normalize_summary` does some post-processing to clean up the summaries | ||
| # As of 2025-10-21 this was necessary due to the LLM prompt | ||
| summary = normalize_summary(summary["summary"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be a followup issue/PR, but do we also need to inject this function somewhere in our production code, i.e. when we run this as a lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do, good call out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, did you file a followup issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not done that, but I totally can do that quick!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nesanders
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments! There are a couple dangling comments I am replying to here, but nothing major, so happy to approve.
| import csv | ||
| from normalize_summaries import normalize_summary | ||
|
|
||
| # Application Default credentials are automatically created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Can we link here?
| continue | ||
| # Note: `normalize_summary` does some post-processing to clean up the summaries | ||
| # As of 2025-10-21 this was necessary due to the LLM prompt | ||
| summary = normalize_summary(summary["summary"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, did you file a followup issue?
Summary
Add a summary/topics backfill script. This is really intended as a one off script but could be useful if the document trigger didn't work properly.
First test, this looks good enough to attempt to write back to firebase.
https://console.firebase.google.com/u/0/project/digital-testimony-dev/firestore/databases/-default-/data/~2FgeneralCourts~2F194~2Fbills~2FH3872 <- has an example where I set both summary and topics.
Here is an example of the corrected summary formatting with appropriate CSV output,
As additional proof, I can read it back in via pandas
Checklist
Screenshots
Add some screenshots highlighting your changes.
Known issues
If you've run against limitations or caveats, include them here. Include follow-up issues as well.
Steps to test/reproduce
For each feature or bug fix, create a step by step list for how a reviewer can test it out. E.g.: