Add summary backfill #1948

chiroptical · 2025-09-24T00:17:58Z

Summary

Add a summary/topics backfill script. This is really intended as a one off script but could be useful if the document trigger didn't work properly.

First test, this looks good enough to attempt to write back to firebase.

H3284,generated_summary,Summary: The bill proposes a program to help state agencies and departments prioritize the hiring of military veterans and individuals who have completed service with the Peace Corps, AmeriCorps, and Commonwealth Corps. It aims to improve recruitment, development, and retention of these individuals in public sector jobs. The human resources division would provide certifications to confirm their status as veterans or alumni of these programs. This initiative seeks to enhance employment opportunities for those who have served in these capacities.,None

https://console.firebase.google.com/u/0/project/digital-testimony-dev/firestore/databases/-default-/data/~2FgeneralCourts~2F194~2Fbills~2FH3872 <- has an example where I set both summary and topics.

Here is an example of the corrected summary formatting with appropriate CSV output,

H4602,generated_topics,"The bill proposes to increase the Monson select board from 3 to 5 members, allowing for broader representation. If the bill is passed, three new select board members would be elected at the next annual town election, with varying term lengths based on the number of votes received. After these initial elections, all future select board members would serve 3-year terms. The bill would take effect as soon as it is passed.","[{'topic': 'Political advertising', 'category': 'Government Operations and Elections'}, {'topic': 'Government studies and investigations', 'category': 'Government Operations and Elections'}, {'topic': 'Community life and organization', 'category': 'Housing and Community Development'}]"

As additional proof, I can read it back in via pandas

> python
Python 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_csv("~/summaries-and-topics.csv")
>>> df.head()
  bill_id            status summary topics
0      H1           skipped     NaN    NaN
1     H10  previous_summary     NaN    NaN
2    H100  previous_summary     NaN    NaN
3   H1000  previous_summary     NaN    NaN
4   H1001  previous_summary     NaN    NaN
>>>

Checklist

On the frontend, I've made my strings translate-able.
If I've added shared components, I've added a storybook story.
I've made pages responsive and look good on mobile.

Screenshots

Add some screenshots highlighting your changes.

Known issues

If you've run against limitations or caveats, include them here. Include follow-up issues as well.

Steps to test/reproduce

For each feature or bug fix, create a step by step list for how a reviewer can test it out. E.g.:

Go to the home page
Click on a testimony
See that it's loaded with a loading spinner

vercel · 2025-09-24T00:18:04Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
maple-dev	Ready	Preview	Comment	Oct 29, 2025 1:33am

llm/.gitignore

llm/backfill_summaries.py

llm/normalize_summaries.py

chiroptical · 2025-10-15T01:40:25Z

llm/normalize_summaries.py

+    strip_summary = re.sub(r"^Summary:", "", summary)
+    lines = strip_summary.splitlines()
+    handle_list_items = [re.sub(r"^- ", "", x) for x in lines]
+    handle_remaining_whitespace = [x.strip() for x in handle_list_items if x.strip() != ""]


Strip all extraneous whitespace and filter any empty lines

llm/normalize_summaries.py

llm/test_bill_on_document_created.py

chiroptical · 2025-10-22T00:24:13Z

llm/.gitignore

 databases/
-.secret.local
+.secret.local
+summaries-and-topics.csv


This is generated by running llm/backfill_summaries.py and I assume we don't want to accidentally commit that.

jicruz96

LGTM

nesanders

Looking great! Just a few comments, all minor stuff.

llm/normalize_summaries.py

nesanders · 2025-10-29T00:27:43Z

llm/backfill_summaries.py

@@ -0,0 +1,82 @@
+# This script fills any missing 'summary' or 'topics' fields on the data model.


Very minor: recommend changing to multiline comment for readability, i.e.

"""This script... """

Recommend adding to comment the explanation that it queries, by default, for session 194 bills and writes output to a local CSV and also automatically edits it in the firebase, IIUC.

nesanders · 2025-10-29T00:28:38Z

llm/backfill_summaries.py

+import csv
+from normalize_summaries import normalize_summary
+
+# Application Default credentials are automatically created.


Do we have docs on how to connect to the MAPLE prod firebase, assuming that's what you are doing? If so, can we link that here?

Good question, as far as I know yes. In ## Contributing Backend Features to Dev/Prod: in the main README.md file.

Great! Can we link here?

By link, do you mean just note that it exists? Or, do you want a link to github directly? Or expect that we'll use sphinx or something in the future and add a relative link in sphinx doc?

llm/backfill_summaries.py

nesanders · 2025-10-29T00:30:38Z

llm/backfill_summaries.py

+    return [f"{bill_id}", f"{status}", f"{summary}", f"{topics}"]
+
+
+bills_ref = db.collection("generalCourts/194/bills")


Recommend moving global constants to top of script and using ALL_CAPS naming convention, per PEP8.

nesanders · 2025-10-29T00:30:53Z

llm/backfill_summaries.py

+
+bills_ref = db.collection("generalCourts/194/bills")
+bills = bills_ref.get()
+with open("./summaries-and-topics.csv", "w") as csvfile:


Recommend making this filename a global constant surfaced at top of script.

llm/backfill_summaries.py

nesanders · 2025-10-29T00:34:59Z

llm/backfill_summaries.py

+            continue
+        # Note: `normalize_summary` does some post-processing to clean up the summaries
+        # As of 2025-10-21 this was necessary due to the LLM prompt
+        summary = normalize_summary(summary["summary"])


It can be a followup issue/PR, but do we also need to inject this function somewhere in our production code, i.e. when we run this as a lambda?

Yes we do, good call out.

Thanks, did you file a followup issue?

I've not done that, but I totally can do that quick!

nesanders

Thanks for addressing the comments! There are a couple dangling comments I am replying to here, but nothing major, so happy to approve.

nesanders · 2025-11-05T00:56:54Z

llm/backfill_summaries.py

+import csv
+from normalize_summaries import normalize_summary
+
+# Application Default credentials are automatically created.


Great! Can we link here?

nesanders · 2025-11-05T00:57:24Z

llm/backfill_summaries.py

+            continue
+        # Note: `normalize_summary` does some post-processing to clean up the summaries
+        # As of 2025-10-21 this was necessary due to the LLM prompt
+        summary = normalize_summary(summary["summary"])


Thanks, did you file a followup issue?

vercel bot deployed to Preview – maple-dev September 24, 2025 00:21 View deployment

vercel bot deployed to Preview – maple-dev September 24, 2025 00:58 View deployment

vercel bot deployed to Preview – maple-dev September 24, 2025 01:14 View deployment

vercel bot deployed to Preview – maple-dev October 8, 2025 01:05 View deployment

chiroptical marked this pull request as ready for review October 8, 2025 01:21

chiroptical requested review from Mephistic, alexjball, kiminkim724, mertbagt, mvictor55, nesanders, sashamaryl and timblais as code owners October 8, 2025 01:21