Add Integrity Checker for BookTitle #13854

TheYorouzoya · 2025-09-11T12:38:57Z

Closes #12271

This PR adds a new integrity checker for the booktitle field along with the associated cleanup actions.

I'll be doing the development in the following major phases (these will be refined further as I move along):

Gathering and defining a clear set of requirements
Drafting an implementation approach
Adding the core logic/classes required
Integrating the feature into the GUI
Iterate till feature meets expectations

Steps to test

Integrity Checker marks an improper Booktitle field

Check Integrity dialogue lists failing check for each embedded field individually

New booktitle cleanup checkbox and sub-panel added to Cleanup Entries Dialog Box

Clicking on the "Clean up 'booktitle'..." checkbox will enable the cleanup sub-panel, allowing the user to pick a cleanup action for each individual field found in booktitle

Post clean up, the fields are moved to their respective field editors

Mandatory checks

I own the copyright of the code submitted and I license it under the MIT license
I manually tested my changes in running JabRef (always required)
I added JUnit tests for changes (if applicable)
I added screenshots in the PR description (if change is visible to the user)
I described the change in CHANGELOG.md in a way that is understandable for the average user (if change is visible to the user)
I checked the user documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request updating file(s) in https://github.com/JabRef/user-documentation/tree/main/en.

Part of JabRef#12271

TheYorouzoya · 2025-09-12T12:35:35Z

Requirements

1. Integrity Checker should flag a `booktitle` with year numbers, locations, and page numbers

The booktitle field should be marked similar to when an integrity check fails for other fields (for example, Author)

The field should also show up in the Check Integrity dialog box in Quality -> Check Integrity

2. Integrity Checker should allow the user to move nested fields to their appropriate place via a cleanup action

If a booktitle is found to contain years. locations, or page numbers, the user should be allowed to perform a cleanup action under Quality -> Clean Up Entries dialog box which would move them to their appropriate fields.

Since it is possible that the field we're trying to move our extracted data to is already populated, the user should be given a choice whether to perform the move for each piece of data found. This translates to adding three cleanup actions or options to the dialog box (one each for year, page number, and location).

Implementation Approach: Details, Edge Cases, and Questions

I'll take an example from the issue post and break it down as follows:

Input: European Conference on Circuit Theory and Design, {ECCTD} 2015, Trondheim, Norway, August 24-26, 2015
Year: 2015, 2015
Month: August
Page Numbers: 24-26
Locations: Trondheim, Norway

Year Numbers

Years can be of the form XXXX with four digits on their own. Since the booktitle field deals with scientific journals, books, and conferences, we can further refine this in the range 16XX to 20XX (the first ever publications are from the 1600s).

Q1. What about multiple years present in a title? Which one do we pick to transfer to the year field? My guess would be the latest one.

Months

Months are spelled out in text (like January, February, and so on). These can be easily picked up via a regex or a simple string comparison. Once found, they can be moved to the Month field under Optional fields.

Page Numbers

These will typically be of the form <number>-<number> (like 22-45) or <number>--<number>(22--45). There aren't any edge cases here to talk about other than whether we want to support more formats.

Note

I'm assuming that the 24-26 in the example is referring to a page range and not a date. If that is not the case, then I'd like some examples of page numbers in a booktitle for reference.

Locations

These can refer to the names of countries and/or cities present in the field. While a previous attempt doesn't provide much in terms of details, this comment on the original issue post does suggest an offline-friendly approach.

Since we'll encounter location names only in cases of conferences, we're only looking at cities big enough to host one of them. GeoNames provides multiple datasets with cutoffs based on population (>500, >1000, >5000, >15000). Just to be on the safe side, we can pick the >1000 population dataset and incorporate that into our search database.

The data has 162,090 cities in it along with a bunch of associated information. We'll strip away all the metadata and keep only the names. Since the data provides city names in UTF-8 as well as in ASCII as separate columns, we'll flatten it further down to one city per line and deduplicate the resulting dataset. Doing this brings us down to 1.9MB from our ~30MB starting point with the total number of cities going up to 173,371 (not that many cities have different UTF-8 and ASCII names). We can add the list of countries on top of this to get our final dataset.

Note

If we really want to be stingy about space, we can further compress this down using something like gzip and get down to around ~800KB as plaintext can be compressed quite well.

Loading this many entries into memory shouldn't be that big of burden either. A GPT-assisted rough calculation puts us at around 40MB of heap usage if we're using a HashSet<String> [Edit: I'm now reconsidering the HashSet and instead using a specialized Trie data structure to accommodate for punctuation and whitespace within location names].

Q2. Is this approach okay? If there are issues with the overhead, we can use a bloom filter, but that is a probabilistic data structure which can lead to some false positives.

Q3. Is there a specific field where these should be moved to as part of the cleanup action? I have noticed these fields: address, location, and venue.

TheYorouzoya · 2025-09-12T12:40:31Z

@koppor please check if the approach fits with the expectations of the feature, and help clarify those questions. I'll start laying out some of the core logic in the meantime.

TheYorouzoya · 2025-09-27T10:44:17Z

Since I have not received any feedback on the approach for the last two weeks, I'll be pushing an implementation as per the outlined approach in a couple of days.

- Add integrity checkers for each of the following fields: Year, Month, Page Range, and Location. - Update FieldCheckers to apply the checkers on the Booktitle, Journal, and Journaltitle fields. - Add location data (countries_cities1000.txt) containing country and city names. The data is sourced from - Add LocationDetector to load location data and allow location extraction from input. - Add BooktitleCleanupField enum to represent each checker field. - Add BooktitleCleanupAction enum to represent the following cleanup actions: remove only, replace, replace if empty, and skip. - Consolidate reused pre-compiled regex patterns into RegexPatterns. - Add BooktitleCleanupPanel and BooktitleCleanupPanelViewModel (inspired by FieldFormatterCleanups) to the GUI. - Add BooktitleCleanupPanel.fxml for the cleanup section's layout and integrate it into CleanupPresetPanel.fxml. - Update CleanupPreferences, CleanupPresetPanel, and CleanupWorker to include Booktitle cleanups. - Update module-info.java to export the BooktitleCleanupAction enum. - Add localization keys to JabRef_en.properties for all the new cleanup GUI text. - Add unit tests and relevant javadoc. Part of JabRef#12271

Part of JabRef#12271

jabgui/src/main/java/org/jabref/gui/commonfxcontrols/BooktitleCleanupPanel.java

jablib/src/main/java/org/jabref/logic/cleanup/BooktitleCleanups.java

jablib/src/test/java/org/jabref/logic/integrity/BooktitleLocationCheckerTest.java

jablib/src/test/java/org/jabref/logic/integrity/BooktitleMonthCheckerTest.java

TheYorouzoya · 2025-09-29T18:54:45Z

@koppor check if the feature implementation so far is up to expectations. Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

jablib/src/test/java/org/jabref/logic/integrity/BooktitlePageRangeCheckerTest.java

jablib/src/test/java/org/jabref/logic/integrity/BooktitleYearCheckerTest.java

jablib/src/test/java/org/jabref/logic/util/LocationDetectorTest.java

Part of JabRef#12271

koppor · 2025-10-17T12:19:42Z

Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

I assume the functionality is existing for the other settings. Thus, please also support this for the new checkbox. Thank you 😅.

TheYorouzoya · 2025-10-17T13:56:20Z

Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

I assume the functionality is existing for the other settings. Thus, please also support this for the new checkbox. Thank you 😅.

My question wasn't a "whether or not" to add the functionality, it says "whether it requires A OR B" to be modified to have it enabled, i.e., which of the two options need to be updated to get it to work.

Update Changelog

2528b8c

Part of JabRef#12271

TheYorouzoya and others added 3 commits September 29, 2025 22:25

Merge branch 'main' into booktitle-integrity-checker

4332291

Fix Whitespace Formatting Issues Across Files

8f0a25f

Part of JabRef#12271