Skip to content

Conversation

TheYorouzoya
Copy link
Contributor

@TheYorouzoya TheYorouzoya commented Sep 11, 2025

Closes #12271

This PR adds a new integrity checker for the booktitle field along with the associated cleanup actions.

I'll be doing the development in the following major phases (these will be refined further as I move along):

  • Gathering and defining a clear set of requirements
  • Drafting an implementation approach
  • Adding the core logic/classes required
  • Integrating the feature into the GUI
  • Iterate till feature meets expectations

Steps to test

  1. Integrity Checker marks an improper Booktitle field
image
  1. Check Integrity dialogue lists failing check for each embedded field individually
image
  1. New booktitle cleanup checkbox and sub-panel added to Cleanup Entries Dialog Box
image
  1. Clicking on the "Clean up 'booktitle'..." checkbox will enable the cleanup sub-panel, allowing the user to pick a cleanup action for each individual field found in booktitle
image
  1. Post clean up, the fields are moved to their respective field editors
image image image

Mandatory checks

@TheYorouzoya
Copy link
Contributor Author

TheYorouzoya commented Sep 12, 2025

Requirements

1. Integrity Checker should flag a booktitle with year numbers, locations, and page numbers

The booktitle field should be marked similar to when an integrity check fails for other fields (for example, Author)

image

The field should also show up in the Check Integrity dialog box in Quality -> Check Integrity

image

2. Integrity Checker should allow the user to move nested fields to their appropriate place via a cleanup action

If a booktitle is found to contain years. locations, or page numbers, the user should be allowed to perform a cleanup action under Quality -> Clean Up Entries dialog box which would move them to their appropriate fields.

image

Since it is possible that the field we're trying to move our extracted data to is already populated, the user should be given a choice whether to perform the move for each piece of data found. This translates to adding three cleanup actions or options to the dialog box (one each for year, page number, and location).

Implementation Approach: Details, Edge Cases, and Questions

I'll take an example from the issue post and break it down as follows:

Input: European Conference on Circuit Theory and Design, {ECCTD} 2015, Trondheim, Norway, August 24-26, 2015
Year: 2015, 2015
Month: August
Page Numbers: 24-26
Locations: Trondheim, Norway

Year Numbers

Years can be of the form XXXX with four digits on their own. Since the booktitle field deals with scientific journals, books, and conferences, we can further refine this in the range 16XX to 20XX (the first ever publications are from the 1600s).

Q1. What about multiple years present in a title? Which one do we pick to transfer to the year field? My guess would be the latest one.

Months

Months are spelled out in text (like January, February, and so on). These can be easily picked up via a regex or a simple string comparison. Once found, they can be moved to the Month field under Optional fields.

Page Numbers

These will typically be of the form <number>-<number> (like 22-45) or <number>--<number>(22--45). There aren't any edge cases here to talk about other than whether we want to support more formats.

Note

I'm assuming that the 24-26 in the example is referring to a page range and not a date. If that is not the case, then I'd like some examples of page numbers in a booktitle for reference.

Locations

These can refer to the names of countries and/or cities present in the field. While a previous attempt doesn't provide much in terms of details, this comment on the original issue post does suggest an offline-friendly approach.

Since we'll encounter location names only in cases of conferences, we're only looking at cities big enough to host one of them. GeoNames provides multiple datasets with cutoffs based on population (>500, >1000, >5000, >15000). Just to be on the safe side, we can pick the >1000 population dataset and incorporate that into our search database.

The data has 162,090 cities in it along with a bunch of associated information. We'll strip away all the metadata and keep only the names. Since the data provides city names in UTF-8 as well as in ASCII as separate columns, we'll flatten it further down to one city per line and deduplicate the resulting dataset. Doing this brings us down to 1.9MB from our ~30MB starting point with the total number of cities going up to 173,371 (not that many cities have different UTF-8 and ASCII names). We can add the list of countries on top of this to get our final dataset.

Note

If we really want to be stingy about space, we can further compress this down using something like gzip and get down to around ~800KB as plaintext can be compressed quite well.

Loading this many entries into memory shouldn't be that big of burden either. A GPT-assisted rough calculation puts us at around 40MB of heap usage if we're using a HashSet<String> [Edit: I'm now reconsidering the HashSet and instead using a specialized Trie data structure to accommodate for punctuation and whitespace within location names].

Q2. Is this approach okay? If there are issues with the overhead, we can use a bloom filter, but that is a probabilistic data structure which can lead to some false positives.

Q3. Is there a specific field where these should be moved to as part of the cleanup action? I have noticed these fields: address, location, and venue.

@TheYorouzoya
Copy link
Contributor Author

TheYorouzoya commented Sep 12, 2025

@koppor please check if the approach fits with the expectations of the feature, and help clarify those questions. I'll start laying out some of the core logic in the meantime.

@TheYorouzoya
Copy link
Contributor Author

Since I have not received any feedback on the approach for the last two weeks, I'll be pushing an implementation as per the outlined approach in a couple of days.

TheYorouzoya and others added 3 commits September 29, 2025 22:25
- Add integrity checkers for each of the following fields: Year, Month,
  Page Range, and Location.
- Update FieldCheckers to apply the checkers on the Booktitle, Journal,
  and Journaltitle fields.
- Add location data (countries_cities1000.txt) containing country and city
  names. The data is sourced from
- Add LocationDetector to load location data and allow location extraction
  from input.
- Add BooktitleCleanupField enum to represent each checker field.
- Add BooktitleCleanupAction enum to represent the following cleanup
  actions: remove only, replace, replace if empty, and skip.
- Consolidate reused pre-compiled regex patterns into RegexPatterns.
- Add BooktitleCleanupPanel and BooktitleCleanupPanelViewModel (inspired
  by FieldFormatterCleanups) to the GUI.
- Add BooktitleCleanupPanel.fxml for the cleanup section's layout and
  integrate it into CleanupPresetPanel.fxml.
- Update CleanupPreferences, CleanupPresetPanel, and CleanupWorker to
  include Booktitle cleanups.
- Update module-info.java to export the BooktitleCleanupAction enum.
- Add localization keys to JabRef_en.properties for all the new cleanup
  GUI text.
- Add unit tests and relevant javadoc.

Part of JabRef#12271
@TheYorouzoya
Copy link
Contributor Author

@koppor check if the feature implementation so far is up to expectations. Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

@koppor koppor added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Oct 17, 2025
@koppor
Copy link
Member

koppor commented Oct 17, 2025

Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

I assume the functionality is existing for the other settings. Thus, please also support this for the new checkbox. Thank you 😅.

@TheYorouzoya
Copy link
Contributor Author

Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

I assume the functionality is existing for the other settings. Thus, please also support this for the new checkbox. Thank you 😅.

My question wasn't a "whether or not" to add the functionality, it says "whether it requires A OR B" to be modified to have it enabled, i.e., which of the two options need to be updated to get it to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New integrity checker for booktitle

2 participants