Skip to content

Conversation

@LonelyMidoriya
Copy link
Contributor

@LonelyMidoriya LonelyMidoriya commented Dec 18, 2025

Summary by CodeRabbit

  • Bug Fixes

    • Improved validation for BOM‑prefixed text strings (UTF‑16BE and UTF‑8), with stricter checks for language escape sequences and preserved behavior for valid/absent sequences.
    • Now emits explicit warnings for invalid language escape sequences and for unsupported UTF‑16LE encoding.
  • Chores

    • Added named warnings for clearer diagnostics when encoding or escape issues are detected.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

Walkthrough

Adds private validation of language escape sequences for BOM-prefixed text strings in COSString.isTextString, new warning constants in StringWarnings, and a UTF‑16LE warning replacement; validators log specific warnings and cause isTextString to return false on invalid sequences.

Changes

Cohort / File(s) Summary
Text String Escape Sequence Validation
src/main/java/org/verapdf/cos/COSString.java
Extends isTextString to detect BOM-prefixed UTF‑16BE and UTF‑8 strings and invoke new private validators checkUTF16BEEscapeSequence(byte[]) and checkUTF8EscapeSequence(byte[]); adds isASCIILetter(byte) helper; validators check language escape sequences (0x00 0x1B for UTF‑16BE, 0x1B for UTF‑8), require 2 or 4 ASCII letters, log warnings and return false for invalid patterns.
String Warning Constants
src/main/java/org/verapdf/as/warnings/StringWarnings.java
Adds StringWarnings class with NOT_ASCII_LETTER, INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH, and NOT_SUPPORTED_UTF16LE_ENCODING public static final string constants; COSString now uses these constants for logged warnings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Check BOM detection bytes and that validators only run when BOMs are present.
  • Review escape-sequence parsing offsets, length checks (2 vs 4 letters) and ASCII-letter validation.
  • Confirm logging uses the new StringWarnings constants and UTF‑16LE message replacement.

Poem

🐰 I nibble bytes beneath the moonlit stack,

I sniff for BOMs and mark the letters back,
Two or four I count with careful paw,
I flag the wrongs and leave the right in awe,
A tiny hop — strings tidy on my track 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main change: adding warning constants and validation logic for invalid language escape sequences in text strings (UTF-16BE and UTF-8 BOMs).
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch language-escape-sequence

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/main/java/org/verapdf/cos/COSString.java (1)

260-260: Fix grammar in warning message.

The warning message uses "have" but should use "has" for grammatical correctness.

🔎 Apply this diff:
-                LOGGER.warning("Text string language escape sequence have invalid length");
+                LOGGER.warning("Text string language escape sequence has invalid length");
-                LOGGER.warning("Text string language escape sequence have invalid length");
+                LOGGER.warning("Text string language escape sequence has invalid length");

Also applies to: 291-291

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcfffe0 and 43aace0.

📒 Files selected for processing (1)
  • src/main/java/org/verapdf/cos/COSString.java (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/main/java/org/verapdf/cos/COSString.java (1)
src/main/java/org/verapdf/tools/PDFDocEncoding.java (1)
  • PDFDocEncoding (28-112)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (2)
src/main/java/org/verapdf/cos/COSString.java (2)

298-301: LGTM!

The helper method correctly identifies ASCII letters (A-Z and a-z) with proper bit masking to handle Java's signed bytes.


237-265: No changes needed—the language code encoding is correct per PDF specification.

The PDF specification defines the escape sequence as: the Unicode value U+001B (byte sequence 0 followed by 27), followed by a 2-character ISO 639 language code—for example, en for English or ja for Japanese. Language escape sequences in UTF-16BE strings use a 2-byte language tag with an optional 2-byte country code.

The current implementation is correct. Language codes in UTF-16BE strings are stored as raw ASCII bytes (not UTF-16BE encoded characters). For example, the language code "en" is represented as two single bytes (0x65, 0x6E), not as UTF-16BE encoded characters that would require four bytes (0x00 0x65, 0x00 0x6E). The method correctly validates this format by checking individual bytes as ASCII letters.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/main/java/org/verapdf/cos/COSString.java (1)

222-236: Critical: Validation return values are ignored (issue not resolved from previous review).

The checkUTF16BEEscapeSequence and checkUTF8EscapeSequence methods return false when invalid escape sequences are detected, but isTextString ignores these return values and always returns true for BOM-prefixed strings. This completely defeats the purpose of the validation—strings with invalid language escape sequences are incorrectly classified as valid text strings.

🔎 Apply this diff to fix the issue:
     public boolean isTextString() {
         if (value.length > 2) {
             if ((value[0] & 0xFF) == 0xFE && (value[1] & 0xFF) == 0xFF) {
-                checkUTF16BEEscapeSequence(value);
-                return true;
+                return checkUTF16BEEscapeSequence(value);
             }
         }
         if (value.length > 3) {
             if ((value[0] & 0xFF) == 0xEF && (value[1] & 0xFF) == 0xBB && (value[2] & 0xFF) == 0xBF) {
-                checkUTF8EscapeSequence(value);
-                return true;
+                return checkUTF8EscapeSequence(value);
             }
         }
         return PDFDocEncoding.isPDFDocEncodingString(value);
🧹 Nitpick comments (3)
src/main/java/org/verapdf/as/warnings/StringWarnings.java (1)

3-7: Consider following utility class best practices.

The class contains only static constants but can still be instantiated or subclassed. Consider making the class final and adding a private constructor to prevent instantiation, following standard utility class patterns.

🔎 View suggested enhancement:
-public class StringWarnings {
+public final class StringWarnings {
+    private StringWarnings() {
+        // Prevent instantiation
+    }
+    
     public static final String NOT_ASCII_LETTER = "Text string language escape sequence contains not ASCII letter";
     public static final String INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH = "Text string language escape sequence has invalid length";
     public static final String NOT_SUPPORTED_UTF16LE_ENCODING = "String object uses encoding UTF16-LE not supported by PDF";
 }
src/main/java/org/verapdf/cos/COSString.java (2)

261-261: Add space after comma.

-                LOGGER.log(Level.WARNING,StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);
+                LOGGER.log(Level.WARNING, StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);

292-292: Add space after comma.

-                LOGGER.log(Level.WARNING,StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);
+                LOGGER.log(Level.WARNING, StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d90e11 and 656b2d3.

📒 Files selected for processing (2)
  • src/main/java/org/verapdf/as/warnings/StringWarnings.java (1 hunks)
  • src/main/java/org/verapdf/cos/COSString.java (3 hunks)
🔇 Additional comments (5)
src/main/java/org/verapdf/cos/COSString.java (5)

23-23: LGTM!

Import correctly added to reference the new warning constants.


115-115: LGTM!

Refactoring to use a centralized constant improves maintainability.


268-297: UTF-8 escape sequence validation logic looks correct.

The validation properly handles UTF-8 encoding where ASCII characters are single bytes. The pattern matching for 2 or 4 letter sequences followed by ESC markers is appropriate.


299-302: LGTM!

The helper method correctly identifies ASCII letters (A-Z, a-z) by checking the appropriate hex ranges.


238-266: Fix minor formatting issue: add space after comma in log statement (line 261)

Line 261 is missing a space after the comma in the log statement: LOGGER.log(Level.WARNING,StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH); should be LOGGER.log(Level.WARNING, StringWarnings.INVALID_LANGUAGE_ESCAPE_SEQUENCE_LENGTH);

@MaximPlusov MaximPlusov merged commit 08a131a into integration Dec 18, 2025
9 checks passed
@MaximPlusov MaximPlusov deleted the language-escape-sequence branch December 18, 2025 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants