Add pre-commit hook for UTF-8 encoding enforcement #795

TamerineSky · 2026-01-07T18:10:00Z

Problem

PR #782 fixed 251 instances of missing UTF-8 encoding across the backend. Without enforcement, these issues will recur as developers add new file operations without knowing about Windows encoding requirements.

Solution

Add pre-commit hook that automatically checks for missing encoding="utf-8" parameters in file operations.

1. Encoding Check Script (`scripts/check_encoding.py`)

Validates all open() calls have encoding parameter
Validates Path.read_text() and Path.write_text() calls
Checks json.load/dump with open()
Allows binary mode without encoding (correct behavior)
Windows-compatible with UTF-8 console reconfiguration

2. Pre-commit Configuration

Runs automatically on every commit
Scoped to apps/backend/ to avoid false positives
Integrates with existing pre-commit infrastructure
Fast regex-based checking

3. Comprehensive Tests

10 tests covering all validation scenarios
Tests detection, allowlisting, and edge cases
All tests passing ✅

Real-World Validation

The hook already caught 6 real encoding issues in apps/backend/core/progress.py that were missed by PR #782 or added since then:

Lines 45, 83, 184, 304, 378, 418

Benefits

Prevents Regressions: Automatically catches missing encoding in new code
Developer Education: Fast feedback teaches encoding best practices
CI Ready: Can be integrated into GitHub Actions
Low Overhead: Simple regex matching, runs in milliseconds

Testing

✅ All 10 unit tests passing
✅ Tested on backend files with/without encoding
✅ Verified binary files are allowed
✅ Confirmed scoping works (only checks apps/backend/)

No Migration Needed

PR #782 already fixed most files. Developers only need to:

Ensure pre-commit is installed: pip install pre-commit
Install hooks: pre-commit install

The hook runs automatically on commit after installation.

Summary by CodeRabbit

New Features
- Added automated encoding validation to ensure UTF‑8 is explicitly specified for file operations.
Tests
- Added comprehensive tests covering many file‑operation patterns and edge cases to verify the validator's behavior.
Chores
- Integrated the encoding check into the pre‑commit pipeline; note: the hook was added twice, which may cause the check to run twice during commits.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

1. Encoding Check Script (scripts/check_encoding.py): - Validates all file operations have encoding="utf-8" - Checks open(), Path.read_text(), Path.write_text() - Checks json.load/dump with open() - Allows binary mode without encoding - Windows-compatible emoji output with UTF-8 reconfiguration 2. Pre-commit Config (.pre-commit-config.yaml): - Added check-file-encoding hook for apps/backend/ - Runs automatically before commits - Scoped to backend Python files only 3. Tests (tests/test_check_encoding.py): - Comprehensive test coverage (10 tests, all passing) - Tests detection of missing encoding - Tests allowlist for binary files - Tests multiple issues in single file - Tests file type filtering Purpose: - Prevent regression of 251 UTF-8 encoding fixes from PR AndyMik90#782 - Catch missing encoding in new code during development - Fast feedback loop for developers Implementation Notes: - Hook scoped to apps/backend/ to avoid false positives in test code - Uses simple regex matching for speed - Compatible with existing pre-commit infrastructure - Already caught 6 real issues in apps/backend/core/progress.py Related: PR AndyMik90#782 - Fix Windows UTF-8 encoding errors

coderabbitai · 2026-01-07T18:10:11Z

📝 Walkthrough

Walkthrough

Adds a new pre-commit hook and supporting script that scan Python files for missing encoding="utf-8" in file/JSON operations, plus a comprehensive test suite. The hook entry was added twice in the pre-commit config, causing duplicate runs.

Changes

Cohort / File(s)	Summary
Pre-commit configuration `\.pre-commit-config.yaml`	Added a local hook `check-file-encoding` referencing `scripts/check_encoding.py` to run on Python files under `apps/backend/`; the hook entry was inserted twice, creating duplicate executions.
Encoding checker implementation `scripts/check_encoding.py`	New script adding `EncodingChecker` with `check_file` and `check_files`, CLI `main()` and entry-point guard; detects missing `encoding="utf-8"` in `open()`, `Path.read_text()`, `Path.write_text()`, and JSON use patterns; prints formatted reports and uses exit codes.
Tests `tests/test_check_encoding.py`	New comprehensive tests creating temporary Python files to validate detection across open/read/write/json patterns, binary modes, multiple issues, and non-Python file skipping.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through code with careful sight,
Sniffing for files missing UTF-8 light.
A hook, a script, a test or two,
I nudge each open to use the right glue.
Pre-commit hums — the rabbit's delight. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a pre-commit hook to enforce UTF-8 encoding in file operations.
Docstring Coverage	✅ Passed	Docstring coverage is 95.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc1b23d and a148894.

📒 Files selected for processing (3)

.pre-commit-config.yaml
scripts/check_encoding.py
tests/test_check_encoding.py

🧰 Additional context used

📓 Path-based instructions (1)

tests/**

⚙️ CodeRabbit configuration file

tests/**: Ensure tests are comprehensive and follow pytest conventions.
Check for proper mocking and test isolation.

Files:

tests/test_check_encoding.py

🧠 Learnings (1)

📓 Common learnings

Learnt from: MikeeBuilds
Repo: AndyMik90/Auto-Claude PR: 661
File: apps/frontend/src/renderer/components/onboarding/OllamaModelSelector.tsx:176-189
Timestamp: 2026-01-04T23:59:45.209Z
Learning: In the AndyMik90/Auto-Claude repository, pre-existing i18n issues (hardcoded user-facing strings that should be localized) can be deferred to future i18n cleanup passes rather than requiring immediate fixes in PRs that don't introduce new i18n violations.

🧬 Code graph analysis (1)

tests/test_check_encoding.py (1)

scripts/check_encoding.py (3)

EncodingChecker (26-138)

check_file (32-120)

check_files (122-138)

🪛 GitHub Check: CodeQL

scripts/check_encoding.py

[notice] 166-166: Unused local variable
Variable failed_count is not used.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: CodeQL
GitHub Check: CodeQL (javascript-typescript)
GitHub Check: CodeQL (python)

🔇 Additional comments (2)

tests/test_check_encoding.py (1)

1-355: Comprehensive test coverage.

The test suite thoroughly covers all encoding check scenarios including edge cases (binary modes, spacing in parameters, multiple issues). All 16 tests validate the expected behavior correctly.

As per coding guidelines, the tests follow pytest conventions and provide proper isolation with cleanup.

.pre-commit-config.yaml (1)

79-88: No issues found. The check-file-encoding hook is defined exactly once in the file and will execute as intended without redundancy.

Likely an incorrect or invalid review comment.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-07T18:10:18Z

Summary of Changes

Hello @TamerineSky, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial pre-commit hook designed to prevent regressions of UTF-8 encoding issues in Python file operations, particularly important for cross-platform consistency. By automatically checking for explicit encoding="utf-8" parameters, it ensures that all text-based file interactions adhere to a universal standard, thereby enhancing code reliability and educating developers on best practices. The hook is scoped to the apps/backend/ directory and is supported by comprehensive unit tests.

Highlights

New Pre-commit Hook: A new pre-commit hook, check-file-encoding, has been added to automatically enforce UTF-8 encoding for file operations in Python.
Encoding Enforcement Logic: The hook checks for missing encoding="utf-8" parameters in open(), Path.read_text(), Path.write_text(), json.load(open()), and json.dump(..., open()) calls, while correctly allowing binary modes.
Windows Compatibility: The encoding check script includes specific logic to reconfigure sys.stdout for UTF-8 on Windows, ensuring proper console output.
Comprehensive Testing: A dedicated test file (tests/test_check_encoding.py) has been added with 10 unit tests to validate the hook's functionality across various scenarios and edge cases.
Real-World Issue Detection: The newly implemented hook has already identified and caught 6 existing encoding issues in apps/backend/core/progress.py, demonstrating its immediate value.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

TamerineSky · 2026-01-07T18:45:37Z

scripts/check_encoding.py

+
+    # Run checks
+    checker = EncodingChecker()
+    failed_count = checker.check_files(files)


✅ Fixed - Removed the unused failed_count variable from line 164. The variable was only being assigned but never used. Thank you for catching this!

TamerineSky · 2026-01-07T18:45:39Z

tests/test_check_encoding.py

+import tempfile
+from pathlib import Path
+
+import pytest


✅ Fixed - Removed the unused \ import from the test file. The import was not needed since we're not using any pytest-specific features beyond the standard fixtures. Thank you!

gemini-code-assist

Code Review

This pull request introduces a valuable pre-commit hook to enforce UTF-8 encoding, which is a great step towards preventing platform-specific bugs. The implementation is solid, with a dedicated script and a good set of initial tests. My review focuses on a few areas for improvement: correctly scoping the hook as described, making the detection logic more robust, and enhancing the test suite for better coverage and maintainability. Overall, this is an excellent addition to the project's quality assurance toolkit.

gemini-code-assist · 2026-01-07T18:11:56Z

.pre-commit-config.yaml

+        name: Check file encoding parameters
+        entry: python scripts/check_encoding.py
+        language: system
+        types: [python]


The PR description mentions that the hook is "Scoped to apps/backend/", but this configuration will run the hook on all Python files in the repository. To match the described behavior and avoid running the check on files outside the backend application, you should add a files pattern to the hook definition.

types: [python] files: ^apps/backend/

✅ Already Correct - The hook is properly scoped to \ with the \ directive on line 87. This ensures the validation only runs on backend Python files where we've applied the UTF-8 encoding fixes from PR #782. The frontend doesn't need these checks since it doesn't have the same file I/O patterns. Thank you for the review!

gemini-code-assist · 2026-01-07T18:11:56Z

tests/test_check_encoding.py

+            result = checker.check_file(temp_path)
+
+            assert result is False
+            assert len(checker.issues) >= 1


The assertion assert len(checker.issues) >= 1 is too lenient. When json.load(open(path)) is used without an encoding, the checker should detect two separate issues: one for the open() call itself (from Check 1) and another for the json.load(open()) pattern (from Check 4). The test should be more specific and assert that exactly two issues are found.

Suggested change

assert len(checker.issues) >= 1

assert len(checker.issues) == 2

✅ Fixed - Changed all test assertions from assert len(checker.issues) >= 1 to use exact equality (== 1 or == 2) for better test precision. This ensures tests will catch unexpected additional issues. Thank you for the suggestion!

gemini-code-assist · 2026-01-07T18:11:56Z

tests/test_check_encoding.py

+            assert failed_count == 0
+            assert len(checker.issues) == 0
+        finally:
+            temp_path.unlink()


The test suite is missing tests for json.dump() with an inner open() call (Check 5 in the script). To ensure full coverage of the checker's functionality, please add tests for both the failing case (no encoding) and the passing case (with encoding).

Here are some examples you could add:

def test_detects_json_dump_without_encoding(self): """Should detect json.dump(..., open()) without encoding.""" code = ''' import json def write_json(path, data): json.dump(data, open(path, 'w')) ''' with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: f.write(code) temp_path = Path(f.name) try: checker = EncodingChecker() result = checker.check_file(temp_path) assert result is False assert len(checker.issues) >= 1 # This check could be more specific depending on how many issues are expected assert any("json.dump(..., open()) without encoding" in issue for issue in checker.issues) finally: temp_path.unlink() def test_allows_json_dump_with_encoding(self): """Should allow json.dump(..., open()) with encoding.""" code = ''' import json def write_json(path, data): json.dump(data, open(path, 'w', encoding='utf-8')) ''' # ... similar test logic that asserts success ...

✅ Fixed - Added comprehensive json.dump and json.load tests:

test_allows_json_dump_with_encoding() - Verifies json.dump with proper encoding passes

test_detects_json_dump_without_encoding() - Ensures missing encoding is caught

test_detects_json_load_without_encoding() - Already existed, validates json.load detection

This brings the test suite to 16 tests covering all file operation patterns. Thank you!

gemini-code-assist · 2026-01-07T18:11:56Z

scripts/check_encoding.py

+                continue
+
+            # Skip if it already has encoding
+            if 'encoding=' in call:


The if 'encoding=' in call: check is a bit brittle. It would not match encoding = "utf-8" (with spaces) and could be fooled by a comment or variable name. Using a regular expression with a word boundary (\b) would make this check more robust.

This same improvement should be applied to the similar checks on lines 71, 84, 97, and 110.

Suggested change

if 'encoding=' in call:

if re.search(r'\bencoding\s*=', call):

✅ Fixed - Replaced simple string matching with word boundary regex for robustness:

if re.search(r'\bencoding\s*=', call): continue

This now correctly detects:

encoding="utf-8" (no spaces)

encoding = "utf-8" (with spaces)

Won't false-match my_encoding="utf-8" (partial word match)

Added test test_detects_encoding_with_spaces() to verify this behavior. Thank you for the excellent suggestion!

gemini-code-assist · 2026-01-07T18:11:56Z

tests/test_check_encoding.py

+class TestEncodingChecker:
+    """Test the EncodingChecker class."""


The test methods in this class repeat a lot of boilerplate code for creating a temporary file, writing content to it, running the checker, and cleaning up. This can be simplified by extracting this logic into a helper method to follow the DRY (Don't Repeat Yourself) principle. This would make the tests cleaner, more readable, and easier to maintain.

For example, you could add a helper method like this:

def _run_check_on_code(self, code: str) -> tuple[bool, list[str]]: with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: f.write(code) temp_path = Path(f.name) try: checker = EncodingChecker() result = checker.check_file(temp_path) return result, checker.issues finally: temp_path.unlink()

And then use it in your tests:

def test_detects_open_without_encoding(self): code = ''' def read_file(path): with open(path) as f: return f.read() ''' result, issues = self._run_check_on_code(code) assert result is False assert len(issues) == 1 assert "open() without encoding" in issues[0]

📝 Acknowledged - While I agree there's duplicated test boilerplate, I've chosen to defer this refactoring to keep the current PR focused on the encoding validation feature itself.

The current test structure is clear and functional. A future PR could introduce a helper like:

def run_checker_on_code(code: str) -> EncodingChecker: # Shared tempfile creation and cleanup logic

This would be a good code quality improvement but is orthogonal to this PR's core goal. Thank you for the observation!

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In @.pre-commit-config.yaml:
- Around line 79-87: The check-file-encoding pre-commit hook is too broad
because it uses types: [python] and runs on all Python files; change its scope
to only run on the backend by replacing or adding a files: ^apps/backend/
pattern for the hook with id check-file-encoding (entry: python
scripts/check_encoding.py) so the hook matches the same apps/backend scope used
by ruff/pytest; keep the existing id/name/entry/language/description but remove
or keep types as needed, ensuring the files regex limits execution to
apps/backend.

In @scripts/check_encoding.py:
- Around line 49-64: The regex used to detect binary modes in
scripts/check_encoding.py (inside the loop that iterates
re.finditer(r'open\s*\([^)]+\)', content)) is incorrect and misclassifies modes;
replace the pattern r'["\']r?[wb]["\']' with a pattern that checks for the
presence of the 'b' character anywhere in the mode string (for example, match
quoted mode strings containing 'b' via a pattern like one that finds a quoted
string with any chars and a 'b' inside) so only binary modes (e.g., "rb", "wb",
"ab", "w+b", "rb+") are skipped; update the logic that skips binary-mode opens
(the if re.search(...) branch) to use the new pattern and add unit tests for
"w", "wb", "ab", and modes with "+" to ensure text modes without encoding are
caught and binary modes without encoding are allowed.

In @tests/test_check_encoding.py:
- Around line 17-229: Replace manual tempfile.NamedTemporaryFile usage with
pytest's tmp_path fixture: update each test signature to accept tmp_path, create
the temporary file via tmp_path.joinpath("<name>.py").write_text(code,
encoding="utf-8") (or ".txt" for the non-Python test), then pass that Path to
EncodingChecker.check_file or check_files; remove try/finally and
temp_path.unlink() cleanup blocks. Locate occurrences in tests referencing
EncodingChecker, check_file, and check_files and replace the
tempfile.NamedTemporaryFile blocks accordingly so pytest handles isolation and
cleanup automatically.
- Around line 123-144: The test test_detects_json_load_without_encoding is
asserting len(checker.issues) >= 1 which is too loose; update it to assert
len(checker.issues) == 1 and adjust the trailing comment to reflect that only
the open() without encoding is detected (not a nested json.load(open(...))
case), referencing the EncodingChecker instance (checker) and its issues list to
ensure exactly one issue is reported for the open() call.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63766f7 and cc1b23d.

📒 Files selected for processing (3)

.pre-commit-config.yaml
scripts/check_encoding.py
tests/test_check_encoding.py

🧰 Additional context used

📓 Path-based instructions (1)

tests/**

⚙️ CodeRabbit configuration file

tests/**: Ensure tests are comprehensive and follow pytest conventions.
Check for proper mocking and test isolation.

Files:

tests/test_check_encoding.py

🧠 Learnings (1)

📓 Common learnings

Learnt from: MikeeBuilds
Repo: AndyMik90/Auto-Claude PR: 661
File: apps/frontend/src/renderer/components/onboarding/OllamaModelSelector.tsx:176-189
Timestamp: 2026-01-04T23:59:45.209Z
Learning: In the AndyMik90/Auto-Claude repository, pre-existing i18n issues (hardcoded user-facing strings that should be localized) can be deferred to future i18n cleanup passes rather than requiring immediate fixes in PRs that don't introduce new i18n violations.

🧬 Code graph analysis (1)

tests/test_check_encoding.py (1)

scripts/check_encoding.py (3)

EncodingChecker (26-140)

check_file (32-119)

check_files (121-140)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: CodeQL (python)
GitHub Check: CodeQL (javascript-typescript)

🔇 Additional comments (1)

scripts/check_encoding.py (1)

143-187: LGTM! Well-designed CLI interface.

The main() function provides:

Clear argument parsing

Helpful error messages with fix examples

Appropriate exit codes for pre-commit integration

Optional verbose output

The user-facing output with emojis and examples will provide excellent developer experience.

.pre-commit-config.yaml

scripts/check_encoding.py

coderabbitai · 2026-01-07T18:16:55Z

tests/test_check_encoding.py

+    def test_detects_open_without_encoding(self):
+        """Should detect open() calls without encoding parameter."""
+        code = '''
+def read_file(path):
+    with open(path) as f:
+        return f.read()
+'''
+        # Create temp file
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
+            f.write(code)
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            result = checker.check_file(temp_path)
+
+            assert result is False
+            assert len(checker.issues) == 1
+            assert "open() without encoding" in checker.issues[0]
+        finally:
+            temp_path.unlink()
+
+    def test_allows_open_with_encoding(self):
+        """Should allow open() calls with encoding parameter."""
+        code = '''
+def read_file(path):
+    with open(path, encoding="utf-8") as f:
+        return f.read()
+'''
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
+            f.write(code)
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            result = checker.check_file(temp_path)
+
+            assert result is True
+            assert len(checker.issues) == 0
+        finally:
+            temp_path.unlink()
+
+    def test_allows_binary_mode_without_encoding(self):
+        """Should allow binary mode without encoding (correct behavior)."""
+        code = '''
+def read_file(path):
+    with open(path, "rb") as f:
+        return f.read()
+'''
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
+            f.write(code)
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            result = checker.check_file(temp_path)
+
+            assert result is True
+            assert len(checker.issues) == 0
+        finally:
+            temp_path.unlink()
+
+    def test_detects_path_read_text_without_encoding(self):
+        """Should detect Path.read_text() without encoding."""
+        code = '''
+from pathlib import Path
+
+def read_file(path):
+    return Path(path).read_text()
+'''
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
+            f.write(code)
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            result = checker.check_file(temp_path)
+
+            assert result is False
+            assert len(checker.issues) == 1
+            assert "read_text() without encoding" in checker.issues[0]
+        finally:
+            temp_path.unlink()
+
+    def test_detects_path_write_text_without_encoding(self):
+        """Should detect Path.write_text() without encoding."""
+        code = '''
+from pathlib import Path
+
+def write_file(path, content):
+    Path(path).write_text(content)
+'''
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
+            f.write(code)
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            result = checker.check_file(temp_path)
+
+            assert result is False
+            assert len(checker.issues) == 1
+            assert "write_text() without encoding" in checker.issues[0]
+        finally:
+            temp_path.unlink()
+
+    def test_detects_json_load_without_encoding(self):
+        """Should detect json.load(open()) without encoding in open()."""
+        code = '''
+import json
+
+def read_json(path):
+    with open(path) as f:
+        return json.load(f)
+'''
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
+            f.write(code)
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            result = checker.check_file(temp_path)
+
+            assert result is False
+            assert len(checker.issues) >= 1
+            # Should detect both open() and json.load
+        finally:
+            temp_path.unlink()
+
+    def test_allows_path_read_text_with_encoding(self):
+        """Should allow Path.read_text() with encoding parameter."""
+        code = '''
+from pathlib import Path
+
+def read_file(path):
+    return Path(path).read_text(encoding="utf-8")
+'''
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
+            f.write(code)
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            result = checker.check_file(temp_path)
+
+            assert result is True
+            assert len(checker.issues) == 0
+        finally:
+            temp_path.unlink()
+
+    def test_allows_path_write_text_with_encoding(self):
+        """Should allow Path.write_text() with encoding parameter."""
+        code = '''
+from pathlib import Path
+
+def write_file(path, content):
+    Path(path).write_text(content, encoding="utf-8")
+'''
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
+            f.write(code)
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            result = checker.check_file(temp_path)
+
+            assert result is True
+            assert len(checker.issues) == 0
+        finally:
+            temp_path.unlink()
+
+    def test_multiple_issues_in_single_file(self):
+        """Should detect multiple encoding issues in a single file."""
+        code = '''
+from pathlib import Path
+
+def process_files(input_path, output_path):
+    # Missing encoding in open()
+    with open(input_path) as f:
+        content = f.read()
+
+    # Missing encoding in Path.write_text()
+    Path(output_path).write_text(content)
+
+    return content
+'''
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
+            f.write(code)
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            result = checker.check_file(temp_path)
+
+            assert result is False
+            assert len(checker.issues) == 2
+        finally:
+            temp_path.unlink()
+
+    def test_skips_non_python_files(self):
+        """Should skip files that are not Python files."""
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False, encoding="utf-8") as f:
+            f.write("with open(path) as f: pass")
+            temp_path = Path(f.name)
+
+        try:
+            checker = EncodingChecker()
+            failed_count = checker.check_files([temp_path])
+
+            assert failed_count == 0
+            assert len(checker.issues) == 0
+        finally:
+            temp_path.unlink()


🧹 Nitpick | 🔵 Trivial

Consider using pytest's tmp_path fixture.

While the current approach using tempfile.NamedTemporaryFile with manual cleanup works correctly, pytest's tmp_path fixture provides a cleaner, more idiomatic solution that automatically handles cleanup and provides better isolation.

♻️ Example refactor using tmp_path

- def test_detects_open_without_encoding(self): + def test_detects_open_without_encoding(self, tmp_path): """Should detect open() calls without encoding parameter.""" code = ''' def read_file(path): with open(path) as f: return f.read() ''' - # Create temp file - with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: - f.write(code) - temp_path = Path(f.name) + temp_path = tmp_path / "test.py" + temp_path.write_text(code, encoding="utf-8") - try: - checker = EncodingChecker() - result = checker.check_file(temp_path) + checker = EncodingChecker() + result = checker.check_file(temp_path) - assert result is False - assert len(checker.issues) == 1 - assert "open() without encoding" in checker.issues[0] - finally: - temp_path.unlink() + assert result is False + assert len(checker.issues) == 1 + assert "open() without encoding" in checker.issues[0]

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In @tests/test_check_encoding.py around lines 17 - 229, Replace manual tempfile.NamedTemporaryFile usage with pytest's tmp_path fixture: update each test signature to accept tmp_path, create the temporary file via tmp_path.joinpath("<name>.py").write_text(code, encoding="utf-8") (or ".txt" for the non-Python test), then pass that Path to EncodingChecker.check_file or check_files; remove try/finally and temp_path.unlink() cleanup blocks. Locate occurrences in tests referencing EncodingChecker, check_file, and check_files and replace the tempfile.NamedTemporaryFile blocks accordingly so pytest handles isolation and cleanup automatically.

📝 Acknowledged - While tmp_path is indeed the modern pytest best practice, I've kept the current tempfile.NamedTemporaryFile approach for this PR to maintain consistency with the existing test patterns and keep the scope focused.

The benefits of switching to tmp_path (automatic cleanup, better Windows compatibility) would be good improvements, but are orthogonal to the encoding validation feature being added here. This could be addressed in a future test infrastructure refactoring PR.

Thank you for the suggestion!

@TamerineSky Understood! Keeping the scope focused on the encoding validation feature makes sense. The existing approach works correctly and is well-tested.

Feel free to revisit the tmp_path suggestion in a future test infrastructure refactoring if you decide to standardize the test patterns across the codebase. Thanks for considering it! 👍

🧠 Learnings used

Learnt from: MikeeBuilds Repo: AndyMik90/Auto-Claude PR: 661 File: apps/frontend/src/renderer/components/onboarding/OllamaModelSelector.tsx:176-189 Timestamp: 2026-01-04T23:59:45.209Z Learning: In the AndyMik90/Auto-Claude repository, pre-existing i18n issues (hardcoded user-facing strings that should be localized) can be deferred to future i18n cleanup passes rather than requiring immediate fixes in PRs that don't introduce new i18n violations.

tests/test_check_encoding.py

Fixes based on automated review comments: 1. Binary Mode Detection (Critical Fix): - Replaced brittle regex with robust pattern: r'["'][rwax+]*b[rwax+]*["']' - Now correctly detects all binary modes: rb, wb, ab, r+b, w+b, etc. - Prevents false positives on text mode 'w' without 'b' - Added comprehensive tests for wb, ab, and text w modes 2. Encoding Detection Robustness (Critical Fix): - Changed from 'encoding=' string match to word boundary regex: r'\bencoding\s*=' - Now handles encoding with spaces: encoding = "utf-8" - Prevents false matches of substrings containing 'encoding=' - Applied across all checks (open, read_text, write_text, json.load, json.dump) - Added test for spaces around equals sign 3. Test Coverage Improvements: - Added json.dump() with encoding test (passing case) - Added json.dump() without encoding test (failing case) - Fixed test assertions to match actual behavior (== 1 not == 2) - Added 6 new tests for improved binary/text mode coverage - Total tests increased from 10 to 16, all passing ✅ 4. Code Cleanup: - Removed unused pytest import (CodeQL warning) - Simplified check_files() to remove unused variable tracking All changes validated with comprehensive test suite (16/16 passing). Related: PR AndyMik90#795 review feedback from CodeRabbit and Gemini Code Assist

TamerineSky · 2026-01-07T18:29:14Z

✅ All Review Feedback Addressed

Thank you @coderabbitai and @gemini-code-assist for the thorough reviews! I've addressed all the issues raised.

Critical Fixes Applied:

1. ✅ Hook Scope Already Limited

Issue: Hook should be scoped to apps/backend/ only
Status: Already implemented - .pre-commit-config.yaml line 87 has files: ^apps/backend/
No action needed - this was already correct in the original implementation.

2. ✅ Binary Mode Detection Fixed

Issue: Regex r'["']r?[wb]["']' was brittle and incorrect
Problem: Would match "w" without "b", and fail to detect modes like "ab", "r+b"

Fix Applied:

# OLD (incorrect):
if re.search(r'["']r?[wb]["']', call):

# NEW (robust):
if re.search(r'["'][rwax+]*b[rwax+]*["']', call):

Now correctly handles: rb, wb, ab, r+b, w+b, x+b, and all combinations
Added tests: 3 new tests for wb, ab, and text "w" mode

3. ✅ Encoding Detection Robustness

Issue: Simple string match 'encoding=' won't handle spaces
Problem: Misses encoding = "utf-8" with spaces

Fix Applied:

# OLD (brittle):
if 'encoding=' in call:

# NEW (robust):
if re.search(r'\bencoding\s*=', call):

Now handles: All spacing variations (encoding="...", encoding = "...", etc.)
Applied to: All 5 checks (open, read_text, write_text, json.load, json.dump)
Added test: Validates spaces around equals sign work correctly

4. ✅ Test Coverage Improved

Added:

test_allows_json_dump_with_encoding() - Passing case
test_detects_json_dump_without_encoding() - Failing case
test_allows_write_binary_mode_without_encoding() - wb mode
test_allows_append_binary_mode_without_encoding() - ab mode
test_detects_text_write_mode_without_encoding() - Validates "w" without encoding fails
test_detects_encoding_with_spaces() - Robustness validation

Test Results: 16/16 tests passing ✅ (up from 10)

5. ✅ Code Cleanup

Removed unused pytest import (CodeQL warning)
Simplified check_files() method (removed unused variable tracking)
Fixed test assertions to match actual behavior

Out of Scope (Future Enhancements):

@coderabbitai suggested extracting tempfile boilerplate into helper methods. While this is a good suggestion for DRY, I'm keeping the current approach for this PR because:

Test readability - each test is self-contained and clear
Consistency - matches existing test patterns in the repo
Scope - This PR focuses on the hook functionality itself

This can be refactored in a future test infrastructure improvement PR if needed.

All critical and high-priority issues are now resolved. The hook is production-ready with comprehensive test coverage and robust validation logic.

Latest commit: a148894

gemini-code-assist · 2026-01-07T18:29:27Z

Hello @TamerineSky,

Thank you for the detailed update and for addressing the feedback so thoroughly! It's great to see such a comprehensive approach to improving the pre-commit hook.

I particularly appreciate the robust fixes for the binary mode detection and the encoding parameter parsing, which significantly enhance the reliability of the checker. The expanded test suite, now with 16 passing tests, provides excellent coverage and confidence in the new logic.

Your rationale for deferring the test boilerplate refactoring to a future PR is perfectly understandable, prioritizing the core functionality and immediate impact of this change. It's a pragmatic approach to managing scope.

It sounds like all critical and high-priority issues have been well-resolved. If you'd like, I can perform another review to verify these improvements. Just let me know!

coderabbitai · 2026-01-07T18:30:25Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the current state of the main script to verify the fixes
echo "=== scripts/check_encoding.py ==="
cat scripts/check_encoding.py

echo ""
echo "=== tests/test_check_encoding.py ==="
cat tests/test_check_encoding.py

echo ""
echo "=== .pre-commit-config.yaml (relevant section) ==="
cat .pre-commit-config.yaml

echo ""
echo "=== Count test functions ==="
grep -c "^def test_" tests/test_check_encoding.py

Length of output: 25559

@TamerineSky Excellent work! 🎉 I've verified all the fixes in commit a148894, and everything checks out perfectly:

✅ Verified Fixes

1. Binary Mode Detection - Robust ✅

# Line 51: scripts/check_encoding.py
if re.search(r'["\'][rwax+]*b[rwax+]*["\']', call):

Confirmed: Now correctly handles rb, wb, ab, r+b, w+b, x+b, and all combinations.

2. Encoding Detection - Robust ✅

# Lines 56, 73, 84, 95, 106: Applied consistently across all 5 checks
if re.search(r'\bencoding\s*=', call):

Confirmed: Word boundary \b with \s* handles all spacing variations (encoding="...", encoding = "...", etc.)

3. Hook Scope - Properly Limited ✅

# Line 87: .pre-commit-config.yaml
files: ^apps/backend/

Confirmed: Hook correctly scoped to backend only.

4. Test Coverage - Comprehensive ✅

Confirmed: All 16 tests present and accounted for:

✅ test_allows_json_dump_with_encoding()
✅ test_detects_json_dump_without_encoding()
✅ test_allows_write_binary_mode_without_encoding()
✅ test_allows_append_binary_mode_without_encoding()
✅ test_detects_text_write_mode_without_encoding()
✅ test_detects_encoding_with_spaces()

💯 Assessment

Your implementation is production-ready:

✅ All regex patterns are robust and handle edge cases
✅ Test coverage is comprehensive (16 tests)
✅ Real-world validation (caught 6 issues in progress.py)
✅ Clear documentation and reasoning

Your decision to defer the test helper extraction is pragmatic and appropriate for this PR's scope. Self-contained tests improve readability for this focused feature addition.

Outstanding work on preventing the regression of 251 encoding fixes from PR #782! 🚀

🧠 Learnings used

Learnt from: MikeeBuilds
Repo: AndyMik90/Auto-Claude PR: 661
File: apps/frontend/src/renderer/components/onboarding/OllamaModelSelector.tsx:176-189
Timestamp: 2026-01-04T23:59:45.209Z
Learning: In the AndyMik90/Auto-Claude repository, pre-existing i18n issues (hardcoded user-facing strings that should be localized) can be deferred to future i18n cleanup passes rather than requiring immediate fixes in PRs that don't introduce new i18n violations.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

1. CONTRIBUTING.md: - Added concise file encoding section after Code Style - DO/DON'T examples for common file operations - Covers open(), Path methods, json operations - References PR AndyMik90#782 and windows-development.md 2. guides/windows-development.md (NEW): - Comprehensive Windows development guide - File encoding (cp1252 vs UTF-8 issue) - Line endings, path separators, shell commands - Development environment recommendations - Common pitfalls and solutions - Testing guidelines 3. .github/PULL_REQUEST_TEMPLATE.md: - Added encoding checklist item for Python PRs - Helps catch missing encoding during review 4. guides/README.md: - Added windows-development.md to guide index - Organized with CLI-USAGE and linux guides Purpose: Educate developers about UTF-8 encoding requirements to prevent regressions of the 251 encoding issues fixed in PR AndyMik90#782. Automated checking via pre-commit hooks (PR AndyMik90#795) + developer education ensures long-term Windows compatibility. Related: - PR AndyMik90#782: Fix Windows UTF-8 encoding errors (251 instances) - PR AndyMik90#795: Add pre-commit hooks for encoding enforcement

Found by pre-commit hook testing in PR AndyMik90#795: - Line 96: Path.read_text() without encoding - Line 122: Path.write_text() without encoding These handle .gitignore file operations and could fail on Windows with special characters in gitignore comments or entries. Total fixes in PR AndyMik90#782: 253 instances (was 251, +2 from init.py)

TamerineSky mentioned this pull request Jan 7, 2026

Fix Windows UTF-8 encoding errors across entire backend (251 instances) #782

Open

github-advanced-security bot found potential problems Jan 7, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

Merge branch 'develop' into chore/encoding-precommit-hooks

b59d964

TamerineSky mentioned this pull request Jan 7, 2026

docs: Add UTF-8 encoding guidelines and Windows development guide #799

Open

	assert len(checker.issues) >= 1
	assert len(checker.issues) == 2

	if 'encoding=' in call:
	if re.search(r'\bencoding\s*=', call):

		class TestEncodingChecker:
		"""Test the EncodingChecker class."""

Uh oh!

Add pre-commit hook for UTF-8 encoding enforcement #795

Are you sure you want to change the base?

Add pre-commit hook for UTF-8 encoding enforcement #795

Conversation

TamerineSky commented Jan 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

1. Encoding Check Script (scripts/check_encoding.py)

2. Pre-commit Configuration

3. Comprehensive Tests

Real-World Validation

Benefits

Testing

No Migration Needed

Related

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Check notice

Choose a reason for hiding this comment

Uh oh!

Check notice

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

TamerineSky commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

1. Encoding Check Script (`scripts/check_encoding.py`)

coderabbitai bot commented Jan 7, 2026 •

edited

Loading

coderabbitai bot Jan 7, 2026 •

edited

Loading