-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add pre-commit hook for UTF-8 encoding enforcement #795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add pre-commit hook for UTF-8 encoding enforcement #795
Conversation
1. Encoding Check Script (scripts/check_encoding.py): - Validates all file operations have encoding="utf-8" - Checks open(), Path.read_text(), Path.write_text() - Checks json.load/dump with open() - Allows binary mode without encoding - Windows-compatible emoji output with UTF-8 reconfiguration 2. Pre-commit Config (.pre-commit-config.yaml): - Added check-file-encoding hook for apps/backend/ - Runs automatically before commits - Scoped to backend Python files only 3. Tests (tests/test_check_encoding.py): - Comprehensive test coverage (10 tests, all passing) - Tests detection of missing encoding - Tests allowlist for binary files - Tests multiple issues in single file - Tests file type filtering Purpose: - Prevent regression of 251 UTF-8 encoding fixes from PR AndyMik90#782 - Catch missing encoding in new code during development - Fast feedback loop for developers Implementation Notes: - Hook scoped to apps/backend/ to avoid false positives in test code - Uses simple regex matching for speed - Compatible with existing pre-commit infrastructure - Already caught 6 real issues in apps/backend/core/progress.py Related: PR AndyMik90#782 - Fix Windows UTF-8 encoding errors
📝 WalkthroughWalkthroughAdds a new pre-commit hook and supporting script that scan Python files for missing Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (3)
🧰 Additional context used📓 Path-based instructions (1)tests/**⚙️ CodeRabbit configuration file
Files:
🧠 Learnings (1)📓 Common learnings🧬 Code graph analysis (1)tests/test_check_encoding.py (1)
🪛 GitHub Check: CodeQLscripts/check_encoding.py[notice] 166-166: Unused local variable ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @TamerineSky, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial pre-commit hook designed to prevent regressions of UTF-8 encoding issues in Python file operations, particularly important for cross-platform consistency. By automatically checking for explicit Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
|
||
| # Run checks | ||
| checker = EncodingChecker() | ||
| failed_count = checker.check_files(files) |
Check notice
Code scanning / CodeQL
Unused local variable Note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Fixed - Removed the unused failed_count variable from line 164. The variable was only being assigned but never used. Thank you for catching this!
| import tempfile | ||
| from pathlib import Path | ||
|
|
||
| import pytest |
Check notice
Code scanning / CodeQL
Unused import Note test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Fixed - Removed the unused \ import from the test file. The import was not needed since we're not using any pytest-specific features beyond the standard fixtures. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable pre-commit hook to enforce UTF-8 encoding, which is a great step towards preventing platform-specific bugs. The implementation is solid, with a dedicated script and a good set of initial tests. My review focuses on a few areas for improvement: correctly scoping the hook as described, making the detection logic more robust, and enhancing the test suite for better coverage and maintainability. Overall, this is an excellent addition to the project's quality assurance toolkit.
| name: Check file encoding parameters | ||
| entry: python scripts/check_encoding.py | ||
| language: system | ||
| types: [python] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description mentions that the hook is "Scoped to apps/backend/", but this configuration will run the hook on all Python files in the repository. To match the described behavior and avoid running the check on files outside the backend application, you should add a files pattern to the hook definition.
types: [python]
files: ^apps/backend/There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Already Correct - The hook is properly scoped to \ with the \ directive on line 87. This ensures the validation only runs on backend Python files where we've applied the UTF-8 encoding fixes from PR #782. The frontend doesn't need these checks since it doesn't have the same file I/O patterns. Thank you for the review!
tests/test_check_encoding.py
Outdated
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is False | ||
| assert len(checker.issues) >= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion assert len(checker.issues) >= 1 is too lenient. When json.load(open(path)) is used without an encoding, the checker should detect two separate issues: one for the open() call itself (from Check 1) and another for the json.load(open()) pattern (from Check 4). The test should be more specific and assert that exactly two issues are found.
| assert len(checker.issues) >= 1 | |
| assert len(checker.issues) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Fixed - Changed all test assertions from assert len(checker.issues) >= 1 to use exact equality (== 1 or == 2) for better test precision. This ensures tests will catch unexpected additional issues. Thank you for the suggestion!
| assert failed_count == 0 | ||
| assert len(checker.issues) == 0 | ||
| finally: | ||
| temp_path.unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite is missing tests for json.dump() with an inner open() call (Check 5 in the script). To ensure full coverage of the checker's functionality, please add tests for both the failing case (no encoding) and the passing case (with encoding).
Here are some examples you could add:
def test_detects_json_dump_without_encoding(self):
"""Should detect json.dump(..., open()) without encoding."""
code = '''
import json
def write_json(path, data):
json.dump(data, open(path, 'w'))
'''
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
f.write(code)
temp_path = Path(f.name)
try:
checker = EncodingChecker()
result = checker.check_file(temp_path)
assert result is False
assert len(checker.issues) >= 1
# This check could be more specific depending on how many issues are expected
assert any("json.dump(..., open()) without encoding" in issue for issue in checker.issues)
finally:
temp_path.unlink()
def test_allows_json_dump_with_encoding(self):
"""Should allow json.dump(..., open()) with encoding."""
code = '''
import json
def write_json(path, data):
json.dump(data, open(path, 'w', encoding='utf-8'))
'''
# ... similar test logic that asserts success ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Fixed - Added comprehensive json.dump and json.load tests:
test_allows_json_dump_with_encoding()- Verifies json.dump with proper encoding passestest_detects_json_dump_without_encoding()- Ensures missing encoding is caughttest_detects_json_load_without_encoding()- Already existed, validates json.load detection
This brings the test suite to 16 tests covering all file operation patterns. Thank you!
scripts/check_encoding.py
Outdated
| continue | ||
|
|
||
| # Skip if it already has encoding | ||
| if 'encoding=' in call: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if 'encoding=' in call: check is a bit brittle. It would not match encoding = "utf-8" (with spaces) and could be fooled by a comment or variable name. Using a regular expression with a word boundary (\b) would make this check more robust.
This same improvement should be applied to the similar checks on lines 71, 84, 97, and 110.
| if 'encoding=' in call: | |
| if re.search(r'\bencoding\s*=', call): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Fixed - Replaced simple string matching with word boundary regex for robustness:
if re.search(r'\bencoding\s*=', call):
continueThis now correctly detects:
encoding="utf-8"(no spaces)encoding = "utf-8"(with spaces)- Won't false-match
my_encoding="utf-8"(partial word match)
Added test test_detects_encoding_with_spaces() to verify this behavior. Thank you for the excellent suggestion!
| class TestEncodingChecker: | ||
| """Test the EncodingChecker class.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test methods in this class repeat a lot of boilerplate code for creating a temporary file, writing content to it, running the checker, and cleaning up. This can be simplified by extracting this logic into a helper method to follow the DRY (Don't Repeat Yourself) principle. This would make the tests cleaner, more readable, and easier to maintain.
For example, you could add a helper method like this:
def _run_check_on_code(self, code: str) -> tuple[bool, list[str]]:
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
f.write(code)
temp_path = Path(f.name)
try:
checker = EncodingChecker()
result = checker.check_file(temp_path)
return result, checker.issues
finally:
temp_path.unlink()And then use it in your tests:
def test_detects_open_without_encoding(self):
code = '''
def read_file(path):
with open(path) as f:
return f.read()
'''
result, issues = self._run_check_on_code(code)
assert result is False
assert len(issues) == 1
assert "open() without encoding" in issues[0]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝 Acknowledged - While I agree there's duplicated test boilerplate, I've chosen to defer this refactoring to keep the current PR focused on the encoding validation feature itself.
The current test structure is clear and functional. A future PR could introduce a helper like:
def run_checker_on_code(code: str) -> EncodingChecker:
# Shared tempfile creation and cleanup logicThis would be a good code quality improvement but is orthogonal to this PR's core goal. Thank you for the observation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.pre-commit-config.yaml:
- Around line 79-87: The check-file-encoding pre-commit hook is too broad
because it uses types: [python] and runs on all Python files; change its scope
to only run on the backend by replacing or adding a files: ^apps/backend/
pattern for the hook with id check-file-encoding (entry: python
scripts/check_encoding.py) so the hook matches the same apps/backend scope used
by ruff/pytest; keep the existing id/name/entry/language/description but remove
or keep types as needed, ensuring the files regex limits execution to
apps/backend.
In @scripts/check_encoding.py:
- Around line 49-64: The regex used to detect binary modes in
scripts/check_encoding.py (inside the loop that iterates
re.finditer(r'open\s*\([^)]+\)', content)) is incorrect and misclassifies modes;
replace the pattern r'["\']r?[wb]["\']' with a pattern that checks for the
presence of the 'b' character anywhere in the mode string (for example, match
quoted mode strings containing 'b' via a pattern like one that finds a quoted
string with any chars and a 'b' inside) so only binary modes (e.g., "rb", "wb",
"ab", "w+b", "rb+") are skipped; update the logic that skips binary-mode opens
(the if re.search(...) branch) to use the new pattern and add unit tests for
"w", "wb", "ab", and modes with "+" to ensure text modes without encoding are
caught and binary modes without encoding are allowed.
In @tests/test_check_encoding.py:
- Around line 17-229: Replace manual tempfile.NamedTemporaryFile usage with
pytest's tmp_path fixture: update each test signature to accept tmp_path, create
the temporary file via tmp_path.joinpath("<name>.py").write_text(code,
encoding="utf-8") (or ".txt" for the non-Python test), then pass that Path to
EncodingChecker.check_file or check_files; remove try/finally and
temp_path.unlink() cleanup blocks. Locate occurrences in tests referencing
EncodingChecker, check_file, and check_files and replace the
tempfile.NamedTemporaryFile blocks accordingly so pytest handles isolation and
cleanup automatically.
- Around line 123-144: The test test_detects_json_load_without_encoding is
asserting len(checker.issues) >= 1 which is too loose; update it to assert
len(checker.issues) == 1 and adjust the trailing comment to reflect that only
the open() without encoding is detected (not a nested json.load(open(...))
case), referencing the EncodingChecker instance (checker) and its issues list to
ensure exactly one issue is reported for the open() call.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
.pre-commit-config.yamlscripts/check_encoding.pytests/test_check_encoding.py
🧰 Additional context used
📓 Path-based instructions (1)
tests/**
⚙️ CodeRabbit configuration file
tests/**: Ensure tests are comprehensive and follow pytest conventions.
Check for proper mocking and test isolation.
Files:
tests/test_check_encoding.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: MikeeBuilds
Repo: AndyMik90/Auto-Claude PR: 661
File: apps/frontend/src/renderer/components/onboarding/OllamaModelSelector.tsx:176-189
Timestamp: 2026-01-04T23:59:45.209Z
Learning: In the AndyMik90/Auto-Claude repository, pre-existing i18n issues (hardcoded user-facing strings that should be localized) can be deferred to future i18n cleanup passes rather than requiring immediate fixes in PRs that don't introduce new i18n violations.
🧬 Code graph analysis (1)
tests/test_check_encoding.py (1)
scripts/check_encoding.py (3)
EncodingChecker(26-140)check_file(32-119)check_files(121-140)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: CodeQL (python)
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (1)
scripts/check_encoding.py (1)
143-187: LGTM! Well-designed CLI interface.The main() function provides:
- Clear argument parsing
- Helpful error messages with fix examples
- Appropriate exit codes for pre-commit integration
- Optional verbose output
The user-facing output with emojis and examples will provide excellent developer experience.
| def test_detects_open_without_encoding(self): | ||
| """Should detect open() calls without encoding parameter.""" | ||
| code = ''' | ||
| def read_file(path): | ||
| with open(path) as f: | ||
| return f.read() | ||
| ''' | ||
| # Create temp file | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: | ||
| f.write(code) | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is False | ||
| assert len(checker.issues) == 1 | ||
| assert "open() without encoding" in checker.issues[0] | ||
| finally: | ||
| temp_path.unlink() | ||
|
|
||
| def test_allows_open_with_encoding(self): | ||
| """Should allow open() calls with encoding parameter.""" | ||
| code = ''' | ||
| def read_file(path): | ||
| with open(path, encoding="utf-8") as f: | ||
| return f.read() | ||
| ''' | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: | ||
| f.write(code) | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is True | ||
| assert len(checker.issues) == 0 | ||
| finally: | ||
| temp_path.unlink() | ||
|
|
||
| def test_allows_binary_mode_without_encoding(self): | ||
| """Should allow binary mode without encoding (correct behavior).""" | ||
| code = ''' | ||
| def read_file(path): | ||
| with open(path, "rb") as f: | ||
| return f.read() | ||
| ''' | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: | ||
| f.write(code) | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is True | ||
| assert len(checker.issues) == 0 | ||
| finally: | ||
| temp_path.unlink() | ||
|
|
||
| def test_detects_path_read_text_without_encoding(self): | ||
| """Should detect Path.read_text() without encoding.""" | ||
| code = ''' | ||
| from pathlib import Path | ||
| def read_file(path): | ||
| return Path(path).read_text() | ||
| ''' | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: | ||
| f.write(code) | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is False | ||
| assert len(checker.issues) == 1 | ||
| assert "read_text() without encoding" in checker.issues[0] | ||
| finally: | ||
| temp_path.unlink() | ||
|
|
||
| def test_detects_path_write_text_without_encoding(self): | ||
| """Should detect Path.write_text() without encoding.""" | ||
| code = ''' | ||
| from pathlib import Path | ||
| def write_file(path, content): | ||
| Path(path).write_text(content) | ||
| ''' | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: | ||
| f.write(code) | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is False | ||
| assert len(checker.issues) == 1 | ||
| assert "write_text() without encoding" in checker.issues[0] | ||
| finally: | ||
| temp_path.unlink() | ||
|
|
||
| def test_detects_json_load_without_encoding(self): | ||
| """Should detect json.load(open()) without encoding in open().""" | ||
| code = ''' | ||
| import json | ||
| def read_json(path): | ||
| with open(path) as f: | ||
| return json.load(f) | ||
| ''' | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: | ||
| f.write(code) | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is False | ||
| assert len(checker.issues) >= 1 | ||
| # Should detect both open() and json.load | ||
| finally: | ||
| temp_path.unlink() | ||
|
|
||
| def test_allows_path_read_text_with_encoding(self): | ||
| """Should allow Path.read_text() with encoding parameter.""" | ||
| code = ''' | ||
| from pathlib import Path | ||
| def read_file(path): | ||
| return Path(path).read_text(encoding="utf-8") | ||
| ''' | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: | ||
| f.write(code) | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is True | ||
| assert len(checker.issues) == 0 | ||
| finally: | ||
| temp_path.unlink() | ||
|
|
||
| def test_allows_path_write_text_with_encoding(self): | ||
| """Should allow Path.write_text() with encoding parameter.""" | ||
| code = ''' | ||
| from pathlib import Path | ||
| def write_file(path, content): | ||
| Path(path).write_text(content, encoding="utf-8") | ||
| ''' | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: | ||
| f.write(code) | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is True | ||
| assert len(checker.issues) == 0 | ||
| finally: | ||
| temp_path.unlink() | ||
|
|
||
| def test_multiple_issues_in_single_file(self): | ||
| """Should detect multiple encoding issues in a single file.""" | ||
| code = ''' | ||
| from pathlib import Path | ||
| def process_files(input_path, output_path): | ||
| # Missing encoding in open() | ||
| with open(input_path) as f: | ||
| content = f.read() | ||
| # Missing encoding in Path.write_text() | ||
| Path(output_path).write_text(content) | ||
| return content | ||
| ''' | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f: | ||
| f.write(code) | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| result = checker.check_file(temp_path) | ||
|
|
||
| assert result is False | ||
| assert len(checker.issues) == 2 | ||
| finally: | ||
| temp_path.unlink() | ||
|
|
||
| def test_skips_non_python_files(self): | ||
| """Should skip files that are not Python files.""" | ||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False, encoding="utf-8") as f: | ||
| f.write("with open(path) as f: pass") | ||
| temp_path = Path(f.name) | ||
|
|
||
| try: | ||
| checker = EncodingChecker() | ||
| failed_count = checker.check_files([temp_path]) | ||
|
|
||
| assert failed_count == 0 | ||
| assert len(checker.issues) == 0 | ||
| finally: | ||
| temp_path.unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Consider using pytest's tmp_path fixture.
While the current approach using tempfile.NamedTemporaryFile with manual cleanup works correctly, pytest's tmp_path fixture provides a cleaner, more idiomatic solution that automatically handles cleanup and provides better isolation.
♻️ Example refactor using tmp_path
- def test_detects_open_without_encoding(self):
+ def test_detects_open_without_encoding(self, tmp_path):
"""Should detect open() calls without encoding parameter."""
code = '''
def read_file(path):
with open(path) as f:
return f.read()
'''
- # Create temp file
- with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False, encoding="utf-8") as f:
- f.write(code)
- temp_path = Path(f.name)
+ temp_path = tmp_path / "test.py"
+ temp_path.write_text(code, encoding="utf-8")
- try:
- checker = EncodingChecker()
- result = checker.check_file(temp_path)
+ checker = EncodingChecker()
+ result = checker.check_file(temp_path)
- assert result is False
- assert len(checker.issues) == 1
- assert "open() without encoding" in checker.issues[0]
- finally:
- temp_path.unlink()
+ assert result is False
+ assert len(checker.issues) == 1
+ assert "open() without encoding" in checker.issues[0]Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @tests/test_check_encoding.py around lines 17 - 229, Replace manual
tempfile.NamedTemporaryFile usage with pytest's tmp_path fixture: update each
test signature to accept tmp_path, create the temporary file via
tmp_path.joinpath("<name>.py").write_text(code, encoding="utf-8") (or ".txt" for
the non-Python test), then pass that Path to EncodingChecker.check_file or
check_files; remove try/finally and temp_path.unlink() cleanup blocks. Locate
occurrences in tests referencing EncodingChecker, check_file, and check_files
and replace the tempfile.NamedTemporaryFile blocks accordingly so pytest handles
isolation and cleanup automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝 Acknowledged - While tmp_path is indeed the modern pytest best practice, I've kept the current tempfile.NamedTemporaryFile approach for this PR to maintain consistency with the existing test patterns and keep the scope focused.
The benefits of switching to tmp_path (automatic cleanup, better Windows compatibility) would be good improvements, but are orthogonal to the encoding validation feature being added here. This could be addressed in a future test infrastructure refactoring PR.
Thank you for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TamerineSky Understood! Keeping the scope focused on the encoding validation feature makes sense. The existing approach works correctly and is well-tested.
Feel free to revisit the tmp_path suggestion in a future test infrastructure refactoring if you decide to standardize the test patterns across the codebase. Thanks for considering it! 👍
🧠 Learnings used
Learnt from: MikeeBuilds
Repo: AndyMik90/Auto-Claude PR: 661
File: apps/frontend/src/renderer/components/onboarding/OllamaModelSelector.tsx:176-189
Timestamp: 2026-01-04T23:59:45.209Z
Learning: In the AndyMik90/Auto-Claude repository, pre-existing i18n issues (hardcoded user-facing strings that should be localized) can be deferred to future i18n cleanup passes rather than requiring immediate fixes in PRs that don't introduce new i18n violations.
Fixes based on automated review comments: 1. Binary Mode Detection (Critical Fix): - Replaced brittle regex with robust pattern: r'["'][rwax+]*b[rwax+]*["']' - Now correctly detects all binary modes: rb, wb, ab, r+b, w+b, etc. - Prevents false positives on text mode 'w' without 'b' - Added comprehensive tests for wb, ab, and text w modes 2. Encoding Detection Robustness (Critical Fix): - Changed from 'encoding=' string match to word boundary regex: r'\bencoding\s*=' - Now handles encoding with spaces: encoding = "utf-8" - Prevents false matches of substrings containing 'encoding=' - Applied across all checks (open, read_text, write_text, json.load, json.dump) - Added test for spaces around equals sign 3. Test Coverage Improvements: - Added json.dump() with encoding test (passing case) - Added json.dump() without encoding test (failing case) - Fixed test assertions to match actual behavior (== 1 not == 2) - Added 6 new tests for improved binary/text mode coverage - Total tests increased from 10 to 16, all passing ✅ 4. Code Cleanup: - Removed unused pytest import (CodeQL warning) - Simplified check_files() to remove unused variable tracking All changes validated with comprehensive test suite (16/16 passing). Related: PR AndyMik90#795 review feedback from CodeRabbit and Gemini Code Assist
✅ All Review Feedback AddressedThank you @coderabbitai and @gemini-code-assist for the thorough reviews! I've addressed all the issues raised. Critical Fixes Applied:1. ✅ Hook Scope Already LimitedIssue: Hook should be scoped to 2. ✅ Binary Mode Detection FixedIssue: Regex Fix Applied: # OLD (incorrect):
if re.search(r'["']r?[wb]["']', call):
# NEW (robust):
if re.search(r'["'][rwax+]*b[rwax+]*["']', call):Now correctly handles: rb, wb, ab, r+b, w+b, x+b, and all combinations 3. ✅ Encoding Detection RobustnessIssue: Simple string match Fix Applied: # OLD (brittle):
if 'encoding=' in call:
# NEW (robust):
if re.search(r'\bencoding\s*=', call):Now handles: All spacing variations ( 4. ✅ Test Coverage ImprovedAdded:
Test Results: 16/16 tests passing ✅ (up from 10) 5. ✅ Code Cleanup
Out of Scope (Future Enhancements):@coderabbitai suggested extracting tempfile boilerplate into helper methods. While this is a good suggestion for DRY, I'm keeping the current approach for this PR because:
This can be refactored in a future test infrastructure improvement PR if needed. All critical and high-priority issues are now resolved. The hook is production-ready with comprehensive test coverage and robust validation logic. Latest commit: a148894 |
|
Hello @TamerineSky, Thank you for the detailed update and for addressing the feedback so thoroughly! It's great to see such a comprehensive approach to improving the pre-commit hook. I particularly appreciate the robust fixes for the binary mode detection and the encoding parameter parsing, which significantly enhance the reliability of the checker. The expanded test suite, now with 16 passing tests, provides excellent coverage and confidence in the new logic. Your rationale for deferring the test boilerplate refactoring to a future PR is perfectly understandable, prioritizing the core functionality and immediate impact of this change. It's a pragmatic approach to managing scope. It sounds like all critical and high-priority issues have been well-resolved. If you'd like, I can perform another review to verify these improvements. Just let me know! |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 25559 @TamerineSky Excellent work! 🎉 I've verified all the fixes in commit a148894, and everything checks out perfectly: ✅ Verified Fixes1. Binary Mode Detection - Robust ✅# Line 51: scripts/check_encoding.py
if re.search(r'["\'][rwax+]*b[rwax+]*["\']', call):Confirmed: Now correctly handles 2. Encoding Detection - Robust ✅# Lines 56, 73, 84, 95, 106: Applied consistently across all 5 checks
if re.search(r'\bencoding\s*=', call):Confirmed: Word boundary 3. Hook Scope - Properly Limited ✅# Line 87: .pre-commit-config.yaml
files: ^apps/backend/Confirmed: Hook correctly scoped to backend only. 4. Test Coverage - Comprehensive ✅Confirmed: All 16 tests present and accounted for:
💯 AssessmentYour implementation is production-ready:
Your decision to defer the test helper extraction is pragmatic and appropriate for this PR's scope. Self-contained tests improve readability for this focused feature addition. Outstanding work on preventing the regression of 251 encoding fixes from PR #782! 🚀 🧠 Learnings usedIf you found this review helpful, would you consider giving us a shout-out on X? Thank you for using CodeRabbit! |
1. CONTRIBUTING.md: - Added concise file encoding section after Code Style - DO/DON'T examples for common file operations - Covers open(), Path methods, json operations - References PR AndyMik90#782 and windows-development.md 2. guides/windows-development.md (NEW): - Comprehensive Windows development guide - File encoding (cp1252 vs UTF-8 issue) - Line endings, path separators, shell commands - Development environment recommendations - Common pitfalls and solutions - Testing guidelines 3. .github/PULL_REQUEST_TEMPLATE.md: - Added encoding checklist item for Python PRs - Helps catch missing encoding during review 4. guides/README.md: - Added windows-development.md to guide index - Organized with CLI-USAGE and linux guides Purpose: Educate developers about UTF-8 encoding requirements to prevent regressions of the 251 encoding issues fixed in PR AndyMik90#782. Automated checking via pre-commit hooks (PR AndyMik90#795) + developer education ensures long-term Windows compatibility. Related: - PR AndyMik90#782: Fix Windows UTF-8 encoding errors (251 instances) - PR AndyMik90#795: Add pre-commit hooks for encoding enforcement
Found by pre-commit hook testing in PR AndyMik90#795: - Line 96: Path.read_text() without encoding - Line 122: Path.write_text() without encoding These handle .gitignore file operations and could fail on Windows with special characters in gitignore comments or entries. Total fixes in PR AndyMik90#782: 253 instances (was 251, +2 from init.py)
Problem
PR #782 fixed 251 instances of missing UTF-8 encoding across the backend. Without enforcement, these issues will recur as developers add new file operations without knowing about Windows encoding requirements.
Solution
Add pre-commit hook that automatically checks for missing
encoding="utf-8"parameters in file operations.1. Encoding Check Script (
scripts/check_encoding.py)open()calls have encoding parameterPath.read_text()andPath.write_text()callsjson.load/dumpwithopen()2. Pre-commit Configuration
apps/backend/to avoid false positives3. Comprehensive Tests
Real-World Validation
The hook already caught 6 real encoding issues in
apps/backend/core/progress.pythat were missed by PR #782 or added since then:Benefits
Testing
No Migration Needed
PR #782 already fixed most files. Developers only need to:
pip install pre-commitpre-commit installThe hook runs automatically on commit after installation.
Related
.claude/session-docs/session-encoding-precommit-hooks.mdSummary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.