-
Notifications
You must be signed in to change notification settings - Fork 2
Feat/mscasso/sp 3820 add skip headers #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@eeisegn has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 6 minutes and 8 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughAdds header-skipping to WFP generation via a new HeaderFilter and ScanossBase logging; integrates header stripping into Winnowing (C and Python paths), bumps package version to 0.7.0, adds tests, introduces lint tooling and CI workflow, and raises minimum Python requirement to 3.9. Changes
Sequence DiagramsequenceDiagram
actor User
participant Winnowing
participant ScanossBase
participant HeaderFilter
participant LanguagePatterns
User->>Winnowing: instantiate(..., skip_headers=True, skip_headers_limit=N)
Winnowing->>ScanossBase: __init__(debug, trace, quiet)
Winnowing->>HeaderFilter: HeaderFilter(debug, trace, quiet, max_skipped_lines)
User->>Winnowing: wfp_for_contents(file_path, contents)
alt skip_headers enabled
Winnowing->>HeaderFilter: filter(file_path, decoded_contents)
HeaderFilter->>LanguagePatterns: map extension → patterns
loop scan lines
HeaderFilter->>HeaderFilter: is_shebang / is_comment / is_import / is_license_header
end
HeaderFilter-->>Winnowing: return line_offset
Winnowing->>Winnowing: __strip_lines_until_offset(wfp, line_offset)
end
Winnowing-->>User: return WFP (may include start_line)
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~30 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (7)
src/scanoss_winnowing/scanossbase.py (2)
83-83: Use explicitOptionalor union type for nullable parameter.Per PEP 484, implicit
Optional(using= Nonewithout type annotation) is prohibited. As flagged by static analysis.+from typing import Optional + ... - def print_to_file_or_stdout(self, content: str, file: str = None): + def print_to_file_or_stdout(self, content: str, file: str | None = None):Or use
Optional[str]if supporting Python < 3.10.
96-96: Same implicit Optional issue here.- def print_to_file_or_stderr(self, msg: str, file: str = None): + def print_to_file_or_stderr(self, msg: str, file: str | None = None):src/scanoss_winnowing/winnowing.py (1)
120-122: Minor formatting note on parameter list.The trailing comma and extra newline before the closing parenthesis is unconventional. Consider cleaning up the formatting.
strip_snippet_ids=None, strip_hpsm_ids=None, skip_md5_ids=None, skip_headers: bool = False, - ): + ):src/scanoss_winnowing/header_filter.py (4)
42-196: Consider addingClassVarannotations for class-level mutable attributes.The
COMMENT_PATTERNS,IMPORT_PATTERNS, andLICENSE_KEYWORDSare mutable class attributes. AddingClassVarannotations clarifies they are shared across all instances and not meant to be overwritten per-instance.+from typing import ClassVar, Dict, List, Optional, Tuple + class LanguagePatterns: """Regex patterns for different programming languages""" + COMMENT_PATTERNS: ClassVar[Dict[str, Dict[str, str]]] = { - COMMENT_PATTERNS = { # ... } + IMPORT_PATTERNS: ClassVar[Dict[str, List[str]]] = { - IMPORT_PATTERNS = { # ... } + LICENSE_KEYWORDS: ClassVar[List[str]] = [ - LICENSE_KEYWORDS = [ # ... ]
374-374: Remove unusednoqadirective.Static analysis indicates
PLR0911is not enabled, so this directive has no effect.- def is_comment(self, line: str, language: str, in_multiline: bool) -> Tuple[bool, bool]: # noqa: PLR0911 + def is_comment(self, line: str, language: str, in_multiline: bool) -> Tuple[bool, bool]:
429-429: Remove unusednoqadirective.Static analysis indicates
PLR0912is not enabled, so this directive has no effect.- def find_first_implementation_line(self, lines: list[str], language: str) -> Optional[int]: # noqa: PLR0912 + def find_first_implementation_line(self, lines: list[str], language: str) -> Optional[int]:
441-441: Unused variable:consecutive_imports_countis set but never read.The variable is incremented on line 499 but its value is never used for any logic or output. Either remove it or document the intended future use.
- consecutive_imports_count = 0 # ... # Check if it's an import if self.is_import(line, language): - if consecutive_imports_count == 0: - self.print_debug(f'Line {line_number}: Detected import section') - consecutive_imports_count += 1 + self.print_debug(f'Line {line_number}: Detected import line') continueAlso applies to: 497-499
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
CHANGELOG.md(1 hunks)src/scanoss_winnowing/__init__.py(1 hunks)src/scanoss_winnowing/header_filter.py(1 hunks)src/scanoss_winnowing/scanossbase.py(1 hunks)src/scanoss_winnowing/winnowing.py(8 hunks)tests/winnowing-test.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/scanoss_winnowing/header_filter.py (1)
src/scanoss_winnowing/scanossbase.py (2)
ScanossBase(28-107)print_debug(58-63)
tests/winnowing-test.py (1)
src/scanoss_winnowing/winnowing.py (2)
Winnowing(74-600)wfp_for_contents(379-520)
src/scanoss_winnowing/winnowing.py (1)
src/scanoss_winnowing/scanossbase.py (2)
ScanossBase(28-107)print_debug(58-63)
🪛 Ruff (0.14.8)
src/scanoss_winnowing/scanossbase.py
83-83: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
96-96: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
src/scanoss_winnowing/header_filter.py
46-73: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
76-184: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
187-192: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
374-374: Unused noqa directive (non-enabled: PLR0911)
Remove unused noqa directive
(RUF100)
429-429: Unused noqa directive (non-enabled: PLR0912)
Remove unused noqa directive
(RUF100)
🔇 Additional comments (8)
src/scanoss_winnowing/__init__.py (1)
25-25: LGTM!Version bump to 0.7.0 aligns with the new skip-headers feature addition documented in the changelog.
tests/winnowing-test.py (1)
154-250: Good test coverage for the skip_headers feature.The test comprehensively validates:
- Both modes produce valid WFP output with
file=lines- Snippet line extraction works correctly
- The skip_headers mode results in higher starting line numbers
- The
start_line=tag is present only when skip_headers is enabled- The start_line value is correctly parsed and validated
src/scanoss_winnowing/winnowing.py (6)
40-41: LGTM on new imports.The imports for
HeaderFilterandScanossBaseare correctly added for the new functionality.
74-74: Good inheritance structure.Winnowing now correctly inherits from ScanossBase, consolidating the logging utilities.
135-156: LGTM on initialization changes.The super().init call correctly propagates the logging flags to ScanossBase, and the new skip_headers and header_filter attributes are properly initialized.
336-377: Well-structured line stripping logic.The
__strip_lines_until_offsetmethod correctly:
- Returns early for non-positive offsets
- Preserves non-snippet lines (file=, hpsm=, etc.)
- Injects start_line tag before the first retained snippet
- Uses exception handling for parsing edge cases
447-451: Header filtering correctly integrated in C-accelerated path.The skip_headers logic is applied after strip_snippet_ids processing, ensuring consistent behavior.
515-519: Header filtering correctly integrated in Python path.Good consistency - the Python path applies the same header filtering logic as the C-accelerated path, ensuring feature parity.
| '.pm': 'perl', | ||
| '.r': 'r', | ||
| '.R': 'r', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code: .R extension mapping is unreachable.
Since path.suffix.lower() is called on line 293, the extension is always lowercase. The .R entry on line 324 will never match; only .r on line 323 is used.
'.r': 'r',
- '.R': 'r',
'.lua': 'lua',📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| '.pm': 'perl', | |
| '.r': 'r', | |
| '.R': 'r', | |
| '.pm': 'perl', | |
| '.r': 'r', |
🤖 Prompt for AI Agents
In src/scanoss_winnowing/header_filter.py around lines 322 to 324, the mapping
includes both '.r' and '.R' but path.suffix.lower() is used earlier, making '.R'
unreachable; remove the redundant '.R' entry (or alternatively ensure keys are
normalized to lowercase and deduplicated) so the mapping only contains lowercase
extensions to avoid dead code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (1)
src/scanoss_winnowing/header_filter.py (1)
237-238: Dead code:.Rextension mapping is unreachable.This issue was already identified in a previous review. Since
path.suffix.lower()is called on line 376, the extension is always lowercase. The.Rentry on line 238 will never match; only.ron line 237 is used.Apply this diff to remove the unreachable entry:
'.r': 'r', - '.R': 'r', '.lua': 'lua',
🧹 Nitpick comments (3)
.github/workflows/lint.yml (1)
40-42: Minor: Consider grouped redirects for cleaner output syntax.The shellcheck tool suggests using grouped redirects for better readability and fewer operations.
Based on static analysis hints, apply this diff:
- echo "files<<EOF" >> "$GITHUB_OUTPUT" - echo "${filtered_files}" >> "$GITHUB_OUTPUT" - echo "EOF" >> "$GITHUB_OUTPUT" + { + echo "files<<EOF" + echo "${filtered_files}" + echo "EOF" + } >> "$GITHUB_OUTPUT"src/scanoss_winnowing/winnowing.py (1)
446-450: Consider extracting the header filtering logic into a helper method.The header filtering logic is duplicated between the C-accelerated path (lines 446-450) and the Python path (lines 514-518). While the duplication is minimal, extracting it to a helper method like
__apply_header_filter(file, wfp, decoded_contents)could improve maintainability if this logic needs to evolve.Example refactoring:
def __apply_header_filter(self, file: str, wfp: str, decoded_contents: str) -> str: """Apply header filtering to WFP if enabled.""" if self.skip_headers: line_offset = self.header_filter.filter(file, decoded_contents) if line_offset > 0: wfp = self.__strip_lines_until_offset(file, wfp, line_offset) return wfpThen use it in both places:
wfp = self.__apply_header_filter(file, wfp, decoded_contents)Also applies to: 514-518
src/scanoss_winnowing/header_filter.py (1)
490-560: Remove or clarify the confusing comment.The comment
# End for loop?on line 559 appears after the for loop has already ended naturally. This comment is confusing and doesn't add value. Consider removing it or replacing it with something clearer if there's a specific reason to mark the loop end.Apply this diff to remove the confusing comment:
return line_number - # End for loop? return None
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.github/workflows/lint.yml(1 hunks)CHANGELOG.md(2 hunks)Makefile(1 hunks)src/scanoss_winnowing/header_filter.py(1 hunks)src/scanoss_winnowing/winnowing.py(9 hunks)tools/linter.sh(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.md
🧰 Additional context used
🧬 Code graph analysis (2)
src/scanoss_winnowing/winnowing.py (2)
src/scanoss_winnowing/header_filter.py (2)
HeaderFilter(284-560)filter(313-360)src/scanoss_winnowing/scanossbase.py (3)
ScanossBase(28-107)print_stderr(45-49)print_debug(58-63)
src/scanoss_winnowing/header_filter.py (1)
src/scanoss_winnowing/scanossbase.py (4)
ScanossBase(28-107)print_msg(51-56)print_debug(58-63)print_trace(65-70)
🪛 actionlint (1.7.9)
.github/workflows/lint.yml
13-13: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
18-18: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
29-29: shellcheck reported issue in this script: SC2129:style:11:1: Consider using { cmd1; cmd2; } >> file instead of individual redirects
(shellcheck)
🔇 Additional comments (13)
Makefile (1)
32-36: LGTM!The lint targets are well-documented and correctly invoke the linter script with appropriate flags.
src/scanoss_winnowing/winnowing.py (6)
39-40: LGTM!The imports are correctly structured and the new dependencies (HeaderFilter and ScanossBase) are properly integrated into the class hierarchy and functionality.
116-121: LGTM!The new parameters
skip_headersandskip_headers_max_linesare well-named and have sensible defaults. The parameter list formatting is consistent with Python conventions.
135-135: LGTM!The super().init call correctly passes debug, trace, and quiet parameters to the ScanossBase parent class.
156-158: LGTM!The HeaderFilter initialization correctly uses the
or Noneidiom to convert a zero value to None for the max_skipped_lines parameter, which enables unlimited filtering when the max is not set.
338-378: LGTM!The
__strip_lines_until_offsetmethod correctly:
- Filters WFP snippet lines based on line numbers
- Preserves non-snippet metadata lines (file=, hpsm=)
- Adds a
start_linemarker to indicate where filtering occurred- Handles parsing errors gracefully with try-except
The logic
line_num > line_offsetis correct since line_offset represents the last line to skip (the line before implementation starts).
413-415: LGTM!Creating
decoded_contentsonce and reusing it is efficient. The'ignore'error handler for UTF-8 decoding is appropriate here since the code already handles binary files separately and needs to be robust against encoding issues.src/scanoss_winnowing/header_filter.py (6)
42-204: LGTM!The LanguagePatterns class provides comprehensive regex patterns for identifying comments, imports, and license headers across a wide range of programming languages. The patterns are well-organized by language family (c_style, python_style, lua_style, html_style) and cover major languages effectively.
257-281: LGTM!The utility functions
is_blank_lineandis_shebangare simple, correct, and clearly documented. The explicit length check inis_blank_lineis more readable than alternatives.
293-311: LGTM!The constructor properly initializes the HeaderFilter with sensible defaults and correctly calls the parent class constructor. The
max_skipped_linesparameter allows flexible control over filtering behavior.
313-360: LGTM!The
filtermethod correctly:
- Validates inputs and handles edge cases
- Returns 0 (no filtering) for unsupported languages or empty files
- Calculates the line offset as
implementation_start - 1(correctly converting from 1-indexed line numbers to 0-indexed offset)- Applies the optional
max_linescap- Provides appropriate debug/trace logging
362-428: LGTM!The helper methods are well-implemented:
detect_languagecorrectly uses lowercase extension matchingis_license_headerperforms case-insensitive keyword detectionget_comment_styleprovides sensible defaults for language families
430-488: LGTM!The
is_commentandis_importmethods correctly handle:
- Single-line and multi-line comment detection
- State tracking for multi-line comments
- Python docstrings with proper quote counting
- Import/include pattern matching across languages
The complexity suppressions (noqa: PLR0911, PLR0912) are justified given the inherent complexity of parsing multiple language syntaxes.
.github/workflows/lint.yml
Outdated
| echo "Linting the following files:" | ||
| echo "${{ steps.changed_files.outputs.files }}" | ||
| # Pass the list of changed files to Ruff. | ||
| echo "${{ steps.changed_files.outputs.files }}" | xargs ruff check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue with xargs handling many files.
Using xargs without -r or proper handling could fail if there are many files or special characters in filenames. Additionally, if the file list is empty, xargs might read from stdin.
Apply this diff to make it more robust:
echo "Linting the following files:"
echo "${{ steps.changed_files.outputs.files }}"
# Pass the list of changed files to Ruff.
- echo "${{ steps.changed_files.outputs.files }}" | xargs ruff check
+ echo "${{ steps.changed_files.outputs.files }}" | xargs -r ruff check
fiThe -r flag prevents xargs from running if the input is empty, providing an extra safety layer alongside the conditional check.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "${{ steps.changed_files.outputs.files }}" | xargs ruff check | |
| echo "${{ steps.changed_files.outputs.files }}" | xargs -r ruff check |
🤖 Prompt for AI Agents
.github/workflows/lint.yml around line 56: the current pipeline uses echo ... |
xargs ruff check which can break on empty input or filenames with spaces/special
chars; change it to produce null-delimited file names and invoke xargs with the
-r (do not run if input is empty) and --null flags so ruff is not executed on
empty input and filenames with special characters are handled safely.
tools/linter.sh
Outdated
| done | ||
|
|
||
| # Find merge base with origin/main | ||
| merge_base=$(git merge-base origin/main HEAD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle missing origin/main gracefully.
If origin/main doesn't exist (e.g., in a fresh clone or different remote setup), the script will fail. Consider adding error handling or documentation about the requirement.
Apply this diff to add error handling:
# Find merge base with origin/main
+if ! git rev-parse --verify origin/main >/dev/null 2>&1; then
+ echo "Error: origin/main branch not found. Ensure you have fetched from origin."
+ exit 1
+fi
merge_base=$(git merge-base origin/main HEAD)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tools/linter.sh around line 32, the git merge-base call assumes origin/main
exists and will fail if it doesn't; add a pre-check for origin/main (e.g., git
rev-parse --verify --quiet refs/remotes/origin/main or git show-ref) and if that
check fails, fall back to a safe default (for example set merge_base to HEAD or
to HEAD~1) and print a clear warning so the script continues instead of
erroring; ensure subsequent logic uses the fallback value and that the script
exits non-fatally or instructs the user how to set up the remote if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
src/scanoss_winnowing/header_filter.py (1)
237-238: Dead code:.Rextension mapping is unreachable.Since
path.suffix.lower()is called on line 376, the extension is always lowercase. The.Rentry on line 238 will never match; only.ron line 237 is used.Apply this diff to remove the redundant entry:
'.r': 'r', - '.R': 'r', '.lua': 'lua',src/scanoss_winnowing/scanossbase.py (1)
51-56: Typo in docstring: "quite" should be "quiet".Apply this diff:
def print_msg(self, *args, **kwargs): """ - Print message if quite mode is not enabled + Print message if quiet mode is not enabled """ if not self.quiet: self.print_stderr(*args, **kwargs)
🧹 Nitpick comments (4)
requirements-dev.txt (1)
5-5: Consider pinning the ruff version for reproducibility.While the dependency is correctly added, consider specifying a version constraint (e.g.,
ruff>=0.8.0) to ensure reproducible builds and consistent linting behavior across environments and CI runs.Apply this diff to pin the ruff version:
-ruff +ruff>=0.8.0src/scanoss_winnowing/header_filter.py (2)
293-311: Consider renaming for clarity.The parameter
skip_limitis stored asself.max_lines, which can be confusing. Consider using a consistent name likemax_skip_linesfor both the parameter and the attribute to improve readability.
60-204: Add unit tests validating language patterns with real-world code samples.The comment, import, and license keyword patterns are production-ready and actively used in filtering logic, but test coverage is limited to Python files. Consider adding integration tests for JavaScript, Java, Go, Rust, C++, and other supported languages using realistic code samples to validate that the regex patterns correctly identify headers, comments, and imports across edge cases.
src/scanoss_winnowing/scanossbase.py (1)
83-94: Consider specifying encoding explicitly.Lines 91 and 104 open files without specifying encoding. For consistency with the rest of the codebase (which uses UTF-8, e.g., line 471 in winnowing.py), consider specifying
encoding='utf-8'to avoid platform-dependent behavior.Apply this diff:
if file: - with open(file, 'w') as f: + with open(file, 'w', encoding='utf-8') as f: f.write(content) else: self.print_stdout(content)And similarly for
print_to_file_or_stderr:if file: - with open(file, 'w') as f: + with open(file, 'w', encoding='utf-8') as f: f.write(msg) else: self.print_stderr(msg)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
.github/workflows/lint.yml(1 hunks)Makefile(1 hunks)requirements-dev.txt(1 hunks)setup.cfg(1 hunks)src/scanoss_winnowing/header_filter.py(1 hunks)src/scanoss_winnowing/scanossbase.py(1 hunks)src/scanoss_winnowing/winnowing.py(9 hunks)tools/linter.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/lint.yml
- tools/linter.sh
🧰 Additional context used
🧬 Code graph analysis (2)
src/scanoss_winnowing/winnowing.py (2)
src/scanoss_winnowing/header_filter.py (2)
HeaderFilter(284-560)filter(313-360)src/scanoss_winnowing/scanossbase.py (3)
ScanossBase(28-107)print_stderr(45-49)print_debug(58-63)
src/scanoss_winnowing/header_filter.py (1)
src/scanoss_winnowing/scanossbase.py (4)
ScanossBase(28-107)print_msg(51-56)print_debug(58-63)print_trace(65-70)
🔇 Additional comments (12)
Makefile (1)
32-42: LGTM! Lint targets are well-structured.The new lint targets follow the existing Makefile conventions and provide clear help text. They appropriately delegate to the
tools/linter.shscript with the correct flags for different linting modes.src/scanoss_winnowing/header_filter.py (3)
313-360: LGTM!The filter method correctly handles edge cases (empty content, unsupported languages, no implementation found) and properly applies the skip limit with appropriate None-checking.
362-386: LGTM!The language detection logic is sound, with appropriate None-handling and debug logging.
536-549: LGTM!The Go import block handling correctly identifies and skips multi-line import declarations enclosed in parentheses.
src/scanoss_winnowing/scanossbase.py (1)
28-107: LGTM!The ScanossBase class provides a clean abstraction for logging and I/O operations. The implementation is straightforward and effective.
src/scanoss_winnowing/winnowing.py (7)
39-40: LGTM!The new imports are correct and support the header filtering feature.
118-118: LGTM!Inheriting from ScanossBase provides consistent logging functionality across the codebase.
161-219: LGTM!The initialization correctly integrates the header filtering feature. The HeaderFilter is instantiated regardless of the
skip_headersflag, which has minimal overhead but could be optimized lazily if needed.
502-506: LGTM!The header filtering is correctly applied to the C-accelerated path after WFP generation.
571-579: LGTM!The header filtering is correctly applied to the Python path, with proper guard against empty WFP content.
471-471: Acceptable tradeoff in encoding handling.Line 471 uses
decode('utf-8', 'ignore')which silently skips invalid UTF-8 bytes. While this ensures the code doesn't crash on encoding errors, it might affect header detection accuracy for files with encoding issues. This is an acceptable tradeoff since such files are edge cases.
396-436: Clarify the semantic meaning ofstart_linein the WFP format.Line 422 sets
start_line={line_offset}, whereline_offsetrepresents "the last line previous to real code" (as noted in line 418's comment). Given thatimplementation_startis 1-indexed andline_offset = implementation_start - 1, the current implementation records the last skipped line rather than the first implementation line. While this appears intentional based on the header_filter semantics, the field namestart_linetypically implies "where code begins" rather than "where code before it ends". Consider either: (1) renaming toskip_until_linefor clarity, or (2) adjusting tostart_line={line_offset + 1}to record the actual first implementation line. If the current behavior is correct, add a clarifying comment explaining thatstart_linerecords the offset point, not the implementation start line.
| if 'doc_string_start' in patterns and '"""' in line: | ||
| # Count how many quotes there are | ||
| count = line.count('"""') | ||
| if count == COMPLETE_DOCSTRING_QUOTE_COUNT: # Complete docstring in one line | ||
| return True, False | ||
| if count == 1: # Start of a multiline docstring | ||
| return True, True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential false positive in docstring detection.
Line 465 counts """ occurrences in the entire line without considering if they appear inside string literals. For example, some_string = 'text with """ inside' would incorrectly be treated as a docstring. While unlikely in headers, consider using the existing regex patterns more precisely.
🤖 Prompt for AI Agents
In src/scanoss_winnowing/header_filter.py around lines 463 to 469, the code uses
line.count('"""') which can misclassify triple quotes that occur inside other
string literals; replace the simple count with the existing/appropriate compiled
regex from patterns (e.g. use patterns['doc_string_start'] or a dedicated
triple-quote regex) and use its finditer/search results to detect actual
triple-quote occurrences (ignoring matches inside other strings/escaped
contexts), then base the COMPLETE_DOCSTRING_QUOTE_COUNT and start-vs-multiline
logic on those regex matches rather than raw line.count().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/scanoss_winnowing/scanossbase.py (3)
72-81: Consider condensing the formatting.The multi-line formatting is valid but unconventional for this simple print statement. A one-liner would be more concise.
Apply this diff to condense:
@staticmethod def print_stdout(*args, **kwargs): """ Print message to STDOUT """ - print( - *args, - file=sys.stdout, - **kwargs, - ) + print(*args, file=sys.stdout, **kwargs)
83-94: Add error handling and explicit encoding for file operations.File operations can raise
FileNotFoundError,PermissionError, orOSError. Additionally, omitting theencodingparameter makes the behavior platform-dependent. Consider adding error handling and explicitly specifyingencoding='utf-8'for cross-platform consistency.Apply this diff to improve robustness:
def print_to_file_or_stdout(self, content: str, file: str = None): """ Print message to file if provided or stdout """ if not content: return if file: - with open(file, 'w') as f: - f.write(content) + try: + with open(file, 'w', encoding='utf-8') as f: + f.write(content) + except (FileNotFoundError, PermissionError, OSError) as e: + self.print_stderr(f"Error writing to file {file}: {e}") + raise else: self.print_stdout(content)Alternatively, document in the docstring that callers should handle file I/O exceptions.
96-107: Add error handling and explicit encoding for file operations.Similar to
print_to_file_or_stdout, this method lacks error handling and explicit encoding. Apply the same improvements for consistency.Apply this diff:
def print_to_file_or_stderr(self, msg: str, file: str = None): """ Print message to file if provided or stderr """ if not msg: return if file: - with open(file, 'w') as f: - f.write(msg) + try: + with open(file, 'w', encoding='utf-8') as f: + f.write(msg) + except (FileNotFoundError, PermissionError, OSError) as e: + self.print_stderr(f"Error writing to file {file}: {e}") + raise else: self.print_stderr(msg)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
CHANGELOG.md(2 hunks)src/scanoss_winnowing/scanossbase.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- CHANGELOG.md
🔇 Additional comments (2)
src/scanoss_winnowing/scanossbase.py (2)
1-26: LGTM!The MIT license header is properly formatted and the import is appropriate for the logging utilities provided.
28-70: LGTM!The base class design is clean with well-structured logging levels. The initialization and conditional print methods (
print_msg,print_debug,print_trace) correctly implement the intended behavior.
Summary by CodeRabbit
New Features
Tests
Chores
CI / Tooling
Compatibility
✏️ Tip: You can customize this high-level summary in your review settings.