Feat/mscasso/sp 3820 add skip headers #16

mscasso-scanoss · 2025-12-17T02:29:54Z

Summary by CodeRabbit

New Features
- Add skip-headers support to omit file headers (comments, license blocks, imports) from analysis and integrate a standardized logging base.
Tests
- Add test validating skip-headers behavior and start-line reporting.
Chores
- Bump package to 0.7.0 and update changelog.
CI / Tooling
- Add lint workflow, Makefile lint targets, linter helper script, and ruff dev dependency.
Compatibility
- Raise minimum Python requirement to 3.9.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-17T02:30:12Z

Warning

Rate limit exceeded

@eeisegn has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 6 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 7cccf3e and a47186d.

📒 Files selected for processing (1)

src/scanoss_winnowing/header_filter.py (1 hunks)

Walkthrough

Adds header-skipping to WFP generation via a new HeaderFilter and ScanossBase logging; integrates header stripping into Winnowing (C and Python paths), bumps package version to 0.7.0, adds tests, introduces lint tooling and CI workflow, and raises minimum Python requirement to 3.9.

Changes

Cohort / File(s)	Summary
Version & Changelog `CHANGELOG.md`, `src/scanoss_winnowing/__init__.py`	Bumped package version to `0.7.0` and added a `0.7.0` changelog entry.
Logging Base `src/scanoss_winnowing/scanossbase.py`	New `ScanossBase` class providing debug/trace/quiet flags and stdout/stderr/file output helpers.
Header Filtering `src/scanoss_winnowing/header_filter.py`	New `HeaderFilter`, `LanguagePatterns`, extension mapping and helpers for detecting shebangs, comments, imports, and license headers; exposes `filter()` and detection utilities.
Core Integration `src/scanoss_winnowing/winnowing.py`	`Winnowing` now inherits `ScanossBase`, accepts `skip_headers` and `skip_headers_limit`, initializes `HeaderFilter`, adds `__strip_lines_until_offset`, and applies header-stripping in both C-accelerated and Python WFP generation paths; removed legacy logging helpers.
Tests `tests/winnowing-test.py`	Added `test_skip_headers_flag` to validate header-skipping behavior and `start_line` injection.
Linting / Tooling `.github/workflows/lint.yml`, `Makefile`, `tools/linter.sh`, `requirements-dev.txt`	Added GitHub Actions lint workflow, Makefile lint targets (`lint`, `lint-fix`, `lint-all`, `lint-fix-all`), `tools/linter.sh` script, and added `ruff` to dev requirements.
Packaging / Docs `setup.cfg`, `PACKAGE.md`, `README.md`	Raised Python compatibility requirement from `>=3.7` to `>=3.9` and updated docs to match.

Sequence Diagram

sequenceDiagram
    actor User
    participant Winnowing
    participant ScanossBase
    participant HeaderFilter
    participant LanguagePatterns

    User->>Winnowing: instantiate(..., skip_headers=True, skip_headers_limit=N)
    Winnowing->>ScanossBase: __init__(debug, trace, quiet)
    Winnowing->>HeaderFilter: HeaderFilter(debug, trace, quiet, max_skipped_lines)
    User->>Winnowing: wfp_for_contents(file_path, contents)
    alt skip_headers enabled
        Winnowing->>HeaderFilter: filter(file_path, decoded_contents)
        HeaderFilter->>LanguagePatterns: map extension → patterns
        loop scan lines
            HeaderFilter->>HeaderFilter: is_shebang / is_comment / is_import / is_license_header
        end
        HeaderFilter-->>Winnowing: return line_offset
        Winnowing->>Winnowing: __strip_lines_until_offset(wfp, line_offset)
    end
    Winnowing-->>User: return WFP (may include start_line)

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Areas needing attention:
- src/scanoss_winnowing/header_filter.py — regex patterns, language mapping, multiline comment edge cases, and max-lines limit.
- src/scanoss_winnowing/winnowing.py — ensure header-stripping is applied uniformly across C and Python paths and correct start_line injection.
- src/scanoss_winnowing/scanossbase.py — logging integration and replacement of removed helpers.
- Packaging/config: setup.cfg, docs updates, and CI lint workflow correctness.

Poem

🐰 I nibble shebangs, comments, and lines,
I hop past licenses, imports, and signs.
ScanossBase hums while HeaderFilter aligns,
Winnowing skips where the real code shines.
Version 0.7.0 — carrots and finds!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main feature: adding skip-headers functionality to the winnowing process.
Docstring Coverage	✅ Passed	Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (7)

src/scanoss_winnowing/scanossbase.py (2)
83-83: Use explicit Optional or union type for nullable parameter.

Per PEP 484, implicit Optional (using = None without type annotation) is prohibited. As flagged by static analysis.
+from typing import Optional
+
 ...
 
-    def print_to_file_or_stdout(self, content: str, file: str = None):
+    def print_to_file_or_stdout(self, content: str, file: str | None = None):
Or use Optional[str] if supporting Python < 3.10.

96-96: Same implicit Optional issue here.
-    def print_to_file_or_stderr(self, msg: str, file: str = None):
+    def print_to_file_or_stderr(self, msg: str, file: str | None = None):
src/scanoss_winnowing/winnowing.py (1)
120-122: Minor formatting note on parameter list.

The trailing comma and extra newline before the closing parenthesis is unconventional. Consider cleaning up the formatting.
                  strip_snippet_ids=None, strip_hpsm_ids=None, skip_md5_ids=None, skip_headers: bool = False,
-
                  ):
+                 ):
src/scanoss_winnowing/header_filter.py (4)
42-196: Consider adding ClassVar annotations for class-level mutable attributes.

The COMMENT_PATTERNS, IMPORT_PATTERNS, and LICENSE_KEYWORDS are mutable class attributes. Adding ClassVar annotations clarifies they are shared across all instances and not meant to be overwritten per-instance.
+from typing import ClassVar, Dict, List, Optional, Tuple
+
 class LanguagePatterns:
     """Regex patterns for different programming languages"""
 
+    COMMENT_PATTERNS: ClassVar[Dict[str, Dict[str, str]]] = {
-    COMMENT_PATTERNS = {
         # ...
     }
+    IMPORT_PATTERNS: ClassVar[Dict[str, List[str]]] = {
-    IMPORT_PATTERNS = {
         # ...
     }
+    LICENSE_KEYWORDS: ClassVar[List[str]] = [
-    LICENSE_KEYWORDS = [
         # ...
     ]
374-374: Remove unused noqa directive.

Static analysis indicates PLR0911 is not enabled, so this directive has no effect.
-    def is_comment(self, line: str, language: str, in_multiline: bool) -> Tuple[bool, bool]:  # noqa: PLR0911
+    def is_comment(self, line: str, language: str, in_multiline: bool) -> Tuple[bool, bool]:
429-429: Remove unused noqa directive.

Static analysis indicates PLR0912 is not enabled, so this directive has no effect.
-    def find_first_implementation_line(self, lines: list[str], language: str) -> Optional[int]:  # noqa: PLR0912
+    def find_first_implementation_line(self, lines: list[str], language: str) -> Optional[int]:
441-441: Unused variable: consecutive_imports_count is set but never read.

The variable is incremented on line 499 but its value is never used for any logic or output. Either remove it or document the intended future use.
-        consecutive_imports_count = 0
         # ...
             # Check if it's an import
             if self.is_import(line, language):
-                if consecutive_imports_count == 0:
-                    self.print_debug(f'Line {line_number}: Detected import section')
-                consecutive_imports_count += 1
+                self.print_debug(f'Line {line_number}: Detected import line')
                 continue
Also applies to: 497-499

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 151fe63 and 5417301.

📒 Files selected for processing (6)

CHANGELOG.md (1 hunks)
src/scanoss_winnowing/__init__.py (1 hunks)
src/scanoss_winnowing/header_filter.py (1 hunks)
src/scanoss_winnowing/scanossbase.py (1 hunks)
src/scanoss_winnowing/winnowing.py (8 hunks)
tests/winnowing-test.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

src/scanoss_winnowing/header_filter.py (1)

src/scanoss_winnowing/scanossbase.py (2)

ScanossBase (28-107)

print_debug (58-63)

tests/winnowing-test.py (1)

src/scanoss_winnowing/winnowing.py (2)

Winnowing (74-600)

wfp_for_contents (379-520)

src/scanoss_winnowing/winnowing.py (1)

src/scanoss_winnowing/scanossbase.py (2)

ScanossBase (28-107)

print_debug (58-63)

🪛 Ruff (0.14.8)

src/scanoss_winnowing/scanossbase.py

83-83: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

96-96: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

src/scanoss_winnowing/header_filter.py

46-73: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

76-184: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

187-192: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

374-374: Unused noqa directive (non-enabled: PLR0911)

Remove unused noqa directive

(RUF100)

429-429: Unused noqa directive (non-enabled: PLR0912)

Remove unused noqa directive

(RUF100)

🔇 Additional comments (8)

src/scanoss_winnowing/__init__.py (1)

25-25: LGTM!

Version bump to 0.7.0 aligns with the new skip-headers feature addition documented in the changelog.

tests/winnowing-test.py (1)

154-250: Good test coverage for the skip_headers feature.

The test comprehensively validates:

Both modes produce valid WFP output with file= lines

Snippet line extraction works correctly

The skip_headers mode results in higher starting line numbers

The start_line= tag is present only when skip_headers is enabled

The start_line value is correctly parsed and validated

src/scanoss_winnowing/winnowing.py (6)

40-41: LGTM on new imports.

The imports for HeaderFilter and ScanossBase are correctly added for the new functionality.

74-74: Good inheritance structure.

Winnowing now correctly inherits from ScanossBase, consolidating the logging utilities.

135-156: LGTM on initialization changes.

The super().init call correctly propagates the logging flags to ScanossBase, and the new skip_headers and header_filter attributes are properly initialized.

336-377: Well-structured line stripping logic.

The __strip_lines_until_offset method correctly:

Returns early for non-positive offsets

Preserves non-snippet lines (file=, hpsm=, etc.)

Injects start_line tag before the first retained snippet

Uses exception handling for parsing edge cases

447-451: Header filtering correctly integrated in C-accelerated path.

The skip_headers logic is applied after strip_snippet_ids processing, ensuring consistent behavior.

515-519: Header filtering correctly integrated in Python path.

Good consistency - the Python path applies the same header filtering logic as the C-accelerated path, ensuring feature parity.

CHANGELOG.md

coderabbitai · 2025-12-17T02:33:21Z

src/scanoss_winnowing/header_filter.py

+            '.pm': 'perl',
+            '.r': 'r',
+            '.R': 'r',


⚠️ Potential issue | 🟡 Minor

Dead code: .R extension mapping is unreachable.

Since path.suffix.lower() is called on line 293, the extension is always lowercase. The .R entry on line 324 will never match; only .r on line 323 is used.

'.r': 'r', - '.R': 'r', '.lua': 'lua',

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

'.pm': 'perl',

'.r': 'r',

'.R': 'r',

'.pm': 'perl',

'.r': 'r',

🤖 Prompt for AI Agents

In src/scanoss_winnowing/header_filter.py around lines 322 to 324, the mapping includes both '.r' and '.R' but path.suffix.lower() is used earlier, making '.R' unreachable; remove the redundant '.R' entry (or alternatively ensure keys are normalized to lowercase and deduplicated) so the mapping only contains lowercase extensions to avoid dead code.

src/scanoss_winnowing/scanossbase.py

…filter usage

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (1)

src/scanoss_winnowing/header_filter.py (1)
237-238: Dead code: .R extension mapping is unreachable.

This issue was already identified in a previous review. Since path.suffix.lower() is called on line 376, the extension is always lowercase. The .R entry on line 238 will never match; only .r on line 237 is used.

Apply this diff to remove the unreachable entry:
         '.r': 'r',
-        '.R': 'r',
         '.lua': 'lua',

🧹 Nitpick comments (3)

.github/workflows/lint.yml (1)
40-42: Minor: Consider grouped redirects for cleaner output syntax.

The shellcheck tool suggests using grouped redirects for better readability and fewer operations.

Based on static analysis hints, apply this diff:
-          echo "files<<EOF" >> "$GITHUB_OUTPUT"
-          echo "${filtered_files}" >> "$GITHUB_OUTPUT"
-          echo "EOF" >> "$GITHUB_OUTPUT"
+          {
+            echo "files<<EOF"
+            echo "${filtered_files}"
+            echo "EOF"
+          } >> "$GITHUB_OUTPUT"
src/scanoss_winnowing/winnowing.py (1)
446-450: Consider extracting the header filtering logic into a helper method.

The header filtering logic is duplicated between the C-accelerated path (lines 446-450) and the Python path (lines 514-518). While the duplication is minimal, extracting it to a helper method like __apply_header_filter(file, wfp, decoded_contents) could improve maintainability if this logic needs to evolve.

Example refactoring:
def __apply_header_filter(self, file: str, wfp: str, decoded_contents: str) -> str:
    """Apply header filtering to WFP if enabled."""
    if self.skip_headers:
        line_offset = self.header_filter.filter(file, decoded_contents)
        if line_offset > 0:
            wfp = self.__strip_lines_until_offset(file, wfp, line_offset)
    return wfp
Then use it in both places:
wfp = self.__apply_header_filter(file, wfp, decoded_contents)
Also applies to: 514-518
src/scanoss_winnowing/header_filter.py (1)
490-560: Remove or clarify the confusing comment.

The comment # End for loop? on line 559 appears after the for loop has already ended naturally. This comment is confusing and doesn't add value. Consider removing it or replacing it with something clearer if there's a specific reason to mark the loop end.

Apply this diff to remove the confusing comment:
             return line_number
-        # End for loop?
         return None

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5417301 and bf66688.

📒 Files selected for processing (6)

.github/workflows/lint.yml (1 hunks)
CHANGELOG.md (2 hunks)
Makefile (1 hunks)
src/scanoss_winnowing/header_filter.py (1 hunks)
src/scanoss_winnowing/winnowing.py (9 hunks)
tools/linter.sh (1 hunks)

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.md

🧰 Additional context used

🧬 Code graph analysis (2)

src/scanoss_winnowing/winnowing.py (2)

src/scanoss_winnowing/header_filter.py (2)

HeaderFilter (284-560)

filter (313-360)

src/scanoss_winnowing/scanossbase.py (3)

ScanossBase (28-107)

print_stderr (45-49)

print_debug (58-63)

src/scanoss_winnowing/header_filter.py (1)

src/scanoss_winnowing/scanossbase.py (4)

ScanossBase (28-107)

print_msg (51-56)

print_debug (58-63)

print_trace (65-70)

🪛 actionlint (1.7.9)

.github/workflows/lint.yml

13-13: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

18-18: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

29-29: shellcheck reported issue in this script: SC2129:style:11:1: Consider using { cmd1; cmd2; } >> file instead of individual redirects

(shellcheck)

🔇 Additional comments (13)

Makefile (1)

32-36: LGTM!

The lint targets are well-documented and correctly invoke the linter script with appropriate flags.

src/scanoss_winnowing/winnowing.py (6)

39-40: LGTM!

The imports are correctly structured and the new dependencies (HeaderFilter and ScanossBase) are properly integrated into the class hierarchy and functionality.

116-121: LGTM!

The new parameters skip_headers and skip_headers_max_lines are well-named and have sensible defaults. The parameter list formatting is consistent with Python conventions.

135-135: LGTM!

The super().init call correctly passes debug, trace, and quiet parameters to the ScanossBase parent class.

156-158: LGTM!

The HeaderFilter initialization correctly uses the or None idiom to convert a zero value to None for the max_skipped_lines parameter, which enables unlimited filtering when the max is not set.

338-378: LGTM!

The __strip_lines_until_offset method correctly:

Filters WFP snippet lines based on line numbers

Preserves non-snippet metadata lines (file=, hpsm=)

Adds a start_line marker to indicate where filtering occurred

Handles parsing errors gracefully with try-except

The logic line_num > line_offset is correct since line_offset represents the last line to skip (the line before implementation starts).

413-415: LGTM!

Creating decoded_contents once and reusing it is efficient. The 'ignore' error handler for UTF-8 decoding is appropriate here since the code already handles binary files separately and needs to be robust against encoding issues.

src/scanoss_winnowing/header_filter.py (6)

42-204: LGTM!

The LanguagePatterns class provides comprehensive regex patterns for identifying comments, imports, and license headers across a wide range of programming languages. The patterns are well-organized by language family (c_style, python_style, lua_style, html_style) and cover major languages effectively.

257-281: LGTM!

The utility functions is_blank_line and is_shebang are simple, correct, and clearly documented. The explicit length check in is_blank_line is more readable than alternatives.

293-311: LGTM!

The constructor properly initializes the HeaderFilter with sensible defaults and correctly calls the parent class constructor. The max_skipped_lines parameter allows flexible control over filtering behavior.

313-360: LGTM!

The filter method correctly:

Validates inputs and handles edge cases

Returns 0 (no filtering) for unsupported languages or empty files

Calculates the line offset as implementation_start - 1 (correctly converting from 1-indexed line numbers to 0-indexed offset)

Applies the optional max_lines cap

Provides appropriate debug/trace logging

362-428: LGTM!

The helper methods are well-implemented:

detect_language correctly uses lowercase extension matching

is_license_header performs case-insensitive keyword detection

get_comment_style provides sensible defaults for language families

430-488: LGTM!

The is_comment and is_import methods correctly handle:

Single-line and multi-line comment detection

State tracking for multi-line comments

Python docstrings with proper quote counting

Import/include pattern matching across languages

The complexity suppressions (noqa: PLR0911, PLR0912) are justified given the inherent complexity of parsing multiple language syntaxes.

.github/workflows/lint.yml

coderabbitai · 2025-12-17T03:19:13Z

.github/workflows/lint.yml

+            echo "Linting the following files:"
+            echo "${{ steps.changed_files.outputs.files }}"
+            # Pass the list of changed files to Ruff.
+            echo "${{ steps.changed_files.outputs.files }}" | xargs ruff check


⚠️ Potential issue | 🟡 Minor

Potential issue with xargs handling many files.

Using xargs without -r or proper handling could fail if there are many files or special characters in filenames. Additionally, if the file list is empty, xargs might read from stdin.

Apply this diff to make it more robust:

echo "Linting the following files:" echo "${{ steps.changed_files.outputs.files }}" # Pass the list of changed files to Ruff. - echo "${{ steps.changed_files.outputs.files }}" | xargs ruff check + echo "${{ steps.changed_files.outputs.files }}" | xargs -r ruff check fi

The -r flag prevents xargs from running if the input is empty, providing an extra safety layer alongside the conditional check.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo "${{ steps.changed_files.outputs.files }}" | xargs ruff check

echo "${{ steps.changed_files.outputs.files }}" | xargs -r ruff check

🤖 Prompt for AI Agents

.github/workflows/lint.yml around line 56: the current pipeline uses echo ... | xargs ruff check which can break on empty input or filenames with spaces/special chars; change it to produce null-delimited file names and invoke xargs with the -r (do not run if input is empty) and --null flags so ruff is not executed on empty input and filenames with special characters are handled safely.

coderabbitai · 2025-12-17T03:19:13Z

tools/linter.sh

+done
+
+# Find merge base with origin/main
+merge_base=$(git merge-base origin/main HEAD)


⚠️ Potential issue | 🟠 Major

Handle missing origin/main gracefully.

If origin/main doesn't exist (e.g., in a fresh clone or different remote setup), the script will fail. Consider adding error handling or documentation about the requirement.

Apply this diff to add error handling:

# Find merge base with origin/main +if ! git rev-parse --verify origin/main >/dev/null 2>&1; then + echo "Error: origin/main branch not found. Ensure you have fetched from origin." + exit 1 +fi merge_base=$(git merge-base origin/main HEAD)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tools/linter.sh around line 32, the git merge-base call assumes origin/main exists and will fail if it doesn't; add a pre-check for origin/main (e.g., git rev-parse --verify --quiet refs/remotes/origin/main or git show-ref) and if that check fails, fall back to a safe default (for example set merge_base to HEAD or to HEAD~1) and print a clear warning so the script continues instead of erroring; ensure subsequent logic uses the fallback value and that the script exits non-fatally or instructs the user how to set up the remote if required.

tools/linter.sh

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

src/scanoss_winnowing/header_filter.py (1)
237-238: Dead code: .R extension mapping is unreachable.

Since path.suffix.lower() is called on line 376, the extension is always lowercase. The .R entry on line 238 will never match; only .r on line 237 is used.

Apply this diff to remove the redundant entry:
             '.r': 'r',
-            '.R': 'r',
             '.lua': 'lua',
src/scanoss_winnowing/scanossbase.py (1)
51-56: Typo in docstring: "quite" should be "quiet".

Apply this diff:
     def print_msg(self, *args, **kwargs):
         """
-        Print message if quite mode is not enabled
+        Print message if quiet mode is not enabled
         """
         if not self.quiet:
             self.print_stderr(*args, **kwargs)

🧹 Nitpick comments (4)

requirements-dev.txt (1)
5-5: Consider pinning the ruff version for reproducibility.

While the dependency is correctly added, consider specifying a version constraint (e.g., ruff>=0.8.0) to ensure reproducible builds and consistent linting behavior across environments and CI runs.

Apply this diff to pin the ruff version:
-ruff
+ruff>=0.8.0
src/scanoss_winnowing/header_filter.py (2)

293-311: Consider renaming for clarity.

The parameter skip_limit is stored as self.max_lines, which can be confusing. Consider using a consistent name like max_skip_lines for both the parameter and the attribute to improve readability.

60-204: Add unit tests validating language patterns with real-world code samples.

The comment, import, and license keyword patterns are production-ready and actively used in filtering logic, but test coverage is limited to Python files. Consider adding integration tests for JavaScript, Java, Go, Rust, C++, and other supported languages using realistic code samples to validate that the regex patterns correctly identify headers, comments, and imports across edge cases.
src/scanoss_winnowing/scanossbase.py (1)
83-94: Consider specifying encoding explicitly.

Lines 91 and 104 open files without specifying encoding. For consistency with the rest of the codebase (which uses UTF-8, e.g., line 471 in winnowing.py), consider specifying encoding='utf-8' to avoid platform-dependent behavior.

Apply this diff:
         if file:
-            with open(file, 'w') as f:
+            with open(file, 'w', encoding='utf-8') as f:
                 f.write(content)
         else:
             self.print_stdout(content)
And similarly for print_to_file_or_stderr:
         if file:
-            with open(file, 'w') as f:
+            with open(file, 'w', encoding='utf-8') as f:
                 f.write(msg)
         else:
             self.print_stderr(msg)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf66688 and 11f06b9.

📒 Files selected for processing (8)

.github/workflows/lint.yml (1 hunks)
Makefile (1 hunks)
requirements-dev.txt (1 hunks)
setup.cfg (1 hunks)
src/scanoss_winnowing/header_filter.py (1 hunks)
src/scanoss_winnowing/scanossbase.py (1 hunks)
src/scanoss_winnowing/winnowing.py (9 hunks)
tools/linter.sh (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

.github/workflows/lint.yml
tools/linter.sh

🧰 Additional context used

🧬 Code graph analysis (2)

src/scanoss_winnowing/winnowing.py (2)

src/scanoss_winnowing/header_filter.py (2)

HeaderFilter (284-560)

filter (313-360)

src/scanoss_winnowing/scanossbase.py (3)

ScanossBase (28-107)

print_stderr (45-49)

print_debug (58-63)

src/scanoss_winnowing/header_filter.py (1)

src/scanoss_winnowing/scanossbase.py (4)

ScanossBase (28-107)

print_msg (51-56)

print_debug (58-63)

print_trace (65-70)

🔇 Additional comments (12)

Makefile (1)

32-42: LGTM! Lint targets are well-structured.

The new lint targets follow the existing Makefile conventions and provide clear help text. They appropriately delegate to the tools/linter.sh script with the correct flags for different linting modes.

src/scanoss_winnowing/header_filter.py (3)

313-360: LGTM!

The filter method correctly handles edge cases (empty content, unsupported languages, no implementation found) and properly applies the skip limit with appropriate None-checking.

362-386: LGTM!

The language detection logic is sound, with appropriate None-handling and debug logging.

536-549: LGTM!

The Go import block handling correctly identifies and skips multi-line import declarations enclosed in parentheses.

src/scanoss_winnowing/scanossbase.py (1)

28-107: LGTM!

The ScanossBase class provides a clean abstraction for logging and I/O operations. The implementation is straightforward and effective.

src/scanoss_winnowing/winnowing.py (7)

39-40: LGTM!

The new imports are correct and support the header filtering feature.

118-118: LGTM!

Inheriting from ScanossBase provides consistent logging functionality across the codebase.

161-219: LGTM!

The initialization correctly integrates the header filtering feature. The HeaderFilter is instantiated regardless of the skip_headers flag, which has minimal overhead but could be optimized lazily if needed.

502-506: LGTM!

The header filtering is correctly applied to the C-accelerated path after WFP generation.

571-579: LGTM!

The header filtering is correctly applied to the Python path, with proper guard against empty WFP content.

471-471: Acceptable tradeoff in encoding handling.

Line 471 uses decode('utf-8', 'ignore') which silently skips invalid UTF-8 bytes. While this ensures the code doesn't crash on encoding errors, it might affect header detection accuracy for files with encoding issues. This is an acceptable tradeoff since such files are edge cases.

396-436: Clarify the semantic meaning of start_line in the WFP format.

Line 422 sets start_line={line_offset}, where line_offset represents "the last line previous to real code" (as noted in line 418's comment). Given that implementation_start is 1-indexed and line_offset = implementation_start - 1, the current implementation records the last skipped line rather than the first implementation line. While this appears intentional based on the header_filter semantics, the field name start_line typically implies "where code begins" rather than "where code before it ends". Consider either: (1) renaming to skip_until_line for clarity, or (2) adjusting to start_line={line_offset + 1} to record the actual first implementation line. If the current behavior is correct, add a clarifying comment explaining that start_line records the offset point, not the implementation start line.

setup.cfg

coderabbitai · 2025-12-17T17:24:08Z

src/scanoss_winnowing/header_filter.py

+        if 'doc_string_start' in patterns and '"""' in line:
+            # Count how many quotes there are
+            count = line.count('"""')
+            if count == COMPLETE_DOCSTRING_QUOTE_COUNT:  # Complete docstring in one line
+                return True, False
+            if count == 1:  # Start of a multiline docstring
+                return True, True


⚠️ Potential issue | 🟡 Minor

Potential false positive in docstring detection.

Line 465 counts """ occurrences in the entire line without considering if they appear inside string literals. For example, some_string = 'text with """ inside' would incorrectly be treated as a docstring. While unlikely in headers, consider using the existing regex patterns more precisely.

🤖 Prompt for AI Agents

In src/scanoss_winnowing/header_filter.py around lines 463 to 469, the code uses line.count('"""') which can misclassify triple quotes that occur inside other string literals; replace the simple count with the existing/appropriate compiled regex from patterns (e.g. use patterns['doc_string_start'] or a dedicated triple-quote regex) and use its finditer/search results to detect actual triple-quote occurrences (ignoring matches inside other strings/escaped contexts), then base the COMPLETE_DOCSTRING_QUOTE_COUNT and start-vs-multiline logic on those regex matches rather than raw line.count().

src/scanoss_winnowing/header_filter.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

src/scanoss_winnowing/scanossbase.py (3)

72-81: Consider condensing the formatting.

The multi-line formatting is valid but unconventional for this simple print statement. A one-liner would be more concise.

Apply this diff to condense:

     @staticmethod
     def print_stdout(*args, **kwargs):
         """
         Print message to STDOUT
         """
-        print(
-            *args,
-            file=sys.stdout,
-            **kwargs,
-        )
+        print(*args, file=sys.stdout, **kwargs)

83-94: Add error handling and explicit encoding for file operations.

File operations can raise FileNotFoundError, PermissionError, or OSError. Additionally, omitting the encoding parameter makes the behavior platform-dependent. Consider adding error handling and explicitly specifying encoding='utf-8' for cross-platform consistency.

Apply this diff to improve robustness:

     def print_to_file_or_stdout(self, content: str, file: str = None):
         """
         Print message to file if provided or stdout
         """
         if not content:
             return
 
         if file:
-            with open(file, 'w') as f:
-                f.write(content)
+            try:
+                with open(file, 'w', encoding='utf-8') as f:
+                    f.write(content)
+            except (FileNotFoundError, PermissionError, OSError) as e:
+                self.print_stderr(f"Error writing to file {file}: {e}")
+                raise
         else:
             self.print_stdout(content)

Alternatively, document in the docstring that callers should handle file I/O exceptions.

96-107: Add error handling and explicit encoding for file operations.

Similar to print_to_file_or_stdout, this method lacks error handling and explicit encoding. Apply the same improvements for consistency.

Apply this diff:

     def print_to_file_or_stderr(self, msg: str, file: str = None):
         """
         Print message to file if provided or stderr
         """
         if not msg:
             return
 
         if file:
-            with open(file, 'w') as f:
-                f.write(msg)
+            try:
+                with open(file, 'w', encoding='utf-8') as f:
+                    f.write(msg)
+            except (FileNotFoundError, PermissionError, OSError) as e:
+                self.print_stderr(f"Error writing to file {file}: {e}")
+                raise
         else:
             self.print_stderr(msg)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11f06b9 and 712016e.

📒 Files selected for processing (2)

CHANGELOG.md (2 hunks)
src/scanoss_winnowing/scanossbase.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

CHANGELOG.md

🔇 Additional comments (2)

src/scanoss_winnowing/scanossbase.py (2)

1-26: LGTM!

The MIT license header is properly formatted and the import is appropriate for the logging utilities provided.

28-70: LGTM!

The base class design is clean with well-structured logging levels. The initialization and conditional print methods (print_msg, print_debug, print_trace) correctly implement the intended behavior.

scanoss-qg added 3 commits December 15, 2025 11:12

Update winnowing.py to support --skip-header

362e42c

class rename from LineFilter to HeaderFilter

5b9f810

update header_filter definitions

5417301

mscasso-scanoss requested a review from eeisegn December 17, 2025 02:29

mscasso-scanoss self-assigned this Dec 17, 2025

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

Project config update: add linter and gitactions. Update file header …

bf66688

…filter usage

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

eeisegn added 5 commits December 17, 2025 17:14

increase min python version

4a83df8

cleanup linter script and calls

5073bf7

update copyright

141b98d

update variable name and fix filter location

11f06b9

review updates

712016e

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

update min python requirement to 3.9

7cccf3e

eeisegn approved these changes Dec 17, 2025

View reviewed changes

add key missing safety check

a47186d

eeisegn merged commit 74cc6e7 into main Dec 17, 2025
2 checks passed

eeisegn deleted the feat/mscasso/SP-3820-add-skip-headers branch December 17, 2025 17:45

	echo "${{ steps.changed_files.outputs.files }}" \| xargs ruff check
	echo "${{ steps.changed_files.outputs.files }}" \| xargs -r ruff check

Feat/mscasso/sp 3820 add skip headers #16

Feat/mscasso/sp 3820 add skip headers #16

Uh oh!

Conversation

mscasso-scanoss commented Dec 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mscasso-scanoss commented Dec 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 17, 2025 •

edited

Loading