Skip to content

Conversation

@mscasso-scanoss
Copy link
Contributor

@mscasso-scanoss mscasso-scanoss commented Dec 17, 2025

Summary by CodeRabbit

  • New Features

    • Add skip-headers support to omit file headers (comments, license blocks, imports) from analysis and integrate a standardized logging base.
  • Tests

    • Add test validating skip-headers behavior and start-line reporting.
  • Chores

    • Bump package to 0.7.0 and update changelog.
  • CI / Tooling

    • Add lint workflow, Makefile lint targets, linter helper script, and ruff dev dependency.
  • Compatibility

    • Raise minimum Python requirement to 3.9.

✏️ Tip: You can customize this high-level summary in your review settings.

@mscasso-scanoss mscasso-scanoss self-assigned this Dec 17, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 17, 2025

Warning

Rate limit exceeded

@eeisegn has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 6 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 7cccf3e and a47186d.

📒 Files selected for processing (1)
  • src/scanoss_winnowing/header_filter.py (1 hunks)

Walkthrough

Adds header-skipping to WFP generation via a new HeaderFilter and ScanossBase logging; integrates header stripping into Winnowing (C and Python paths), bumps package version to 0.7.0, adds tests, introduces lint tooling and CI workflow, and raises minimum Python requirement to 3.9.

Changes

Cohort / File(s) Summary
Version & Changelog
CHANGELOG.md, src/scanoss_winnowing/__init__.py
Bumped package version to 0.7.0 and added a 0.7.0 changelog entry.
Logging Base
src/scanoss_winnowing/scanossbase.py
New ScanossBase class providing debug/trace/quiet flags and stdout/stderr/file output helpers.
Header Filtering
src/scanoss_winnowing/header_filter.py
New HeaderFilter, LanguagePatterns, extension mapping and helpers for detecting shebangs, comments, imports, and license headers; exposes filter() and detection utilities.
Core Integration
src/scanoss_winnowing/winnowing.py
Winnowing now inherits ScanossBase, accepts skip_headers and skip_headers_limit, initializes HeaderFilter, adds __strip_lines_until_offset, and applies header-stripping in both C-accelerated and Python WFP generation paths; removed legacy logging helpers.
Tests
tests/winnowing-test.py
Added test_skip_headers_flag to validate header-skipping behavior and start_line injection.
Linting / Tooling
.github/workflows/lint.yml, Makefile, tools/linter.sh, requirements-dev.txt
Added GitHub Actions lint workflow, Makefile lint targets (lint, lint-fix, lint-all, lint-fix-all), tools/linter.sh script, and added ruff to dev requirements.
Packaging / Docs
setup.cfg, PACKAGE.md, README.md
Raised Python compatibility requirement from >=3.7 to >=3.9 and updated docs to match.

Sequence Diagram

sequenceDiagram
    actor User
    participant Winnowing
    participant ScanossBase
    participant HeaderFilter
    participant LanguagePatterns

    User->>Winnowing: instantiate(..., skip_headers=True, skip_headers_limit=N)
    Winnowing->>ScanossBase: __init__(debug, trace, quiet)
    Winnowing->>HeaderFilter: HeaderFilter(debug, trace, quiet, max_skipped_lines)
    User->>Winnowing: wfp_for_contents(file_path, contents)
    alt skip_headers enabled
        Winnowing->>HeaderFilter: filter(file_path, decoded_contents)
        HeaderFilter->>LanguagePatterns: map extension → patterns
        loop scan lines
            HeaderFilter->>HeaderFilter: is_shebang / is_comment / is_import / is_license_header
        end
        HeaderFilter-->>Winnowing: return line_offset
        Winnowing->>Winnowing: __strip_lines_until_offset(wfp, line_offset)
    end
    Winnowing-->>User: return WFP (may include start_line)
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

  • Areas needing attention:
    • src/scanoss_winnowing/header_filter.py — regex patterns, language mapping, multiline comment edge cases, and max-lines limit.
    • src/scanoss_winnowing/winnowing.py — ensure header-stripping is applied uniformly across C and Python paths and correct start_line injection.
    • src/scanoss_winnowing/scanossbase.py — logging integration and replacement of removed helpers.
    • Packaging/config: setup.cfg, docs updates, and CI lint workflow correctness.

Poem

🐰 I nibble shebangs, comments, and lines,
I hop past licenses, imports, and signs.
ScanossBase hums while HeaderFilter aligns,
Winnowing skips where the real code shines.
Version 0.7.0 — carrots and finds!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main feature: adding skip-headers functionality to the winnowing process.
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (7)
src/scanoss_winnowing/scanossbase.py (2)

83-83: Use explicit Optional or union type for nullable parameter.

Per PEP 484, implicit Optional (using = None without type annotation) is prohibited. As flagged by static analysis.

+from typing import Optional
+
 ...
 
-    def print_to_file_or_stdout(self, content: str, file: str = None):
+    def print_to_file_or_stdout(self, content: str, file: str | None = None):

Or use Optional[str] if supporting Python < 3.10.


96-96: Same implicit Optional issue here.

-    def print_to_file_or_stderr(self, msg: str, file: str = None):
+    def print_to_file_or_stderr(self, msg: str, file: str | None = None):
src/scanoss_winnowing/winnowing.py (1)

120-122: Minor formatting note on parameter list.

The trailing comma and extra newline before the closing parenthesis is unconventional. Consider cleaning up the formatting.

                  strip_snippet_ids=None, strip_hpsm_ids=None, skip_md5_ids=None, skip_headers: bool = False,
-
                  ):
+                 ):
src/scanoss_winnowing/header_filter.py (4)

42-196: Consider adding ClassVar annotations for class-level mutable attributes.

The COMMENT_PATTERNS, IMPORT_PATTERNS, and LICENSE_KEYWORDS are mutable class attributes. Adding ClassVar annotations clarifies they are shared across all instances and not meant to be overwritten per-instance.

+from typing import ClassVar, Dict, List, Optional, Tuple
+
 class LanguagePatterns:
     """Regex patterns for different programming languages"""
 
+    COMMENT_PATTERNS: ClassVar[Dict[str, Dict[str, str]]] = {
-    COMMENT_PATTERNS = {
         # ...
     }
+    IMPORT_PATTERNS: ClassVar[Dict[str, List[str]]] = {
-    IMPORT_PATTERNS = {
         # ...
     }
+    LICENSE_KEYWORDS: ClassVar[List[str]] = [
-    LICENSE_KEYWORDS = [
         # ...
     ]

374-374: Remove unused noqa directive.

Static analysis indicates PLR0911 is not enabled, so this directive has no effect.

-    def is_comment(self, line: str, language: str, in_multiline: bool) -> Tuple[bool, bool]:  # noqa: PLR0911
+    def is_comment(self, line: str, language: str, in_multiline: bool) -> Tuple[bool, bool]:

429-429: Remove unused noqa directive.

Static analysis indicates PLR0912 is not enabled, so this directive has no effect.

-    def find_first_implementation_line(self, lines: list[str], language: str) -> Optional[int]:  # noqa: PLR0912
+    def find_first_implementation_line(self, lines: list[str], language: str) -> Optional[int]:

441-441: Unused variable: consecutive_imports_count is set but never read.

The variable is incremented on line 499 but its value is never used for any logic or output. Either remove it or document the intended future use.

-        consecutive_imports_count = 0
         # ...
             # Check if it's an import
             if self.is_import(line, language):
-                if consecutive_imports_count == 0:
-                    self.print_debug(f'Line {line_number}: Detected import section')
-                consecutive_imports_count += 1
+                self.print_debug(f'Line {line_number}: Detected import line')
                 continue

Also applies to: 497-499

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 151fe63 and 5417301.

📒 Files selected for processing (6)
  • CHANGELOG.md (1 hunks)
  • src/scanoss_winnowing/__init__.py (1 hunks)
  • src/scanoss_winnowing/header_filter.py (1 hunks)
  • src/scanoss_winnowing/scanossbase.py (1 hunks)
  • src/scanoss_winnowing/winnowing.py (8 hunks)
  • tests/winnowing-test.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/scanoss_winnowing/header_filter.py (1)
src/scanoss_winnowing/scanossbase.py (2)
  • ScanossBase (28-107)
  • print_debug (58-63)
tests/winnowing-test.py (1)
src/scanoss_winnowing/winnowing.py (2)
  • Winnowing (74-600)
  • wfp_for_contents (379-520)
src/scanoss_winnowing/winnowing.py (1)
src/scanoss_winnowing/scanossbase.py (2)
  • ScanossBase (28-107)
  • print_debug (58-63)
🪛 Ruff (0.14.8)
src/scanoss_winnowing/scanossbase.py

83-83: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


96-96: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

src/scanoss_winnowing/header_filter.py

46-73: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


76-184: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


187-192: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


374-374: Unused noqa directive (non-enabled: PLR0911)

Remove unused noqa directive

(RUF100)


429-429: Unused noqa directive (non-enabled: PLR0912)

Remove unused noqa directive

(RUF100)

🔇 Additional comments (8)
src/scanoss_winnowing/__init__.py (1)

25-25: LGTM!

Version bump to 0.7.0 aligns with the new skip-headers feature addition documented in the changelog.

tests/winnowing-test.py (1)

154-250: Good test coverage for the skip_headers feature.

The test comprehensively validates:

  • Both modes produce valid WFP output with file= lines
  • Snippet line extraction works correctly
  • The skip_headers mode results in higher starting line numbers
  • The start_line= tag is present only when skip_headers is enabled
  • The start_line value is correctly parsed and validated
src/scanoss_winnowing/winnowing.py (6)

40-41: LGTM on new imports.

The imports for HeaderFilter and ScanossBase are correctly added for the new functionality.


74-74: Good inheritance structure.

Winnowing now correctly inherits from ScanossBase, consolidating the logging utilities.


135-156: LGTM on initialization changes.

The super().init call correctly propagates the logging flags to ScanossBase, and the new skip_headers and header_filter attributes are properly initialized.


336-377: Well-structured line stripping logic.

The __strip_lines_until_offset method correctly:

  • Returns early for non-positive offsets
  • Preserves non-snippet lines (file=, hpsm=, etc.)
  • Injects start_line tag before the first retained snippet
  • Uses exception handling for parsing edge cases

447-451: Header filtering correctly integrated in C-accelerated path.

The skip_headers logic is applied after strip_snippet_ids processing, ensuring consistent behavior.


515-519: Header filtering correctly integrated in Python path.

Good consistency - the Python path applies the same header filtering logic as the C-accelerated path, ensuring feature parity.

Comment on lines 322 to 324
'.pm': 'perl',
'.r': 'r',
'.R': 'r',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dead code: .R extension mapping is unreachable.

Since path.suffix.lower() is called on line 293, the extension is always lowercase. The .R entry on line 324 will never match; only .r on line 323 is used.

             '.r': 'r',
-            '.R': 'r',
             '.lua': 'lua',
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
'.pm': 'perl',
'.r': 'r',
'.R': 'r',
'.pm': 'perl',
'.r': 'r',
🤖 Prompt for AI Agents
In src/scanoss_winnowing/header_filter.py around lines 322 to 324, the mapping
includes both '.r' and '.R' but path.suffix.lower() is used earlier, making '.R'
unreachable; remove the redundant '.R' entry (or alternatively ensure keys are
normalized to lowercase and deduplicated) so the mapping only contains lowercase
extensions to avoid dead code.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
src/scanoss_winnowing/header_filter.py (1)

237-238: Dead code: .R extension mapping is unreachable.

This issue was already identified in a previous review. Since path.suffix.lower() is called on line 376, the extension is always lowercase. The .R entry on line 238 will never match; only .r on line 237 is used.

Apply this diff to remove the unreachable entry:

         '.r': 'r',
-        '.R': 'r',
         '.lua': 'lua',
🧹 Nitpick comments (3)
.github/workflows/lint.yml (1)

40-42: Minor: Consider grouped redirects for cleaner output syntax.

The shellcheck tool suggests using grouped redirects for better readability and fewer operations.

Based on static analysis hints, apply this diff:

-          echo "files<<EOF" >> "$GITHUB_OUTPUT"
-          echo "${filtered_files}" >> "$GITHUB_OUTPUT"
-          echo "EOF" >> "$GITHUB_OUTPUT"
+          {
+            echo "files<<EOF"
+            echo "${filtered_files}"
+            echo "EOF"
+          } >> "$GITHUB_OUTPUT"
src/scanoss_winnowing/winnowing.py (1)

446-450: Consider extracting the header filtering logic into a helper method.

The header filtering logic is duplicated between the C-accelerated path (lines 446-450) and the Python path (lines 514-518). While the duplication is minimal, extracting it to a helper method like __apply_header_filter(file, wfp, decoded_contents) could improve maintainability if this logic needs to evolve.

Example refactoring:

def __apply_header_filter(self, file: str, wfp: str, decoded_contents: str) -> str:
    """Apply header filtering to WFP if enabled."""
    if self.skip_headers:
        line_offset = self.header_filter.filter(file, decoded_contents)
        if line_offset > 0:
            wfp = self.__strip_lines_until_offset(file, wfp, line_offset)
    return wfp

Then use it in both places:

wfp = self.__apply_header_filter(file, wfp, decoded_contents)

Also applies to: 514-518

src/scanoss_winnowing/header_filter.py (1)

490-560: Remove or clarify the confusing comment.

The comment # End for loop? on line 559 appears after the for loop has already ended naturally. This comment is confusing and doesn't add value. Consider removing it or replacing it with something clearer if there's a specific reason to mark the loop end.

Apply this diff to remove the confusing comment:

             return line_number
-        # End for loop?
         return None
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5417301 and bf66688.

📒 Files selected for processing (6)
  • .github/workflows/lint.yml (1 hunks)
  • CHANGELOG.md (2 hunks)
  • Makefile (1 hunks)
  • src/scanoss_winnowing/header_filter.py (1 hunks)
  • src/scanoss_winnowing/winnowing.py (9 hunks)
  • tools/linter.sh (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md
🧰 Additional context used
🧬 Code graph analysis (2)
src/scanoss_winnowing/winnowing.py (2)
src/scanoss_winnowing/header_filter.py (2)
  • HeaderFilter (284-560)
  • filter (313-360)
src/scanoss_winnowing/scanossbase.py (3)
  • ScanossBase (28-107)
  • print_stderr (45-49)
  • print_debug (58-63)
src/scanoss_winnowing/header_filter.py (1)
src/scanoss_winnowing/scanossbase.py (4)
  • ScanossBase (28-107)
  • print_msg (51-56)
  • print_debug (58-63)
  • print_trace (65-70)
🪛 actionlint (1.7.9)
.github/workflows/lint.yml

13-13: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)


18-18: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)


29-29: shellcheck reported issue in this script: SC2129:style:11:1: Consider using { cmd1; cmd2; } >> file instead of individual redirects

(shellcheck)

🔇 Additional comments (13)
Makefile (1)

32-36: LGTM!

The lint targets are well-documented and correctly invoke the linter script with appropriate flags.

src/scanoss_winnowing/winnowing.py (6)

39-40: LGTM!

The imports are correctly structured and the new dependencies (HeaderFilter and ScanossBase) are properly integrated into the class hierarchy and functionality.


116-121: LGTM!

The new parameters skip_headers and skip_headers_max_lines are well-named and have sensible defaults. The parameter list formatting is consistent with Python conventions.


135-135: LGTM!

The super().init call correctly passes debug, trace, and quiet parameters to the ScanossBase parent class.


156-158: LGTM!

The HeaderFilter initialization correctly uses the or None idiom to convert a zero value to None for the max_skipped_lines parameter, which enables unlimited filtering when the max is not set.


338-378: LGTM!

The __strip_lines_until_offset method correctly:

  • Filters WFP snippet lines based on line numbers
  • Preserves non-snippet metadata lines (file=, hpsm=)
  • Adds a start_line marker to indicate where filtering occurred
  • Handles parsing errors gracefully with try-except

The logic line_num > line_offset is correct since line_offset represents the last line to skip (the line before implementation starts).


413-415: LGTM!

Creating decoded_contents once and reusing it is efficient. The 'ignore' error handler for UTF-8 decoding is appropriate here since the code already handles binary files separately and needs to be robust against encoding issues.

src/scanoss_winnowing/header_filter.py (6)

42-204: LGTM!

The LanguagePatterns class provides comprehensive regex patterns for identifying comments, imports, and license headers across a wide range of programming languages. The patterns are well-organized by language family (c_style, python_style, lua_style, html_style) and cover major languages effectively.


257-281: LGTM!

The utility functions is_blank_line and is_shebang are simple, correct, and clearly documented. The explicit length check in is_blank_line is more readable than alternatives.


293-311: LGTM!

The constructor properly initializes the HeaderFilter with sensible defaults and correctly calls the parent class constructor. The max_skipped_lines parameter allows flexible control over filtering behavior.


313-360: LGTM!

The filter method correctly:

  • Validates inputs and handles edge cases
  • Returns 0 (no filtering) for unsupported languages or empty files
  • Calculates the line offset as implementation_start - 1 (correctly converting from 1-indexed line numbers to 0-indexed offset)
  • Applies the optional max_lines cap
  • Provides appropriate debug/trace logging

362-428: LGTM!

The helper methods are well-implemented:

  • detect_language correctly uses lowercase extension matching
  • is_license_header performs case-insensitive keyword detection
  • get_comment_style provides sensible defaults for language families

430-488: LGTM!

The is_comment and is_import methods correctly handle:

  • Single-line and multi-line comment detection
  • State tracking for multi-line comments
  • Python docstrings with proper quote counting
  • Import/include pattern matching across languages

The complexity suppressions (noqa: PLR0911, PLR0912) are justified given the inherent complexity of parsing multiple language syntaxes.

echo "Linting the following files:"
echo "${{ steps.changed_files.outputs.files }}"
# Pass the list of changed files to Ruff.
echo "${{ steps.changed_files.outputs.files }}" | xargs ruff check
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential issue with xargs handling many files.

Using xargs without -r or proper handling could fail if there are many files or special characters in filenames. Additionally, if the file list is empty, xargs might read from stdin.

Apply this diff to make it more robust:

             echo "Linting the following files:"
             echo "${{ steps.changed_files.outputs.files }}"
             # Pass the list of changed files to Ruff.
-            echo "${{ steps.changed_files.outputs.files }}" | xargs ruff check
+            echo "${{ steps.changed_files.outputs.files }}" | xargs -r ruff check
           fi

The -r flag prevents xargs from running if the input is empty, providing an extra safety layer alongside the conditional check.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "${{ steps.changed_files.outputs.files }}" | xargs ruff check
echo "${{ steps.changed_files.outputs.files }}" | xargs -r ruff check
🤖 Prompt for AI Agents
.github/workflows/lint.yml around line 56: the current pipeline uses echo ... |
xargs ruff check which can break on empty input or filenames with spaces/special
chars; change it to produce null-delimited file names and invoke xargs with the
-r (do not run if input is empty) and --null flags so ruff is not executed on
empty input and filenames with special characters are handled safely.

tools/linter.sh Outdated
done

# Find merge base with origin/main
merge_base=$(git merge-base origin/main HEAD)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle missing origin/main gracefully.

If origin/main doesn't exist (e.g., in a fresh clone or different remote setup), the script will fail. Consider adding error handling or documentation about the requirement.

Apply this diff to add error handling:

 # Find merge base with origin/main
+if ! git rev-parse --verify origin/main >/dev/null 2>&1; then
+    echo "Error: origin/main branch not found. Ensure you have fetched from origin."
+    exit 1
+fi
 merge_base=$(git merge-base origin/main HEAD)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tools/linter.sh around line 32, the git merge-base call assumes origin/main
exists and will fail if it doesn't; add a pre-check for origin/main (e.g., git
rev-parse --verify --quiet refs/remotes/origin/main or git show-ref) and if that
check fails, fall back to a safe default (for example set merge_base to HEAD or
to HEAD~1) and print a clear warning so the script continues instead of
erroring; ensure subsequent logic uses the fallback value and that the script
exits non-fatally or instructs the user how to set up the remote if required.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
src/scanoss_winnowing/header_filter.py (1)

237-238: Dead code: .R extension mapping is unreachable.

Since path.suffix.lower() is called on line 376, the extension is always lowercase. The .R entry on line 238 will never match; only .r on line 237 is used.

Apply this diff to remove the redundant entry:

             '.r': 'r',
-            '.R': 'r',
             '.lua': 'lua',
src/scanoss_winnowing/scanossbase.py (1)

51-56: Typo in docstring: "quite" should be "quiet".

Apply this diff:

     def print_msg(self, *args, **kwargs):
         """
-        Print message if quite mode is not enabled
+        Print message if quiet mode is not enabled
         """
         if not self.quiet:
             self.print_stderr(*args, **kwargs)
🧹 Nitpick comments (4)
requirements-dev.txt (1)

5-5: Consider pinning the ruff version for reproducibility.

While the dependency is correctly added, consider specifying a version constraint (e.g., ruff>=0.8.0) to ensure reproducible builds and consistent linting behavior across environments and CI runs.

Apply this diff to pin the ruff version:

-ruff
+ruff>=0.8.0
src/scanoss_winnowing/header_filter.py (2)

293-311: Consider renaming for clarity.

The parameter skip_limit is stored as self.max_lines, which can be confusing. Consider using a consistent name like max_skip_lines for both the parameter and the attribute to improve readability.


60-204: Add unit tests validating language patterns with real-world code samples.

The comment, import, and license keyword patterns are production-ready and actively used in filtering logic, but test coverage is limited to Python files. Consider adding integration tests for JavaScript, Java, Go, Rust, C++, and other supported languages using realistic code samples to validate that the regex patterns correctly identify headers, comments, and imports across edge cases.

src/scanoss_winnowing/scanossbase.py (1)

83-94: Consider specifying encoding explicitly.

Lines 91 and 104 open files without specifying encoding. For consistency with the rest of the codebase (which uses UTF-8, e.g., line 471 in winnowing.py), consider specifying encoding='utf-8' to avoid platform-dependent behavior.

Apply this diff:

         if file:
-            with open(file, 'w') as f:
+            with open(file, 'w', encoding='utf-8') as f:
                 f.write(content)
         else:
             self.print_stdout(content)

And similarly for print_to_file_or_stderr:

         if file:
-            with open(file, 'w') as f:
+            with open(file, 'w', encoding='utf-8') as f:
                 f.write(msg)
         else:
             self.print_stderr(msg)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf66688 and 11f06b9.

📒 Files selected for processing (8)
  • .github/workflows/lint.yml (1 hunks)
  • Makefile (1 hunks)
  • requirements-dev.txt (1 hunks)
  • setup.cfg (1 hunks)
  • src/scanoss_winnowing/header_filter.py (1 hunks)
  • src/scanoss_winnowing/scanossbase.py (1 hunks)
  • src/scanoss_winnowing/winnowing.py (9 hunks)
  • tools/linter.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/lint.yml
  • tools/linter.sh
🧰 Additional context used
🧬 Code graph analysis (2)
src/scanoss_winnowing/winnowing.py (2)
src/scanoss_winnowing/header_filter.py (2)
  • HeaderFilter (284-560)
  • filter (313-360)
src/scanoss_winnowing/scanossbase.py (3)
  • ScanossBase (28-107)
  • print_stderr (45-49)
  • print_debug (58-63)
src/scanoss_winnowing/header_filter.py (1)
src/scanoss_winnowing/scanossbase.py (4)
  • ScanossBase (28-107)
  • print_msg (51-56)
  • print_debug (58-63)
  • print_trace (65-70)
🔇 Additional comments (12)
Makefile (1)

32-42: LGTM! Lint targets are well-structured.

The new lint targets follow the existing Makefile conventions and provide clear help text. They appropriately delegate to the tools/linter.sh script with the correct flags for different linting modes.

src/scanoss_winnowing/header_filter.py (3)

313-360: LGTM!

The filter method correctly handles edge cases (empty content, unsupported languages, no implementation found) and properly applies the skip limit with appropriate None-checking.


362-386: LGTM!

The language detection logic is sound, with appropriate None-handling and debug logging.


536-549: LGTM!

The Go import block handling correctly identifies and skips multi-line import declarations enclosed in parentheses.

src/scanoss_winnowing/scanossbase.py (1)

28-107: LGTM!

The ScanossBase class provides a clean abstraction for logging and I/O operations. The implementation is straightforward and effective.

src/scanoss_winnowing/winnowing.py (7)

39-40: LGTM!

The new imports are correct and support the header filtering feature.


118-118: LGTM!

Inheriting from ScanossBase provides consistent logging functionality across the codebase.


161-219: LGTM!

The initialization correctly integrates the header filtering feature. The HeaderFilter is instantiated regardless of the skip_headers flag, which has minimal overhead but could be optimized lazily if needed.


502-506: LGTM!

The header filtering is correctly applied to the C-accelerated path after WFP generation.


571-579: LGTM!

The header filtering is correctly applied to the Python path, with proper guard against empty WFP content.


471-471: Acceptable tradeoff in encoding handling.

Line 471 uses decode('utf-8', 'ignore') which silently skips invalid UTF-8 bytes. While this ensures the code doesn't crash on encoding errors, it might affect header detection accuracy for files with encoding issues. This is an acceptable tradeoff since such files are edge cases.


396-436: Clarify the semantic meaning of start_line in the WFP format.

Line 422 sets start_line={line_offset}, where line_offset represents "the last line previous to real code" (as noted in line 418's comment). Given that implementation_start is 1-indexed and line_offset = implementation_start - 1, the current implementation records the last skipped line rather than the first implementation line. While this appears intentional based on the header_filter semantics, the field name start_line typically implies "where code begins" rather than "where code before it ends". Consider either: (1) renaming to skip_until_line for clarity, or (2) adjusting to start_line={line_offset + 1} to record the actual first implementation line. If the current behavior is correct, add a clarifying comment explaining that start_line records the offset point, not the implementation start line.

Comment on lines +463 to +469
if 'doc_string_start' in patterns and '"""' in line:
# Count how many quotes there are
count = line.count('"""')
if count == COMPLETE_DOCSTRING_QUOTE_COUNT: # Complete docstring in one line
return True, False
if count == 1: # Start of a multiline docstring
return True, True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential false positive in docstring detection.

Line 465 counts """ occurrences in the entire line without considering if they appear inside string literals. For example, some_string = 'text with """ inside' would incorrectly be treated as a docstring. While unlikely in headers, consider using the existing regex patterns more precisely.

🤖 Prompt for AI Agents
In src/scanoss_winnowing/header_filter.py around lines 463 to 469, the code uses
line.count('"""') which can misclassify triple quotes that occur inside other
string literals; replace the simple count with the existing/appropriate compiled
regex from patterns (e.g. use patterns['doc_string_start'] or a dedicated
triple-quote regex) and use its finditer/search results to detect actual
triple-quote occurrences (ignoring matches inside other strings/escaped
contexts), then base the COMPLETE_DOCSTRING_QUOTE_COUNT and start-vs-multiline
logic on those regex matches rather than raw line.count().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/scanoss_winnowing/scanossbase.py (3)

72-81: Consider condensing the formatting.

The multi-line formatting is valid but unconventional for this simple print statement. A one-liner would be more concise.

Apply this diff to condense:

     @staticmethod
     def print_stdout(*args, **kwargs):
         """
         Print message to STDOUT
         """
-        print(
-            *args,
-            file=sys.stdout,
-            **kwargs,
-        )
+        print(*args, file=sys.stdout, **kwargs)

83-94: Add error handling and explicit encoding for file operations.

File operations can raise FileNotFoundError, PermissionError, or OSError. Additionally, omitting the encoding parameter makes the behavior platform-dependent. Consider adding error handling and explicitly specifying encoding='utf-8' for cross-platform consistency.

Apply this diff to improve robustness:

     def print_to_file_or_stdout(self, content: str, file: str = None):
         """
         Print message to file if provided or stdout
         """
         if not content:
             return
 
         if file:
-            with open(file, 'w') as f:
-                f.write(content)
+            try:
+                with open(file, 'w', encoding='utf-8') as f:
+                    f.write(content)
+            except (FileNotFoundError, PermissionError, OSError) as e:
+                self.print_stderr(f"Error writing to file {file}: {e}")
+                raise
         else:
             self.print_stdout(content)

Alternatively, document in the docstring that callers should handle file I/O exceptions.


96-107: Add error handling and explicit encoding for file operations.

Similar to print_to_file_or_stdout, this method lacks error handling and explicit encoding. Apply the same improvements for consistency.

Apply this diff:

     def print_to_file_or_stderr(self, msg: str, file: str = None):
         """
         Print message to file if provided or stderr
         """
         if not msg:
             return
 
         if file:
-            with open(file, 'w') as f:
-                f.write(msg)
+            try:
+                with open(file, 'w', encoding='utf-8') as f:
+                    f.write(msg)
+            except (FileNotFoundError, PermissionError, OSError) as e:
+                self.print_stderr(f"Error writing to file {file}: {e}")
+                raise
         else:
             self.print_stderr(msg)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11f06b9 and 712016e.

📒 Files selected for processing (2)
  • CHANGELOG.md (2 hunks)
  • src/scanoss_winnowing/scanossbase.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • CHANGELOG.md
🔇 Additional comments (2)
src/scanoss_winnowing/scanossbase.py (2)

1-26: LGTM!

The MIT license header is properly formatted and the import is appropriate for the logging utilities provided.


28-70: LGTM!

The base class design is clean with well-structured logging levels. The initialization and conditional print methods (print_msg, print_debug, print_trace) correctly implement the intended behavior.

@eeisegn eeisegn merged commit 74cc6e7 into main Dec 17, 2025
2 checks passed
@eeisegn eeisegn deleted the feat/mscasso/SP-3820-add-skip-headers branch December 17, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants