Skip to content

Commit bfe204a

Browse files
committed
refactor: improve experimental source code pattern analysis of pypi packages
Signed-off-by: Carl Flottmann <[email protected]>
1 parent 256fd0c commit bfe204a

35 files changed

+2191
-720
lines changed

.pre-commit-config.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ repos:
3030
- id: isort
3131
name: Sort import statements
3232
args: [--settings-path, pyproject.toml]
33+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
3334

3435
# Add Black code formatters.
3536
- repo: https://github.com/ambv/black
@@ -38,6 +39,7 @@ repos:
3839
- id: black
3940
name: Format code
4041
args: [--config, pyproject.toml]
42+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
4143
- repo: https://github.com/asottile/blacken-docs
4244
rev: 1.19.1
4345
hooks:
@@ -65,6 +67,7 @@ repos:
6567
files: ^src/macaron/|^tests/
6668
types: [text, python]
6769
additional_dependencies: [flake8-bugbear==22.10.27, flake8-builtins==2.0.1, flake8-comprehensions==3.10.1, flake8-docstrings==1.6.0, flake8-mutable==1.2.0, flake8-noqa==1.4.0, flake8-pytest-style==1.6.0, flake8-rst-docstrings==0.3.0, pep8-naming==0.13.2]
70+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
6871
args: [--config, .flake8]
6972

7073
# Check GitHub Actions workflow files.
@@ -82,6 +85,7 @@ repos:
8285
entry: pylint
8386
language: python
8487
files: ^src/macaron/|^tests/
88+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
8589
types: [text, python]
8690
args: [--rcfile, pyproject.toml]
8791

@@ -94,6 +98,7 @@ repos:
9498
language: python
9599
files: ^src/macaron/|^tests/
96100
types: [text, python]
101+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
97102
args: [--show-traceback, --config-file, pyproject.toml]
98103

99104
# Check for potential security issues.
@@ -106,6 +111,7 @@ repos:
106111
files: ^src/macaron/|^tests/
107112
types: [text, python]
108113
additional_dependencies: ['bandit[toml]']
114+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
109115

110116
# Enable a whole bunch of useful helper hooks, too.
111117
# See https://pre-commit.com/hooks.html for more hooks.
@@ -197,6 +203,18 @@ repos:
197203
always_run: true
198204
pass_filenames: false
199205

206+
# Checks that tests/malware_analyzer/pypi/resources/sourcecode_samples files do not have executable permissions
207+
# This is another measure to make sure the files can't be accidentally executed
208+
- repo: local
209+
hooks:
210+
- id: sourcecode-sample-permissions
211+
name: Sourcecode sample executable permissions checker
212+
entry: scripts/dev_scripts/samples_permissions_checker.sh
213+
language: system
214+
always_run: true
215+
pass_filenames: false
216+
217+
200218
# A linter for Golang
201219
- repo: https://github.com/golangci/golangci-lint
202220
rev: v1.64.6

.semgrepignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Items added to this file will be ignored by Semgrep.

CONTRIBUTING.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@ See below for instructions to set up the development environment.
7272
- PRs should be merged using the `Squash and merge` strategy. In most cases a single commit with
7373
a detailed commit message body is preferred. Make sure to keep the `Signed-off-by` line in the body.
7474

75+
### PyPI Malware Detection Contribution
76+
77+
Please see the [README for the malware analyzer](./src/macaron/malware_analyzer/README.md) for information on contributing Heuristics and code patterns.
78+
7579
## Branching model
7680

7781
* The `main` branch should be used as the base branch for pull requests. The `release` branch is designated for releases and should only be merged into when creating a new release for Macaron.

docker/Dockerfile.final

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ RUN : \
4646
&& . .venv/bin/activate \
4747
&& pip install --no-compile --no-cache-dir --upgrade pip setuptools \
4848
&& find $HOME/dist -depth \( -type f \( -name "macaron-*.whl" \) \) -exec pip install --no-compile --no-cache-dir '{}' \; \
49-
&& pip uninstall semgrep \
49+
&& pip uninstall semgrep -y \
5050
&& find $HOME/dist -depth \( -type f \( -name "semgrep-*.whl" \) \) -exec pip install --no-compile --no-cache-dir '{}' \; \
5151
&& rm -rf $HOME/dist \
5252
&& deactivate

docs/source/pages/developers_guide/apidoc/macaron.malware_analyzer.pypi_heuristics.sourcecode.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,14 @@ macaron.malware\_analyzer.pypi\_heuristics.sourcecode package
99
Submodules
1010
----------
1111

12+
macaron.malware\_analyzer.pypi\_heuristics.sourcecode.pypi\_sourcecode\_analyzer module
13+
---------------------------------------------------------------------------------------
14+
15+
.. automodule:: macaron.malware_analyzer.pypi_heuristics.sourcecode.pypi_sourcecode_analyzer
16+
:members:
17+
:undoc-members:
18+
:show-inheritance:
19+
1220
macaron.malware\_analyzer.pypi\_heuristics.sourcecode.suspicious\_setup module
1321
------------------------------------------------------------------------------
1422

pyproject.toml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ dependencies = [
3737
"beautifulsoup4 >= 4.12.0,<5.0.0",
3838
"problog >= 2.2.6,<3.0.0",
3939
"cryptography >=44.0.0,<45.0.0",
40+
"semgrep == 1.113.0",
4041
]
4142
keywords = []
4243
# https://pypi.org/classifiers/
@@ -119,12 +120,14 @@ Issues = "https://github.com/oracle/macaron/issues"
119120
[tool.bandit]
120121
tests = []
121122
skips = ["B101"]
122-
123+
exclude_dirs = ['tests/malware_analyzer/pypi/resources/sourcecode_samples']
123124

124125
# https://github.com/psf/black#configuration
125126
[tool.black]
126127
line-length = 120
127-
128+
force-exclude = '''
129+
tests/malware_analyzer/pypi/resources/sourcecode_samples/
130+
'''
128131

129132
# https://github.com/commitizen-tools/commitizen
130133
# https://commitizen-tools.github.io/commitizen/bump/
@@ -170,7 +173,6 @@ exclude = [
170173
"SECURITY.md",
171174
]
172175

173-
174176
# https://pycqa.github.io/isort/
175177
[tool.isort]
176178
profile = "black"
@@ -181,7 +183,6 @@ skip_gitignore = true
181183

182184
# https://mypy.readthedocs.io/en/stable/config_file.html#using-a-pyproject-toml
183185
[tool.mypy]
184-
# exclude=
185186
show_error_codes = true
186187
show_column_numbers = true
187188
check_untyped_defs = true
@@ -209,7 +210,6 @@ module = [
209210
]
210211
ignore_missing_imports = true
211212

212-
213213
# https://pylint.pycqa.org/en/latest/user_guide/configuration/index.html
214214
[tool.pylint.MASTER]
215215
fail-under = 10.0
@@ -261,6 +261,7 @@ addopts = """-vv -ra --tb native \
261261
--doctest-modules --doctest-continue-on-failure --doctest-glob '*.rst' \
262262
--cov macaron \
263263
--ignore tests/integration \
264+
--ignore tests/malware_analyzer/pypi/resources/sourcecode_samples \
264265
""" # Consider adding --pdb
265266
# https://docs.python.org/3/library/doctest.html#option-flags
266267
doctest_optionflags = "IGNORE_EXCEPTION_DETAIL"
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/usr/bin/env bash
2+
3+
# Copyright (c) 2022 - 2025, Oracle and/or its affiliates. All rights reserved.
4+
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
5+
6+
#
7+
# Checks if the files in tests/malware_analyzer/pypi/resources/sourcecode_samples have executable permissions,
8+
# failing if any do.
9+
#
10+
11+
# Strict bash options.
12+
#
13+
# -e: exit immediately if a command fails (with non-zero return code),
14+
# or if a function returns non-zero.
15+
#
16+
# -u: treat unset variables and parameters as error when performing
17+
# parameter expansion.
18+
# In case a variable ${VAR} is unset but we still need to expand,
19+
# use the syntax ${VAR:-} to expand it to an empty string.
20+
#
21+
# -o pipefail: set the return value of a pipeline to the value of the last
22+
# (rightmost) command to exit with a non-zero status, or zero
23+
# if all commands in the pipeline exit successfully.
24+
#
25+
# Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html.
26+
set -euo pipefail
27+
28+
MACARON_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && cd ../.. && pwd)"
29+
SAMPLES_PATH="${MACARON_DIR}/tests/malware_analyzer/pypi/resources/sourcecode_samples"
30+
31+
# any files have any of the executable bits set
32+
executables=$( ( find "$SAMPLES_PATH" -type f -perm -u+x -o -type f -perm -g+x -o -type f -perm -o+x | sed "s|$MACARON_DIR/||"; git ls-files "$SAMPLES_PATH" --full-name) | sort | uniq -d)
33+
if [ -n "$executables" ]; then
34+
echo "The following files should not have any executable permissions:"
35+
echo "$executables"
36+
exit 1
37+
fi

src/macaron/__main__.py

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,10 @@ def analyze_slsa_levels_single(analyzer_single_args: argparse.Namespace) -> None
9696

9797
global_config.local_maven_repo = user_provided_local_maven_repo
9898

99+
if analyzer_single_args.force_analyze_source and not analyzer_single_args.analyze_source:
100+
logger.error("'--force-analyze-source' requires '--analyze-source'.")
101+
sys.exit(os.EX_USAGE)
102+
99103
analyzer = Analyzer(global_config.output_path, global_config.build_log_path)
100104

101105
# Initiate reporters.
@@ -172,8 +176,9 @@ def analyze_slsa_levels_single(analyzer_single_args: argparse.Namespace) -> None
172176
analyzer_single_args.sbom_path,
173177
deps_depth,
174178
provenance_payload=prov_payload,
175-
validate_malware=analyzer_single_args.validate_malware,
176179
verify_provenance=analyzer_single_args.verify_provenance,
180+
analyze_source=analyzer_single_args.analyze_source,
181+
force_analyze_source=analyzer_single_args.force_analyze_source,
177182
)
178183
sys.exit(status_code)
179184

@@ -477,10 +482,22 @@ def main(argv: list[str] | None = None) -> None:
477482
)
478483

479484
single_analyze_parser.add_argument(
480-
"--validate-malware",
485+
"--analyze-source",
481486
required=False,
482487
action="store_true",
483-
help=("Enable malware validation."),
488+
help=(
489+
"For improved malware detection, analyze the source code of the"
490+
+ " (PyPI) package using a textual scan and dataflow analysis."
491+
),
492+
)
493+
494+
single_analyze_parser.add_argument(
495+
"--force-analyze-source",
496+
required=False,
497+
action="store_true",
498+
help=(
499+
"Forces PyPI sourcecode analysis to run regardless of other heuristic results. Requires '--analyze-source'."
500+
),
484501
)
485502

486503
single_analyze_parser.add_argument(

src/macaron/config/defaults.ini

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -600,3 +600,17 @@ major_threshold = 20
600600
epoch_threshold = 3
601601
# The number of days +/- the day of publish the calendar versioning day may be.
602602
day_publish_error = 4
603+
604+
# disable default semgrep rulesets here (i.e. all rule IDs in a Semgrep .yaml file) using comma separated ruleset names. Use the name
605+
# without the .yaml prefix here. Currently, we disable the exfiltration rulesets by default due to a high false positive rate.
606+
disabled_default_rulesets = exfiltration
607+
# disable individual rules here (i.e. individual rule IDs inside a Semgrep .yaml file) using comma separated rule IDs. You may also
608+
# provide the IDs of your custom semgrep rules here too, as all Semgrep rule IDs must be unique.
609+
disabled_rules =
610+
# absolute path to a directory where a custom set of semgrep rules for source code analysis are stored. These will be included
611+
# with Macaron's default rules. The path will be normalised to the OS path type.
612+
custom_semgrep_rules =
613+
# disable custom semgrep rulesets here (i.e. all rule IDs in a Semgrep .yaml file) using comma separated ruleset names. Use the name
614+
# without the .yaml prefix here. Note, this will be ignored if a path to custom semgrep rules is not provided. These file names must
615+
# be unique.
616+
disabled_custom_rulesets =

src/macaron/errors.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,3 +109,7 @@ class HeuristicAnalyzerValueError(MacaronError):
109109

110110
class LocalArtifactFinderError(MacaronError):
111111
"""Happens when there is an error looking for local artifacts."""
112+
113+
114+
class SourceCodeError(MacaronError):
115+
"""Error for operations on package source code."""

src/macaron/malware_analyzer/README.md

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Implementation of Heuristic Malware Detector
1+
# Implementation of Malware Detector
22

33
## PyPI Ecosystem
44

@@ -52,20 +52,69 @@ When a heuristic fails, with `HeuristicResult.FAIL`, then that is an indicator b
5252
- **Rule**: Return `HeuristicResult.FAIL` if the major or epoch is abnormally high; otherwise, return `HeuristicResult.PASS`.
5353
- **Dependency**: Will be run if the One Release heuristic fails.
5454

55+
### Source Code Analysis with Semgrep
56+
57+
The following analyzer has been included as an optional feature, available by supplying `--analyze-source` in the CLI to `macaron analyze`:
58+
59+
**PyPI Source Code Analyzer**
60+
- **Description**: Uses Semgrep, with default rules written in `src/macaron/resources/pypi_malware_rules` and custom rules available by supplying a path to `custom_semgrep_rules` in `defaults.ini`, to scan the package `.tar` source code.
61+
- **Rule**: If any Semgrep rule is triggered, the heuristic fails with `HeuristicResult.FAIL` and subsequently fails the package with `CheckResultType.FAILED`. If no rule is triggered, the heuristic passes with `HeuristicResult.PASS` and the `CheckResultType` result from the combination of all other heuristics is maintained.
62+
- **Dependency**: Will be run if the Source Code Repo fails. This dependency can be bypassed by suppying `--force-analyze-source` in the CLI, along with `--analyze-source`.
63+
64+
This feature is currently a work in progress, and supports detection of code obfuscation techniques and remote exfiltration behaviors. It uses Semgrep OSS for detection. `defaults.ini` may be used to provide custom rules and exclude them:
65+
- `disabled_default_rulesets`: supply to this a comma separated list of the names of default Semgrep rule files (excluding the `.yaml` extension) to disable all rule IDs in that file.
66+
- `disabled_rules`: supply to this a comma separated list of individual rule IDs to disable (from both the default and custom list).
67+
- `custom_semgrep_rules`: supply to this an absolute path to a directory containing custom Semgrep `.yaml` files to be run alongside the default ones.
68+
- `disabled_custom_rulesets`: supply to this a comma separated list of the names of custom Semgrep rule files (excluding the `.yaml` extension) to disable all rule IDs in that file.
69+
5570
### Contributing
5671

5772
When contributing an analyzer, it must meet the following requirements:
5873

5974
- The analyzer must be implemented in a separate file, placed in the relevant folder based on what it analyzes ([metadata](./pypi_heuristics/metadata/) or [sourcecode](./pypi_heuristics/sourcecode/)).
6075
- The analyzer must inherit from the `BaseHeuristicAnalyzer` class and implement the `analyze` function, returning relevant information specific to the analysis.
6176
- The analyzer name must be added to [heuristics.py](./pypi_heuristics/heuristics.py) file so it can be used for rule combinations in [detect_malicious_metadata_check.py](../slsa_analyzer/checks/detect_malicious_metadata_check.py)
77+
- The analyzer must be added to the list of analyzers in `detect_malicious_metadata_check.py` to be run.
6278
- Update the `malware_rules_problog_model` in [detect_malicious_metadata_check.py](../slsa_analyzer/checks/detect_malicious_metadata_check.py) with logical statements where the heuristic should be included. When adding new rules, please follow the following guidelines:
6379
- Provide a [confidence value](../slsa_analyzer/checks/check_result.py) using the `Confidence` enum.
6480
- Ensure it is assigned to the `problog_result_access` string variable, otherwise it will not be queried and evaluated.
6581
- Assign a rule ID to the rule. This will be used to backtrack to determine if it was triggered.
6682
- Make sure to wrap pass/fail statements in `passed()` and `failed()`. Not doing so may result in undesirable behaviour, see the comments in the model for more details.
6783
- If there are commonly used combinations introduced by adding the heuristic, combine and justify them at the top of the static model (see `quickUndetailed` and `forceSetup` as current examples).
6884

85+
**Contributing Code Pattern Rules**
86+
87+
When contributing more Semgrep rules for `pypi_sourcecode_analyzer.py` to use, the following requirements must be met:
88+
89+
- Semgrep `.yaml` Rules are stored in `src/macaron/resources/pypi_malware_rules` and are named based on the category of code behaviors they detect.
90+
- If the rule comes under one of the already defined categories, place it within that `.yaml` file, else create a new `.yaml` file using the category name.
91+
- Each rule ID must be prefixed by the category followed by a single underscore ('_'), so for obfuscation rules in `obfuscation.yaml` each rule ID is prefixed with `obfuscation_`, followed by an ID which uses a hiphen ('-') as a separator.
92+
- Tests must be written for each rule contributed. These are stored in `tests/malware_analyzer/pypi/test_pypi_sourcescode_analyzer.py`.
93+
- These tests are written on a per-category bases, running each category individually. Each category must have a folder under `tests/malware_analyzer/pypi/resources/sourcecode_samples`.
94+
- Within these folders, there must be sample code patterns for testing, and a file `expected_results.json` with the expected JSON output of the analyzer for that category.
95+
- Each sample code pattern `.py` file must not have executable permissions and must include code that prevents it from being accidentally imported or run. The current files use this method:
96+
97+
```
98+
"""
99+
Running this code will not produce any malicious behavior, but code isolation measures are
100+
in place for safety.
101+
"""
102+
103+
import sys
104+
105+
# ensure no symbols are exported so this code cannot accidentally be used
106+
__all__ = []
107+
sys.exit()
108+
109+
def test_function():
110+
"""
111+
All code to be tested will be defined inside this function, so it is all local to it. This is
112+
to isolate the code to be tested, as it exists to replicate the patterns present in malware
113+
samples.
114+
"""
115+
sys.exit()
116+
```
117+
69118
### Confidence Score Motivation
70119

71120
The original seven heuristics which started this work were Empty Project Link, Unreachable Project Links, One Release, High Release Frequency, Unchange Release, Closer Release Join Date, and Suspicious Setup. These heuristics (excluding those with a dependency) were run on 1167 packages from trusted organizations, with the following results:

src/macaron/malware_analyzer/pypi_heuristics/heuristics.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@ class Heuristics(str, Enum):
3737
#: Indicates that the package has an unusually large version number for a single release.
3838
ANOMALOUS_VERSION = "anomalous_version"
3939

40+
#: Indicates that the package source code contains suspicious code patterns.
41+
SUSPICIOUS_PATTERNS = "suspicious_patterns"
42+
4043

4144
class HeuristicResult(str, Enum):
4245
"""Result type indicating the outcome of a heuristic."""

0 commit comments

Comments
 (0)