Semgrep Static Analysis
When to Use Semgrep
Ideal scenarios:
- Quick security scans (minutes, not hours)
- Pattern-based bug detection
- Enforcing coding standards and best practices
- Finding known vulnerability patterns
- Single-file analysis without complex data flow
- First-pass analysis before deeper tools
Consider CodeQL instead when:
- Need interprocedural taint tracking across files
- Complex data flow analysis required
- Analyzing custom proprietary frameworks
When NOT to Use
Do NOT use this skill for:
- Complex interprocedural data flow analysis (use CodeQL instead)
- Binary analysis or compiled code without source
- Custom deep semantic analysis requiring AST/CFG traversal
- When you need to track taint across many function boundaries
Installation
# pip
python3 -m pip install semgrep
# Homebrew
brew install semgrep
# Docker
docker run --rm -v "${PWD}:/src" returntocorp/semgrep semgrep --config auto /src
# Update
pip install --upgrade semgrep
Core Workflow
1. Quick Scan
semgrep --config auto . # Auto-detect rules
semgrep --config auto --metrics=off . # Disable telemetry for proprietary code
2. Use Rulesets
semgrep --config p/<RULESET> . # Single ruleset
semgrep --config p/security-audit --config p/trailofbits . # Multiple
| Ruleset | Description |
|---|---|
p/default |
General security and code quality |
p/security-audit |
Comprehensive security rules |
p/owasp-top-ten |
OWASP Top 10 vulnerabilities |
p/cwe-top-25 |
CWE Top 25 vulnerabilities |
p/r2c-security-audit |
r2c security audit rules |
p/trailofbits |
Trail of Bits security rules |
p/python |
Python-specific |
p/javascript |
JavaScript-specific |
p/golang |
Go-specific |
3. Output Formats
semgrep --config p/security-audit --sarif -o results.sarif . # SARIF
semgrep --config p/security-audit --json -o results.json . # JSON
semgrep --config p/security-audit --dataflow-traces . # Show data flow
4. Scan Specific Paths
semgrep --config p/python app.py # Single file
semgrep --config p/javascript src/ # Directory
semgrep --config auto --include='**/test/**' . # Include tests (excluded by default)
Writing Custom Rules
Basic Structure
rules:
- id: hardcoded-password
languages: [python]
message: "Hardcoded password detected: $PASSWORD"
severity: ERROR
pattern: password = "$PASSWORD"
Pattern Syntax
| Syntax | Description | Example |
|---|---|---|
... |
Match anything | func(...) |
$VAR |
Capture metavariable | $FUNC($INPUT) |
<... ...> |
Deep expression match | <... user_input ...> |
Pattern Operators
| Operator | Description |
|---|---|
pattern |
Match exact pattern |
patterns |
All must match (AND) |
pattern-either |
Any matches (OR) |
pattern-not |
Exclude matches |
pattern-inside |
Match only inside context |
pattern-not-inside |
Match only outside context |
pattern-regex |
Regex matching |
metavariable-regex |
Regex on captured value |
metavariable-comparison |
Compare values |
Combining Patterns
rules:
- id: sql-injection
languages: [python]
message: "Potential SQL injection"
severity: ERROR
patterns:
- pattern-either:
- pattern: cursor.execute($QUERY)
- pattern: db.execute($QUERY)
- pattern-not:
- pattern: cursor.execute("...", (...))
- metavariable-regex:
metavariable: $QUERY
regex: .*\+.*|.*\.format\(.*|.*%.*
Taint Mode (Data Flow)
Simple pattern matching finds obvious cases:
# Pattern `os.system($CMD)` catches this:
os.system(user_input) # Found
But misses indirect flows:
# Same pattern misses this:
cmd = user_input
processed = cmd.strip()
os.system(processed) # Missed - no direct match
Taint mode tracks data through assignments and transformations:
- Source: Where untrusted data enters (
user_input) - Propagators: How it flows (
cmd = ...,processed = ...) - Sanitizers: What makes it safe (
shlex.quote()) - Sink: Where it becomes dangerous (
os.system())
rules:
- id: command-injection
languages: [python]
message: "User input flows to command execution"
severity: ERROR
mode: taint
pattern-sources:
- pattern: request.args.get(...)
- pattern: request.form[...]
- pattern: request.json
pattern-sinks:
- pattern: os.system($SINK)
- pattern: subprocess.call($SINK, shell=True)
- pattern: subprocess.run($SINK, shell=True, ...)
pattern-sanitizers:
- pattern: shlex.quote(...)
- pattern: int(...)
Full Rule with Metadata
rules:
- id: flask-sql-injection
languages: [python]
message: "SQL injection: user input flows to query without parameterization"
severity: ERROR
metadata:
cwe: "CWE-89: SQL Injection"
owasp: "A03:2021 - Injection"
confidence: HIGH
mode: taint
pattern-sources:
- pattern: request.args.get(...)
- pattern: request.form[...]
- pattern: request.json
pattern-sinks:
- pattern: cursor.execute($QUERY)
- pattern: db.execute($QUERY)
pattern-sanitizers:
- pattern: int(...)
fix: cursor.execute($QUERY, (params,))
Testing Rules
Test File Format
# test_rule.py
def test_vulnerable():
user_input = request.args.get("id")
# ruleid: flask-sql-injection
cursor.execute("SELECT * FROM users WHERE id = " + user_input)
def test_safe():
user_input = request.args.get("id")
# ok: flask-sql-injection
cursor.execute("SELECT * FROM users WHERE id = ?", (user_input,))
semgrep --test rules/
CI/CD Integration (GitHub Actions)
name: Semgrep
on:
push:
branches: [main]
pull_request:
schedule:
- cron: '0 0 1 * *' # Monthly
jobs:
semgrep:
runs-on: ubuntu-latest
container:
image: returntocorp/semgrep
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for diff-aware scanning
- name: Run Semgrep
run: |
if [ "${{ github.event_name }}" = "pull_request" ]; then
semgrep ci --baseline-commit ${{ github.event.pull_request.base.sha }}
else
semgrep ci
fi
env:
SEMGREP_RULES: >-
p/security-audit
p/owasp-top-ten
p/trailofbits
Configuration
.semgrepignore
tests/fixtures/
**/testdata/
generated/
vendor/
node_modules/
Suppress False Positives
password = get_from_vault() # nosemgrep: hardcoded-password
dangerous_but_safe() # nosemgrep
Performance
semgrep --config rules/ --time . # Check rule performance
ulimit -n 4096 # Increase file descriptors for large codebases
Path Filtering in Rules
rules:
- id: my-rule
paths:
include: [src/]
exclude: [src/generated/]
Third-Party Rules
pip install semgrep-rules-manager
semgrep-rules-manager --dir ~/semgrep-rules download
semgrep -f ~/semgrep-rules .
Rationalizations to Reject
| Shortcut | Why It's Wrong |
|---|---|
| "Semgrep found nothing, code is clean" | Semgrep is pattern-based; it can't track complex data flow across functions |
| "I wrote a rule, so we're covered" | Rules need testing with semgrep --test; false negatives are silent |
| "Taint mode catches injection" | Only if you defined all sources, sinks, AND sanitizers correctly |
| "Pro rules are comprehensive" | Pro rules are good but not exhaustive; supplement with custom rules for your codebase |
| "Too many findings = noisy tool" | High finding count often means real problems; tune rules, don't disable them |
Resources
- Registry: https://semgrep.dev/explore
- Playground: https://semgrep.dev/playground
- Docs: https://semgrep.dev/docs/
- Trail of Bits Rules: https://github.com/trailofbits/semgrep-rules
- Blog: https://semgrep.dev/blog/