Finding Duplicate-Intent Functions
Overview
LLM-generated codebases accumulate semantic duplicates: functions that serve the same purpose but were implemented independently. Classical copy-paste detectors (jscpd) find syntactic duplicates but miss "same intent, different implementation."
This skill uses a two-phase approach: classical extraction followed by LLM-powered intent clustering.
When to Use
- Codebase has grown organically with multiple contributors (human or LLM)
- You suspect utility functions have been reimplemented multiple times
- Before major refactoring to identify consolidation opportunities
- After jscpd has been run and syntactic duplicates are already handled
Quick Reference
| Phase |
Tool |
Model |
Output |
| 1. Extract |
scripts/extract-functions.sh |
- |
catalog.json |
| 2. Categorize |
scripts/categorize-prompt.md |
haiku |
categorized.json |
| 3. Split |
scripts/prepare-category-analysis.sh |
- |
categories/*.json |
| 4. Detect |
scripts/find-duplicates-prompt.md |
opus |
duplicates/*.json |
| 5. Report |
scripts/generate-report.sh |
- |
report.md |
Process
digraph duplicate_detection {
rankdir=TB;
node [shape=box];
extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
categorize [label="2. Categorize by domain\n(haiku subagent)"];
split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
detect [label="4. Find duplicates per category\n(opus subagent per category)"];
report [label="5. Generate report\n./scripts/generate-report.sh"];
review [label="6. Human review & consolidate"];
extract -> categorize -> split -> detect -> report -> review;
}
Phase 1: Extract Function Catalog
./scripts/extract-functions.sh src/ -o catalog.json
Options:
-o FILE: Output file (default: stdout)
-c N: Lines of context to capture (default: 15)
-t GLOB: File types (default: *.ts,*.tsx,*.js,*.jsx)
--include-tests: Include test files (excluded by default)
Test files (*.test.*, *.spec.*, __tests__/**) are excluded by default since test utilities are less likely to be consolidation candidates.
Phase 2: Categorize by Domain
Dispatch a haiku subagent using the prompt in scripts/categorize-prompt.md.
Insert the contents of catalog.json where indicated in the prompt template. Save output as categorized.json.
Phase 3: Split into Categories
./scripts/prepare-category-analysis.sh categorized.json ./categories
Creates one JSON file per category. Only categories with 3+ functions are worth analyzing.
Phase 4: Find Duplicates (Per Category)
For each category file in ./categories/, dispatch an opus subagent using the prompt in scripts/find-duplicates-prompt.md.
Save each output as ./duplicates/{category}.json.
Phase 5: Generate Report
./scripts/generate-report.sh ./duplicates ./duplicates-report.md
Produces a prioritized markdown report grouped by confidence level.
Phase 6: Human Review
Review the report. For HIGH confidence duplicates:
- Verify the recommended survivor has tests
- Update callers to use the survivor
- Delete the duplicates
- Run tests
High-Risk Duplicate Zones
Focus extraction on these areas first - they accumulate duplicates fastest:
| Zone |
Common Duplicates |
utils/, helpers/, lib/ |
General utilities reimplemented |
| Validation code |
Same checks written multiple ways |
| Error formatting |
Error-to-string conversions |
| Path manipulation |
Joining, resolving, normalizing paths |
| String formatting |
Case conversion, truncation, escaping |
| Date formatting |
Same formats implemented repeatedly |
| API response shaping |
Similar transformations for different endpoints |
Common Mistakes
Extracting too much: Focus on exported functions and public methods. Internal helpers are less likely to be duplicated across files.
Skipping the categorization step: Going straight to duplicate detection on the full catalog produces noise. Categories focus the comparison.
Using haiku for duplicate detection: Haiku is cost-effective for categorization but misses subtle semantic duplicates. Use Opus for the actual duplicate analysis.
Consolidating without tests: Before deleting duplicates, ensure the survivor has tests covering all use cases of the deleted functions.
Raw SKILL.md
---
name: Finding-Duplicate-Functions
description: Use when auditing a codebase for semantic duplication - functions that do the same thing but have different names or implementations. Especially useful for LLM-generated codebases where new functions are often created rather than reusing existing ones.
---
# Finding Duplicate-Intent Functions
## Overview
LLM-generated codebases accumulate semantic duplicates: functions that serve the same purpose but were implemented independently. Classical copy-paste detectors (jscpd) find syntactic duplicates but miss "same intent, different implementation."
This skill uses a two-phase approach: classical extraction followed by LLM-powered intent clustering.
## When to Use
- Codebase has grown organically with multiple contributors (human or LLM)
- You suspect utility functions have been reimplemented multiple times
- Before major refactoring to identify consolidation opportunities
- After jscpd has been run and syntactic duplicates are already handled
## Quick Reference
| Phase | Tool | Model | Output |
|-------|------|-------|--------|
| 1. Extract | `scripts/extract-functions.sh` | - | `catalog.json` |
| 2. Categorize | `scripts/categorize-prompt.md` | haiku | `categorized.json` |
| 3. Split | `scripts/prepare-category-analysis.sh` | - | `categories/*.json` |
| 4. Detect | `scripts/find-duplicates-prompt.md` | opus | `duplicates/*.json` |
| 5. Report | `scripts/generate-report.sh` | - | `report.md` |
## Process
```dot
digraph duplicate_detection {
rankdir=TB;
node [shape=box];
extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
categorize [label="2. Categorize by domain\n(haiku subagent)"];
split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
detect [label="4. Find duplicates per category\n(opus subagent per category)"];
report [label="5. Generate report\n./scripts/generate-report.sh"];
review [label="6. Human review & consolidate"];
extract -> categorize -> split -> detect -> report -> review;
}
```
### Phase 1: Extract Function Catalog
```bash
./scripts/extract-functions.sh src/ -o catalog.json
```
Options:
- `-o FILE`: Output file (default: stdout)
- `-c N`: Lines of context to capture (default: 15)
- `-t GLOB`: File types (default: `*.ts,*.tsx,*.js,*.jsx`)
- `--include-tests`: Include test files (excluded by default)
Test files (`*.test.*`, `*.spec.*`, `__tests__/**`) are excluded by default since test utilities are less likely to be consolidation candidates.
### Phase 2: Categorize by Domain
Dispatch a **haiku** subagent using the prompt in `scripts/categorize-prompt.md`.
Insert the contents of `catalog.json` where indicated in the prompt template. Save output as `categorized.json`.
### Phase 3: Split into Categories
```bash
./scripts/prepare-category-analysis.sh categorized.json ./categories
```
Creates one JSON file per category. Only categories with 3+ functions are worth analyzing.
### Phase 4: Find Duplicates (Per Category)
For each category file in `./categories/`, dispatch an **opus** subagent using the prompt in `scripts/find-duplicates-prompt.md`.
Save each output as `./duplicates/{category}.json`.
### Phase 5: Generate Report
```bash
./scripts/generate-report.sh ./duplicates ./duplicates-report.md
```
Produces a prioritized markdown report grouped by confidence level.
### Phase 6: Human Review
Review the report. For HIGH confidence duplicates:
1. Verify the recommended survivor has tests
2. Update callers to use the survivor
3. Delete the duplicates
4. Run tests
## High-Risk Duplicate Zones
Focus extraction on these areas first - they accumulate duplicates fastest:
| Zone | Common Duplicates |
|------|-------------------|
| `utils/`, `helpers/`, `lib/` | General utilities reimplemented |
| Validation code | Same checks written multiple ways |
| Error formatting | Error-to-string conversions |
| Path manipulation | Joining, resolving, normalizing paths |
| String formatting | Case conversion, truncation, escaping |
| Date formatting | Same formats implemented repeatedly |
| API response shaping | Similar transformations for different endpoints |
## Common Mistakes
**Extracting too much**: Focus on exported functions and public methods. Internal helpers are less likely to be duplicated across files.
**Skipping the categorization step**: Going straight to duplicate detection on the full catalog produces noise. Categories focus the comparison.
**Using haiku for duplicate detection**: Haiku is cost-effective for categorization but misses subtle semantic duplicates. Use Opus for the actual duplicate analysis.
**Consolidating without tests**: Before deleting duplicates, ensure the survivor has tests covering all use cases of the deleted functions.