|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +league/commonmark is a highly-extensible PHP Markdown parser that fully supports the CommonMark spec and GitHub-Flavored Markdown (GFM). It's based on the CommonMark JS reference implementation and provides a robust, extensible architecture for parsing and rendering Markdown content. |
| 8 | + |
| 9 | +## Development Commands |
| 10 | + |
| 11 | +### Testing |
| 12 | +- `composer test` - Run all tests (includes linting, static analysis, unit tests, and pathological tests) |
| 13 | +- `composer phpunit` - Run PHPUnit tests only (no coverage) |
| 14 | +- `composer pathological` - Run pathological performance tests |
| 15 | + |
| 16 | +### Code Quality |
| 17 | +- `composer phpcs` - Run PHP CodeSniffer for coding standards |
| 18 | +- `composer phpcbf` - Automatically fix coding standards issues |
| 19 | +- `composer phpstan` - Run PHPStan static analysis |
| 20 | +- `composer psalm` - Run Psalm static analysis with stats |
| 21 | + |
| 22 | +(IMPORTANT: you MUST ALWAYS use PHP 7.4 to run `phpcs` and `phpcbf`. You SHOULD use the `php` service from docker-compose, which uses that version. Example: `docker compose exec php composer phpcs`) |
| 23 | + |
| 24 | +### Benchmarking |
| 25 | +- `./tests/benchmark/benchmark.php` - Compare performance against other Markdown parsers |
| 26 | + |
| 27 | +## Architecture Overview |
| 28 | + |
| 29 | +### Core Components |
| 30 | + |
| 31 | +**Converters**: Main entry points using Facade pattern |
| 32 | +- `CommonMarkConverter` - Preconfigured with `CommonMarkCoreExtension` |
| 33 | +- `GithubFlavoredMarkdownConverter` - Includes GFM extensions bundle |
| 34 | +- `MarkdownConverter` - Base class orchestrating `MarkdownParser` + `HtmlRenderer` |
| 35 | +- Pattern: Factory with default configurations + Facade for complex pipeline |
| 36 | + |
| 37 | +**Environment System**: Service container and registry |
| 38 | +- `Environment` - Central registry managing parsers/renderers with priorities |
| 39 | +- Implements PSR-14 event dispatcher for pre/post processing hooks |
| 40 | +- Uses lazy initialization - extensions registered on first use |
| 41 | +- Pattern: Registry + Builder + Dependency Injection |
| 42 | + |
| 43 | +**Parser Architecture**: Two-phase recursive descent parsing |
| 44 | +- **Block Phase**: `MarkdownParser` processes line-by-line with active parser stack |
| 45 | + - `BlockStartParserInterface` - Strategy pattern for block detection |
| 46 | + - State machine with continuation tracking and reference processing |
| 47 | + - Security: NUL character replacement, configurable nesting limits |
| 48 | +- **Inline Phase**: `InlineParserEngine` with regex pre-compilation |
| 49 | + - `InlineParserInterface` - Strategy with regex-based matching |
| 50 | + - Position-based parser coordination with delimiter processing |
| 51 | + - Adjacent text merging optimization |
| 52 | + |
| 53 | +**AST (Abstract Syntax Tree)**: Composite pattern with doubly-linked structure |
| 54 | +- `Node` base class with tree navigation/manipulation methods |
| 55 | +- `AbstractBlock`/`AbstractInline` - Template method pattern for element types |
| 56 | +- `Document` - Root node with reference map storage |
| 57 | +- Uses `Dflydev\DotAccessData\Data` for flexible metadata storage |
| 58 | +- Supports multiple traversal: iterator, walker, query system |
| 59 | + |
| 60 | +**Rendering**: Visitor pattern with strategy delegation |
| 61 | +- `HtmlRenderer` - Traverses AST, delegates to node-specific renderers |
| 62 | +- `NodeRendererInterface` - Strategy pattern for extensible rendering |
| 63 | +- Hierarchical renderer lookup supporting inheritance |
| 64 | +- Pre/post-render events with configurable block separators |
| 65 | + |
| 66 | +**Extension System**: Plugin pattern with composite support |
| 67 | +- `ExtensionInterface` - Simple contract for environment configuration |
| 68 | +- `CommonMarkCoreExtension` - Complete spec implementation with priorities |
| 69 | +- `GithubFlavoredMarkdownExtension` - Composite bundling multiple GFM features |
| 70 | +- Performance: Optimized parser ordering and lazy registration |
| 71 | + |
| 72 | +### Key Directories |
| 73 | + |
| 74 | +**`src/Extension/`**: All built-in extensions |
| 75 | +- `CommonMark/` - Core CommonMark specification features |
| 76 | +- `GithubFlavoredMarkdownExtension.php` - GFM bundle extension |
| 77 | +- Individual feature extensions: `Table/`, `Strikethrough/`, `TaskList/`, etc. |
| 78 | + |
| 79 | +**`src/Parser/`**: Parsing logic |
| 80 | +- `Block/` - Block-level parsing components |
| 81 | +- `Inline/` - Inline parsing components |
| 82 | +- `MarkdownParser.php` - Main parsing coordinator |
| 83 | + |
| 84 | +**`src/Node/`**: AST node definitions |
| 85 | +- `Block/` - Block-level nodes (paragraphs, headings, lists, etc.) |
| 86 | +- `Inline/` - Inline nodes (text, emphasis, links, etc.) |
| 87 | + |
| 88 | +**`src/Renderer/`**: Output rendering |
| 89 | +- `Block/` and `Inline/` subdirectories mirror node structure |
| 90 | +- `HtmlRenderer.php` - Main HTML output renderer |
| 91 | + |
| 92 | +## AST (Abstract Syntax Tree) Manipulation |
| 93 | + |
| 94 | +The library uses a doubly-linked AST where all elements (including the root `Document`) extend from the `Node` class: |
| 95 | + |
| 96 | +### AST Traversal Methods |
| 97 | + |
| 98 | +- **Iterator**: `$node->iterator()` - Fastest for complete tree traversal |
| 99 | +- **Walker**: `$node->walker()` - Full control with enter/leave events, use `resumeAt()` for safe modifications |
| 100 | +- **Query**: `(new Query())->where()->findAll($node)` - Easy but memory-intensive, creates snapshots |
| 101 | +- **Manual**: `$node->next()`, `$node->parent()`, `$node->children()` - Best for direct relationships |
| 102 | + |
| 103 | +### AST Modification |
| 104 | + |
| 105 | +- **Adding**: `appendChild()`, `prependChild()`, `insertAfter()`, `insertBefore()` |
| 106 | +- **Removing**: `detach()`, `replaceWith()`, `detachChildren()`, `replaceChildren()` |
| 107 | +- **Data**: `$node->data->set('custom/info', $value)`, `$node->data->set('attributes/class', 'css-class')` |
| 108 | + |
| 109 | +## Extension Development |
| 110 | + |
| 111 | +### Creating Extensions |
| 112 | +1. Implement `ExtensionInterface` with `register(EnvironmentBuilderInterface $environment)` method |
| 113 | +2. Register components with priorities: `addInlineParser()`, `addBlockStartParser()`, `addRenderer()` |
| 114 | +3. Follow existing extension patterns in `src/Extension/` |
| 115 | + |
| 116 | +### Key Interfaces |
| 117 | +- **Block Parsers**: `BlockStartParserInterface` - implement `tryStart()` and `tryContinue()` |
| 118 | +- **Inline Parsers**: `InlineParserInterface` - implement `getMatchDefinition()` and `parse()` |
| 119 | +- **Delimiter Processors**: `DelimiterProcessorInterface` - for emphasis-style wrapping syntax |
| 120 | +- **Renderers**: `NodeRendererInterface` - implement `render()`, use `HtmlElement` for safety |
| 121 | +- **Events**: PSR-14 events like `DocumentParsedEvent` for AST manipulation |
| 122 | +- **Configuration**: `ConfigurableExtensionInterface` with `league/config` validation |
| 123 | + |
| 124 | +### Cursor Usage & Parsing |
| 125 | +- `Cursor` class: dual ASCII/UTF-8 paths, character caching, position state management |
| 126 | +- Key methods: `peek()`, `match()`, `saveState()`/`restoreState()`, `advanceBy()` |
| 127 | + |
| 128 | +## Testing Strategy |
| 129 | + |
| 130 | +### Test Categories & Commands |
| 131 | +- **Unit Tests** (`tests/unit/`) - Component testing, mirrors source structure |
| 132 | +- **Functional Tests** (`tests/functional/`) - End-to-end with `.md`/`.html` pairs |
| 133 | +- **Pathological Tests** (`tests/pathological/`) - Security/DoS prevention |
| 134 | +- **Extension Tests** (`tests/functional/Extension/`) - Per-extension testing |
| 135 | + |
| 136 | +### Running Tests |
| 137 | +- `composer test` - Full test suite |
| 138 | +- `composer phpunit` - PHPUnit tests only |
| 139 | +- `composer pathological` - Security/performance tests |
| 140 | + |
| 141 | +## Security Configuration (CRITICAL for Untrusted Input) |
| 142 | + |
| 143 | +When handling untrusted user input, certain security settings are essential to prevent XSS, DoS, and other attacks. These particular ones should be checked where necessary: |
| 144 | + |
| 145 | +### HTML Input Security (`html_input`) |
| 146 | + |
| 147 | +**Implementation**: `HtmlFilter::filter()` in `HtmlBlockRenderer` and `HtmlInlineRenderer` |
| 148 | +**Default**: `'allow'` (unsafe for untrusted input) |
| 149 | +**Attack Vector**: XSS through raw HTML injection |
| 150 | + |
| 151 | +**Options**: |
| 152 | +- `HtmlFilter::STRIP` returns empty string |
| 153 | +- `HtmlFilter::ESCAPE` uses `htmlspecialchars($html, ENT_NOQUOTES)` |
| 154 | +- `HtmlFilter::ALLOW` returns raw HTML unchanged |
| 155 | + |
| 156 | +### Unsafe Links Protection (`allow_unsafe_links`) |
| 157 | + |
| 158 | +**Implementation**: `RegexHelper::isLinkPotentiallyUnsafe()` in `LinkRenderer` and `ImageRenderer` |
| 159 | +**Default**: `true` (allows unsafe links) |
| 160 | +**Attack Vector**: XSS through malicious protocols (javascript:, vbscript:, file:, data:) |
0 commit comments