|
| 1 | +# HTML::Parser - Perl HTML Parser Module |
| 2 | + |
| 3 | +HTML::Parser is a C/XS-based Perl module for parsing HTML documents. It's part of the libwww-perl organization and provides event-driven HTML parsing with support for multiple parser modes and extensive customization options. |
| 4 | + |
| 5 | +Always reference these instructions first and fallback to search or bash commands only when you encounter unexpected information that does not match the info here. |
| 6 | + |
| 7 | +## Working Effectively |
| 8 | + |
| 9 | +- Bootstrap, build, and test the repository: |
| 10 | + - `perl Makefile.PL` -- generates Makefile (takes ~0.12 seconds) |
| 11 | + - `make` -- builds the C/XS module (takes ~1.5 seconds, NEVER CANCEL) |
| 12 | + - `make test` -- runs 464 tests (takes ~3 seconds, NEVER CANCEL) |
| 13 | +- Clean the build: |
| 14 | + - `make clean` -- removes build artifacts (takes ~0.01 seconds) |
| 15 | +- Dependencies are automatically handled by the Perl build system: |
| 16 | + - Runtime dependencies: HTML::Tagset, HTTP::Headers, URI, etc. |
| 17 | + - Test dependencies: Test::More, File::Spec, Config, etc. |
| 18 | + - All dependencies are typically available in standard Perl installations |
| 19 | + |
| 20 | +## Validation |
| 21 | + |
| 22 | +- ALWAYS test core functionality after making changes to the XS code or main Parser.pm: |
| 23 | + ```perl |
| 24 | + perl -MHTML::Parser -e 'print "HTML::Parser loads successfully\n"' |
| 25 | + ``` |
| 26 | +- ALWAYS test example scripts in eg/ directory: |
| 27 | + - `perl eg/htext test.html` -- extracts plain text from HTML |
| 28 | + - `perl eg/hstrip test.html` -- strips unwanted tags and attributes |
| 29 | +- ALWAYS run a complete parsing scenario manually: |
| 30 | + - Create test HTML file with various elements (tags, attributes, text, comments) |
| 31 | + - Parse it with HTML::Parser using start_h, end_h, text_h handlers |
| 32 | + - Verify all elements are parsed correctly |
| 33 | +- ALWAYS run the full test suite before committing: `make test` |
| 34 | +- The test suite covers 464 test cases across 50 test files and must ALL pass |
| 35 | + |
| 36 | +## Build System Details |
| 37 | + |
| 38 | +- Uses ExtUtils::MakeMaker build system (traditional Perl approach) |
| 39 | +- XS (C extension) compilation is handled automatically |
| 40 | +- Generated files during build: Parser.c (from Parser.xs), Parser.so, blib/ directory |
| 41 | +- Configuration: Makefile.PL defines build parameters including MARKED_SECTION support |
| 42 | +- Build artifacts are placed in blib/ directory structure |
| 43 | + |
| 44 | +## Project Structure |
| 45 | + |
| 46 | +### Key Files and Directories: |
| 47 | +- `lib/HTML/Parser.pm` -- Main Perl module with XS loading |
| 48 | +- `Parser.xs` -- XS interface between Perl and C code |
| 49 | +- `hparser.c` -- Core C parsing engine |
| 50 | +- `lib/HTML/` -- Additional modules (Entities, LinkExtor, HeadParser, etc.) |
| 51 | +- `t/` -- Test suite (50 test files, 464 tests total) |
| 52 | +- `eg/` -- Example scripts demonstrating usage |
| 53 | +- `cpanfile` -- Dependency specification |
| 54 | +- `Makefile.PL` -- Build configuration |
| 55 | + |
| 56 | +### Important Modules: |
| 57 | +- `HTML::Parser` -- Main parser class (lib/HTML/Parser.pm) |
| 58 | +- `HTML::Entities` -- HTML entity encoding/decoding (lib/HTML/Entities.pm) |
| 59 | +- `HTML::LinkExtor` -- Extract links from HTML (lib/HTML/LinkExtor.pm) |
| 60 | +- `HTML::HeadParser` -- Parse HTML head sections (lib/HTML/HeadParser.pm) |
| 61 | +- `HTML::PullParser` -- Pull-style parsing interface (lib/HTML/PullParser.pm) |
| 62 | + |
| 63 | +## Testing |
| 64 | + |
| 65 | +- Test suite is comprehensive with 464 tests across multiple scenarios |
| 66 | +- Tests cover: basic parsing, entity handling, filters, callbacks, edge cases |
| 67 | +- All tests use Test::More framework |
| 68 | +- Key test categories: |
| 69 | + - Parser functionality (t/parser.t, t/callback.t) |
| 70 | + - Entity handling (t/entities.t, t/uentities.t) |
| 71 | + - Filter methods (t/filter.t, t/filter-methods.t) |
| 72 | + - Unicode support (t/unicode.t) |
| 73 | + - Various parser modes and options |
| 74 | + |
| 75 | +## Common Tasks |
| 76 | + |
| 77 | +### Building from scratch: |
| 78 | +```bash |
| 79 | +perl Makefile.PL |
| 80 | +make |
| 81 | +make test |
| 82 | +``` |
| 83 | + |
| 84 | +### Testing specific functionality: |
| 85 | +```bash |
| 86 | +# Test entity handling |
| 87 | +perl -MHTML::Entities -e 'print HTML::Entities::encode_entities("<test>") . "\n"' |
| 88 | + |
| 89 | +# Test basic parsing |
| 90 | +perl -MHTML::Parser -e ' |
| 91 | + my $p = HTML::Parser->new(text_h => [sub { print "$_[0]\n" }, "dtext"]); |
| 92 | + $p->parse("<p>Hello & world</p>"); |
| 93 | + $p->eof; |
| 94 | +' |
| 95 | +``` |
| 96 | + |
| 97 | +### Manual validation scenarios: |
| 98 | +1. **Basic HTML parsing**: Create HTML with tags, attributes, text, and entities. Parse and verify all components are extracted correctly. |
| 99 | +2. **Entity decoding**: Test HTML entities like &, <, >, ' are properly decoded. |
| 100 | +3. **Filter functionality**: Test ignore_tags, report_tags, and ignore_elements work correctly. |
| 101 | +4. **Callback handling**: Verify start_h, end_h, text_h callbacks receive correct parameters. |
| 102 | + |
| 103 | +### File outputs from commonly run commands: |
| 104 | + |
| 105 | +#### Repository root listing: |
| 106 | +``` |
| 107 | +Changes TODO dist.ini hparser.c lib/ t/ |
| 108 | +LICENSE cpanfile eg/ hparser.h mkhctype test.html |
| 109 | +Makefile.PL .github/ entities.html hints/ mkpfunc test_parser.pl |
| 110 | +META.json .gitignore .perltidyrc typemap Parser.xs tokenpos.h |
| 111 | +README.md .mailmap hctype.h pfunc.h ppport.h util.c |
| 112 | +``` |
| 113 | + |
| 114 | +#### Example scripts (eg/ directory): |
| 115 | +``` |
| 116 | +hanchors hbody hdisable hdump hform hlc hrefsub hstrip htext htextsub htitle |
| 117 | +``` |
| 118 | + |
| 119 | +#### Test directory: |
| 120 | +- 50 test files covering all functionality |
| 121 | +- Tests range from basic parsing to complex Unicode scenarios |
| 122 | +- All tests must pass for a valid build |
| 123 | + |
| 124 | +## CI/CD |
| 125 | + |
| 126 | +- GitHub Actions workflows for Linux, macOS, and Windows |
| 127 | +- Workflows test multiple Perl versions (5.10 to 5.40) |
| 128 | +- All builds must pass before merge |
| 129 | +- Located in .github/workflows/ |
| 130 | + |
| 131 | +## Development Notes |
| 132 | + |
| 133 | +- This is a mature, stable codebase (version 3.84) |
| 134 | +- Changes should be minimal and well-tested |
| 135 | +- XS/C code changes require careful validation |
| 136 | +- Backward compatibility is important |
| 137 | +- Follow existing code style (see .perltidyrc) |
| 138 | + |
| 139 | +## Performance |
| 140 | + |
| 141 | +- Parser is optimized for speed with C implementation |
| 142 | +- Handles large documents efficiently |
| 143 | +- Event-driven approach minimizes memory usage |
| 144 | +- Build times are fast (~1.5 seconds total) |
| 145 | +- Test execution is quick (~3 seconds for full suite) |
| 146 | + |
| 147 | +## Troubleshooting |
| 148 | + |
| 149 | +- If build fails, ensure C compiler is available |
| 150 | +- Missing dependencies are usually auto-detected |
| 151 | +- Test failures indicate breaking changes |
| 152 | +- XS compilation errors suggest C code issues |
| 153 | +- Use `make clean` to reset build state |
0 commit comments