Skip to content

Commit 9d6d9ab

Browse files
Copilotoalders
andcommitted
Add comprehensive GitHub Copilot instructions
Co-authored-by: oalders <[email protected]>
1 parent 9676237 commit 9d6d9ab

File tree

5 files changed

+156
-52
lines changed

5 files changed

+156
-52
lines changed

.github/copilot-instructions.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# HTML::Parser - Perl HTML Parser Module
2+
3+
HTML::Parser is a C/XS-based Perl module for parsing HTML documents. It's part of the libwww-perl organization and provides event-driven HTML parsing with support for multiple parser modes and extensive customization options.
4+
5+
Always reference these instructions first and fallback to search or bash commands only when you encounter unexpected information that does not match the info here.
6+
7+
## Working Effectively
8+
9+
- Bootstrap, build, and test the repository:
10+
- `perl Makefile.PL` -- generates Makefile (takes ~0.12 seconds)
11+
- `make` -- builds the C/XS module (takes ~1.5 seconds, NEVER CANCEL)
12+
- `make test` -- runs 464 tests (takes ~3 seconds, NEVER CANCEL)
13+
- Clean the build:
14+
- `make clean` -- removes build artifacts (takes ~0.01 seconds)
15+
- Dependencies are automatically handled by the Perl build system:
16+
- Runtime dependencies: HTML::Tagset, HTTP::Headers, URI, etc.
17+
- Test dependencies: Test::More, File::Spec, Config, etc.
18+
- All dependencies are typically available in standard Perl installations
19+
20+
## Validation
21+
22+
- ALWAYS test core functionality after making changes to the XS code or main Parser.pm:
23+
```perl
24+
perl -MHTML::Parser -e 'print "HTML::Parser loads successfully\n"'
25+
```
26+
- ALWAYS test example scripts in eg/ directory:
27+
- `perl eg/htext test.html` -- extracts plain text from HTML
28+
- `perl eg/hstrip test.html` -- strips unwanted tags and attributes
29+
- ALWAYS run a complete parsing scenario manually:
30+
- Create test HTML file with various elements (tags, attributes, text, comments)
31+
- Parse it with HTML::Parser using start_h, end_h, text_h handlers
32+
- Verify all elements are parsed correctly
33+
- ALWAYS run the full test suite before committing: `make test`
34+
- The test suite covers 464 test cases across 50 test files and must ALL pass
35+
36+
## Build System Details
37+
38+
- Uses ExtUtils::MakeMaker build system (traditional Perl approach)
39+
- XS (C extension) compilation is handled automatically
40+
- Generated files during build: Parser.c (from Parser.xs), Parser.so, blib/ directory
41+
- Configuration: Makefile.PL defines build parameters including MARKED_SECTION support
42+
- Build artifacts are placed in blib/ directory structure
43+
44+
## Project Structure
45+
46+
### Key Files and Directories:
47+
- `lib/HTML/Parser.pm` -- Main Perl module with XS loading
48+
- `Parser.xs` -- XS interface between Perl and C code
49+
- `hparser.c` -- Core C parsing engine
50+
- `lib/HTML/` -- Additional modules (Entities, LinkExtor, HeadParser, etc.)
51+
- `t/` -- Test suite (50 test files, 464 tests total)
52+
- `eg/` -- Example scripts demonstrating usage
53+
- `cpanfile` -- Dependency specification
54+
- `Makefile.PL` -- Build configuration
55+
56+
### Important Modules:
57+
- `HTML::Parser` -- Main parser class (lib/HTML/Parser.pm)
58+
- `HTML::Entities` -- HTML entity encoding/decoding (lib/HTML/Entities.pm)
59+
- `HTML::LinkExtor` -- Extract links from HTML (lib/HTML/LinkExtor.pm)
60+
- `HTML::HeadParser` -- Parse HTML head sections (lib/HTML/HeadParser.pm)
61+
- `HTML::PullParser` -- Pull-style parsing interface (lib/HTML/PullParser.pm)
62+
63+
## Testing
64+
65+
- Test suite is comprehensive with 464 tests across multiple scenarios
66+
- Tests cover: basic parsing, entity handling, filters, callbacks, edge cases
67+
- All tests use Test::More framework
68+
- Key test categories:
69+
- Parser functionality (t/parser.t, t/callback.t)
70+
- Entity handling (t/entities.t, t/uentities.t)
71+
- Filter methods (t/filter.t, t/filter-methods.t)
72+
- Unicode support (t/unicode.t)
73+
- Various parser modes and options
74+
75+
## Common Tasks
76+
77+
### Building from scratch:
78+
```bash
79+
perl Makefile.PL
80+
make
81+
make test
82+
```
83+
84+
### Testing specific functionality:
85+
```bash
86+
# Test entity handling
87+
perl -MHTML::Entities -e 'print HTML::Entities::encode_entities("<test>") . "\n"'
88+
89+
# Test basic parsing
90+
perl -MHTML::Parser -e '
91+
my $p = HTML::Parser->new(text_h => [sub { print "$_[0]\n" }, "dtext"]);
92+
$p->parse("<p>Hello &amp; world</p>");
93+
$p->eof;
94+
'
95+
```
96+
97+
### Manual validation scenarios:
98+
1. **Basic HTML parsing**: Create HTML with tags, attributes, text, and entities. Parse and verify all components are extracted correctly.
99+
2. **Entity decoding**: Test HTML entities like &amp;, &lt;, &gt;, &#39; are properly decoded.
100+
3. **Filter functionality**: Test ignore_tags, report_tags, and ignore_elements work correctly.
101+
4. **Callback handling**: Verify start_h, end_h, text_h callbacks receive correct parameters.
102+
103+
### File outputs from commonly run commands:
104+
105+
#### Repository root listing:
106+
```
107+
Changes TODO dist.ini hparser.c lib/ t/
108+
LICENSE cpanfile eg/ hparser.h mkhctype test.html
109+
Makefile.PL .github/ entities.html hints/ mkpfunc test_parser.pl
110+
META.json .gitignore .perltidyrc typemap Parser.xs tokenpos.h
111+
README.md .mailmap hctype.h pfunc.h ppport.h util.c
112+
```
113+
114+
#### Example scripts (eg/ directory):
115+
```
116+
hanchors hbody hdisable hdump hform hlc hrefsub hstrip htext htextsub htitle
117+
```
118+
119+
#### Test directory:
120+
- 50 test files covering all functionality
121+
- Tests range from basic parsing to complex Unicode scenarios
122+
- All tests must pass for a valid build
123+
124+
## CI/CD
125+
126+
- GitHub Actions workflows for Linux, macOS, and Windows
127+
- Workflows test multiple Perl versions (5.10 to 5.40)
128+
- All builds must pass before merge
129+
- Located in .github/workflows/
130+
131+
## Development Notes
132+
133+
- This is a mature, stable codebase (version 3.84)
134+
- Changes should be minimal and well-tested
135+
- XS/C code changes require careful validation
136+
- Backward compatibility is important
137+
- Follow existing code style (see .perltidyrc)
138+
139+
## Performance
140+
141+
- Parser is optimized for speed with C implementation
142+
- Handles large documents efficiently
143+
- Event-driven approach minimizes memory usage
144+
- Build times are fast (~1.5 seconds total)
145+
- Test execution is quick (~3 seconds for full suite)
146+
147+
## Troubleshooting
148+
149+
- If build fails, ensure C compiler is available
150+
- Missing dependencies are usually auto-detected
151+
- Test failures indicate breaking changes
152+
- XS compilation errors suggest C code issues
153+
- Use `make clean` to reset build state

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,6 @@ Makefile
99
MYMETA.*
1010
.build/
1111
HTML-Parser-*/
12+
test.html
13+
test_parser.pl
1214

eg/htext

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ sub tag {
1717

1818
sub text {
1919
return if $inside{script} || $inside{style};
20-
print Encode::encode('utf8', $_[0]);
20+
print encode('utf8', $_[0]);
2121
}
2222

2323
HTML::Parser->new(

test.html

Lines changed: 0 additions & 18 deletions
This file was deleted.

test_parser.pl

Lines changed: 0 additions & 33 deletions
This file was deleted.

0 commit comments

Comments
 (0)