Use file content heuristics to decide file reader. #1962

Dimi1010 · 2025-09-12T09:40:58Z

The PR adds heuristics based on the file content that is more robust than deciding based on the file extension.

The new decision model scans the start of the file for its magic number signature. It then compares it to the signatures of supported file types [1] and constructs a reader instance based on the result.

A new function createReader and tryCreateReader has been added due to changes in the public API of the factory.
The functions differ in the error handling scheme, as createReader throws and tryCreateReader returns nullptr on error.

Method behaviour changes during erroneous scenarios:

Scenario	`getReader`	`createReader`	`tryCreateReader`
File not found	N/A	Throws exception	Return `nullptr`
Unsupported format	Return `PcapFileDeviceReader`	Throws exception	Return `nullptr`

…sed on the magic number.

…le-selection

… tied to it.

…ics detection method.

codecov · 2025-09-12T09:59:59Z

Codecov Report

❌ Patch coverage is 92.15017% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.49%. Comparing base (b91640f) to head (a15f529).

Files with missing lines	Patch %	Lines
Tests/Pcap++Test/Tests/FileTests.cpp	93.12%	5 Missing and 4 partials ⚠️
Pcap++/src/CaptureFileFormatDetector.cpp	90.24%	7 Missing and 1 partial ⚠️
Pcap++/src/PcapFileDevice.cpp	91.54%	4 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev    #1962      +/-   ##
==========================================
+ Coverage   83.45%   83.49%   +0.04%     
==========================================
  Files         311      312       +1     
  Lines       54578    54848     +270     
  Branches    11808    11881      +73     
==========================================
+ Hits        45546    45798     +252     
- Misses       7832     7835       +3     
- Partials     1200     1215      +15

Flag	Coverage Δ
alpine320	`75.88% <77.92%> (+<0.01%)`	⬆️
fedora42	`75.44% <77.02%> (-0.01%)`	⬇️
macos-14	`81.58% <84.10%> (+<0.01%)`	⬆️
macos-15	`81.57% <85.35%> (+0.01%)`	⬆️
mingw32	`70.01% <77.01%> (+0.03%)`	⬆️
mingw64	`69.97% <77.01%> (+0.20%)`	⬆️
npcap	`?`
rhel94	`75.45% <77.02%> (-0.02%)`	⬇️
ubuntu2004	`59.46% <61.53%> (+<0.01%)`	⬆️
ubuntu2004-zstd	`59.59% <60.89%> (-0.01%)`	⬇️
ubuntu2204	`75.39% <77.02%> (-0.02%)`	⬇️
ubuntu2204-icpx	`57.84% <60.57%> (+0.01%)`	⬆️
ubuntu2404	`75.48% <76.87%> (-0.02%)`	⬇️
ubuntu2404-arm64	`75.56% <77.92%> (+<0.01%)`	⬆️
unittest	`83.49% <92.15%> (+0.04%)`	⬆️
windows-2022	`85.42% <88.97%> (+0.17%)`	⬆️
windows-2025	`85.45% <89.06%> (+0.13%)`	⬆️
winpcap	`85.45% <89.06%> (-0.08%)`	⬇️
xdp	`52.70% <0.00%> (-0.29%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Pcap++/src/PcapFileDevice.cpp

Tests/Pcap++Test/Tests/FileTests.cpp

seladb · 2025-09-15T08:08:40Z

Tests/Pcap++Test/Tests/FileTests.cpp

-	PTF_ASSERT_NOT_NULL(dynamic_cast<pcpp::PcapNgFileReaderDevice*>(genericReader));
-	PTF_ASSERT_TRUE(genericReader->open());
+	// ------- IFileReaderDevice::createReader() Factory
+	// TODO: Move to a separate unit test.


We should add the following to get more coverage:

Open a snoop file

Open a file that is not any of the options

Open pcap files with different magic numbers

Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions

3d713ab adds the following tests:

Pcap, PcapNG, Zst file with correct content + extension

Pcap, PcanNG file with correct content + wrong extension

Bogus content file with correct extension (pcap, pcapng, zst)

Bogus content file with wrong extension (txt)

Haven't found a snoop file to add. Do we have any?

Open pcap files with different magic numbers

Do you mean Pcap content that has just its magic number changed? Because IMO it is reasonable to consider that invalid format and fail as regular bogus data.

Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions

Pending on #1962 (comment) .

Pcap++/src/PcapFileDevice.cpp

Move it out if it needs to be reused somewhere.

Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.

seladb · 2025-09-21T08:10:16Z

@Dimi1010 some CI tests fail...

…le-selection

seladb · 2025-10-16T06:34:23Z

Pcap++/header/PcapFileDevice.h

+		enum class CaptureFileFormat
+		{
+			Unknown,
+			Pcap,        // regular pcap with microsecond precision
+			PcapNano,    // regular pcap with nanosecond precision
+			PcapNG,      // uncompressed pcapng
+			PcapNGZstd,  // zstd compressed pcapng
+			Snoop,       // solaris snoop
+		};
+
+		/// @brief Heuristic file format detector that scans the magic number of the file format header.
+		class CaptureFileFormatDetector


If I'm not mistaken, this used to be in the .cpp file, right? Is the reason we moved it to the .h file is to make it easier to test?

If yes, I think we can test it using createReader() - create a temporary fake file with the data we want to test, and delete it when the test is done

I tried that suggestion initially, but it would have been an extremely fragile unit test. The "pass" conditions would have been checked indirectly.

Also, createReader has multiple return paths for Nano / Zst file formats, which would have caused complications since the format test would have needed to care about the environment it runs at, which it doesn't have to as a standalone.

Any additional changes to createReader could also break the test, which they really shouldn't. For example, I am thinking of maybe adding additional logic for Zst archive to check if the compressed data is actually a pcapng, and not a random file. This would be a nightmare to make compatible with the "spoofed files" test due to assumptions on the test that createReader doesn't do anything more complicated than check the initial magic number.

So, in the end, you end up with a more compilcated unit test to read through that:

depends on the environment it runs on.

can be broken not just by changes to the format detector but also changes to the createReader factory, too.

induces requirements on createReader as it uses its behavior to test detectFormat.

I understand it's better to test CaptureFileFormatDetector as a standalone class, but it requires exposing it in the .h file which is not great (even though it's in the internal namespace). Testing createReader is a bit more fragile, but I don't think the difference is that big. Of course, if we add logic to detect more file types or update the existing detection logic some tests might break, but we easily fix them as needed.

I usually try to avoid the internal namespace where possible because it's still in the .h file and is exposed to users, and we'd like to keep our API as clean as possible

Testing createReader is a bit more fragile, but I don't think the difference is that big. Of course, if we add logic to detect more file types or update the existing detection logic some tests might break, but we easily fix them as needed.

It is a big difference and it's not always an easy fix. I plan to add the aforementioned Zst checks in another PR after this one, and that would make zst spoofing in createReader impossible, due to zst format automatically being checked for PcapNg or Unknown contents. Therefor you can't rely on the return of createReader to find out what the return of detectFormat was, because nullptr can be returned from several paths from detectFormat return value (Unknown, Nano + unsupported, Zst + unsupported). We have already had issues with tests being silently broken (#1977 comes to mind), so I would prefer to avoid fragile tests if we can.

I usually try to avoid the internal namespace where possible because it's still in the .h file and is exposed to users, and we'd like to keep our API as clean as possible

Fair, it is exposed, but the that is the entire reason of having the internal namespace. It is a common convention that external users shouldn't really touch it. If you want to keep the primary public header files clean there are a couple options:

I have seen many libraries have a subfolder internal / detail in their public include folder, where they keep all their internal code headers that need to be exposed. That keeps the "internal" code separate from the "public" code, if users want to read through the headers. This is a common convention used in Boost libraries. "public" headers that depend on internal headers include them from the internal subfolder.

In the current case, we have another option. Since the CaptureFileFormatDetector is only needed in the cpp part and not in the header part, we can extract it to a fully internal header, kept with the source files. This would prevent it from being exposed in the public API, but the Test project can be manually set to search for headers from "Pcap++/src" too, to allow it to link in the tests.

It is a big difference and it's not always an easy fix. I plan to add the aforementioned Zst checks in another PR after this one, and that would make zst spoofing in createReader impossible, due to zst format automatically being checked for PcapNg or Unknown contents. Therefor you can't rely on the return of createReader to find out what the return of detectFormat was, because nullptr can be returned from several paths from detectFormat return value (Unknown, Nano + unsupported, Zst + unsupported). We have already had issues with tests being silently broken (#1977 comes to mind), so I would prefer to avoid fragile tests if we can.

I'm not sure I understand... if we create fake files we know which type to expect, so all the test needs to do is verify the created file device is of the expected type 🤔

In the current case, we have another option. Since the CaptureFileFormatDetector is only needed in the cpp part and not in the header part, we can extract it to a fully internal header, kept with the source files. This would prevent it from being exposed in the public API, but the Test project can be manually set to search for headers from "Pcap++/src" too, to allow it to link in the tests.

I guess we can do that, but I still don't understand why we can't test it with createReader or tryCreateReader

Curious why wouldn't you want that?

Extracting the archived file just to verify the format is wasteful and might take a long time, especially if it's called on large files. Also - for these large files it'd mean the file is extracted twice - once in createReader and again when actually reading the file

Yes, but I am unsure if it gives a precise error message of what went wrong or just a generic failure error.

If this is indeed the case, maybe we need to fix the LightPcapNg code?

Which is as simple as calling open() inside the factory function, no? As you said, the backend already does validation, so why not reuse it for the factory validation?

Not necessarily - as far as I know open() checks mostly the header and doesn't go over the rest of the file, so a user can open a file with a correct header but corrupted data and reading the file will fail

Extracting the archived file just to verify the format is wasteful and might take a long time, especially if it's called on large files.

There is no need to extract the entire file. ZST compression works on independent frames allowing frame-by-frame (streaming) decompression. We only need to decompress the first frame to read the magic number to validate that the archive contents appear to be PcapNG. How large the total file is is irrelevant.

Incidentally, frame-by-frame is also how LightPcapNG reads the ZST archive. It decompresses a frame, reads the fully decompressed PcapNG records in it, and decompresses the next frame if needed.

If this is indeed the case, maybe we need to fix the LightPcapNg code?

Which is C code, and makes it much harder to output a readable error. Not to mention we need to deal with passing that error up the stack.

Not necessarily - as far as I know open() checks mostly the header and doesn't go over the rest of the file, so a user can open a file with a correct header but corrupted data and reading the file will fail

But it will still have passed open(). My idea isn't that createReader should validate that everything is correct. It is that it should validate just enough to guarantee that the returned device can successfully pass an open() call. The device might even be returned already opened and ready for reading, reducing the user side boilerplate.

There is no reason to return a device that can't even be opened, since the user can't do anything with it. It just adds more boilerplate as the user has to do the error handling twice.

If the records afterwards are corrupted at some point, the read should fail when the corrupted data is reached.

There is no need to extract the entire file. ZST compression works on independent frames allowing frame-by-frame (streaming) decompression. We only need to decompress the first frame to read the magic number to validate that the archive contents appear to be PcapNG. How large the total file is is irrelevant.

Incidentally, frame-by-frame is also how LightPcapNG reads the ZST archive. It decompresses a frame, reads the fully decompressed PcapNG records in it, and decompresses the next frame if needed.

I asked ChatGPT, and there's indeed a way to read just the header using ZSTD_getFrameHeader - is this what you're referring to?

Which is C code, and makes it much harder to output a readable error. Not to mention we need to deal with passing that error up the stack.

Still, I'd be in favor of fixing it in LightPcapNg. In any case - PcapPlusPlus doesn't call Zstd API directly, everything goes through LightPcapNg so I think the right way would be to fix or add it in LightPcapNg

But it will still have passed open(). My idea isn't that createReader should validate that everything is correct. It is that it should validate just enough to guarantee that the returned device can successfully pass an open() call. The device might even be returned already opened and ready for reading, reducing the user side boilerplate.

There is no reason to return a device that can't even be opened, since the user can't do anything with it. It just adds more boilerplate as the user has to do the error handling twice.

If the records afterwards are corrupted at some point, the read should fail when the corrupted data is reached.

This is one way to look at it, another approach could say - createReader should do its best to guess the right reader, and whether it can be opened or not is a question for open(). I'm not sure which approach is better so I would vote for the simplest option 🙂

I asked ChatGPT, and there's indeed a way to read just the header using ZSTD_getFrameHeader - is this what you're referring to?

Not really, but similar. From what I understand, the ZST archive has the following format:
[ZST Header] [CBlock (Compressed Block) 0] [CBlock 1] [CBlock 2] ... [CBlock N]

Those blocks need to be decompressed sequentially starting from CBlock 0. (I was wrong earlier when I mentioned they were completely independent). The thing is, the magic number of the inner file is at the start of CBlock 0, so we don't need to decompress any of the other blocks. ZSTD provides a streaming API. We can be use that to only decompress CBlock 0 to fetch the magic number of the compressed content and stop.

Still, I'd be in favor of fixing it in LightPcapNg. In any case - PcapPlusPlus doesn't call Zstd API directly, everything goes through LightPcapNg so I think the right way would be to fix or add it in LightPcapNg

It is possible to add the check to light_pcapng. It would need to do a check around the time it tries to read the header block. My question is how to communicate that error condition to the pcpp layer? From what I can see in light_pcapng_open_read, if it fails to open the error message is outputted as fprint(...) assertion messages and the return value is nullptr. I suppose we can do something similar, but then the pcpp layer gets a generic pass / fail, and not really a reason why it failed.

This is one way to look at it, another approach could say - createReader should do its best to guess the right reader, and whether it can be opened or not is a question for open(). I'm not sure which approach is better so I would vote for the simplest option 🙂

Both approaches are valid. I think the main decision point are the two questions:

In what situation do you want to make a device and not open it? I think that would be a pretty rare occasion.

What can you do with a device that can't open? I think the answer here is nothing, except to destroy it, which raises the question of why return such a device in the first place?

I updated the createReader procedure in ae9caa8 to include a openDevice parameter. I think that would do for a factory procedure for now? Users would have the option of keeping the old behaviour and manually opening the device later, but for the general case it is handled for them and they can immediately begin reading from it for ease of use.

What do you think?

It is possible to add the check to light_pcapng. It would need to do a check around the time it tries to read the header block. My question is how to communicate that error condition to the pcpp layer? From what I can see in light_pcapng_open_read, if it fails to open the error message is outputted as fprint(...) assertion messages and the return value is nullptr. I suppose we can do something similar, but then the pcpp layer gets a generic pass / fail, and not really a reason why it failed.

This is a valid question, but maybe PcapPlusPlus doesn't need to know the exact reason - it only needs to know that it failed - fprint will print the reason like GetLastError used in many libraries

Both approaches are valid. I think the main decision point are the two questions:

* In what situation do you want to make a device and not open it? I think that would be a pretty rare occasion. * What can you do with a device that can't open? I think the answer here is nothing, except to destroy it, which raises the question of why return such a device in the first place?

I updated the createReader procedure in ae9caa8 to include a openDevice parameter. I think that would do for a factory procedure for now? Users would have the option of keeping the old behaviour and manually opening the device later, but for the general case it is handled for them and they can immediately begin reading from it for ease of use.

What do you think?

This is generally true with out API - usually when creating a device the next thing to do is opening it. I originall did it to avoid throwing exceptions or handling validations in the c'tor. I'd consider createReader as a "smart constructor" that decides on the object to return, but the rest of the API should stay consistent -> the next action would be calling open()

Pcap++/src/PcapFileDevice.cpp

seladb · 2025-10-16T07:04:01Z

Tests/Pcap++Test/Tests/FileTests.cpp

 	}
 };

+PTF_TEST_CASE(TestFileFormatDetector)


Please see my previous comment. Maybe we can create a temp fake file with the expected data and run createReader()

Tests/Pcap++Test/Tests/FileTests.cpp

Tests/Pcap++Test/PcapExamples/file_heuristics/pcapng-example.pcapng

Tests/Pcap++Test/PcapExamples/file_heuristics/pcapng-with-pcap-ext.pcapng.pcap

Tests/Pcap++Test/Tests/FileTests.cpp

…e headers.

…in `createReader` instead of having the Format detector assume that is what is intended.

…t detector from libpcap behaviour.

…AII initialization.

Dimi1010 added 4 commits September 12, 2025 12:03

Added heuristics file content detector that determines the content ba…

02de760

…sed on the magic number.

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

d2b6339

…le-selection

Moved stream checkpoint outside format detector as it is not directly…

685dd9f

… tied to it.

Added a new factory function createReader that uses the new heurist…

40dee69

…ics detection method.

Dimi1010 added the enhancement label Sep 12, 2025

Add <algorithm> include.

f1e3e18

Dimi1010 added 2 commits September 12, 2025 13:17

Added unit tests.

8da1790

Deprecated old factory function.

3ad51e2

Dimi1010 added the API deprecation Pull requests that deprecate parts of the public interface. label Sep 12, 2025

Dimi1010 added 3 commits September 12, 2025 14:08

Add byte-swapped zstd magic number.

15c2000

Lint

17af8d4

Move enum closer to first usage.

46418ec

Dimi1010 marked this pull request as ready for review September 12, 2025 11:36

Dimi1010 requested a review from seladb as a code owner September 12, 2025 11:36

Dimi1010 requested review from clementperon, egecetin and tigercosmos September 12, 2025 11:36

tigercosmos approved these changes Sep 12, 2025

View reviewed changes

seladb reviewed Sep 15, 2025

View reviewed changes

Dimi1010 added 4 commits September 15, 2025 15:45

Added unit tests for file reader device factory.

3d713ab

Revert indentation.

a2391ec

Fixed StreamCheckpoint to restore original stream state.

ea328d7

Merge branch 'dev' into feature/heuristic-file-selection

db86c3e

Dimi1010 commented Sep 19, 2025

View reviewed changes

Pcap++/src/PcapFileDevice.cpp Outdated Show resolved Hide resolved

Dimi1010 added 3 commits September 20, 2025 12:59

Merge branch 'dev' into feature/heuristic-file-selection

4aed9bd

Moved isStreamSeekable helper to inside CaptureFileFormatDetector.

a83ae2b

Move it out if it needs to be reused somewhere.

Added pcap magic number for Alexey Kuznetzov's modified pcap format.

916e872

Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

022529f

…le-selection

Added tests for createReader failures.

af12d2f

Dimi1010 mentioned this pull request Oct 9, 2025

Added clang-format rule to insert braces for multiline control blocks. #1990

Closed

Dimi1010 added 5 commits October 11, 2025 00:09

Merge branch 'dev' into feature/heuristic-file-selection

b357087

Simplified pcap detection to not require to read the entire pcap header.

443c883

Added const qualifiers to detector methods.

202d5cc

Added dedicated unit tests for CaptureFileFormatDetector.

e6b2aa9

Added more tests for createReader.

b8fb635

Dimi1010 requested a review from seladb October 11, 2025 13:35

Add static assert for array indice checks.

c6c7720

seladb reviewed Oct 16, 2025

View reviewed changes

Dimi1010 added 7 commits October 18, 2025 12:28

Updated detectPcap selection.

181a8b4

Merge branch 'dev' into feature/heuristic-file-selection

76aa850

Extracted capture format detector to remove it from publicly availabl…

d8f7419

…e headers.

Fix includes.

91a7a0a

Removed duplicate files from tracking.

e275950

Lint

e7a42b5

Trimmed pcapng sample.

54f7bae

Dimi1010 linked an issue Oct 20, 2025 that may be closed by this pull request

Add indication if LightPcapNG backend is compiled with ZSTD compression support. #1973

Open

Dimi1010 mentioned this pull request Oct 20, 2025

Add indication if LightPcapNG backend is compiled with ZSTD compression support. #1973

Open

nbooster mentioned this pull request Oct 21, 2025

Unaligned packet field (UBSanitizer error report) plus warnings... #2001

Open

Dimi1010 added 10 commits October 24, 2025 09:29

Merge branch 'dev' into feature/heuristic-file-selection

93cba3d

Change PcapNGZst to ZstArchive. Zst to PcapNG branch folding is done …

3643fac

…in `createReader` instead of having the Format detector assume that is what is intended.

Added separate format value for "modified" pcap to separate the forma…

ec5980f

…t detector from libpcap behaviour.

Docs fix.

b3639a9

Merge branch 'dev' into feature/heuristic-file-selection

ce561b9

Merge branch 'dev' into feature/heuristic-file-selection

6322b41

Add automatic open functionality to createReader factory to mimic R…

ae9caa8

…AII initialization.

Merge branch 'dev' into feature/heuristic-file-selection

956b596

Update docstring.

5275349

Fix exception message assert.

a15f529

Use file content heuristics to decide file reader. #1962

Are you sure you want to change the base?

Use file content heuristics to decide file reader. #1962

Uh oh!

Conversation

Dimi1010 commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seladb commented Sep 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seladb Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimi1010 Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimi1010 Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dimi1010 commented Sep 12, 2025 •

edited

Loading

codecov bot commented Sep 12, 2025 •

edited

Loading

seladb Oct 16, 2025 •

edited

Loading

Dimi1010 Nov 16, 2025 •

edited

Loading

Dimi1010 Nov 23, 2025 •

edited

Loading