-
Couldn't load subscription status.
- Fork 88
New read api #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
New read api #233
Conversation
| if self.check == res { | ||
| Ok(()) | ||
| } else { | ||
| /* TODO: make this into our own Crc32Error error type! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality
| where | ||
| R: Read + Seek, | ||
| { | ||
| // TODO: use .get_or_try_init() once stabilized to provide a closure returning a Result! |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality
| reader: R, | ||
| ) -> Result<EntryReader<R>, ZipError> | ||
| where | ||
| /* TODO: this really shouldn't be required upon construction (especially since the reader |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality
| let xz_reader = XzDecoder::new(buf_reader); | ||
| Ok(EntryReader::Xz(xz_reader)) | ||
| } | ||
| /* TODO: make this into its own EntryReadError error type! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality
|
|
||
| #[allow(deprecated)] | ||
| if let CompressionMethod::Unsupported(_) = data.compression_method { | ||
| /* TODO: make this into its own EntryReadError error type! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality
| info, | ||
| compressed_size: data.compressed_size, | ||
| }); | ||
| /* TODO: make this into its own EntryReadError error type! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality
| } | ||
|
|
||
| if let Some(last_modified_time) = data.last_modified_time { | ||
| /* TODO: use let chains once stabilized! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality
ea58b90 to
dcbabcc
Compare
|
I've just added the new API to the fuzz testing in 568fcab, and benchmarks confirming it achieves the same performance as the current API in 2731e7e. These benchmarks also show that decompressing entries with the stream API is just as fast as iterating over entries from a |
7f56cfe to
e882a72
Compare
|
Re point 3: I'm confused about why this is a bug. If the block has |
@cosmicexplorer I'm not sure I can properly review this PR without an answer to this question. |
That's correct, for the case in which we are only interested in local blocks, but it causes us to drop the first metadata block when we are interested in the This is actually not an indictment of I will think about whether |
|
I guess I should say another thing for context: this shouldn't really be considered a priority, and it doesn't block any current work on the crate (including #208 or #235). I continue to think it's a useful refactoring change, but given that it necessarily requires introducing a breaking API change, it might be worth leaving on the back burner for now and coming back to when there are other breaking changes we want to introduce with a major version bump. I was very interested to see whether it would produce any performance improvement (maybe it would when operating over an in-memory The part that I like the most about this (and the reason for its creation) is how the parameterized Given the idea I just had above (making
I'm going to mark this as draft until I can do the above, and ping you when I've got something for review. |
e882a72 to
f33748b
Compare
- also use git dep for zstd to pull in removal of R: BufRead bound - move new methods into unstable/read.rs - support getting aes verification info from a raw ZipEntry - flesh out other methods from ZipFile -> ZipEntry - now create a streaming abstraction - move some methods around - make AesModeInfo pub(crate) to make its intention clear - make ArchiveEntry more publicly accessible - make ZipFile impl EntryData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I didn't have time to finish this review. Could you please resolve the merge conflicts, and then either mark this PR as ready to merge or provide an update on your plans for it?
| * tests/data/generate-king-lear-zip.py. | ||
| */ | ||
| fn get_test_data() -> ZipResult<ZipArchive<fs::File>> { | ||
| let path = Path::new(env!("CARGO_MANIFEST_DIR")).join("tests/data/king-lear-compressed.zip"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer include_bytes! here, to eliminate the type of disk and the state of the filesystem as potential sources of noise.
Moved from #207 after weird github errors.
Problem
I believe there are several shortcomings that necessitate a slight reimplementation of much of
src/read.rs, some involving breaking API changes:Sendand/orSyncbounds fromRtoZipFile<R>is literally impossible with the current public API that type-erases todyn Read.dyn Read + Send, but that would be a lot of trouble and not worth it.RisSendand always carry adyn Read + Send, but that would be unsound andunsafeand could introduce memory unsafety.ZipFilehas a lot of stateful logic, including aZipFileReader::NoReaderstate, even necessitating aninvalid_state()helper method for error checking.ZipFilehas aDropimpl which is only used when reading stream archives (to ensure we read out the rest of the entry from the stream), but this niche use case necessitates every reader struct in the crate to provide an.into_inner()method.Dropimpl forZipFileis a bad idea, there is currently a bug inZipStreamReaderwhich drops the firstZipStreamFileMetadataentry, becauseread_zipfile_from_stream()does not save the contents of the lastZipLocalEntryBlockit reads which hasspec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE.ZipFile.ZipStreamVisitorinterface, and we can have direct control over when to iterate through each entry of the streaming zip file. That's whatzip::unstable::read::streamingprovides.StreamingArchivemaintains the internal state necessary to retain the bytes of the first metadata entry after we go through all the local file entries, fixing the bug inZipStreamReaderwhich drops the first metadata entry.ZipFileDatathat we have onZipFile, but for other structs. That's why this change introducesEntryData, which implements all the methods traditionally provided as member functions ofZipFile.ZipFileDataitself, and we can instead keep that to ourselves as an internal implementation detail.ZipFileReader::Compressed(Box<Crc32Reader<Decompressor<io::BufReader<CryptoReader<'a>>>>>)always goes through the logic forCryptoReader, when encrypted zip files are an extremely niche use case. We shouldn't be cluttering up our type definitions or our stack traces with encryption logic when most users are never going to touch it!Crc32Readeratzip2/src/crc32.rs
Line 15 in 3f6768e
Crc32Readerat all, and this change means we actually don't, instead of having to insert little workarounds and bolt on flags to disable stuff; see https://github.com/cosmicexplorer/zip/blob/04f6ced5f2197d0ef790774442e6371ab904229e/src/read.rs#L1329-L1345, where we do the more complex logic to instantiate the crypto reader in this new change:That's the entire method! No jumping into decryption code paths for the much more common case of simple decompression with crc32 checking!
So that's my case for the API changes. We can do them in pieces, but I'm proposing this intentionally breaking change that I believe will improve the internal architecture and correctness, increase our ability to perform optimizations, and generate more flexible entry structs in our public API.
More Specifically
This section was copied from #207.
As described in gyscos/zstd-rs#288 (comment), our usage of
&'a mut dyn ReadinZipFileReaderstops us from being able to implSendorSync, or generally to send aZipFileinto a separate thread than the one owning theZipArchiveit belongs to. While this is generally not an issue, as the single synchronousRead + Seekhandle makes it difficult to farm out work to subthreads, for the purposes of #72, we would like to be able to reuse some code instead of creating an entirely separate implementation. Note: #72 does not actually use this code, and this is a false argument. However, it is correct that other code using this crate, especially inasynccontexts, could really use the ability to imposeSendbounds, whichZipFileis unable to satisfy.That brings us to the second set of issues, surrounding zipcrypto support:
CryptoReaderenum and themake_crypto_reader()method, even though zip crypto is a very uncommon use case,ZipFileitself contains some very complex mutable state with both aZipFileReaderand aCryptoReader, even thoughZipFileReaderitself also contains aCryptoReader,Crc32Readerhas to maintain a special flag to disable CRC checking in the special case of AE2 encryption, which is a sign of a leaky abstraction and may easily introduce a failure to check CRC in other cases by accident in the future.Solution
Luckily, all of this becomes significantly easier without too many changes:
src/unstable/read.rsto reimplementread::ZipFilewith the newunstable::read::ZipEntry.EntryReaderandZipEntrywith a parameterized reader typeRinstead of an internal&'a mut dyn Read.ZipArchive::by_index()returningZipResult<ZipEntry<'_, impl Read + '_>>to avoid leaking internal structs while retaining the ability to rearrange the layout of our internal structs.CryptoReadercreation throughCryptoVariant.This also leads to a significantly improved refactoring of streaming support in the
streamingsubmodule ofsrc/unstable/read.rs.Result
zipcrypto support doesn't muck up the non-crypto methods,
ZipEntryis nowSend, andsrc/unstable/read.rsis significantly easier to read with equivalent functionality. This will have to be a breaking change in order to achieveSend-ability, but I believe the readability/maintainability is substantially improved with this change.TODO
ZipFilewrap adyn Readhandle produced fromZipEntry, so there's noCryptoReadertype in our non-crypto code paths.ZipStreamReaderuseStreamingArchiveinternally (generatingZipFileinstances from theZipEntrys it produces).