-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
expose the offset of a zipfile from the start of the file as a public API #84481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
module zipfile Tag "Components": I am not sure "Library (Lib)" is the correct one. If it isn't, please fix. I use python to check zip files against malware. zipfile already handles this, finding the ZIP structure inside the file. My change is just to add a new public property, to expose an internal variable: the file offset of the ZIP structure. I know, I am after the code freeze of Python 2.7.18. |
This is a new feature and cannot be added to older versions which are in feature-freeze. Adding the feature to (say) Python 2.7.18 would be inconsistent, because it wouldn't exist in 2.7.0 through .17. Likewise for all the other versions before 3.9. Personally, this sounds like a nice feature to have, and your use-case sounds convincing to me. |
Hi Steven Every software "ecosystem" has its guidelines and I am a newbie about Mmh I see your concerns. I agree about your deletions of all py 3 versions About Py 2, I remark these facts:
I agree my request is an exception but I think you have to agree this I ask you please
Many thanks, Massimo |
Could something similar be achieved by looking for the earliest file header offset? def find_earliest_header_offset(zf):
earliest_offset = None
for zinfo in zf.infolist():
if earliest_offset is None:
earliest_offset = zinfo.header_offset
else:
earliest_offset = min(zinfo.header_offset, earliest_offset)
return earliest_offset You could also adapt this using
to see if there were any sections inside the archive which were not referenced from the central directory. Not sure if zip files with arbitrary bytes inside the archive would be valid everywhere, but I think they are using zipfile. You can also have zipped content inside an archive which has a valid fileheader but no reference from the central directory. Those entries are discoverable by implementations which process content serially from the start of the file but not implementations which rely on the central directory. |
Sorry Massimo, there are no new features being added to 2.7, not even https://www.python.org/doc/sunset-python-2/ Python 2 is effectively now a dead project from the point of view of us You could try submitting your feature request to third-party bundlers of For what it is worth, I don't agree that this situation is exceptional. If you want this in 2.7 for your own personal use, wait for the 2.7.18 |
I am not sure it would help you. There are legitimate files which contain a payload followed by the ZIP archive (self-extracting archives, programs with embedded ZIP archives). And the malware can make the offset of the ZIP archive be zero. If you want to check whether the file looks like an executable, analyze first few bytes of the file. All executable files should start by one of well recognized signatures, otherwise the OS would not know how to execute them and they would not be malware. |
On Sat, 18 Apr 2020 at 04:37, Steven D'Aprano [email protected] Yes, it seems to me obvious it will work only with Python 2.7.18, and I see I am used to other softwares where some features are backported to older Speaking in general, not only python: if the maintainers backport that Steven many thanks for your answers and patience to explain. |
Hi Serhiy Thanks for the suggestion but I don't need to analyse different I spend two words about my work. I analyze ZIP archives because they are the "incarnation" also of microsoft I always find these kind of files with not zero offset aren't strictly For us checking the offset is very effective: we discard "bad" documents at Massimo On Sat, 18 Apr 2020 at 09:36, Serhiy Storchaka <[email protected]>
|
Hi Daniel Could you please elaborate the advantages of your loop versus my two lines Thanks, Massimo On Sat, 18 Apr 2020 at 03:26, Daniel Hillier <[email protected]> wrote:
|
Hi Massimo, Unless I'm missing something about your requirements, the advantage is that Cheers, On Sat, Apr 18, 2020 at 11:36 PM Massimo Sala <[email protected]>
|
Just check the first 4 bytes of the file. In "normal" ZIP archive they are b'PK\3\4' (or b'PK\5\6' if it is empty). It is so reliable as checking the offset, and more efficient. It is even more reliable, because a malware can have zero ZIP archive offset, but it cannot start with b'PK\3\4'. |
Sorry I forgot to mention one specific case. I agree your tip can be useful to other readers. On Sat, 18 Apr 2020 at 15:45, Serhiy Storchaka <[email protected]>
|
I choosed to use the internal variable *concat* because
Mmh
We can try / except but we loose the computation.
Yes, indeed. If I am right about the pros of my patch, I stand for it. Many thanks for you attention. On Sat, 18 Apr 2020 at 15:45, Daniel Hillier <[email protected]> wrote:
|
skimming the issue I think what was being asked for here is a way to expose the offset of the zipfile from the start of the file as an documented public API? Is that accurate? Does anyone still want this feature? A PR against main would be useful. |
I think it is useful to be able to interact with the prefix portion of a file that has a zip file suffix. I've opened #132165 with a patch to fix this. |
* Add ZipFile.data_offset attribute This attribute provides the offset to zip data from the start of the file, when available. * Add blurb-it * Try fixing class ref in NEWS
thanks, merged! |
I'm re-opening this because of that: https://github.com/python/cpython/pull/132165/files#r2030278958. @emmatyping Can you take care of either amending the docs or make sure that the attribute is correctly defined (whether it's None or not) independently of the opening mode? TiA |
@emmatyping, how is the offset calculated? I'm trying to build Python 3.14.0a7 - I run the tests twice - once, during the build, where the tests pass, and then using the installed Python. In the second case the tests testing the offset fail: In the CI run, this is the outcome: FAIL: test_data_offset_with_exe_prepended (test.test_zipfile.test_core.TestDataOffsetPrependedZip.test_data_offset_with_exe_prepended)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib64/python3.14/test/test_zipfile/test_core.py", line 3431, in test_data_offset_with_exe_prepended
self._test_data_offset(self.exe_zip)
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/usr/lib64/python3.14/test/test_zipfile/test_core.py", line 3428, in _test_data_offset
self.assertEqual(zipfp.data_offset, 713)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 717 != 713
======================================================================
FAIL: test_data_offset_with_exe_prepended_zip64 (test.test_zipfile.test_core.TestDataOffsetPrependedZip.test_data_offset_with_exe_prepended_zip64)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib64/python3.14/test/test_zipfile/test_core.py", line 3434, in test_data_offset_with_exe_prepended_zip64
self._test_data_offset(self.exe_zip64)
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.14/test/test_zipfile/test_core.py", line 3428, in _test_data_offset
self.assertEqual(zipfp.data_offset, 713)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 717 != 713 Do you have any pointers as how to approach this? |
@befeleme can you provide more info on your platform, and the steps you took to run the tests on the installed Python? When I You can also set Also, the code is here: https://github.com/python/cpython/blob/main/Lib/zipfile/__init__.py#L1486-L1494 |
One way to reproduce the issue is to install the built Python packages to a Fedora container.
This ends up with the above AssertionError. |
Output with
|
I'm afraid the tests pass for me. @befeleme What CPU is your host? I am on an x86_64 machine. |
I've tested this on a few more systems I have access to (all x86_64) and I cannot reproduce the failure. |
I also have an x86_64 machine. I took the latest container with Fedora Rawhide today (coming from quay.io, installed Python 3.14.0a7 available in the repositories and reran the test_zipfile, with the same result of two failing tests.
I see the same failure on my machine, on Fedora CI, and running the container on Debian, x86_64 machine. |
The problem is that the Fedora Python specfile changes the shebang of two files:
It replaces This issue is unrelated to Python, and specific to Fedora specfile (RPM). |
I'm closing this issue as completed in this case. Thanks for the help Victor! |
Do the test files need to be executable? |
Python and Python test suite don't need the script to be executable, but you can run these 2 test scripts. Example:
|
Would you accept a PR removing the executable bit? (I don't know if Fedora is the only distribution mangling the shebangs, but I imagine this can unexpectedly hit other downstream packagers). |
Turns out |
* Add ZipFile.data_offset attribute This attribute provides the offset to zip data from the start of the file, when available. * Add blurb-it * Try fixing class ref in NEWS
Uh oh!
There was an error while loading. Please reload this page.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: