-
Notifications
You must be signed in to change notification settings - Fork 894
segfault in docker when used with torch and matplotlib #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is there a reason you wish to use |
There isn't a particular reason, but the fact is most users don't know the existence of "GUI conflicts" doesn't seem like the right cause, given that the following also segfaults in the same environment: from ctypes import cdll
import torch
try:
import cv2
except:
pass
cdll.LoadLibrary("/lib64/libgcc_s.so.1") or the following also segfaults: from ctypes import cdll
def load(x):
print("Loading", x)
try:
cdll.LoadLibrary(x)
except Exception as e:
print("\tFailed", e)
else:
print("\tSucc")
load("/opt/app-root/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so")
load("/opt/app-root/lib/python3.6/site-packages/cv2/../opencv_python.libs/libcrypto-354cbd1a.so.1.1")
load("/lib64/libgcc_s.so.1") |
This repository and the packages are probably the first results when you google for opencv-python. The README clearly explains different available packages. Therefore, I would expect that pytorch maintainers (or any package maintainers / users who use these packages) would spend a few minutes of their time to read through the provided documentation. There's certainly some kind of conflict somewhere in those binaries (some overlapping symbols maybe...?) and it's likely that the headless version works because it does not use the same symbols as I'll try to have a look into this when I have more time. |
I agree with you that this is not a problem for those who are able to reach here and find out about "opencv-python-headless", but I'm mainly speaking from the perspective of an average user, who may:
A majority of users match some of the above points in my experience. Addressing issues like this would help them a lot! Thanks |
As I wrote, I will address the issue once I have enough free time to look deeper into it. |
Ah, the root cause was a lot easier to find than I anticipated. The If you set the contents of Traceback (most recent call last):
File "a.py", line 1, in <module>
import cv2
File "/opt/app-root/lib/python3.6/site-packages/cv2/__init__.py", line 5, in <module>
from .cv2 import *
ImportError: libGL.so.1: cannot open shared object file: No such file or directory Fixed Dockerfile: FROM centos/python-36-centos7
USER root
WORKDIR /root
RUN yum update -y && yum install mesa-libGL -y <------ This is needed, since opencv-python depends on the full X11 stack
RUN pip install --upgrade pip
RUN pip install opencv-python
RUN pip install -U matplotlib
RUN pip install torch==1.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
COPY a.py /root/a.py
RUN python -X faulthandler a.py This happens because the Again, I would recommend using |
I have seen the error, but I don't think that justifies causing a segfault? An import failure is not supposed to have any side effects. Missing X11 shouldn't cause segfault either. Moreover, using the other example I posted with:
also reproduces the segfault. It reports a missing "libz.so", and if "opencv_python.libs/libz.so" was loaded before this line, it will run without segfault. So it reads more like a problem about the initialization of certain .so libs. (A bit more background: two projects I maintained optionally depend on opencv, with |
I have no control over the initialization order of the dynamically loaded libraries.
Additionally, you might have not seen this behaviour in older While I was testing your Dockerfile, I found out that you can avoid the issue by just commenting out the GDB output for original
Next, after removing the whole
So, something odd going on with If import order is changed, everything is ok: import torch
import matplotlib.pyplot as plt
error = None
try:
import cv2
except Exception as e:
error = e
pass
print(error) yields: I have no time to dig deeper currently, but my changing the import order will fix the issue temporarily. There is some side effect somewhere during the |
Thanks for the investigation! I also tried a few versions just now: it seems all the
What's interesting is that, in 3.4.10.37 |
Auditwheel behavior could have changed between those releases. It's not perfect tool, there are issues (also libz related, see https://github.com/pypa/auditwheel/issues) but it does its job reasonably well. I rebuild occasionally the extended About zlib: pypa/auditwheel#152 |
Some more investigations using the following code:
Running this docker, it shows that:
> 7ff0c820b000-7ff0c8282000 r--p 00000000 08:01 2901798 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libcrypto-354cbd1a.so.1.1
> 7ff0c8282000-7ff0c8431000 r-xp 00077000 08:01 2901798 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libcrypto-354cbd1a.so.1.1
> 7ff0c8431000-7ff0c84bc000 r--p 00226000 08:01 2901798 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libcrypto-354cbd1a.so.1.1
> 7ff0c84bc000-7ff0c84bd000 ---p 002b1000 08:01 2901798 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libcrypto-354cbd1a.so.1.1
> 7ff0c84bd000-7ff0c84ea000 rw-p 002b1000 08:01 2901798 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libcrypto-354cbd1a.so.1.1
> 7ff0c84ea000-7ff0c84ee000 rw-p 00000000 00:00 0
> 7ff0c84ee000-7ff0c8523000 rw-p 00334000 08:01 2901798 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libcrypto-354cbd1a.so.1.1
> 7ff0c8523000-7ff0c853f000 r--p 00000000 08:01 2901801 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libssl-1eebc6e1.so.1.1
> 7ff0c853f000-7ff0c858e000 r-xp 0001c000 08:01 2901801 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libssl-1eebc6e1.so.1.1
> 7ff0c858e000-7ff0c85a8000 r--p 0006b000 08:01 2901801 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libssl-1eebc6e1.so.1.1
> 7ff0c85a8000-7ff0c85a9000 ---p 00085000 08:01 2901801 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libssl-1eebc6e1.so.1.1
> 7ff0c85a9000-7ff0c85b6000 rw-p 00085000 08:01 2901801 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libssl-1eebc6e1.so.1.1
> 7ff0c85b6000-7ff0c85c6000 rw-p 000a8000 08:01 2901801 /usr/local/lib/python3.8/site-packages/opencv_python.libs/libssl-1eebc6e1.so.1.1 That seems to be either a bug in the so files, or a bug in dlopen. Do you have any ideas what might be wrong with the so files, or escalate it to other places (e.g. could it be a bug of auditwheel?) |
I don't have any new ideas currently. I think the next step would be to open an issue to the auditwheel repo. However, I'll keep this issue open as well. One thing to note is that there is a conflict between |
It looks like a result of this glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=20839, because:
Given that many popular versions of glibc still cannot correctly handle SO files with |
Thanks for debugging the issue. I'll add the flag to the Dockerfiles and rebuild them. I'll also post download links for the wheels here when I have them available. |
Seems to be working now at least with the original Dockerfile: FROM centos/python-36-centos7
USER root
WORKDIR /root
RUN yum update -y
RUN pip install --upgrade pip
RUN pip install https://opencvpythonartifacts.blob.core.windows.net/c001042fefc533f7a68c24c249972dcb2596305b/opencv_python-4.4.0+c001042-cp36-cp36m-manylinux2014_x86_64.whl
RUN pip install -U matplotlib
RUN pip install torch==1.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
COPY a.py /root/a.py
RUN python -X faulthandler a.py Python 3.8 wheel if you wish to test it: https://opencvpythonartifacts.blob.core.windows.net/c001042fefc533f7a68c24c249972dcb2596305b/opencv_python-4.4.0+c001042-cp38-cp38-manylinux2014_x86_64.whl |
I will publish a new release once you have confirmed that this solution works for you. |
Thanks! That has fixed the problem in two containers that had this issue. |
The fix is included now in the latest releases. |
Is anybody from this thread able to explain how to fix this? I am receiving the same segmentation fault for python3.8 as well for 3.9.
This is my requirements.txt:
My base image |
Steps to reproduce
It works if I use
opencv-python-headless
instead.Issue submission checklist
opencv-python
The text was updated successfully, but these errors were encountered: