Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched domains of object detection include face detection/recognition, pedestrian detection, image annotation, vehicle counting, activity recognition, video object co-segmentation and so on. It is also used in tracking objects, for example tracking a ball during a football match, tracking movement of a cricket bat, or tracking a person in a video.
Object detection generally fall into either neural network-based or non-neural approaches: For non-neural approaches, it becomes necessary to first define features, then using a technique such as support vector machine (SVM) to do the classification. In this project, we emphasize the neural techniques that are able to do end-to-end object detection without specifically defining features.
YOLO(You Only Look Once) is a popular real-time object detection algorithm. It looks at the entire image at once and only once, which allows it to capture the context of detected objects. It combines what was once a multi-step process, using a single neural network to perform both classification and prediction of bounding boxes for detected objects. As such, it is heavily optimized for detection performance and can run much faster than running two separate neural networks to detect and classify objects separately, at the cost of losing some recognition precision which is acceptable in general cases.
Such advantage making it an ideal choice for applications of real-time detection. Additionally, YOLO can generalize the representations of various objects, making it more applicable to a variety of new environments. Without loss of generality, we adopt the YOLOv5 model as the per-frame object detection in this project.
DeepSORT is a tracking-by detection algorithm that considers both the bounding box parameters of the detection results, and the information about appearance of the tracked objects to associate the detections in a new frame with previously tracked objects. It is an online/real-time tracking algorithm. Therefore it only considers information about the current and previous frames to make predictions about the current frame without the need to process the whole video at once.
Kalman Filter is the key to DeepSort. It gives the best compromise between a prediction and a measurement, the prediction being just a step of a motion model (for tracking at least). The catch is that the prediction is not made into the future, but from the past state to the present one. And then It compare this with last measurement (i.e. tracking result), that gives two estimates of the position, both with uncertainties, so here comes the opportunity tofind a better estimate.
The measurement uncertainty is hard to get right on image data, and is usually just selected empirically. Note that the state's uncertainty is also updated in the filter: it will shift a lot of weight on motion data. And the best part of the Kalman filter is that it is recursive, which means where we take current readings to predict the current state, then use the measurements and update the predictions.
TensorRT is an SDK for optimizing trained deep learning models to enable high-performance inference. TensorRT contains a deep learning inference optimizer for trained deep learning models, and a runtime for execution. After we have got the trained deep learning models, they can be handled by TensorRT with higher throughput and lower latency. In this project, TensortRT is leveraged to parse and accelerate the inference of the pretrained models of Yolo and DeepSORT.
Nvidia Jetson is a series of embedded computing boards. Jetson is a low-power system and is designed for accelerating machine learning applications. In this project, we use a Jetson TX2 model as the development/deployment target.
- Model: Jetson TX2
- CPU: Dual-Core NVIDIA Denver 2 64-Bit CPU
- Quad-Core ARM® Cortex®-A57 MPCore
- GPU: 256-core NVIDIA Pascal™ GPU architecture
- Memory: 8GB 128-bit LPDDR4 Memory 1866MHz - 59.7 GB/s
- Storage: 32GB eMMC 5.1
- OS: Ubuntu 20.04
- JetPack: 5.0.1 DP
- CUDA: 11.4.14
- cuDNN: 8.3.2
- TensorRT: 8.4.0.11 (DP)
- Triton 22.03
We temporarily skip the model training phase which starts from the bare skeleton and use pre-trained models from the community resource. Here are the reasons:
- Due to the restriction of test data and insufficient hardware resource, it would be quiet time-consuming to start from the sketch.
- The open source version of Yolov5 and Deepsort has been proved they can yield satisfying performance and accurate predict results.
- If necessary in near feature, we still can do the fine-tune on the pre-trained network using some customized data to achieve better result.
There are three main options for converting a model with TensorRT:
- TF-TRT: For converting TensorFlow models, the TensorFlow integration (TF-TRT) provides both model conversion and a high-level runtime API, and has the capability to fall back to TensorFlow implementations where TensorRT does not support a particular operator.
- Programmatic ONNX conversion from .onnx files: A more performant option for automatic model conversion and deployment is to convert using ONNX. TensorRT supports automatic conversion from ONNX files using either the TensorRT API, or trtexec. ONNX conversion is all-or-nothing, meaning all operations in the model must be supported by TensorRT (or users must provide custom plugins for unsupported operations). The end result of ONNX conversion is a singular TensorRT engine that allows less overhead than using TF-TRT.
- Manually constructing a network using the TensorRT API (either in C++ or Python): In this case, the network architecture is built by calling TensorRT APIs. Then a weight file is used to define the edge weights. By combing the network structure and weights, the initial neural network model has been restored. And it can be handled by TensorRT for the next setup of deployment.
In this project, we decided to use the approach 2 since the pretrained network in the .onnx format can be successfully translated by TensortRT.
According the explaination abolve, the program workflow is pretty straightforword and easy to understand:
- The original input can be the video capture or the offline video file, which is done by OpenCV API;
- Use the YoloV5 engine to inference all objects in the current frame;
- Use the DeepSort engine to track all moving objects;
- Use the box to mark all identified objects;
- Use the OpenCV API to draw the final result to the console;
The structure details of the project are described as following:
- yolo: the header/source code for the YoloV5 model;
- deepsort: the header/source code for the DeepSort model;
- include: the common header directory across the whole project;
- src: the source code for the main module;
- build: the directory holding the outpu binaries including the final result;
- resources: the directory holds the model onnx files and video clips used for testing purpose;
We use the following sources to generate our test videos:
- 7AM London bus ride through the heart of London - Oxford St, Piccadilly, Big Ben - Bus Route 453 🚌 - YouTube
- Amsterdam street view Elandsgracht - Centrum - YouTube
- Waikiki, Hawaii, USA | Street Walk | Shopping and Waikiki Beach - YouTube
- 🔴Tokyo Walk 24/7🚶♀️【🌈Urban Paradise】🏙 Shibuya, Shinjuku, Ikebukuro, etc.😊💖Please Subscribe!💖⬇︎⬇︎⬇︎ - YouTube
- Paris Drive 4K - Sunset Drive - France - YouTube
- OpenCV library;
- Boost library;
- Eigen3 library;
- Cuda/Cudnn/Tensort libraries(usually installed by default);
cd tensorrtx/yolov5
// update CLASS_NUM in yololayer.h if your model is trained on custom dataset
mkdir build
cd build
cmake ..
make
To the built executable is pretty straightforward:
./yolosort video_file_input_path
Here are some notes:
-
Please put the .onnx or the .engine files in the same directory as the executable file.
-
The default name for the model files are deepsort[.onnx/.engine] and yolov5s[.onnx/.engine].
-
If the input network files are in .onnx suffix, the program will try to convert them to TensorRT engine files first.
-
If the input network files are engine files, the program will go to the inference procedure directly.
-
In this experiments, offline video files are used. As for the real-world deployment, we can switch to the online camera by simple do the following:
cv2.VideoCapture(0);
Here are some screenshots from our experiments. Please check the video file for the whole results. From the result videos, we can observe following conclusions:
- The program can output result at XXX FPS, which falls in the range that we believe to be acceptable.
- Using human supervision, the output boxes cover most cognizable objects that appears in the video.
We believe our current implementation has achieved our goal at this moment.
Our future works are oriented by following aspects:
- continuous optimization of performance to make the program keep high performance when there exist many(100+) cognizable objects;
- try to apply some more sophisticated industry models, such as YoloX, SSD and so on;
- adapt to more Jetson models and runtime versions: the TensorRT engine files are not universal across different versions. In this case, it is necessary to pre-compile different engine files for different runtime versions in order to save the deployment consumption.
- the code package: http://20.112.98.130:8080/code.zip
- the demo video: http://20.112.98.130:8080/demo.mp4
- You Only Look Once: Unified, Real-Time Object Detection https://arxiv.org/abs/1506.02640
- Simple Online and Realtime Tracking with a Deep Association Metric https://arxiv.org/abs/1703.07402
- TensorRT https://developer.nvidia.com/tensorrt
- DeepStream https://developer.nvidia.com/deepstream-sdk