laitimes

An open-source video analysis structured framework: VideoPipe

author:Not bald programmer
An open-source video analysis structured framework: VideoPipe

In this era of ubiquitous video, we enjoy the convenience of personalized push, and every time we refresh, we can encounter video content that seems to be tailor-made. Was there ever a moment when you were curious about the "wisdom screening" behind this? What is it about the technology that enables the system to accurately capture your interests in the vast sea of videos and find the right video clips for you with just a few keywords? Beneath this seemingly simple browsing experience is the in-depth analysis and understanding of video content by computer science.

How does a computer understand a large number of videos?

Video is essentially a series of sequential image frames that are played at a certain frame rate, resulting in a continuous dynamic effect. The fundamentals of computer analysis of video can be broken down into three core steps:

"1. Decoding: Conversion of video to image frames"

The video is first decoded, a process that breaks down the continuous stream of motion into frames of static images.

"2. Analysis/Reasoning: The Magic of AI Algorithms"

The decoded image frame then proceeds to the analysis phase, which is where AI technology comes into play. Using advanced algorithms such as deep learning and machine learning, computers can not only recognize basic elements such as objects, faces, and text in images, but also further understand the context of the scene, detect actions, and even analyze emotions and intentions. This provides the possibility of tagging video content, summarizing and topic extraction.

"3. Code"

Reassemble image frames that have undergone specific processing (such as adding annotations, filtering, special effects, etc.) into a video.

An open-source video analysis structured framework: VideoPipe

Although this may seem like a few steps, there are many technical details and complex algorithms involved. For example, how to quickly deploy the trained AI image algorithm model to practical application scenarios? For programmers who have not been exposed to computer vision (hereinafter referred to as CV), or algorithm engineers who are purely engaged in algorithms, it may be a bit difficult to implement + implement AI video analysis related functions. However, as video becomes more and more widely used in daily life, the need to process and analyze video data is gradually increasing.

An open-source video analysis structured framework: VideoPipe

Today, I would like to introduce you to an open-source video analysis structured framework: VideoPipe, which aims to make developing video analysis applications as easy as writing the web with Django.

https://github.com/sherlockchou86/VideoPipe

VideoPipe is a framework for video analytics and structuring, written in C++ with fewer dependencies and easy integration. It is designed as a pipeline, in which each node is independent of each other and can be matched on its own, which can be used to build different types of video analytics applications, such as video structuring, image search, face recognition, and behavior analysis in the traffic/security field (such as traffic incident detection).

An open-source video analysis structured framework: VideoPipe

Introduction to VideoPipe

VideoPipe is similar to NVIDIA's DeepStream and Huawei's mxVision frameworks, but it's easier to use and more portable. It's written entirely in native C++ and relies on only a handful of popular third-party modules (such as OpenCV).

VideoPipe adopts a plug-in oriented coding style, which can be matched according to different needs, and we can use independent plug-ins (i.e., Node types in the framework) to build different types of video analysis applications. You just need to prepare the model and understand how to parse its output, and inference can be implemented based on different backends, such as OpenCV::D NN (default), TensorRT, PaddleInference, ONNXRuntime, whatever you like. The following figure shows how VideoPipe works.

An open-source video analysis structured framework: VideoPipe

As you can see, it offers the following features:

  • Stream read/push: supports multiple video streaming protocols, such as UDP, RTSP, RTMP, and other real-time transmission protocols.
  • Video decoding/encoding: Integrates OpenCV and GStreamer libraries to provide high-performance video and image encoding and decoding capabilities, and supports hardware acceleration to ensure real-time and smooth video processing.
  • Deep learning-based algorithm inference: Built-in support for a variety of deep learning models, including object detection, image classification, feature extraction, etc., provides powerful computing power for intelligent analysis of video content.
  • 目标跟踪:集成IOU(Intersection over Union)、SORT(Simple Online and Realtime Tracking)等先进追踪算法,实现对移动物体的稳定、准确追踪。
  • Behavior analysis (BA): Based on target tracking technology, further analysis of specific behaviors, such as traffic violations (crossing the line, illegal parking), crowd flow analysis, etc., provides decision-making basis for traffic management and security monitoring.
  • Data broker: Efficiently forwards the analyzed structured data (such as JSON, XML, or custom formats) to a specified destination for subsequent data storage, analysis, or display.
  • Recording & Screenshots: Automatically record videos for a specific time period or capture keyframe screenshots as needed.
  • On-Screen Display (OSD): Overlay the model output results on the video frame (On-Screen Display), such as selecting detected targets and annotating behavior analysis results to improve user interaction experience and system transparency.

Get started quickly

VideoPipe doesn't pick hardware. Whether it's a high-end server equipped with a professional acceleration card, or a regular computer that relies only on a CPU, VideoPipe runs smoothly. At the same time, the VideoPipe project contains a number of detailed sample codes, which show how to use the VideoPipe framework to quickly build a face recognition application.

/*
* 名称:1-1-N sample
* 完整代码位于:samples/1-1-N_sample.cpp
* 功能说明:1个视频输入,1个视频分析任务(人脸检测和识别),2个输出(屏幕输出/RTMP推流输出)
* 注意:模型和视频文件需要自行准备
*/

int main() {
    // 日志配置
    VP_SET_LOG_INCLUDE_CODE_LOCATION(false);
    VP_SET_LOG_INCLUDE_THREAD_ID(false);
    VP_LOGGER_INIT();

    // 1、创建节点
    // 创建视频源节点
    // 从本地视频文件(./test_video/10.mp4)读取视频流
    auto file_src_0 = std::make_shared<vp_nodes::vp_file_src_node>("file_src_0", 0, "./test_video/10.mp4", 0.6);
    // 2、模型推理 Node
    // 一级推理:人脸检测,使用预先训练好的模型face_detection_yunet_2022mar.onnx
    auto yunet_face_detector_0 = std::make_shared<vp_nodes::vp_yunet_face_detector_node>("yunet_face_detector_0", "./models/face/face_detection_yunet_2022mar.onnx");
    // 二级推理:人脸识别,利用模型face_recognition_sface_2021dec.onnx提取人脸特征
    auto sface_face_encoder_0 = std::make_shared<vp_nodes::vp_sface_feature_encoder_node>("sface_face_encoder_0", "./models/face/face_recognition_sface_2021dec.onnx");
    // 3、OSD Node
    // 将人脸识别的结果绘制到视频帧上
    auto osd_0 = std::make_shared<vp_nodes::vp_face_osd_node_v2>("osd_0");
    // 在本地屏幕显示处理后的视频
    auto screen_des_0 = std::make_shared<vp_nodes::vp_screen_des_node>("screen_des_0", 0);
    // 通过RTMP协议推流到指定服务器(rtmp://192.168.77.60/live/10000)
    auto rtmp_des_0 = std::make_shared<vp_nodes::vp_rtmp_des_node>("rtmp_des_0", 0, "rtmp://192.168.77.60/live/10000");

    // 构建管道,将各个节点按处理顺序连接起来,形成了一个数据处理流水线
    // 视频数据从源节点开始,经过人脸检测、人脸识别,最后到OSD节点处理,并同时输出到屏幕和RTMP流
    yunet_face_detector_0->attach_to({file_src_0});
    sface_face_encoder_0->attach_to({yunet_face_detector_0});
    osd_0->attach_to({sface_face_encoder_0});

    // 管道自动拆分,通过屏幕/推流输出结果
    screen_des_0->attach_to({osd_0});
    rtmp_des_0->attach_to({osd_0});

    // 启动管道
    file_src_0->start();

    // 可视化管道
    vp_utils::vp_analysis_board board({file_src_0});
    board.display();
}           

From the above code, you can find that the VideoPipe framework abstracts the steps of video analysis/processing into a pipe, and each step of processing is a node in the pipeline, and the processing process is as follows:

  1. Video Reading (Node): Reads video data from a file or network stream and performs preliminary decoding processing to prepare it for subsequent analysis.
  2. Model Inference (Node): Encapsulates the inference process of deep learning models, making the integration of advanced features such as facial recognition straightforward and efficient.
  3. OSD (Node): visualizes the analysis results and graphically superimposes the recognized face information on the video frame.
  4. Build a pipeline: Connect nodes in a logical order to form a complete processing link.
  5. Startup and monitoring: The entire process can be started by starting the source node of the pipeline, such as the video read node. At the same time, it provides visualization functions, so that developers can intuitively monitor the running status of pipelines.
An open-source video analysis structured framework: VideoPipe

After the code is run, three screens will appear as shown in the figure above. They are the pipeline running state diagram (status auto-refresh), the on-screen display result (GUI), and the player display result (RTMP), so you can use VideoPipe!

Read on