compile, curse, repeat: real-time video streaming with rockchip

The Initial Setup: Death by a Thousand Conversions

The project started with a basic setup: Flask server, OpenCV for capture, and a USB-based media capture device. The video pipeline was shuttling frames through too many conversions - capture, process, save to disk, HTTP reference, render on page - making the entire streaming solution effectively useless for our needs.

It's fine when each frame has to be captured, processed, saved to a file directory, then referenced via HTTP before finally being rendered on a Flask page up to a certain point. This setup had reached its limits, for both project complexity and for my own patience and sanity. We needed something much more direct with fewer handoffs between components, yesterday.

Diving into the MIPI (CSI2) Rabbit Hole

(tldr on MIPI - it's a fast and light image data packaging and transmission protocol used by cam sensors in processors on smartphones and embedded devices)

Since we were still working with the Raspberry Pi, we thought an HDMI-to-CSI2 converter would be the logical next step, since the RPi had tons of documentation for working with MIPI for ArduCam inputs. This would let us use the captured video directly with fewer conversion steps, reducing latency by eliminating those file system and HTTP bottlenecks.

We went with a Geekworm HDMI-CSI2 converter, a fresh change from the USB capture method. This bridge device is powered by the Toshiba TC358743XBG, a CMOS digital IC that converts HDMI input into MIPI CSI-2 output. I still have nightmares where the datasheet chases after me with a bat, taunting me like a little freak.

The converter in all its glory. Cute little blighter - shame you were so troublesome. After checking the FFCs a million times over, I was still running into corrupted frame buffers. Nothing made sense… the device was being recognized but kept throwing errors over misalignment, so I wrote an EDID config to explicitly define correct timings and formats. For anyone unfamiliar, having to mess with EDID configs means you've already crossed into a technical wasteland where most folks would bail cause it's kinda like telling your hardware "no, you're actually this other thing lmao" and hoping it believes you. In my case, I was SOL.

Recommendations received ranged from dedicated Lontium display chips to going the FPGA route (LMAO)… it was becoming clear that this approach was bringing more complexity than solutions. The MIPI rabbit hole was getting deeper by the day, and I was spending more time fighting the hardware than solving our actual latency problem.

In retrospect, using a Compute Module 4 would have probably murked it off rip. The CM4 has better support for these kinds of bridge chips, without all the hacky firmware workarounds needed on the regular RPi. It's just that the firmware support for these chips are largely outdated for the RPis, and is a much more straightforward process for setting it up on the CM4.

By this point, I'd sunk nearly two weeks into the MIPI approach with diminishing returns. Time to pivot.

Farewell Raspberry, Hello Orange Pi

Be fooled not by the name - the Orange Pi isn't actually a Raspberry Pi knock-off. It's a versatile, palm-sized single-board computer with eMMC + M.2 support, dedicated video processing (with up to 8k video decoding!), and plenty of GPIO pins, making it closer to a mini-desktop-homunculus rather than a microcontroller or a restricted IOT device*.

*These devices are lovely, if you can find the right open-source software and tweak the firmware to make it work for you.

We followed a lead on the Orange Pi 5 Plus, and soon left the RPi in the rear window. This was my first time working with a SBC from Rockchip, a semiconductor manufacturer known for producing components used in everything from Unitree's Go-series robot dogs and their humanoid lineup (as well as having been reported to be found in a captured Russian FPV drone… yikes. Note: Rockchip has since largely pulled out of direct SBC sales since).

The Orange Pi offered something crucial that made all the difference: native HDMI input. This completely bypassed the need for an HDMI bridge that was causing data quality issues from lane splitting and merging. It supported 4K output, USB-OTG capability, and would run a full Linux OS - essentially everything we needed. After the MIPI struggle, this felt like finding water in the desert.

Rockchip's hardware acceleration with their Hantro VPU backend opened up possibilities I hadn't considered before. We could push full-resolution display output at 120 fps if we wanted. But local high-quality display wasn't enough… we needed to access this stream from anywhere… anytime… which meant yet another layer of complexity.

Plumbing the Streaming Pipeline

Here's where the real engineering challenge began. With a device capable of encoding and decoding captured media at high res + refresh rates, I needed a consumption node to broadcast the stream anywhere. Initially, I explored several frameworks:

OBS Studio: Great for local recording but too heavyweight for our embedded use case
GStreamer: Powerful but with a pipeline syntax that makes assembly language look intuitive
Prism: Promising but with limited hardware acceleration support on ARM
Moblin: Interesting features but lacking the low-latency performance we needed

While each could technically handle network streaming, I was after something open-source, modular, and grounded in a decentralized design philosophy. After testing multiple options, MediaMTX emerged as the clear winner due to its easy-to-manage network topologies, thorough documentation, and completely open-source codebase.

I opted to go for the Orange Pi running as the publisher node, and set up a subscriber node on a Google Cloud VM instance. Tried out a couple of different protocols like RTMP, WebRTC, and HLS, but in the end, I ended up going with RTSP, one of MediaMTX's many supported protocols.

Networking Purgatory

Since the producer node was running behind a Tailscale, connecting isolated networks called for a TURN/STUN server setup. For those unfamiliar, these are special servers that help establish direct connections between devices on different networks - essential for real-time video.

After using a STUN-TURN diagnostic tool, it turned out a STUN server alone was enough to get both nodes talking. Once we set the right ports and correctly addressed the nodes, this part of the setup was solid. One small victory tucked in a massive series of defeats.

Descent into Dependency Hell

With the networking configured, it was time to actually stream content to the node. I turned to FFmpeg - or rather, a fork: ffmpeg-rockchip. This version added support for hardware-accelerated video encoding and decoding by tapping directly into Rockchip's VPU (Hantro) instead of traditional CPU-bound H.264 codecs. (This is honestly gonna need its own dedicated article - jesus, what an absolute shitshow.)

I'd used Ninja, CMake, and other lower-level build systems before for smaller projects, but boy was I spoiled. (A similar experience that would later come back to haunt me when wrestling with Catkin and catkin_ws during ROS 1 Noetic builds for ABB robot libraries). I got sucked deep into dependency and versioning hell, and had just about maxxed out the number of yerbs I could consume in one day (please sponsor me Guayaki, I love you, Bluephoria Yerba Mate).

Here's what I was up against:

Rockchip's Media Process Platform (MPP) - a multimedia suite providing hardware-accelerated encoding, decoding, and video processing - had outdated headers. This forced me to compile RKMPP, an open-source fork of MPP, from scratch before I could even start building a stripped-down FFmpeg.
Doing that meant wrangling Meson, CMake, GCC, and a loooong list of dependencies into submission. On top of that, I needed Jellyfin-Rga, another fork of Rockchip's (again, outdated) Raster Graphic Acceleration unit module, just to handle video filtering.

God bless open source embedded devs. Are y'all doing ok? Apparently, Joshua Riek's had enough. Every step felt like two steps backward, one step forward. Build errors with cryptic error messages. Incompatibilities in required libraries, cross-compilation errors, version conflicts… never seemed to end. After the nth time cleaning then rebuilding - minimal FFmpeg built, with support for hardware-accelerated encoding, decoding, and RGA. The battle is won, but the war wasn't over…

The Unholy Incantation

I spent days crawling through under-documented repos, hunting for obscure flags and hitting dead ends. The whole time I kept thinking there had to be an easier way, but also knew that if it were easy, it wouldn't be worth writing about lol. After countless iterations, I parsed together this command:


ffmpeg -hwaccel rkmpp -hwaccel_output_format drm_prime -afbc rga \
-f v4l2 -input_format nv12 -i /dev/video0 -vf crop=896:1944 -c:a copy -strict -2 \
-c:v h264_rkmpp -rc_mode vbr -b:v 6m -maxrate 6m -bufsize 12m -profile:v high -g:v 120 \
-f rtsp rtsp://none_of_your_business:yr_mom's_port/mystream

Breaking down a few key parts:

-hwaccel rkmpp: Tells FFmpeg to use Rockchip's hardware acceleration
-hwaccel_output_format drm_prime: Specifies the memory model for hardware acceleration
-afbc rga: Enables ARM Frame Buffer Compression using the for Raster Graphics Acceleration
-c:v h264_rkmpp: Uses hardware H.264 encoding via Rockchip's MPP

I ran it on my Orange Pi, listened for connections from the MediaMTX VM to start serving it, then visited the url in my browser - and there it was - the stream. Not just any stream, but a goddamn near real-time one. My beautiful baby stream. And as Nicki would say:

"All you bitches (media nodes, frame buffers, VPU cores) is my sons."

The latency dropped from 3 seconds to practically nothing. After weeks of technical hell, seeing that real-time video feed felt like god's mercy; washing your hands in warm water after a long hike, or catching the last train for the night right as the doors close... A fleeting, glorious moment from an otherwise unglamorous venture.

Was It Worth It?

Was it worth the dependency hell, multiple devices, the custom compiling, the obscure flags? For a project that where real-time feedback was critical, hell yeah brother! For anything else, probably not.

But now I'm carrying around this bizarre specialized knowledge about hardware acceleration for a relatively under-prolific SoC and its peripherals that might be completely obsolete in 18 months, but isn't that the true essence of modern technical work? We spend months becoming experts in tech stacks that'll be replaced before we've even updated our LinkedIn profiles. Though I love getting in the weeds, it seems like the deeper you dive into specialized tech problems, the more ephemeral your expertise becomes - especially in the age of just dumping several complete codebases and docs to a 1M+ context window LLM to fetch quick answers for a one-off problem. (But if you're looking for a good source to do as such… I'd recommend GitIngest.)

Despite all the hubbub, there's a peculiar satisfaction in solving these kinds of technical puzzles that feels different from regular development work. Elegant architecture and clean code sounds great, but I admit I take far too much joy in wrestling with poorly documented hardware, quirky drivers, and esoteric configurations until they finally submit to your will.

And for those brief moments when everything finally works, it's absolutely worth it.