FLOROS: Flow Residuals for Open Set Dynamic Object Tracking | |||
Qingyuan Li | Juan Rached | ||
MIT 6.8300 Final Project |
FLOROS: Flow Residuals for Open Set Dynamic Object Tracking | |||
Qingyuan Li | Juan Rached | ||
MIT 6.8300 Final Project |
Advances in autonomous Unmanned Aerial Vehicle (UAV) and Unmanned Ground Vehicle (UGV) navigation are improving our quality of life. The integration of this technology into emergency services, search and rescue, first responders, law enforcement, and agriculture has observed accelerated growth in the past decade, promising a safer future. UAV/UGV navigation has also had an impact in the delivery industry, where UAV/UGV deliveries enable the prompt dispatch of essential products, such as medicine, to remote communities, difficult to access through typical shipping. However, this is still an early technology, with several limitations and open research questions. In particular, dynamic obstacle avoidance for UAV/UGVs is a longstanding problem in robotics.
Current state-of-the-art methods rely on event cameras, stereo cameras, depth cameras, or lidar. However, power and payload limitations restrict the perception and compute capacity of unmanned vehicles, particularly UAVs. These constraints force a difficult balance between latency and performance for real-time applications that is still an open question in the field today. Some methods favor efficiency while others favor performance, but the field is yet to adopt a general procedure to estimate the motion of dynamic obstacles. As a consequence, a large number of path planning algorithms assume static environments, but this is a strong assumption given that we live in a dynamic world. Some methods perform dynamic obstacle avoidance by assuming the state of the dynamic obstacle is known at all times. This is fine in a laboratory setting, where a motion capture system can provide such estimates, but breaks in real-world applications. Therefore, a scheme for estimating the motion of dynamic obstacles from sensory input is indispensable if we are to deploy a complete system. Some approaches employ Kalman filters to recover the position and velocity of obstacles in the field of view from point clouds, but these methods are either restrictive, allowing robust planning only around vehicles with known dynamics, or brittle, if nothing is assumed about the obstacles' dynamics.
There are two core challenges in estimating the motion of objects from a moving platform:
Unfortunately, this becomes increasingly difficult when performing highly dynamic, high-speed motion, as is the norm in the UAV literature. At those speeds, odometry latency and inaccuracy build up quickly, and motion blur can pollute the visual signal, which complicates the task of separating agent from obstacle motion. In this project, we propose using a combination of RAFT optical flow and the geometric optical flow induced by scene flow to solve both of the aforementioned problems with just a frame camera.
We compute the residuals from RAFT optical flow and the geometric optical flow induced by scene flow, using solely RGBD and odometry data. The latter assumes a static scene, so subtracting the flow produced by the motion of the vehicle produces residuals that theoretically capture only the movement of the dynamic objects in the scene. Below is a detailed discussion on how each of the modules involved in our approach works and the role they play in the pipeline.
Optical flow assigns pixel correspondences between two frames by solving the optimization problem below:
\[ E = \min_{\mu} \left[ (\nabla^T \mathbf{I}_1\mu - (\mathbf{I}_2-\mathbf{I}_1)^2+\alpha(|\nabla \mu_x|^2+|\nabla \mu_y|^2)) \right] \]Where the first term enforces correspondences based on the color constancy assumption, which states that the color of each point in the scene does not change between frames, and the second term promotes smooth uniform flow by encouraging each flow vector to be close to the average of its neighboring vectors. Common pitfalls of classic optical flow include:
As a consequence, it has become the norm to substitute the classical optical flow approach with learned models, the most notable of which is RAFT. RAFT extracts per-pixel features and builds multi-scale 4D correlation volumes. This rich representation allows the model to generalize and reduce the effect of lighting changes and blur on the flow estimates. The compact representation makes it so that computing the flow between each frame is also faster than solving the original optimization problem.
We obtain our flow residuals by computing the difference between the RAFT optical flow and the "geometric optical flow", or optical flow
resulting solely from camera-induced scene flow.
For each pixel \( [u, v] \) with corresponding depth \( d \) in the first frame, we first unproject the pixel coordinates into 3D space using the depth data and camera intrinsics:
\[ p_1=d K^{-1} [u, v, 1]^T \]
We assume that \( p_1 \) is static in the 3D scene. We transform the 3D point \( p_1 \), which is in the reference frame of the camera at frame 1, into the reference frame of the camera at frame 2. Given that
the robot pose is \( T_{world}^{r1} \) at frame 1 and \( T_{world}^{r2} \) at frame 2, and that the transform between the odometry frame and camera frame is \( T_{odom}^{cam} \),
we can transform the point as follows:
\[ T_{r2}^{r1} = (T_{world}^{r2})^{-1} \cdot T_{world}^{r1} \]
\[ p_2 = (T_{odom}^{cam})^{-1} \cdot T_{r2}^{r1} \cdot T_{odom}^{cam} \cdot [p_1^T \dots 1]^T \]
Finally, we project the point \( p_2 \) into the image plane to get its pixel location in the second frame:
\[ [u', v'] = [f_x \frac{p_{2,x}}{p_{2,z}} + c_x, f_y \frac{p_{2,y}}{p_{2,z}} + c_y] \]
where \( f_x, f_y \) are the focal lengths of the camera and \( c_x, c_y \) are the coordinates of the camera center in the image plane.
Then, \( [u', v'] - [u, v] \) gives us the geometric optical flow at \( [u, v] \) in frame 1.
The key observation is the assumption that \( p_1 \) is static in the 3D scene. If \( p_1 \) moves \( \delta = [\delta_x, \delta_y, \delta_z] \) (in the camera reference frame at frame 2)
between frame 1 and frame 2, we get the following image projection at frame 2:
\[ [u'_{dynamic}, v'_{dynamic}] = [f_x \frac{p_{2,x} + \delta_x}{p_{2,z} + \delta_z} + c_x, f_y \frac{p_{2,y} + \delta_y}{p_{2,z} + \delta_z} + c_y] \]
In the video below, RAFT optical flow is juxtaposed with geometric optical flow.
Where the geometric optical flow is white, it cannot be computed because there is no depth data (we capped the
max depth at 6.0m; see Appendix I.).
Since RAFT estimates the actual observed optical flow from frame 1 to frame 2, it will return an optical flow close to \( [u'_{dynamic}, v'_{dynamic}] - [u, v] \) at \( [u, v] \).
Then, the residual (RAFT minus geometric flow) at \( [u, v] \) is \( [u'_{dynamic}, v'_{dynamic}] - [u', v'] \).
Although we could theoretically recover the depth of \( [u'_{dynamic}, v'_{dynamic}] \) from the depth data for frame 2, in practice, noisy depth data and imperfect
pixel associations make this a noisy process (see Appendix II.). Instead, we assume that the object maintains constant depth between frames (\( \delta_z = 0 \)) and use the residual to recover only \( [\delta_x, \delta_y] \):
\[ [\delta_x, \delta_y] = p_{2,z} ([u'_{dynamic}, v'_{dynamic}] - [u', v']) \cdot [\frac{1}{f_x}, \frac{1}{f_y}] \]
This assumption works reasonably well in practice, although it is not effective at detecting objects moving along the camera Z axis.
Finally, given \( v=\| \delta_x, \delta_y \| \) as an estimate for the velocity of the 3D point corresponding to each pixel, we can mask areas of the image with
\( v \) higher than a set threshold to obtain masks of dynamic objects in frame 1.
Below is a video of our prototype of this method running on data from Kimera-Multi [10], where the right image shows flow residual magnitudes masked by the velocity threshold. To track the motion of dynamic objects across frames, we first fed the residuals to the point tracking algorithm from [7]. However, this introduces unnecessary computation, as RAFT optical flow is accurate enough to estimate the motion of each obstacle between the current and next frame. The details are discussed below.
We implemented the above method in a pipeline that works with standard ROS bags as input data, with flow and residuals computed in batches using PyTorch tensors on GPU for performance. Once velocity magnitudes are computed from residual flow, we apply a threshold to the velocity magnitudes to obtain a binary mask of dynamic objects. The threshold can be tuned. Because depth camera noise increases as depth increases, we set the threshold to scale by depth: \( thresh=min\_thresh + d \cdot thresh\_depth\_gain \)
Because of the noise in odometry and depth data, we improved the performance by post-processing the binary mask with OpenCV morphological operations. Opening removes small points of noise, while closing removes small holes in the mask. Once we have extracted a processed binary mask for the frame, we use scipy.ndimage.label to label connected components. Each connected component is extracted as a mask and assumed to be a dynamic object. The pixels in the mask are unprojected to get a 3D pointcloud representing the dynamic object. We discard objects with a maximum axial variance outside a set range to further remove noise.
To facilitate tracking, we compute frame-by-frame associations between currently tracked dynamic objects and newly detected dynamic objects.
At each iteration, we predict the 3D location of currently tracked dynamic objects by propagating their masks into the next image frame using optical flow,
then using the next frame's depth data to unproject the masks into predicted point clouds. When dynamic objects are detected in the next frame, we unproject their masks
in that frame to get detected point clouds. We then use the Hungarian algorithm for global nearest neighbors with point cloud centroid distance-based similarity to match newly detected dynamic objects with
currently tracked dynamic objects using their associated point clouds.
We augment the cost matrix and set costs to a large value if the point clouds are more than
a set distance apart, so that newly detected objects not close to any currently tracked objects are not matched and become new tracked objects. Otherwise, currently
tracked objects are updated with their new detections.
Representative videos of our algorithm running on the data we collected are shown below. Top left and right are the RGB and depth images. Bottom right is the velocity residual magnitude, and bottom left is the post-processed binary mask. Overlaid on the images are the tracked dynamic objects and their IDs. Noise is still present, but much of it is filtered out.
Model | \( N_{gt} \) | \( N_{det} \) | \( N \) | \( N_{corr} \) | \( det\% \) | \( fp\% \) |
---|---|---|---|---|---|---|
raft-kitti | 1283 | 1041 | 1416 | 1058 | 81.14% | 25.28% |
raft-things | 1283 | 1085 | 1383 | 1110 | 84.57% | 19.74% |
raft-sintel | 1283 | 1088 | 1384 | 1114 | 84.80% | 19.51% |
raft-chairs | 1283 | 672 | 1173 | 702 | 52.38% | 40.15% |
Framerate | \( N_{gt} \) | \( N_{det} \) | \( N \) | \( N_{corr} \) | \( det\% \) | \( fp\% \) |
---|---|---|---|---|---|---|
15Hz | 1283 | 1088 | 1384 | 1114 | 84.80% | 19.51% |
10Hz | 849 | 715 | 900 | 731 | 84.22% | 18.78% |
6Hz | 513 | 416 | 530 | 422 | 81.09% | 20.38% |
Method | \( N_{gt} \) | \( N_{det} \) | \( N \) | \( N_{corr} \) | \( det\% \) | \( fp\% \) |
---|---|---|---|---|---|---|
Original | 1283 | 1088 | 1384 | 1114 | 84.80% | 19.51% |
No Morphology | 1283 | 1102 | 2626 | 1294 | 85.89% | 50.72% |
No axial variance-based threshold | 1283 | 1119 | 1662 | 1183 | 87.22% | 28.82% |
No depth-based velocity threshold gain* | 1318 | 1097 | 1605 | 1112 | 83.23% | 30.72% |
Method | \( N_{gt} \) | \( N_{det} \) | \( N \) | \( N_{corr} \) | \( det\% \) | \( fp\% \) |
---|---|---|---|---|---|---|
Original | 1283 | 1088 | 1384 | 1114 | 84.80% | 19.51% |
2-Consecutive Requirement | 1283 | 1080 | 1224 | 1093 | 84.18% | 10.70% |
We would like to thank Professor Jonathan How for his guidance throughout this project and for generously providing access to the resources that we utilized. Special thanks as well to Lucas Jia for assisting us with physical experiments and data collection.
See [dynamic-object-detection on Github] for our final implementation code. Instructions for downloading our data and running the algorithm and evaluations are in the README.
The prototype code is also [hosted on Github].
Method | \( N_{gt} \) | \( N_{det} \) | \( N \) | \( N_{corr} \) | \( det\% \) | \( fp\% \) |
---|---|---|---|---|---|---|
Original | 1283 | 1088 | 1384 | 1114 | 84.80% | 19.51% |
3D residual | 1283 | 939 | 35857 | 1400 | 73.19% | 96.10% |
Depth Limit (m) | \( N_{gt} \) | \( N_{det} \) | \( N \) | \( N_{corr} \) | \( det\% \) | \( fp\% \) |
---|---|---|---|---|---|---|
6.0 | 1283 | 1088 | 1384 | 1114 | 84.80% | 19.51% |
7.5 | 1306 | 1087 | 1434 | 1113 | 83.23% | 22.38% |
10.0 | 1306 | 1049 | 1317 | 1075 | 80.32% | 28.82% |
15.0 | 1306 | 1042 | 1288 | 1068 | 79.79% | 17.08% |