FLOROS: Flow Residuals for Open Set Dynamic Object Tracking
Qingyuan Li Juan Rached
MIT 6.8300 Final Project
Outline

Introduction

Related Works

Methodology

Experiments

Results

Conclusion

References

Codebase

Appendix


Introduction

Advances in autonomous Unmanned Aerial Vehicle (UAV) and Unmanned Ground Vehicle (UGV) navigation are improving our quality of life. The integration of this technology into emergency services, search and rescue, first responders, law enforcement, and agriculture has observed accelerated growth in the past decade, promising a safer future. UAV/UGV navigation has also had an impact in the delivery industry, where UAV/UGV deliveries enable the prompt dispatch of essential products, such as medicine, to remote communities, difficult to access through typical shipping. However, this is still an early technology, with several limitations and open research questions. In particular, dynamic obstacle avoidance for UAV/UGVs is a longstanding problem in robotics.

Current state-of-the-art methods rely on event cameras, stereo cameras, depth cameras, or lidar. However, power and payload limitations restrict the perception and compute capacity of unmanned vehicles, particularly UAVs. These constraints force a difficult balance between latency and performance for real-time applications that is still an open question in the field today. Some methods favor efficiency while others favor performance, but the field is yet to adopt a general procedure to estimate the motion of dynamic obstacles. As a consequence, a large number of path planning algorithms assume static environments, but this is a strong assumption given that we live in a dynamic world. Some methods perform dynamic obstacle avoidance by assuming the state of the dynamic obstacle is known at all times. This is fine in a laboratory setting, where a motion capture system can provide such estimates, but breaks in real-world applications. Therefore, a scheme for estimating the motion of dynamic obstacles from sensory input is indispensable if we are to deploy a complete system. Some approaches employ Kalman filters to recover the position and velocity of obstacles in the field of view from point clouds, but these methods are either restrictive, allowing robust planning only around vehicles with known dynamics, or brittle, if nothing is assumed about the obstacles' dynamics.

There are two core challenges in estimating the motion of objects from a moving platform:

  1. Estimating the motion of the scene.
  2. Removing the motion of the vehicle attached to the camera.

Unfortunately, this becomes increasingly difficult when performing highly dynamic, high-speed motion, as is the norm in the UAV literature. At those speeds, odometry latency and inaccuracy build up quickly, and motion blur can pollute the visual signal, which complicates the task of separating agent from obstacle motion. In this project, we propose using a combination of RAFT optical flow and the geometric optical flow induced by scene flow to solve both of the aforementioned problems with just a frame camera.


Methodology

We compute the residuals from RAFT optical flow and the geometric optical flow induced by scene flow, using solely RGBD and odometry data. The latter assumes a static scene, so subtracting the flow produced by the motion of the vehicle produces residuals that theoretically capture only the movement of the dynamic objects in the scene. Below is a detailed discussion on how each of the modules involved in our approach works and the role they play in the pipeline.

Theory

Optical flow assigns pixel correspondences between two frames by solving the optimization problem below:

\[ E = \min_{\mu} \left[ (\nabla^T \mathbf{I}_1\mu - (\mathbf{I}_2-\mathbf{I}_1)^2+\alpha(|\nabla \mu_x|^2+|\nabla \mu_y|^2)) \right] \]

Where the first term enforces correspondences based on the color constancy assumption, which states that the color of each point in the scene does not change between frames, and the second term promotes smooth uniform flow by encouraging each flow vector to be close to the average of its neighboring vectors. Common pitfalls of classic optical flow include:

  1. It is expensive to solve an optimization problem for each frame in a video sequence.
  2. The color constancy assumption is susceptible to blur and changes in lighting conditions.

As a consequence, it has become the norm to substitute the classical optical flow approach with learned models, the most notable of which is RAFT. RAFT extracts per-pixel features and builds multi-scale 4D correlation volumes. This rich representation allows the model to generalize and reduce the effect of lighting changes and blur on the flow estimates. The compact representation makes it so that computing the flow between each frame is also faster than solving the original optimization problem.

We obtain our flow residuals by computing the difference between the RAFT optical flow and the "geometric optical flow", or optical flow resulting solely from camera-induced scene flow.

For each pixel \( [u, v] \) with corresponding depth \( d \) in the first frame, we first unproject the pixel coordinates into 3D space using the depth data and camera intrinsics: \[ p_1=d K^{-1} [u, v, 1]^T \] We assume that \( p_1 \) is static in the 3D scene. We transform the 3D point \( p_1 \), which is in the reference frame of the camera at frame 1, into the reference frame of the camera at frame 2. Given that the robot pose is \( T_{world}^{r1} \) at frame 1 and \( T_{world}^{r2} \) at frame 2, and that the transform between the odometry frame and camera frame is \( T_{odom}^{cam} \), we can transform the point as follows: \[ T_{r2}^{r1} = (T_{world}^{r2})^{-1} \cdot T_{world}^{r1} \] \[ p_2 = (T_{odom}^{cam})^{-1} \cdot T_{r2}^{r1} \cdot T_{odom}^{cam} \cdot [p_1^T \dots 1]^T \] Finally, we project the point \( p_2 \) into the image plane to get its pixel location in the second frame: \[ [u', v'] = [f_x \frac{p_{2,x}}{p_{2,z}} + c_x, f_y \frac{p_{2,y}}{p_{2,z}} + c_y] \] where \( f_x, f_y \) are the focal lengths of the camera and \( c_x, c_y \) are the coordinates of the camera center in the image plane. Then, \( [u', v'] - [u, v] \) gives us the geometric optical flow at \( [u, v] \) in frame 1. The key observation is the assumption that \( p_1 \) is static in the 3D scene. If \( p_1 \) moves \( \delta = [\delta_x, \delta_y, \delta_z] \) (in the camera reference frame at frame 2) between frame 1 and frame 2, we get the following image projection at frame 2: \[ [u'_{dynamic}, v'_{dynamic}] = [f_x \frac{p_{2,x} + \delta_x}{p_{2,z} + \delta_z} + c_x, f_y \frac{p_{2,y} + \delta_y}{p_{2,z} + \delta_z} + c_y] \] In the video below, RAFT optical flow is juxtaposed with geometric optical flow. Where the geometric optical flow is white, it cannot be computed because there is no depth data (we capped the max depth at 6.0m; see Appendix I.).

Figure 1. Clockwise from top left: RGB image, depth image,
geometric optical flow, RAFT flow

Since RAFT estimates the actual observed optical flow from frame 1 to frame 2, it will return an optical flow close to \( [u'_{dynamic}, v'_{dynamic}] - [u, v] \) at \( [u, v] \). Then, the residual (RAFT minus geometric flow) at \( [u, v] \) is \( [u'_{dynamic}, v'_{dynamic}] - [u', v'] \). Although we could theoretically recover the depth of \( [u'_{dynamic}, v'_{dynamic}] \) from the depth data for frame 2, in practice, noisy depth data and imperfect pixel associations make this a noisy process (see Appendix II.). Instead, we assume that the object maintains constant depth between frames (\( \delta_z = 0 \)) and use the residual to recover only \( [\delta_x, \delta_y] \): \[ [\delta_x, \delta_y] = p_{2,z} ([u'_{dynamic}, v'_{dynamic}] - [u', v']) \cdot [\frac{1}{f_x}, \frac{1}{f_y}] \] This assumption works reasonably well in practice, although it is not effective at detecting objects moving along the camera Z axis.

Finally, given \( v=\| \delta_x, \delta_y \| \) as an estimate for the velocity of the 3D point corresponding to each pixel, we can mask areas of the image with \( v \) higher than a set threshold to obtain masks of dynamic objects in frame 1.

Prototype

Below is a video of our prototype of this method running on data from Kimera-Multi [10], where the right image shows flow residual magnitudes masked by the velocity threshold. To track the motion of dynamic objects across frames, we first fed the residuals to the point tracking algorithm from [7]. However, this introduces unnecessary computation, as RAFT optical flow is accurate enough to estimate the motion of each obstacle between the current and next frame. The details are discussed below.

Figure 2. Top is RGB image, bottom is flow residual magnitude
masked by the velocity threshold.

Final Algorithm

We implemented the above method in a pipeline that works with standard ROS bags as input data, with flow and residuals computed in batches using PyTorch tensors on GPU for performance. Once velocity magnitudes are computed from residual flow, we apply a threshold to the velocity magnitudes to obtain a binary mask of dynamic objects. The threshold can be tuned. Because depth camera noise increases as depth increases, we set the threshold to scale by depth: \( thresh=min\_thresh + d \cdot thresh\_depth\_gain \)

Because of the noise in odometry and depth data, we improved the performance by post-processing the binary mask with OpenCV morphological operations. Opening removes small points of noise, while closing removes small holes in the mask. Once we have extracted a processed binary mask for the frame, we use scipy.ndimage.label to label connected components. Each connected component is extracted as a mask and assumed to be a dynamic object. The pixels in the mask are unprojected to get a 3D pointcloud representing the dynamic object. We discard objects with a maximum axial variance outside a set range to further remove noise.

To facilitate tracking, we compute frame-by-frame associations between currently tracked dynamic objects and newly detected dynamic objects.

At each iteration, we predict the 3D location of currently tracked dynamic objects by propagating their masks into the next image frame using optical flow, then using the next frame's depth data to unproject the masks into predicted point clouds. When dynamic objects are detected in the next frame, we unproject their masks in that frame to get detected point clouds. We then use the Hungarian algorithm for global nearest neighbors with point cloud centroid distance-based similarity to match newly detected dynamic objects with currently tracked dynamic objects using their associated point clouds.

We augment the cost matrix and set costs to a large value if the point clouds are more than a set distance apart, so that newly detected objects not close to any currently tracked objects are not matched and become new tracked objects. Otherwise, currently tracked objects are updated with their new detections.

Representative videos of our algorithm running on the data we collected are shown below. Top left and right are the RGB and depth images. Bottom right is the velocity residual magnitude, and bottom left is the post-processed binary mask. Overlaid on the images are the tracked dynamic objects and their IDs. Noise is still present, but much of it is filtered out.

Figure 3. Clockwise from top left: RGB image, depth image,
velocity magnitude from flow residual, post-processed binary mask

Figure 4. The same dynamic object maintains a consistent ID throughout frames where it is
moving continuously, showing the efficacy of the Hungarian algorithm for frame-to-frame association.

Experiments

Dataset

We evaluate our method on our own dataset, collected in the Highbay of the Aerospace Controls Laboratory at MIT. We used a RealSense D455 depth camera on a Clearpath Husky with an Ouster lidar and MicroStrain IMU running lidar-inertial odometry. The moving objects are a hexcopter and another ground robot. All robots were teleoperated. The ground truth pose data was collected using a Vicon motion capture system and recorded in a ROS2 bag. The Husky recorded a ROS1 bag with depth camera and pose data. We use our own ROS bag processing software for dataloading.

The video of our algorithm running on the entire dataset is linked here: [hamilton full demo].

Metrics

We run our algorithm on the entire dataset. For each frame of the output, if there is ground truth data close to the associated timestamp, we compute the following metrics: * Our robots are ~1.0m in their largest dimension, so we used that as a reasonable threshold. The ground truth data is not aligned with the robot center of mass in the robot Z axis, creating steady state error. Although far from a perfect metric, this should be a reasonable estimate and allow us to compare runs with different parameters.

We also define the following percentages for ease of comparison:

Results

We evaluated the performance of our algorithm when certain parameters are changed, while all others are kept constant. First, we determined the best RAFT model for our application. RAFT options were not touched, besides defining iter=12 for each model.
Table 1. RAFT model comparison
Model \( N_{gt} \) \( N_{det} \) \( N \) \( N_{corr} \) \( det\% \) \( fp\% \)
raft-kitti 1283 1041 1416 1058 81.14% 25.28%
raft-things 1283 1085 1383 1110 84.57% 19.74%
raft-sintel 1283 1088 1384 1114 84.80% 19.51%
raft-chairs 1283 672 1173 702 52.38% 40.15%

Subsequent evaluations were run with the raft-sintel model.

We then evaluated the effect of framerate on our algorithm. The original RGBD data was recorded at 30Hz, but we downsampled it to 15, 10, and 6 Hz.
Table 2. Framerate comparison
Framerate \( N_{gt} \) \( N_{det} \) \( N \) \( N_{corr} \) \( det\% \) \( fp\% \)
15Hz 1283 1088 1384 1114 84.80% 19.51%
10Hz 849 715 900 731 84.22% 18.78%
6Hz 513 416 530 422 81.09% 20.38%

The relative performance of the algorithm is similar across framerates, but a higher framerate allows for earlier detection of dynamic objects. Therefore, we ran all other evaluations at 15Hz.

We then evaluated the effect of various noise-reduction methods by performing ablations on various parameters of the algorithm.
Table 3. Noise reduction comparison
Method \( N_{gt} \) \( N_{det} \) \( N \) \( N_{corr} \) \( det\% \) \( fp\% \)
Original 1283 1088 1384 1114 84.80% 19.51%
No Morphology 1283 1102 2626 1294 85.89% 50.72%
No axial variance-based threshold 1283 1119 1662 1183 87.22% 28.82%
No depth-based velocity threshold gain* 1318 1097 1605 1112 83.23% 30.72%

* We set the fixed velocity threshold to 0.775 (0.4 + 0.125 * 3) for this run. This was previously the average velocity threshold in our depth range (6m), since the original parameters were 0.4 for the minimum velocity threshold and 0.125 for the depth-based gain.

We find that morphological operations are very effective at removing noise from the mask. Variance-based discarding and depth-based velocity threshold gain also have a noticeable effect. Variance-based thresholding and morphological operations somewhat reduce the detection rate, but the reduction in false positives is more significant.

To further reduce false positives, we evaluated the effect of requiring that a dynamic object be detected in two consecutive frames before being considered a dynamic object.
Table 4. Using 3D residuals
Method \( N_{gt} \) \( N_{det} \) \( N \) \( N_{corr} \) \( det\% \) \( fp\% \)
Original 1283 1088 1384 1114 84.80% 19.51%
2-Consecutive Requirement 1283 1080 1224 1093 84.18% 10.70%

This change halved the number of false positives and barely decreased the detection rate, though we did not include it in the default algorithm used for the other evaluations, since we did not want to influence results by relying on the probabilistic assumption that noise is independent between frames.

Conclusion

Although our method shows promise in efficient real-time detection and tracking of dynamic obstacles, it is extremely sensitive to the quality of data. During our experiments, we observed that poor quality depth data or odometry estimates led to the geometric optical flow having large deviations from the RAFT optical flow in static areas of the image. Future work could include using a trained neural network to segment dynamic objects, since we do not use the corresponding RGB image as a prior for the segmentation of flow residuals. A neural network that also uses FastSAM segments or superpixels would better leverage all the available data.

A planned extension of this work is to use LiDAR as the source for depth data. This would greatly increase the range and accuracy of depth data, at the cost of sparsity (which could be addressed with stereo camera-LiDAR fusion). In addition, it may allow us to use the 3D residual method discussed in Appendix I, which can better detect movement in the camera Z axis.

A major limitation of this method is that under high accelerations, the RAFT and geometric flow significantly diverge, leading to underwhelming results during UAV flights. There are multiple culprits, including noisy depth data and poor RAFT results at high accelerations. The most likely reason seems to be poor odometry, since the divergence often occurs during acceleration, suggesting latency and/or noise in the odometry data. Also, since pose data is collected at independent times from the RGBD data, we must perform linear interpolation to approximate it every frame, which introduces additional error.

Despite these shortcomings, our method proves to be a promising approach to the problem of dynamic obstacle tracking. With only odometry and depth camera data, it demonstrates good performance and runtime on real-world data in the tested environments. There appear to be many directions for further work to improve robustness, efficiency, and generalization.


References

[1] Tordesillas, J., Lopez, B. T., & How, J. P. (2019, November). Faster: Fast and safe trajectory planner for flights in unknown environments. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1934-1940). IEEE.

[2] Tordesillas, J., & How, J. P. (2021). MADER: Trajectory planner in multiagent and dynamic environments. IEEE Transactions on Robotics, 38(1), 463-476.

[3] Kondo, K., Tordesillas, J., Figueroa, R., Rached, J., Merkel, J., Lusk, P. C., & How, J. P. (2023, May). Robust MADER: Decentralized and asynchronous multiagent trajectory planner robust to communication delay. In 2023 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1687-1693). IEEE.

[4] Tordesillas, J., & How, J. P. (2022). Panther: Perception-aware trajectory planner in dynamic environments. IEEE Access, 10, 22662-22677.

[5] Falanga, D., Kleber, K., & Scaramuzza, D. (2020). Dynamic obstacle avoidance for quadrotors with event cameras. Science Robotics, 5(40), eaaz9712.

[6] Chao, H., Gu, Y., & Napolitano, M. (2013, May). A survey of optical flow techniques for UAV navigation applications. In 2013 International Conference on Unmanned Aircraft Systems (ICUAS) (pp. 710-716). IEEE.

[7] Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., & Rupprecht, C. (2024, September). Cotracker: It is better to track together. In European Conference on Computer Vision (pp. 18-35). Cham: Springer Nature Switzerland.

[7] Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., & Rupprecht, C. (2024, September). Cotracker: It is better to track together. In European Conference on Computer Vision (pp. 18-35). Cham: Springer Nature Switzerland.

[8] Smith, C., Charatan, D., Tewari, A., & Sitzmann, V. (2024). Flowmap: High-quality camera poses, intrinsics, and depth via gradient descent. arXiv preprint arXiv:2404.15259.

[9] Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16 (pp. 402-419). Springer International Publishing.

[10] Tian, Y., Chang, Y., Arias, F. H., Nieto-Granda, C., How, J. P., & Carlone, L. (2022). Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems. IEEE Transactions on Robotics, 38(4).

Acknowledgments

We would like to thank Professor Jonathan How for his guidance throughout this project and for generously providing access to the resources that we utilized. Special thanks as well to Lucas Jia for assisting us with physical experiments and data collection.


Codebase

See [dynamic-object-detection on Github] for our final implementation code. Instructions for downloading our data and running the algorithm and evaluations are in the README.

The prototype code is also [hosted on Github].


Appendix

I. Using 3D Residuals

Because we have the depth data for both frame 1 and frame 2, in theory we can compute the full 3D movement of each pixel between frames. Given RAFT flow \( [\delta u_{raft}, \delta v_{raft}] \) at pixel \( [u, v] \), we can compute the 3D residual as follows: \[ p_2 = (T_{odom}^{cam})^{-1} \cdot T_{r2}^{r1} \cdot T_{odom}^{cam} \cdot [p_1^T \dots 1]^T + [u_{raft}, v_{raft}, 0]^T \] \[ p_{2, raft} = d_2 K^{-1} [u + \delta u_{raft}, v + \delta v_{raft}, 1]^T \] \[ \delta \approx p_{2, raft} - p_2 \] where \( d_2 \) is the depth of pixel \( [u + \delta u_{raft}, v + \delta v_{raft}] \) in frame 2. This method works by assuming that RAFT correctly predicts pixel correspondences, in which case the 3D point corresponding to pixel \( [u + \delta u_{raft}, v + \delta v_{raft}] \) in frame 2 is the same as the 3D point corresponding to pixel \( [u, v] \) in frame 1. Since \( p_2 \) accounts for the camera motion between frames but not the motion of the point itself, while RAFT accounts for both, the difference is the approximate 3D motion of the point between frames.

Unfortunately, this method is not effective in practice. The RAFT flow is not accurate enough, and noise in the depth data is amplified by the projection into 3D. We ran our evaluation with the 3D residual method and found that it decreased the detection rate and dramatically increased the number of false positives, mostly patches of depth noise at larger depths.
Table 4. Using 3D residuals
Method \( N_{gt} \) \( N_{det} \) \( N \) \( N_{corr} \) \( det\% \) \( fp\% \)
Original 1283 1088 1384 1114 84.80% 19.51%
3D residual 1283 939 35857 1400 73.19% 96.10%

II. Max Depth Limits

We later compared performances at different maximum depth limits (all other parameters fixed).
Table 5. Depth limit comparison
Depth Limit (m) \( N_{gt} \) \( N_{det} \) \( N \) \( N_{corr} \) \( det\% \) \( fp\% \)
6.0 1283 1088 1384 1114 84.80% 19.51%
7.5 1306 1087 1434 1113 83.23% 22.38%
10.0 1306 1049 1317 1075 80.32% 28.82%
15.0 1306 1042 1288 1068 79.79% 17.08%

As the maximum depth limit increases, the number of dynamic objects according to the ground truth does not significantly increase (possibly due to the depth-based velocity threshold gain). The detection rate decreases with increased depth limit.

We looked at the output videos and found that 15.0m decreased the false positive rate because it brought the walls into range of the depth data, so odometry errors caused larger portions of the image to be masked, which were then pruned by the axis variance threshold. However, this is not a generalizable phenomenon. Therefore, we decided to continue using the 6.0m limit suggested by Intel, as any depth beyond that is guaranteed to be noisy.

III. Smoothing and Quadcopter Experiments

While developing the prototype, we noticed rapid changes in pose produced high-frequency noise in the flow residuals, especially for quadcopter experiments where video was recorded on a highly dynamic vehicle. Our first instinct was to design a low-pass filter to smooth the outputs of our pipeline. Thus, our prototype returned a running average of the flow residuals parametrized by \( \alpha \) as \[ smoothFlow = \alpha \cdot Flow + (1 - \alpha) \cdot prevFlow \] As expected of a running average, smoothing mitigated some of the noise, but it also generated a delay between the estimated and ground truth obstacle position. In fact, some smoothing granted a minor performance boost, but it quickly degraded as we over-smoothed the signal and created a larger gap between the estimate and the actual position.

This motivated us to pivot from the running average approach. The analysis above was performed on data collected onboard a x500 Holybro Quadcopter. Below are the flow masks of the flow from the videos recorded by the quadcopter with \(1 - \alpha \) = 0.2 smoothing (left) and \(1 - \alpha \) = 0.8 smoothing (right). The flashes of white are moments in which the RAFT flow and geometric flow disagree. The attentive viewer will notice that the flashing occurs when the quadcopter rolls. We found this to be a common trend: the signal-to-noise ratio decreases as the vehicle performs aggressive maneuvers, particularly during sharp changes in orientation. Notice that the video on the right exhibits less flashing, as it is heavily smoothed. This comes at the expense of a higher trajectory error as the smoothed estimate lags behind the unprocessed one. Although we developed an improved noise reduction strategy, the noise introduced during aggressive UAV flight requires advanced denoising techniques beyond the scope of this project, so we constrained the remainder of our experiments to ground vehicles.

Figure 5. Left is light smoothing, right is heavy smoothing.

To estimate the obstacle position, we unproject the points in the mask into 3D space and obtain their centroid. We use this centroid as our prediction of the obstacle's position. Clearly, a significant limitation of this approach is that it only works in scenes with at most one dynamic obstacle. Note that these are preliminary results, meant to convey the challenge of handling noisy visual signals during aggressive motion. Our final implementation computes the trajectory error by following a more sophisticated procedure outlined in the methods and experiment sections of this blog.