FLOROS: Flow Residuals for Open Set Dynamic Object Tracking

Qingyuan Li

Juan Rached

MIT 6.8300 Final Project

Introduction

Advances in autonomous Unmanned Aerial Vehicle (UAV) and Unmanned Ground Vehicle (UGV) navigation are improving our quality of life. The integration of this technology into emergency services, search and rescue, first responders, law enforcement, and agriculture has observed accelerated growth in the past decade, promising a safer future. UAV/UGV navigation has also had an impact in the delivery industry, where UAV/UGV deliveries enable the prompt dispatch of essential products, such as medicine, to remote communities, difficult to access through typical shipping. However, this is still an early technology, with several limitations and open research questions. In particular, dynamic obstacle avoidance for UAV/UGVs is a longstanding problem in robotics.

Current state-of-the-art methods rely on event cameras, stereo cameras, depth cameras, or lidar. However, power and payload limitations restrict the perception and compute capacity of unmanned vehicles, particularly UAVs. These constraints force a difficult balance between latency and performance for real-time applications that is still an open question in the field today. Some methods favor efficiency while others favor performance, but the field is yet to adopt a general procedure to estimate the motion of dynamic obstacles. As a consequence, a large number of path planning algorithms assume static environments, but this is a strong assumption given that we live in a dynamic world. Some methods perform dynamic obstacle avoidance by assuming the state of the dynamic obstacle is known at all times. This is fine in a laboratory setting, where a motion capture system can provide such estimates, but breaks in real-world applications. Therefore, a scheme for estimating the motion of dynamic obstacles from sensory input is indispensable if we are to deploy a complete system. Some approaches employ Kalman filters to recover the position and velocity of obstacles in the field of view from point clouds, but these methods are either restrictive, allowing robust planning only around vehicles with known dynamics, or brittle, if nothing is assumed about the obstacles' dynamics.

There are two core challenges in estimating the motion of objects from a moving platform:

Estimating the motion of the scene.
Removing the motion of the vehicle attached to the camera.

Unfortunately, this becomes increasingly difficult when performing highly dynamic, high-speed motion, as is the norm in the UAV literature. At those speeds, odometry latency and inaccuracy build up quickly, and motion blur can pollute the visual signal, which complicates the task of separating agent from obstacle motion. In this project, we propose using a combination of RAFT optical flow and the geometric optical flow induced by scene flow to solve both of the aforementioned problems with just a frame camera.

Related Work

In [1], Tordesillas et al propose a fast and safe trajectory planner that plans in known and unknown environments. In other words, it plans within the observed and unobserved regions of the map. But this approach assumes a static world. [2] and [3] present a multiagent, asynchronous, and decentralized trajectory planner and an extension to mitigate delay checks in hardware deployment. While both [2] and [3] can plan in dynamic environments, they assume that the state of the dynamic objects is known. [4], a real-time perception-aware trajectory planner for multirotor UAVs, can plan in dynamic environments and does not expect the state of dynamic objects to be given. Instead, [4] clusters the point clouds produced by an onboard depth sensor and applies the Hungarian algorithm to track those clusters. [4], however, does not work in static scenes, and its point cloud scheme breaks when static objects are picked up by the depth sensor as well.

[5] proposes the use of event cameras for low-latency dynamic obstacle detection and avoidance. While they achieve faster response times than frame cameras, event cameras are expensive and do not capture the full scene. Since they cannot pick up static objects or color, they are incapable of performing tasks relevant to robotics systems, such as feature matching, segmentation, etc. Frame cameras can be used to solve a variety of perception tasks, but require more elaborate algorithms for motion detection than event cameras.

The fundamental vision-based approach to motion estimation is optical flow, and optical flow methods for UAV applications have been thoroughly studied [6]. But standard optical flow struggles with fine motion estimation, particularly at high speeds and in dynamic lighting conditions, making it inadequate for UAV navigation. Additionally, pure optical flow methods estimate motion between frames, but cannot distinguish individual objects or track them over time. [7] presents a transformer-based point tracking pipeline that successfully estimates the trajectory of dynamic objects from medium-to-long video sequences. Estimating obstacle trajectories with point tracking is considerably more robust than with optical flow alone, but [7] still would require a procedure to remove ego motion from the point tracks if running on a moving agent.

[8] leverages the scene flow induced by a moving camera to produce high-quality camera poses, intrinsics, and depth estimates. We propose reversing the pipeline from [8], computing the expected optical flow between two frames given depth data, camera poses, and intrinsics, under the assumption of a static scene. If there are any non-static objects in the scene, we can subtract the expected optical flow from the RAFT-estimated optical flow [9] to estimate motion in the scene not explained by ego movement. We will develop an algorithm to convert the residuals into a discrete list of moving objects and their masks, and will propagate masks between frames to obtain tracks for dynamic obstacles over time.

Methodology

We compute the residuals from RAFT optical flow and the geometric optical flow induced by scene flow, using solely RGBD and odometry data. The latter assumes a static scene, so subtracting the flow produced by the motion of the vehicle produces residuals that theoretically capture only the movement of the dynamic objects in the scene. Below is a detailed discussion on how each of the modules involved in our approach works and the role they play in the pipeline.

Theory

Optical flow assigns pixel correspondences between two frames by solving the optimization problem below:

\[ E = \min_{\mu} \left[ (\nabla^T \mathbf{I}_1\mu - (\mathbf{I}_2-\mathbf{I}_1)^2+\alpha(|\nabla \mu_x|^2+|\nabla \mu_y|^2)) \right] \]

Where the first term enforces correspondences based on the color constancy assumption, which states that the color of each point in the scene does not change between frames, and the second term promotes smooth uniform flow by encouraging each flow vector to be close to the average of its neighboring vectors. Common pitfalls of classic optical flow include:

It is expensive to solve an optimization problem for each frame in a video sequence.
The color constancy assumption is susceptible to blur and changes in lighting conditions.

As a consequence, it has become the norm to substitute the classical optical flow approach with learned models, the most notable of which is RAFT. RAFT extracts per-pixel features and builds multi-scale 4D correlation volumes. This rich representation allows the model to generalize and reduce the effect of lighting changes and blur on the flow estimates. The compact representation makes it so that computing the flow between each frame is also faster than solving the original optimization problem.

We obtain our flow residuals by computing the difference between the RAFT optical flow and the "geometric optical flow", or optical flow resulting solely from camera-induced scene flow.

For each pixel \( [u, v] \) with corresponding depth \( d \) in the first frame, we first unproject the pixel coordinates into 3D space using the depth data and camera intrinsics: \[ p_1=d K^{-1} [u, v, 1]^T \] We assume that \( p_1 \) is static in the 3D scene. We transform the 3D point \( p_1 \), which is in the reference frame of the camera at frame 1, into the reference frame of the camera at frame 2. Given that the robot pose is \( T_{world}^{r1} \) at frame 1 and \( T_{world}^{r2} \) at frame 2, and that the transform between the odometry frame and camera frame is \( T_{odom}^{cam} \), we can transform the point as follows: \[ T_{r2}^{r1} = (T_{world}^{r2})^{-1} \cdot T_{world}^{r1} \] \[ p_2 = (T_{odom}^{cam})^{-1} \cdot T_{r2}^{r1} \cdot T_{odom}^{cam} \cdot [p_1^T \dots 1]^T \] Finally, we project the point \( p_2 \) into the image plane to get its pixel location in the second frame: \[ [u', v'] = [f_x \frac{p_{2,x}}{p_{2,z}} + c_x, f_y \frac{p_{2,y}}{p_{2,z}} + c_y] \] where \( f_x, f_y \) are the focal lengths of the camera and \( c_x, c_y \) are the coordinates of the camera center in the image plane. Then, \( [u', v'] - [u, v] \) gives us the geometric optical flow at \( [u, v] \) in frame 1. The key observation is the assumption that \( p_1 \) is static in the 3D scene. If \( p_1 \) moves \( \delta = [\delta_x, \delta_y, \delta_z] \) (in the camera reference frame at frame 2) between frame 1 and frame 2, we get the following image projection at frame 2: \[ [u'_{dynamic}, v'_{dynamic}] = [f_x \frac{p_{2,x} + \delta_x}{p_{2,z} + \delta_z} + c_x, f_y \frac{p_{2,y} + \delta_y}{p_{2,z} + \delta_z} + c_y] \] In the video below, RAFT optical flow is juxtaposed with geometric optical flow. Where the geometric optical flow is white, it cannot be computed because there is no depth data (we capped the max depth at 6.0m; see Appendix I.).

Figure 1. Clockwise from top left: RGB image, depth image,
geometric optical flow, RAFT flow

Since RAFT estimates the actual observed optical flow from frame 1 to frame 2, it will return an optical flow close to \( [u'_{dynamic}, v'_{dynamic}] - [u, v] \) at \( [u, v] \). Then, the residual (RAFT minus geometric flow) at \( [u, v] \) is \( [u'_{dynamic}, v'_{dynamic}] - [u', v'] \). Although we could theoretically recover the depth of \( [u'_{dynamic}, v'_{dynamic}] \) from the depth data for frame 2, in practice, noisy depth data and imperfect pixel associations make this a noisy process (see Appendix II.). Instead, we assume that the object maintains constant depth between frames (\( \delta_z = 0 \)) and use the residual to recover only \( [\delta_x, \delta_y] \): \[ [\delta_x, \delta_y] = p_{2,z} ([u'_{dynamic}, v'_{dynamic}] - [u', v']) \cdot [\frac{1}{f_x}, \frac{1}{f_y}] \] This assumption works reasonably well in practice, although it is not effective at detecting objects moving along the camera Z axis.

Finally, given \( v=\| \delta_x, \delta_y \| \) as an estimate for the velocity of the 3D point corresponding to each pixel, we can mask areas of the image with \( v \) higher than a set threshold to obtain masks of dynamic objects in frame 1.

Prototype

Below is a video of our prototype of this method running on data from Kimera-Multi [10], where the right image shows flow residual magnitudes masked by the velocity threshold. To track the motion of dynamic objects across frames, we first fed the residuals to the point tracking algorithm from [7]. However, this introduces unnecessary computation, as RAFT optical flow is accurate enough to estimate the motion of each obstacle between the current and next frame. The details are discussed below.

Figure 2. Top is RGB image, bottom is flow residual magnitude
masked by the velocity threshold.

Final Algorithm

We implemented the above method in a pipeline that works with standard ROS bags as input data, with flow and residuals computed in batches using PyTorch tensors on GPU for performance. Once velocity magnitudes are computed from residual flow, we apply a threshold to the velocity magnitudes to obtain a binary mask of dynamic objects. The threshold can be tuned. Because depth camera noise increases as depth increases, we set the threshold to scale by depth: \( thresh=min\_thresh + d \cdot thresh\_depth\_gain \)

Because of the noise in odometry and depth data, we improved the performance by post-processing the binary mask with OpenCV morphological operations. Opening removes small points of noise, while closing removes small holes in the mask. Once we have extracted a processed binary mask for the frame, we use scipy.ndimage.label to label connected components. Each connected component is extracted as a mask and assumed to be a dynamic object. The pixels in the mask are unprojected to get a 3D pointcloud representing the dynamic object. We discard objects with a maximum axial variance outside a set range to further remove noise.

To facilitate tracking, we compute frame-by-frame associations between currently tracked dynamic objects and newly detected dynamic objects.

At each iteration, we predict the 3D location of currently tracked dynamic objects by propagating their masks into the next image frame using optical flow, then using the next frame's depth data to unproject the masks into predicted point clouds. When dynamic objects are detected in the next frame, we unproject their masks in that frame to get detected point clouds. We then use the Hungarian algorithm for global nearest neighbors with point cloud centroid distance-based similarity to match newly detected dynamic objects with currently tracked dynamic objects using their associated point clouds.

We augment the cost matrix and set costs to a large value if the point clouds are more than a set distance apart, so that newly detected objects not close to any currently tracked objects are not matched and become new tracked objects. Otherwise, currently tracked objects are updated with their new detections.

Representative videos of our algorithm running on the data we collected are shown below. Top left and right are the RGB and depth images. Bottom right is the velocity residual magnitude, and bottom left is the post-processed binary mask. Overlaid on the images are the tracked dynamic objects and their IDs. Noise is still present, but much of it is filtered out.

Figure 3. Clockwise from top left: RGB image, depth image,
velocity magnitude from flow residual, post-processed binary mask

Figure 4. The same dynamic object maintains a consistent ID throughout frames where it is
moving continuously, showing the efficacy of the Hungarian algorithm for frame-to-frame association.

Experiments

Dataset

We evaluate our method on our own dataset, collected in the Highbay of the Aerospace Controls Laboratory at MIT. We used a RealSense D455 depth camera on a Clearpath Husky with an Ouster lidar and MicroStrain IMU running lidar-inertial odometry. The moving objects are a hexcopter and another ground robot. All robots were teleoperated. The ground truth pose data was collected using a Vicon motion capture system and recorded in a ROS2 bag. The Husky recorded a ROS1 bag with depth camera and pose data. We use our own ROS bag processing software for dataloading.

The video of our algorithm running on the entire dataset is linked here: [hamilton full demo].

Metrics

We run our algorithm on the entire dataset. For each frame of the output, if there is ground truth data close to the associated timestamp, we compute the following metrics:

Theoretical number of dynamic objects according to ground truth data. (\( N_{gt} \))

Since we have ground truth poses, we can compute where the Scout and hexcopter are in the current image frame. We can also approximate their velocities by linearizing around their poses at that frame (0.25s ahead/behind). If the centroid of a robot is within the image dimensions and it has an approximate velocity above the depth-scaled threshold from the parameters, we count it as a dynamic object in that frame.

Number of detected ground truth dynamic objects (\( N_{det} \))

If there is a detected object in that frame with a point cloud centroid within 1.0m* of the ground truth dynamic object pose, we count that object as detected in that frame.
This also gives us the number of ground truth dynamic objects not detected; i.e., false negative (\( N_{missed} \))

Total number of detected dynamic objects (\( N \))
Number of correctly detected dynamic objects. (\( N_{corr} \))

If the detected object's point cloud centroid is within 1.0m of the ground truth object pose, we count it as a correct detection.
This also gives us the number of detected dynamic objects not associated with ground truth data; i.e., false positives (\( N_{incorr} \)).

* Our robots are ~1.0m in their largest dimension, so we used that as a reasonable threshold. The ground truth data is not aligned with the robot center of mass in the robot Z axis, creating steady state error. Although far from a perfect metric, this should be a reasonable estimate and allow us to compare runs with different parameters.

We also define the following percentages for ease of comparison:

Percentage of ground truth dynamic objects detected \( det\% =\dfrac{N_{det}}{N_{gt}}\)
Percentage of false positives \( fp\% =\dfrac{N_{incorr}}{N}\)

Results

We evaluated the performance of our algorithm when certain parameters are changed, while all others are kept constant. First, we determined the best RAFT model for our application. RAFT options were not touched, besides defining iter=12 for each model.

Table 1. RAFT model comparison

Model	\( N_{gt} \)	\( N_{det} \)	\( N \)	\( N_{corr} \)	\( det\% \)	\( fp\% \)
raft-kitti	1283	1041	1416	1058	81.14%	25.28%
raft-things	1283	1085	1383	1110	84.57%	19.74%
raft-sintel	1283	1088	1384	1114	84.80%	19.51%
raft-chairs	1283	672	1173	702	52.38%	40.15%

Subsequent evaluations were run with the raft-sintel model.

We then evaluated the effect of framerate on our algorithm. The original RGBD data was recorded at 30Hz, but we downsampled it to 15, 10, and 6 Hz.

Table 2. Framerate comparison

Framerate	\( N_{gt} \)	\( N_{det} \)	\( N \)	\( N_{corr} \)	\( det\% \)	\( fp\% \)
15Hz	1283	1088	1384	1114	84.80%	19.51%
10Hz	849	715	900	731	84.22%	18.78%
6Hz	513	416	530	422	81.09%	20.38%

The relative performance of the algorithm is similar across framerates, but a higher framerate allows for earlier detection of dynamic objects. Therefore, we ran all other evaluations at 15Hz.

We then evaluated the effect of various noise-reduction methods by performing ablations on various parameters of the algorithm.

Table 3. Noise reduction comparison

Method	\( N_{gt} \)	\( N_{det} \)	\( N \)	\( N_{corr} \)	\( det\% \)	\( fp\% \)
Original	1283	1088	1384	1114	84.80%	19.51%
No Morphology	1283	1102	2626	1294	85.89%	50.72%
No axial variance-based threshold	1283	1119	1662	1183	87.22%	28.82%
No depth-based velocity threshold gain*	1318	1097	1605	1112	83.23%	30.72%

* We set the fixed velocity threshold to 0.775 (0.4 + 0.125 * 3) for this run. This was previously the average velocity threshold in our depth range (6m), since the original parameters were 0.4 for the minimum velocity threshold and 0.125 for the depth-based gain.

We find that morphological operations are very effective at removing noise from the mask. Variance-based discarding and depth-based velocity threshold gain also have a noticeable effect. Variance-based thresholding and morphological operations somewhat reduce the detection rate, but the reduction in false positives is more significant.

To further reduce false positives, we evaluated the effect of requiring that a dynamic object be detected in two consecutive frames before being considered a dynamic object.

Table 4. Using 3D residuals

Method	\( N_{gt} \)	\( N_{det} \)	\( N \)	\( N_{corr} \)	\( det\% \)	\( fp\% \)
Original	1283	1088	1384	1114	84.80%	19.51%
2-Consecutive Requirement	1283	1080	1224	1093	84.18%	10.70%

This change halved the number of false positives and barely decreased the detection rate, though we did not include it in the default algorithm used for the other evaluations, since we did not want to influence results by relying on the probabilistic assumption that noise is independent between frames.

Conclusion

Although our method shows promise in efficient real-time detection and tracking of dynamic obstacles, it is extremely sensitive to the quality of data. During our experiments, we observed that poor quality depth data or odometry estimates led to the geometric optical flow having large deviations from the RAFT optical flow in static areas of the image. Future work could include using a trained neural network to segment dynamic objects, since we do not use the corresponding RGB image as a prior for the segmentation of flow residuals. A neural network that also uses FastSAM segments or superpixels would better leverage all the available data.

A planned extension of this work is to use LiDAR as the source for depth data. This would greatly increase the range and accuracy of depth data, at the cost of sparsity (which could be addressed with stereo camera-LiDAR fusion). In addition, it may allow us to use the 3D residual method discussed in Appendix I, which can better detect movement in the camera Z axis.

A major limitation of this method is that under high accelerations, the RAFT and geometric flow significantly diverge, leading to underwhelming results during UAV flights. There are multiple culprits, including noisy depth data and poor RAFT results at high accelerations. The most likely reason seems to be poor odometry, since the divergence often occurs during acceleration, suggesting latency and/or noise in the odometry data. Also, since pose data is collected at independent times from the RGBD data, we must perform linear interpolation to approximate it every frame, which introduces additional error.

Despite these shortcomings, our method proves to be a promising approach to the problem of dynamic obstacle tracking. With only odometry and depth camera data, it demonstrates good performance and runtime on real-world data in the tested environments. There appear to be many directions for further work to improve robustness, efficiency, and generalization.

Acknowledgments

We would like to thank Professor Jonathan How for his guidance throughout this project and for generously providing access to the resources that we utilized. Special thanks as well to Lucas Jia for assisting us with physical experiments and data collection.

Codebase

See [dynamic-object-detection on Github] for our final implementation code. Instructions for downloading our data and running the algorithm and evaluations are in the README.

The prototype code is also [hosted on Github].

Appendix

I. Using 3D Residuals

Because we have the depth data for both frame 1 and frame 2, in theory we can compute the full 3D movement of each pixel between frames. Given RAFT flow \( [\delta u_{raft}, \delta v_{raft}] \) at pixel \( [u, v] \), we can compute the 3D residual as follows: \[ p_2 = (T_{odom}^{cam})^{-1} \cdot T_{r2}^{r1} \cdot T_{odom}^{cam} \cdot [p_1^T \dots 1]^T + [u_{raft}, v_{raft}, 0]^T \] \[ p_{2, raft} = d_2 K^{-1} [u + \delta u_{raft}, v + \delta v_{raft}, 1]^T \] \[ \delta \approx p_{2, raft} - p_2 \] where \( d_2 \) is the depth of pixel \( [u + \delta u_{raft}, v + \delta v_{raft}] \) in frame 2. This method works by assuming that RAFT correctly predicts pixel correspondences, in which case the 3D point corresponding to pixel \( [u + \delta u_{raft}, v + \delta v_{raft}] \) in frame 2 is the same as the 3D point corresponding to pixel \( [u, v] \) in frame 1. Since \( p_2 \) accounts for the camera motion between frames but not the motion of the point itself, while RAFT accounts for both, the difference is the approximate 3D motion of the point between frames.

Unfortunately, this method is not effective in practice. The RAFT flow is not accurate enough, and noise in the depth data is amplified by the projection into 3D. We ran our evaluation with the 3D residual method and found that it decreased the detection rate and dramatically increased the number of false positives, mostly patches of depth noise at larger depths.

Table 4. Using 3D residuals

Method	\( N_{gt} \)	\( N_{det} \)	\( N \)	\( N_{corr} \)	\( det\% \)	\( fp\% \)
Original	1283	1088	1384	1114	84.80%	19.51%
3D residual	1283	939	35857	1400	73.19%	96.10%

II. Max Depth Limits

We later compared performances at different maximum depth limits (all other parameters fixed).

Table 5. Depth limit comparison

Depth Limit (m)	\( N_{gt} \)	\( N_{det} \)	\( N \)	\( N_{corr} \)	\( det\% \)	\( fp\% \)
6.0	1283	1088	1384	1114	84.80%	19.51%
7.5	1306	1087	1434	1113	83.23%	22.38%
10.0	1306	1049	1317	1075	80.32%	28.82%
15.0	1306	1042	1288	1068	79.79%	17.08%

As the maximum depth limit increases, the number of dynamic objects according to the ground truth does not significantly increase (possibly due to the depth-based velocity threshold gain). The detection rate decreases with increased depth limit.

We looked at the output videos and found that 15.0m decreased the false positive rate because it brought the walls into range of the depth data, so odometry errors caused larger portions of the image to be masked, which were then pruned by the axis variance threshold. However, this is not a generalizable phenomenon. Therefore, we decided to continue using the 6.0m limit suggested by Intel, as any depth beyond that is guaranteed to be noisy.

III. Smoothing and Quadcopter Experiments

While developing the prototype, we noticed rapid changes in pose produced high-frequency noise in the flow residuals, especially for quadcopter experiments where video was recorded on a highly dynamic vehicle. Our first instinct was to design a low-pass filter to smooth the outputs of our pipeline. Thus, our prototype returned a running average of the flow residuals parametrized by \( \alpha \) as \[ smoothFlow = \alpha \cdot Flow + (1 - \alpha) \cdot prevFlow \] As expected of a running average, smoothing mitigated some of the noise, but it also generated a delay between the estimated and ground truth obstacle position. In fact, some smoothing granted a minor performance boost, but it quickly degraded as we over-smoothed the signal and created a larger gap between the estimate and the actual position.

This motivated us to pivot from the running average approach. The analysis above was performed on data collected onboard a x500 Holybro Quadcopter. Below are the flow masks of the flow from the videos recorded by the quadcopter with \(1 - \alpha \) = 0.2 smoothing (left) and \(1 - \alpha \) = 0.8 smoothing (right). The flashes of white are moments in which the RAFT flow and geometric flow disagree. The attentive viewer will notice that the flashing occurs when the quadcopter rolls. We found this to be a common trend: the signal-to-noise ratio decreases as the vehicle performs aggressive maneuvers, particularly during sharp changes in orientation. Notice that the video on the right exhibits less flashing, as it is heavily smoothed. This comes at the expense of a higher trajectory error as the smoothed estimate lags behind the unprocessed one. Although we developed an improved noise reduction strategy, the noise introduced during aggressive UAV flight requires advanced denoising techniques beyond the scope of this project, so we constrained the remainder of our experiments to ground vehicles.

Figure 5. Left is light smoothing, right is heavy smoothing.

To estimate the obstacle position, we unproject the points in the mask into 3D space and obtain their centroid. We use this centroid as our prediction of the obstacle's position. Clearly, a significant limitation of this approach is that it only works in scenes with at most one dynamic obstacle. Note that these are preliminary results, meant to convey the challenge of handling noisy visual signals during aggressive motion. Our final implementation computes the trajectory error by following a more sophisticated procedure outlined in the methods and experiment sections of this blog.

Introduction

Related Work

Methodology

Theory

Prototype

Final Algorithm

Experiments

Dataset

Metrics

Results

Conclusion

References

Acknowledgments

Codebase

Appendix

I. Using 3D Residuals

II. Max Depth Limits

III. Smoothing and Quadcopter Experiments