3D Object Detection for TIAGo Robot: using a Faster R-CNN Network

Authors: Jianfeng Cui & Aakaash Radhoe

Introduction and Motivation

Figure 1: Products in the retail store. The products are labeled from 0 to 4 from left to right.

TIAGo is equipped with an RGB-D camera. So basically the strategy is that we trained a Faster R-CNN on our customized dataset(with COCO format) and utilized it for 2D object detection. Each target’s 2D bounding box is able to generate a 3D viewing frustum, by which we could filter out the object point clouds from all possible candidates. Finally, the filtered “clean” point clouds are fed to a cylinder fitting detector by using the PCL library to calculate a 3D pose. This works because we priorly know that the five products are all shaped like a cylinder.

Now let’s go through the work step by step!

Design: 2D object detection

The model

The Faster R-CNN is composed of a feature extraction network which is usually a pre-trained CNN. This is then followed by the object detection system, which is composed of two modules, which are both trainable. The first module is the Region Proposal Network (RPN), which generates the object proposals. The second module predicts the actual class of the object.

Figure 2: Faster R-CNN, the unified object detection network, Source:[7]

Anchors are very important in Faster R-CNN, an anchor is a box. These anchors are used to provide a predefined set of bounding boxes, in different sizes and ratios. The default for Faster R-CNN is 9 anchor boxes for an image. The anchors are used to capture the scale and aspect ratio of the specific object classes we want to detect.

Figure 3: Anchor boxes at (320,320), Source:[8]

The Region Proposal Network (RPN) will output proposals (anchors), this will then go through a classifier and regressor to check for the occurrence of objects. After the RPN, the proposals are in different sizes. With different sized regions, we will get different sized CNN feature maps, this will make it inefficient to work with. To solve this Region of Interest Pooling is used, this way the feature maps are reduced to the same size.

The code we used for the Faster R-CNN is from open-mmlab. This toolbox is based on PyTorch. The Faster R-CNN network we used only has a minor edit for the anchor boxes. The anchor boxes used for this model are 8 instead of the default 9.

We used this toolbox to train the network on the product images we selected and annotated. But before that, we need to convert the generated dataset format. Labelme generated JSON files separately for each image sample, now for feeding the dataset to the training pipeline we use this package [3] to generate a single COCO data formatted JSON file. Paired with the folder containing all the images, it integrates all the information about our dataset.

Prepare the data

Figure 4: Annotation process of the items

Training the model


Figure 5: cls and bbox loss plot

In figure 5, the loss functions are plotted for the cls and bbox loss. Here we can see that the cls loss converges around 900 iterations and the bbox loss after around 1250 iterations.

Figure 6: bbox_mAP plotted per epoch

In figure 6, the bbox mAP is plotted for each epoch. At 12 epoch the bbox mAP is at 1. These results were sufficient to proceed and implement the model in the simulation.

Deploy the network in ROS

The inference result from the deployed network will be a list containing arrays representing the detected bounding box for each object respectively. Each valid array contains 5 elements: the x and y pixel coordinates of the upper left and bottom right corner, and the prediction confidence. The result is then used to generate a standard ROS message Detection2D:

def generate_obj(self, result, id, msg):
obj = Detection2D()
obj.header = msg.header
obj.source_img = msg
result = result[0]
obj.bbox.center.x = (result[0] + result[2]) / 2
obj.bbox.center.y = (result[1] + result[3]) / 2
obj.bbox.size_x = result[2] — result[0]
obj.bbox.size_y = result[3] — result[1]
obj_hypothesis = ObjectHypothesisWithPose()
obj_hypothesis.id = str(id)
obj_hypothesis.score = result[4]
return obj

After appending all detected objects into a Detection2DArray, the packaged message can be published to ROS and used by other modules.

Another problem is that the CvBridge library for converting image formats for ROS is not available for Python3. This can be smartly solved by a trick using NumPy:

im = np.frombuffer(msg.data, dtype = np.uint8).reshape(msg.height, msg.width, -1)

When the simulation is running and using it on the TIAGo robot, these are the results we get. In figure 7, we can see how the network detects the items presented to the robot and shows the prediction confidence.

Figure 7: Faster R-CNN deployed in ROS

Now we have successfully deployed a 2D object detection network in ROS! Now let’s further use it for our 3D object detection.

Design: 3D object detection

In fact, we found that this strategy is quite similar to the F-PointNet proposed by this paper [9] for 3D object detection. As shown below in the figure from the paper, the basic idea is also extruding the 2D region to a 3D viewing frustum. But then we do not use PointNet or other 3D object detection networks to generate 3D bounding boxes, rather in this task we use the point clouds algorithm for fitting cylinders to calculate a 3D pose of our target object.

Figure 8: F-PointNet strategy, Source: [9]

Filter the object point cloud out

sensor_msgs::PointCloud2 pcs_filtered;
int POINT_STEP = latestPointClouds.point_step; // NOTE: 32, actually
size_t tc_u_min = int(tc_u — tc_size_u / 2) * POINT_STEP;
size_t tc_v_min = int(tc_v — tc_size_v / 2);
size_t tc_u_max = tc_u_min + tc_size_u * POINT_STEP;
size_t tc_v_max = tc_v_min + tc_size_v;

Note that here the target object’s bounding box is already found and selected out. The Variables tc_u, tc_v, tc_size_u and tc_size_v are the centroid’s x, y coordinate and the size along x and y-direction. The four calculated variables indicate the lower and upper bound index limit in the data field of the raw point clouds. Then we construct the filtered point cloud:

pcs_filtered.header.frame_id = latestPointClouds.header.frame_id;
pcs_filtered.header.stamp = ros::Time::now();
pcs_filtered.height = tc_size_v;
pcs_filtered.width = tc_size_u;
pcs_filtered.fields = latestPointClouds.fields;
pcs_filtered.is_bigendian = latestPointClouds.is_bigendian;
pcs_filtered.point_step = POINT_STEP; //32
pcs_filtered.row_step = POINT_STEP * tc_size_u;
pcs_filtered.is_dense = latestPointClouds.is_dense;
int raw_row_step = latestPointClouds.row_step;
decltype(latestPointClouds.data) filtered_data = {};
for(size_t row = tc_v_min; row < tc_v_max; row++)
for(size_t col = tc_u_min; col < tc_u_max; col++)
auto dat = latestPointClouds.data[row * raw_row_step + col];
pcs_filtered.data = filtered_data;

Now pcs_filtered contains the filtered clean point clouds that we want! We can now calculate its pose.

Calculate the 3D pose

pcl::SACSegmentationFromNormals<PointT, pcl::Normal> seg;
pcl::PointIndices::Ptr inliers(new pcl::PointIndices);
seg.setNormalDistanceWeight (0.1);
seg.setRadiusLimits(minRadius, maxRadius);
seg.segment(*inliers, *cylinderCoefficients);

The cylinder detector can fit the input object point cloud as a cylinder and also return its centroid’s pose. That is what we want! Now the pipeline is done!

Communication overview

Figure 9: Ros Nodes

We have find_object and mmdetector nodes. find_object is for the TIAGo’s perception. It receives the object id from set_nav_goal and returns the 3D pose of this target object.

The node mmdetector utilizes the Faster R-CNN on the RGB image from the camera to calculate object 2D bounding boxes and publish them on mmdetector/objects for find_object to use, and also publish the debug image on mmdetector/debug_image for image_view to visualize it. After find_object receives the call from set_nav_goal(perception module starts), it subscribes to the 2D bounding box and also the whole raw pointcloud, processes it(filter out the pointcloud located in the frustum generated by 2D bounding boxes), and publishes this “clean” filtered pointcloud on the topic to_cylinder_detector. The cylinder detector processes this pointcloud, calculates the 3D pose of this object and publishes it. Note that find_object already keeps monitoring if the 3D pose is available, so now it gets it and returns this result back to set_nav_goal(perception module finished the task).


Animation: Object Detection in ROS

The whole video can be found at: https://drive.google.com/file/d/1VqTg8dFCKsVXLAXaDL9LmUPcT7icDjg0/view?usp=sharing, where the 3D object detection task is to recognize the sprite and calculate its 3D pose. In the video you can see two windows running: Inside of TIAGo++’s headshowing the current camera view and mmdetector/debug_image showing the debug image calculated by the Faster R-CNN. The debug image window runs slower(with 1 Hz) because this process can stress a burden on your GPU so we tuned this frequency. After TIAGo moves in front of the 5 objects on the table, it looks down, which is a pre-defined behavior because we assume that the objects will be roughly located on the table, lower than the original image view. And then TIAGo looks at the target object(in this case, sprite) and calculates the 3D pose. At 1:48 you can see the whole point cloud attained(with colors painted on) by the RGBD camera. At 2:14 you can see the filtered “sprite” point cloud and at 2:49 the calculated 3D pose of the sprite. The perception part can recognize the object it wants to find out pretty well. Below is an animation from the video showing the sprite point cloud.

Animation: Point Cloud of the sprite can







[5] https://pcl.readthedocs.io/projects/tutorials/en/latest/cylinder _segmentation.html?highlight=cylinder


[7]Ren, S., He, K., Girshick, R., Sun, J. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. https://arxiv.org/abs/1506.01497


[9]Charles, R. Q., Wei, L., Chenxia, W., Hao, S., Leonidas, J. G. (2018) Frustum PointNets for 3D Object Detection From RGB-D Data https://arxiv.org/abs/1711.08488



MSc Student Robotics at the Technical University of Delft.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store