3D Object Detection for TIAGo Robot: using a Faster R-CNN Network

Authors: Jianfeng Cui & Aakaash Radhoe

Introduction and Motivation

In this project, we designed a strategy for solving the 3D object detection problem for a robot named TIAGo [1], which is working in a retail store. The whole working scenario for TIAGo is to recognize, pick up and deliver the products to the customers, so this project serves as its perception module. It will be able to recognize the target products appearing in the image view, calculate its 3D pose and deliver this information to other modules. In our case, the retail store has 5 products that need to be recognized: ‘biscuits’, ‘cocacola’, ‘pringles’, ‘pringles2’, ‘sprite’, with encoded ids from 0 to 4.

Figure 1: Products in the retail store. The products are labeled from 0 to 4 from left to right.

TIAGo is equipped with an RGB-D camera. So basically the strategy is that we trained a Faster R-CNN on our customized dataset(with COCO format) and utilized it for 2D object detection. Each target’s 2D bounding box is able to generate a 3D viewing frustum, by which we could filter out the object point clouds from all possible candidates. Finally, the filtered “clean” point clouds are fed to a cylinder fitting detector by using the PCL library to calculate a 3D pose. This works because we priorly know that the five products are all shaped like a cylinder.

Now let’s go through the work step by step!

Design: 2D object detection

The model

We first started looking around for an object detection algorithm with code. We came across several good choices like YOLOv4, SqueezeDet and Faster R-CNN. At first, we chose to work with YOLOv4, but due to it being too computationally expensive and not reproducible in the given timeframe, when implementing the algorithm in a simulation. So, we chose Faster R-CNN, this algorithm was easy to train and fast to work with the real-time detection in the simulation.

The Faster R-CNN is composed of a feature extraction network which is usually a pre-trained CNN. This is then followed by the object detection system, which is composed of two modules, which are both trainable. The first module is the Region Proposal Network (RPN), which generates the object proposals. The second module predicts the actual class of the object.

Figure 2: Faster R-CNN, the unified object detection network, Source:[7]

Anchors are very important in Faster R-CNN, an anchor is a box. These anchors are used to provide a predefined set of bounding boxes, in different sizes and ratios. The default for Faster R-CNN is 9 anchor boxes for an image. The anchors are used to capture the scale and aspect ratio of the specific object classes we want to detect.

Figure 3: Anchor boxes at (320,320), Source:[8]

The Region Proposal Network (RPN) will output proposals (anchors), this will then go through a classifier and regressor to check for the occurrence of objects. After the RPN, the proposals are in different sizes. With different sized regions, we will get different sized CNN feature maps, this will make it inefficient to work with. To solve this Region of Interest Pooling is used, this way the feature maps are reduced to the same size.

The code we used for the Faster R-CNN is from open-mmlab. This toolbox is based on PyTorch. The Faster R-CNN network we used only has a minor edit for the anchor boxes. The anchor boxes used for this model are 8 instead of the default 9.

We used this toolbox to train the network on the product images we selected and annotated. But before that, we need to convert the generated dataset format. Labelme generated JSON files separately for each image sample, now for feeding the dataset to the training pipeline we use this package [3] to generate a single COCO data formatted JSON file. Paired with the folder containing all the images, it integrates all the information about our dataset.

Prepare the data

For collecting images, we could run the TIAGo in ROS and subscribe to the images published by the camera in a lower frequency, and filter out the wanted images which are representative of the five objects. We picked 121 images and started making the custom dataset. We used labelme [2], which is a graphical image annotation tool. By manually drawing polygons on the image and labeling them with different category names, a JSON file is generated by labelme for each image, in which the coordinates of those polygons and the label information are stored. In figure 4, is shown how we labeled the data using the labelme tool.

Figure 4: Annotation process of the items

Training the model

With the data prepared and ready to go, we started training the Faster R-CNN. The training of the network took us approximately 15 minutes, using an Nvidia GTX1650. This model is then saved and loaded later on in the ROS detection module. The trained network is saved as a .pth file and can be found on Github.


With the model trained, we went ahead and validated the model on the test set. Where we got the following results:

Figure 5: cls and bbox loss plot

In figure 5, the loss functions are plotted for the cls and bbox loss. Here we can see that the cls loss converges around 900 iterations and the bbox loss after around 1250 iterations.

Figure 6: bbox_mAP plotted per epoch

In figure 6, the bbox mAP is plotted for each epoch. At 12 epoch the bbox mAP is at 1. These results were sufficient to proceed and implement the model in the simulation.

Deploy the network in ROS

Now it’s time to bring the magic to ROS. With the Faster R-CNN model trained and able to detect all the products, we implemented this in a general ROS package, so we can run this in a real-time Gazebo simulation. The ROS package is designed for running a 2D object detector’s inference using mmdetection in ROS, which can be found at [4]. There the installation and usage are explained.

The inference result from the deployed network will be a list containing arrays representing the detected bounding box for each object respectively. Each valid array contains 5 elements: the x and y pixel coordinates of the upper left and bottom right corner, and the prediction confidence. The result is then used to generate a standard ROS message Detection2D:

def generate_obj(self, result, id, msg):
obj = Detection2D()
obj.header = msg.header
obj.source_img = msg
result = result[0]
obj.bbox.center.x = (result[0] + result[2]) / 2
obj.bbox.center.y = (result[1] + result[3]) / 2
obj.bbox.size_x = result[2] — result[0]
obj.bbox.size_y = result[3] — result[1]
obj_hypothesis = ObjectHypothesisWithPose()
obj_hypothesis.id = str(id)
obj_hypothesis.score = result[4]
return obj

After appending all detected objects into a Detection2DArray, the packaged message can be published to ROS and used by other modules.

Another problem is that the CvBridge library for converting image formats for ROS is not available for Python3. This can be smartly solved by a trick using NumPy:

im = np.frombuffer(msg.data, dtype = np.uint8).reshape(msg.height, msg.width, -1)

When the simulation is running and using it on the TIAGo robot, these are the results we get. In figure 7, we can see how the network detects the items presented to the robot and shows the prediction confidence.

Figure 7: Faster R-CNN deployed in ROS

Now we have successfully deployed a 2D object detection network in ROS! Now let’s further use it for our 3D object detection.

Design: 3D object detection

Our strategy to expand the current ability to 3D object detection is quite simple. TIAGo can generate raw point clouds using an RGB-D camera, and we know that the image pixels can be seen as the 2D projection of the actual 3D environment points on the image plane. So the 2D bounding box gained from the last section can generate a 3D frustum that restricts the whole raw point clouds into a certain area of our interest. We simply assume that the points located in the area form the target 3D object. In this way, we could filter out the small cluster of object point clouds from all raw points.

In fact, we found that this strategy is quite similar to the F-PointNet proposed by this paper [9] for 3D object detection. As shown below in the figure from the paper, the basic idea is also extruding the 2D region to a 3D viewing frustum. But then we do not use PointNet or other 3D object detection networks to generate 3D bounding boxes, rather in this task we use the point clouds algorithm for fitting cylinders to calculate a 3D pose of our target object.

Figure 8: F-PointNet strategy, Source: [9]

Filter the object point cloud out

To achieve this, we could hack the ROS pointcloud2 message by indexing the wanted points from that data stored in it and package them into our newly defined object point cloud message. Here shows a small snippet from the codes:

sensor_msgs::PointCloud2 pcs_filtered;
int POINT_STEP = latestPointClouds.point_step; // NOTE: 32, actually
size_t tc_u_min = int(tc_u — tc_size_u / 2) * POINT_STEP;
size_t tc_v_min = int(tc_v — tc_size_v / 2);
size_t tc_u_max = tc_u_min + tc_size_u * POINT_STEP;
size_t tc_v_max = tc_v_min + tc_size_v;

Note that here the target object’s bounding box is already found and selected out. The Variables tc_u, tc_v, tc_size_u and tc_size_v are the centroid’s x, y coordinate and the size along x and y-direction. The four calculated variables indicate the lower and upper bound index limit in the data field of the raw point clouds. Then we construct the filtered point cloud:

pcs_filtered.header.frame_id = latestPointClouds.header.frame_id;
pcs_filtered.header.stamp = ros::Time::now();
pcs_filtered.height = tc_size_v;
pcs_filtered.width = tc_size_u;
pcs_filtered.fields = latestPointClouds.fields;
pcs_filtered.is_bigendian = latestPointClouds.is_bigendian;
pcs_filtered.point_step = POINT_STEP; //32
pcs_filtered.row_step = POINT_STEP * tc_size_u;
pcs_filtered.is_dense = latestPointClouds.is_dense;
int raw_row_step = latestPointClouds.row_step;
decltype(latestPointClouds.data) filtered_data = {};
for(size_t row = tc_v_min; row < tc_v_max; row++)
for(size_t col = tc_u_min; col < tc_u_max; col++)
auto dat = latestPointClouds.data[row * raw_row_step + col];
pcs_filtered.data = filtered_data;

Now pcs_filtered contains the filtered clean point clouds that we want! We can now calculate its pose.

Calculate the 3D pose

Since the five objects are all shaped as cylinders, we could simply make use of the cylinder segmentation approach in the PCL library here [5]. Furthermore, PAL Robotics already provides a cylinder detector node using PCL [6]. With this benefit, we just need to define its communication with our codes. The codes for segmentation are:

pcl::SACSegmentationFromNormals<PointT, pcl::Normal> seg;
pcl::PointIndices::Ptr inliers(new pcl::PointIndices);
seg.setNormalDistanceWeight (0.1);
seg.setRadiusLimits(minRadius, maxRadius);
seg.segment(*inliers, *cylinderCoefficients);

The cylinder detector can fit the input object point cloud as a cylinder and also return its centroid’s pose. That is what we want! Now the pipeline is done!

Communication overview

To integrate all parts together into the ROS system and make them work in the simulation, in this section we show the overall communication between these ROS nodes to give a clear data flowchart about how this strategy works. See the figure below.

Figure 9: Ros Nodes

We have find_object and mmdetector nodes. find_object is for the TIAGo’s perception. It receives the object id from set_nav_goal and returns the 3D pose of this target object.

The node mmdetector utilizes the Faster R-CNN on the RGB image from the camera to calculate object 2D bounding boxes and publish them on mmdetector/objects for find_object to use, and also publish the debug image on mmdetector/debug_image for image_view to visualize it. After find_object receives the call from set_nav_goal(perception module starts), it subscribes to the 2D bounding box and also the whole raw pointcloud, processes it(filter out the pointcloud located in the frustum generated by 2D bounding boxes), and publishes this “clean” filtered pointcloud on the topic to_cylinder_detector. The cylinder detector processes this pointcloud, calculates the 3D pose of this object and publishes it. Note that find_object already keeps monitoring if the 3D pose is available, so now it gets it and returns this result back to set_nav_goal(perception module finished the task).


Animation: Object Detection in ROS

The whole video can be found at: https://drive.google.com/file/d/1VqTg8dFCKsVXLAXaDL9LmUPcT7icDjg0/view?usp=sharing, where the 3D object detection task is to recognize the sprite and calculate its 3D pose. In the video you can see two windows running: Inside of TIAGo++’s headshowing the current camera view and mmdetector/debug_image showing the debug image calculated by the Faster R-CNN. The debug image window runs slower(with 1 Hz) because this process can stress a burden on your GPU so we tuned this frequency. After TIAGo moves in front of the 5 objects on the table, it looks down, which is a pre-defined behavior because we assume that the objects will be roughly located on the table, lower than the original image view. And then TIAGo looks at the target object(in this case, sprite) and calculates the 3D pose. At 1:48 you can see the whole point cloud attained(with colors painted on) by the RGBD camera. At 2:14 you can see the filtered “sprite” point cloud and at 2:49 the calculated 3D pose of the sprite. The perception part can recognize the object it wants to find out pretty well. Below is an animation from the video showing the sprite point cloud.

Animation: Point Cloud of the sprite can


When looking back at the project we are very happy with the result. The object detection using Faster R-CNN was successful and performed very well. The implementation in ROS also went well. And with the use of the Faster R-CNN network, it was possible to detect the shopping items in real-time and calculate the 3D pose. So, we were able to implement everything we wanted to implement. Yet there are some improvements we would want to make, like adding more items or making an even less computationally expensive network. It would also be interesting to reproduce the Faster R-CNN from scratch and implement it again in ROS. And compare the results with each other and see where we could further improve the model. But due to time constraints, we used the Faster R-CNN network and implemented this in ROS. In the end, we are pretty happy with the results and implementation of the object detection for TIAGo.


All the code for this project can be found using the following links: The Faster r-CNN code used can be found at: https://github.com/open-mmlab/mmdetection The ROS package with our implementation of the Faster r-CNN can be found at: https://github.com/jianfengc11/TIAGo_find_object





[5] https://pcl.readthedocs.io/projects/tutorials/en/latest/cylinder _segmentation.html?highlight=cylinder


[7]Ren, S., He, K., Girshick, R., Sun, J. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. https://arxiv.org/abs/1506.01497


[9]Charles, R. Q., Wei, L., Chenxia, W., Hao, S., Leonidas, J. G. (2018) Frustum PointNets for 3D Object Detection From RGB-D Data https://arxiv.org/abs/1711.08488

MSc Student Robotics at the Technical University of Delft.