Region-based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision, and specifically object detection and localization.[1] The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category (e.g. car or pedestrian) of the object. In general, R-CNN architectures perform selective search[2] over feature maps outputted by a CNN.
R-CNN has been extended to perform other computer vision tasks, such as: tracking objects from a drone-mounted camera,[3] locating text in an image,[4] and enabling object detection in Google Lens.[5]
Mask R-CNN is also one of seven tasks in the MLPerf Training Benchmark, which is a competition to speed up the training of neural networks.[6]
^Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "14.8. Region-based CNNs (R-CNNs)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN978-1-009-38943-3.