The Sama-Coco Dataset

We are proud to offer the Sama-Coco dataset, a relabelling of the Coco-2017 dataset by our own in-house Sama associates (here’s more information about our people!). We invite the Machine Learning (ML) community to use it for anything you would like to do – all free of charge and ungated.

This is part of our ongoing effort to redefine data quality for the modern age, and to contribute to the wider research and development efforts of the ML community. Here are the ungated links to the two datasets (both covered by the Creative Commons license) so that you can get started right away.

Sama Coco DatasetAbstract background shapes
Table of Contents
Talk to an Expert
Coco-2017
Sama-Coco
Coco-2017
Sama-Coco
Difference
Overview
Number of images
123 287
123 287
0
Number of classes
80
80
0
Number of classes with more objects annotated
33
47
Coco-2017
Sama-Coco
Difference
Instances
Number of instances
(crowds included)
896 782
1 115 464
218 682 (x1.24)
Number of crowds
10 498
47 428
36 930 (x4.5)
Objects composed of more than one polygon
86 156
175 698
89 952 (x2)
Number of vertices
21 726 743
40 258 235
18 531 492 (x1.85)
Coco-2017
Sama-Coco
Difference
Object Sizes
Very small objects
(<=10×10 pixels)
78 213
48 394
-29 819 (x0.6)
Small Objects
(<32×32 pixels)
371 655 (41.4%)
555 006 (49.8%)
183 351 (x1.49)
Medium Objects
(>= 32×32 and <96×96 pixels)
86 156
354 290 (31.8%)
46 558 (x1.15)
Large Objects
(>=96×96 pixels)
217 395 (24.2%)
206 168 (18.4%)
-11 227

Sama-Coco by the Numbers

Here’s a quick overview of the two datasets’ most important characteristics:

{{tables}}

Number of instances per class

(10 most frequent classes)

number of instances

Sama-Coco’s Key Features

Some key features should be highlighted:

  • The core number of images and item classes are the same across both the Sama-Coco and the original Coco-2017 datasets.
  • The number and instances of crowds are significantly greater in Sama-Coco. This is partially because our associates were tasked with decomposing large, singular crowds into smaller individual elements and smaller crowds. While both datasets share the same base, Sama-Coco has more instances for 47 of the 80 classes. In some cases, such as for the person class, the number of instances is significantly higher than the one in Coco-2017.
  • Associates were instructed to be more precise and comprehensive when annotating instances and crowds. This led to a sharp rise in the total number of vertices – it nearly doubled. The number of large objects also dropped significantly, as the individual members or elements of big crowds or clusters of objects were relabeled as their own unique items.
  • There is a significant reduction in the number of very small objects – those measuring 10×10 pixels or less. It was a conscious choice at the outset to ask associates not to annotate such small objects. We were attempting to balance quality and time allocated to labeling when we made this decision, and we believe that the significantly greater number of other small objects (between 10×10 and 32×32 pixels) and medium objects (between 32×32 and 96×96 pixels) that emerged in our dataset justifies this decision.

Illustrative Differences between Sama-Coco and Coco-2017

Here, we cover two images that are illustrative of some of the differences between Sama-Coco and Coco-2017.

In this first example, Coco labellers largely treated this as one singular crowd, whereas in Sama-Coco, each person was individually labeled.

sama coco dataset comparison
Current Sama Co

This second example shows how most annotations were carried out with an acute level of  precision. Coco’s motorcycle annotation is rather coarse, whereas Sama-Coco’s is more fine-grained.

sama coco dataset comparison vs 2017

How Sama-Coco was Labeled

We revisited all 123 287 images pre-loaded with annotations from the Coco-2017 dataset with up to 500 associates performing three key tasks. They had to:

  • Distinguish crowd from non-crowd images (note that both Sama-Coco and Coco-2017 loosely defined a crowd as a group of instances of the same class that are co-located).
  • Prioritize annotating instances of objects over crowds of objects. However, when associates encountered more than a certain number of instances of a specific class in a single image, they were told to label the first of such instances individually and then label the balance as part of a crowd. The exact number of instances to annotate changed over the course of the project. This requirement was done to balance budget, time, and quality considerations.
  • Ignore objects that were smaller than 10×10 pixels (some associates deleted Coco-2017 pre-annotations for such small objects whereas others simply ignored them).

Sama-Coco Installation Instruction For FiftyOne App

Load Sama-Coco directly from the FiftyOne app. Explore all 123,287 images directly within FiftyOne and compare them side by side with the original MS Coco dataset.

To set up simply:

  1. Download the FiftyOne app here
  2. You can then load both datasets with the following code:

Please Give Us Your Feedback!

We’d love to hear from you about your experience with Sama-Coco! Please contact sama-coco@sama.com with your feedback. Thanks!