Object detection without training using Grounding Dino

Object detection without training using Grounding Dino

Historically the cost of performing computer vision training has been expensive and often prohibitive but with Grounding Dino, you might be able to skip the training step. Watch this video to see how Grounding Dino, a Zero Shot Object Detector, performs object detection without custom training.


"The things you can do with computer vision are pretty impressive, for example, you can perform classification, object detection, segmentation and other amazing things that can solve real problems for you, but there is a cost associated with computer vision systems, that have historically been pretty high, well, that is until recently.

So, what are these high costs I’m referring to?

Well, you've got to train the computer vision algorithm to recognize whatever objects you're interested in detecting.

For example, if we wanted to detect compasses in a video stream, we'd need to first collect hundreds or even thousands of images of compasses, then we'd need to annotate or label the images with bounding boxes, and then we need to feed the labeled dataset into a training process, and as a result, we get a weights file that we can then use in our computer vision application.

Now, this training process is very time-consuming and expensive, and the associated costs could be prohibitive.

Ok, but I've got some good news for you and some slightly bad news, but let's start with the good news.

The good news is, there's a new tool you can use, called Grounding Dino, which is a zero shot object detector.

Ok, so what's a zero shot object detector?

Well, it's a tool that allows you to perform object detection without needing to perform training.

In other words, you can perform arbitrary object detection in images, without having to perform the expensive training process that I mentioned a moment ago.

Let me show you how this works in a Google Colab notebook, which you can find a link to in the description of this video.

So, the first cell runs the nvidia-smi command which will verify we've got access to a GPU. If you receive a message that the command isn't found, follow these instructions to change the colab runtime to include a GPU.

In the next cell, I'm simply storing the current working directory in a variable, that will get used in subsequent cells.

Now in this cell, I'm cloning the Grounding Dino repo and I'm installing it with pip install, then I'm installing the supervision package. This cell takes a little bit of time to execute, so I'll fast-forward to the end.

Next, I'm downloading a weights file that will get used by grounding dino.

Now in this cell, I'm downloading a few images that we'll perform object detection on.

In the next cell, I'm loading the model, and now we're ready to do some object detection.

So, in this cell, I'm going to load up this image with a compass, and the next question you probably have is, how do we tell Grounding Dino, what to look for in this image?

Well, to search for an object, you simply include a text prompt for the object you're attempting to detect, so in this example, we're looking for a compass.

So, I'll go ahead and run this cell, which takes a few seconds, and here's the result. We see this image with a bounding box around the compass, and again I want to reemphasize that we didn't do any training to make this object detection work, which is pretty amazing.

Let's look at the next cell. In this cell, we're loading up an image named air and we're performing object detection for parachutes, and so I'll run this. And check that out, all three parachutes were detected, and again this happened without any training on our part.

Now, here's something interesting you can do with Grounding Dino, I could change the search term to parachute on the left, and let's run this cell again, and cool, it worked, it only put a bounding box around the leftmost parachute.

This next cell will detect surfers in the ocean, let's try it out.

Ok, it looks like it identified every surfer in the image, nice.

Ok, I've got one more example to show you, and this example has a combination of a snowboarder and some skiers and I want to see if it can identify the snowboarder.

So, I'll run this cell, and checkout the result. It was able to detect the snowboarder, cool.

Ok, so you might be thinking, that Grounding Dino is the new go to tool for object detection. In other words, you might be thinking you can throw out YOLO or whatever model you currently use, and start using Grounding Dino.

Well, there's one thing you need to know about Grounding Dino, which is that it's relatively slow. For example, I measured the time it took to infer the snowboarder in this image, and it took over four-tenths of a second, on this GPU and even if we were using A really powerful GPU, like an A100, we'd still only be able to achieve around 7 to 8 frames per second, so as of now, you can't achieve real-time object detection with Grounding Dino.

Ok, so Grounding Dino is relatively slow, which is kind of a bummer, and it means you'll likely still need to use a faster model like YOLO, for many scenarios.

But, Grounding Dino can still help you out, even if you're using a model like yolo.

Ok, so what am I getting at here?

[Show animation w/ Grounding DINO in the workflow]

Well, what if we used Grounding Dino to label or annotate images for the training process. Essentially, we could eliminate the expensive and time-consuming process of manually labeling images for training, which is a big win and the fact that Grounding Dino is slow, likely doesn't matter for the training process.

In other words, as part of the training process, we could collect images that contain the objects we want to perform training on. Then we could feed these images into Grounding Dino, which will generate bounding boxes or labels, and then we could feed these labeled images into the training process.

This is a pretty clever way of eliminating the most costly part of performing object detection, the training.

Ok, now this workflow won't work for every object, I mean, if you've got some obscure object there's a chance that Grounding Dino won't be able to understand what your looking for, but there are plenty of scenarios where Grounding Dino can work just fine.

Hey, here at Mycelial, we're building development tools for machine learning on the edge. More specifically, we're building the data backbone for edge machine learning applications. If you're curious about what we're building, I'd encourage you to join our mailing list to learn more."