"So imagine that you and I work together, and we're just about to start a new project that uses computer vision. The only problem is we've never worked with computer vision, and so, to get a bit of practice, we're going to work on a simple computer vision project, together.
So what's the project we're going to work on?
Well, we want to use object detection, on surfers.
So for example, given this video of surfers, we'd like the application to draw bounding boxes over each surfer in the water.
So, how challenging is it to do something like this?
Well, I wouldn't say it's simple, but it is pretty easy with the modern tools that are available to us.
So what are the tools we'll be using?
Well, primarily two tools:
Ok, let's get started.
The first thing we'll do is create a project directory named surfer_vision and inside that directory, I'll create a virtual environment, by keying in, python, and I'll add the m switch for module, followed by venv, and I'll name the directory venv as well.
Next I'll activate the virtual environment by keying source venv/bin/activate.
Now, there's only one package that we need to install, which is the ultralytics package. This package includes YOLO version 8, and it will install opencv as a dependency, which we'll use as well.
So, to install our dependency, I'll key in, pip install ultralytics
Now I'll open up this directory in visual studio code, and I'll create a main.py file.
Alright, I've got this video named surfers, that I'll put in this directory, and this is the video I'd like to perform the object detection on.
So, to start, let's programmatically read frames from this video, one by one.
Todo this, I'll first import cv2, which is the open cv library, that was installed along with ultralytics.
Next we'll load the video, by creating a capture object. So todo this, I'll say, capture equals, cv2, dot, video capture, and I'll pass in the videos file name.
Now, to read an individual frame from the video, I'll call capture.read(), and of course I'll store what's returned, from this method, which is a two element tuple, in a is_frame and a frame variable.
Next, I'll show the frame, on our screen, by keying cv2, dot, imshow, then I'll pass in a name for the window, which I'll call video, and then I'll pass in the frame.
Ok, we can run our script as it is, which I'll do, but you'll notice that it immediately ends, and we don't really see anything.
Now, in reality, the script we just wrote, showed the frame momentarily, but then it immediately closed, which isn't very useful.
So, to pause execution after showing the frame, I'll key in cv2, dot, waitKey, and I'll pass in a zero. Ok, so this line will effectively block execution until a key is pressed on the keyboard, which gives us a chance to look at the frame on the screen.
Next, I'll do a little bit of cleanup, at the end of the script, by calling capture, release and lastly calling, cv2, dot, destroyAllWindows.
Ok, let's go run our script, and as you can see here, we're seeing the first frame, of our video, cool.
Alright, so we can read the first frame of the video, but how do we get the other frames in the video?
Well, we just need to put this bit of code here, in a while loop.
So, todo this, I'll say while True, then I'll indent these three lines of code.
Let's run this script again.
Ok, so we're still only seeing the first frame, but if we press a key on the keyboard, you'll notice that we see the next frame, and the next frame, and so on.
All right, but how do we get the video to play, continuously?
Well, we can just change the argument to waitKey, from zero, to 1, which means wait for 1 ms for a keypress before continuing.
Now I'll run our script again, and as you can see, the video is playing, and it's playing pretty fast.
So, there's a problem with our script, which is that, there's no way to stop the video. It just keeps playing until it's done, and then we see an error in the terminal.
Alright, to fix this, I'm going to store the value returned by the waitKey function, which by the way, is the key code for the key that's pressed on the keyboard.
Then, I'll say, if the key is the number 27, which is the code for the escape key, then break out of this loop.
Ok, let's give this a try, and now we see the video playing, and if I press escape, the video stops playing, cool.
Alright, there's still one bug in our script, which you can see if I play the video to completion.
So, the problem is, when we get to the end of the video, there are no more frames left to show, but we're still trying to show it, on this line right here, which leads to an error.
Ok, so how do we fix this?
Well, we have this is_frame variable, right here, which we can use to see if the returned frame is a valid frame.
So, to properly handle this scenario, I'll say, if not is_frame, then I'll break out of the loop.
Ok, so, we're able to read each frame of the video, and we're displaying it on the screen, which is great, now let's add in object detection.
So, to handle object detection, we'll use YOLO in our script.
So, I'll say, from ultralytics, import YOLO.
Next, I'll create the yolo object, by setting model, equal to YOLO, and we need to pass in the model we wish to use in YOLO.
Ok, so what do we put here?
Well, if you look at YOLO's readme file, you see a few different models you can use.
So, will these models work for us?
In other words, will one of these models be able to detect surfers?
Well, it might work, because the default models can detect 80 different classes of objects, and two of the objects it can detect are relevant to our project.
So what two objects am I talking about? Well, the default models can detect people and surfboards, so let's try the small sized model to start.
So, to use the model, I'll pass in the string, "yolov8s" for small, dot pt.
Now, you might be wondering, where is this model file, we intend on using?
Well, the way this works is, if this file exists locally, in this project directory, than it will use the local model file, but if it doesn't exist, it will go ahead and download it from the repo's assets and then use it.
The next thing we'll do is perform the object detection by calling model, and passing in our frame, then I'll save what's returned by this function in a results variable.
Now, just below this, I'll print out the results, and also for the moment, I'll change this one, back to a zero, just so we can more easily inspect what's printed to the console.
So, I'll run our script again, but this time things are taking a bit longer, because we're performing object detection now. Ok, now it's downloading the model file, which is to be expected, and now we see the first frame.
Alright, lets go take a closer look at the console.
Ok, so we see that the model detected 4 people and it took about 64 ms to complete, which is a little bit slow, but we'll come back to this in a moment.
You see this boxes attribute, this holds the coordinates of the bounding boxes, for the 4 people that were detected. I want to access these bounding box coordinates, so that we can draw some boxes onto the frames.
So, how do we get the bounding boxes?
Well, first I want to point out that our results object, which we see printed in the console, is actually a list, so let's do the following.
We'll say, for result in results, then in the loop, to get the bounding boxes, I'll create a variable named bboxes, and I'll set it to result, dot, boxes, dot, xyxy, and then I'll print bboxes, and I'll run our script again.
Ok, so now we see this list of lists here, and each one of these lists represents the x y coordinates for the bounding box, and we can use these to draw boxes on the image frames.
Well, we need to do a couple of other things first though.
We need to convert this tensor array to a numpy array, and we need to convert the floating point numbers to integers.
So, to handle this, I'll do the following. I'll call the cpu function, followed by a call to numpy, and then a call to the function astype, and I'll pass in the string int.
Ok, let's run our script again, and now we see a numpy array of integers, cool.
Alright, now let's draw these boxes on our image frames.
So, todo this, I'll say, for bbox in bboxes, then I'll destructure the x and y coordinates from the bounding box by keying, x, y, x2, y2, equals bbox.
Now I'll draw the rectangle by keying cv2, dot, rectangle, which we'll draw on the frame, and I'll pass in the top left point, followed by the bottom right point, and I'll pass in a green color, with a line thickness of 2.
Ok, let's run our script again, and check that out, we've got bounding boxes around the people it detected, cool.
Now, it didn't detect all the surfers, but we'll circle back to this in a moment.
Let's change the waitKey parameter back to one, so the video plays continuously, and we'll run the script again.
Ok, so it plays, but it's pretty slow, each inference is taking around 60 ms, which gives us a frame rate of around 16 to 17 frames per second.
So, why is it slow?
Well, the reason it's slow is because, yolo is using my cpu, instead of my gpu.
So, the question is, can yolo use my GPU on my M1 mac?
There's a way you can check this by using the pytorch package.
So, I'll import torch, then right here, where the inference happens, I'll say if torch dot backends dot mps dot is available, then we'll pass the device parameter which I'll set to the string, mps.
Otherwise, we'll infer the results, using the cpu, in my case.
Now, if you're not on a mac, and you've got a properly installed nvidia GPU, then the model should default to using the GPU.
Ok, I'll run the script again, and checkout how fast its working now. Inference takes about 20 ms when it's using the GPU, so it's about 3 to 4 times faster, cool.
So, how well is the model we're using working?
Well, I'd say it's decent, but towards the end of the video, when the surfers are off in the distance, it doesn't detect most of the surfers.
So, what can we do to improve the accuracy of detections?
Well, we could use one of the bigger models, but there's a tradeoff. With the bigger models you get improved accuracy, but slower inference.
Let's try using the biggest model, so I'll change the model from yolov8s, for small, to yolov8x, for extra large, and then I'll run the script.
Now we see the extra large model being downloaded.
Ok, so inference is definitely slower, it's taking about 55 ms in total time, which gets us to about 18 frames per second, but it does seem to be more accurate in detecting the surfers.
Alright, so what if we're not satisfied with the accuracy and/or the speed of inference?
Well, another option would be to perform custom training on images or videos of actual surfers, which in theory could result in significantly improved accuracy, with a potentially smaller and faster model.
So, stay tuned to the next video, where you'll learn how to train a model, on actual surfer images, and we'll see if we can achieve better results with the custom-trained model.
Hey, here at mycelial, we're building development tools for machine learning on the edge. More specifically, we're building the databackbone for edge machine learning applications. If you're curious about what we're building, I'd encourage you to join our mailing list to learn more."