Skip to content

NickCrews/kinect-climbing

Repository files navigation

kinect-climbing

Uses an Xbox Kinect, Python, NumPy, and OpenCV2 to detect climbers and holds on an indoor rock-climbing wall. You can see a detailed write up at https://nickcrews.weebly.com/virtuwall.html

Detected Climber

Goal:

I want to create an augmented reality climbing wall, similar to those shown in these videos. The first step of the project was to detect the climber and holds on the wall, and this is what I talk about here. Next year, I want to add in the projector, and begin to add interaction into the system.

Setup:

The Computer Science Department at Colorado College has an XBox Kinect that they allowed me to borrow all semester. It is a first generation model 1473, which means that it uses an infrared (IR) projector and camera, offset from each other, to create a stereoscopic depth image, similar to as we do with our two eyes. This depth image is calculated on the Kinect from the raw IR feed with embedded hardware, and can be requested as an 11 bit 480x640 image. The Kinect can also stream the 480x640 raw feed from the IR camera, and a 3 channel, 8 bit, 480x640 feed from another built in RGB (visible light) camera.

I chose to use Python because of its ease of use, and the computer vision libraries available. The primary libraries I used were NumPy, for fast and easy opertations on matrices and images, OpenCV, which is a free computer vision library. To interface with the Kinect, I used OpenKinect.org's libfreenect library, which has a nice wrapper for Python that interfaces directly with NumPy. For visualizing image histograms, and other plotting, I used the library MatPlotLib.

I was able to use the Ritt Kellogg Climbing Gym outside of normal operation hours for testing. I used a flat 15x15 foot section of wall, which was covered in a normal distribution of routes.

Recording and Replaying Video:

One of the first orders of business was making it so I didn't have to work in the gym all the time. I wanted an easy way to save and replay video captured from the kinect in a way that replicated a live feed from the Kinect exactly. To save video, I wrote a VideoSaver class, which takes individual frames and saves them to numbered PNG files in a directory. This method is uglier than saving frames directly to a video, but I had trouble finding a library that could write to a lossless video format. Without a lossless video format, the frames would be compressed and distorted, making tests with a saved recording and the live feed form the Kinect inconsistent. PNG images have no compression, making them perfectly emulate a live feed. Plus, using individual images made it easier to inspect or edit them manually if I wanted. Also, by numbering indivudal frames this way, depth and RGB frames could be related to each other.

I also wrote a VideoSource class that could read from either the Kinect or these saved recordings, making them completely indistinguishable, as I wanted. These two classes, along with some other utility i/o methods make up the inputoutput.py module.

Depth Normalization:

Another problem I had to deal with was converting the depth frames from the Kinect to a format that was useful. The raw NumPy array that libfreenect gives you is a single channel 16 bit unsigned integer array, of which only the first 11 bits are actually used. Normally, pixels can have values ranging from 0 to 2^10-1 = 1023, with larger numbers representing things that are further away from the Kinect. A value of 2^11-1 = 2047 signifies an error, where the object is either too close or too far away from the Kinec to be detected properly, or there is some other problem.

There are two issues with this: many of the methods that I wanted to use from OpenCV only take 8 bit images as parameters, and the errors created ugly splotches and noise that messed up the algorithms used later on. At first I tried rescaling the 10 bit image to 8 bits by dividing by 4, but that thew out a lot of the contrast that I really needed. After looking at histograms of many different videos, it looked like most of the pixels lived in the upper portion of the 0-1023 range. Based off this, I first constrained the pixel values to be between 768 and 1023, and then subtracted 768 from all of them. This left values in the range of 0-255, which is just what we wanted. Since almost all of the pixels are above 768, not much information is lost from clipping. There might be better ways to convert from 10 bits to 8 bits, that don't throw away any useful information, and I'm still thinking of exploring this step more.

After converting to 8 bits, I still had to deal with the 'errors' in the image. For this, I used a method called 'Inpainting', where the error zones are filled in from the borders, by replacing each error pixel with a weighted average of the surrounding pixels. Luckily, OpenCV has a very nice implementation of this. The results were very good, fixing both the 'salt and pepper' noise throughout the image, and the large error blotches that randomly appeared, and the consistent ones where the IR light scatters off of hair and other troublesome materials.

These two steps of normalization are taken care of in the method pretty_depth() in processing.py.

Climber Detection:

After dealing with these petty implementation problems, I got started on the main point of the project: detecting the climber on the wall.

The first idea was to use the raw IR feed from the Kinect, and have the climber wear reflective bracelets and anklets. The IR light from the Kinect's projector would bounce off the reflectors much better than the surrounding wall and clothes, and so the climber's hands and feet could be located by finding the brightest points in the raw IR image. The benefit of this method is it's simplicity, but it has many drawbacks. First, the climber is required to wear these bands, which greatly decreases the usability. Also, this method would only be able to locate the wrists and ankles. It would be ideal to be able to determine the whole outline of the climber's body, to check for collisions when dodging chainsaws in games for instance, or in case you wanted to tap a button with your elbow. Even if hands and feet were the only things we cared about, it would still be hard to estimate them accurately from just wrist and ankle positions. Finally, at longer distances, the reflectors had to be oriented fairly well to be picked up, and the strange poses that often are required for climbing made this system too inconsistent.

The second method I tried was background subtraction with the RGB feed. Background subtraction is a technique that uses the assumption that the climber is the only thing moving in the video. It uses a statisitcal model: Store the last N frames (I used 500), and calculate the mean value and the deviation for each pixel over this time series. Then, each time you get a new frame, compare it to this model. If a pixel's value is further than say, 1 standard deviation from the mean, than it must be a foreground pixel (AKA the climber). Otherwise, it's background. Add this new frame to the model, and throw out the oldest one, and repeat.

In certain situations, such as video surveillance software that only records when there's movement, this works really well. However, in this context, where we need to find the outline of a climber very precisely, the shadows created by the bright point lights in the gym mess things up. Modifying the lights in the gym was out of the question, and I wanted the system to be able to adapt to the lighting conditions of any gym, so without some sort of background subtraction that's robust to shadows, this wasn't going to work.

An obvious solution to this problem is to use the depth capabilites from the Kinect with a background subtractor. There are no shadows to worry about in a depth image! However, one problem is that the climber is close to the wall, and so contrast between the climber and the background, especially around the hands and feet, is small. To improve contrast, I used a technique called Contrast Limited Adaptive Histogram Equalization (CLAHE). Similar to normal Adaptive Histogram Equalization, CLAHE looks at the histogram of all the pixel values in the image, and shifts the values of all the pixels in such a way so that there is a nice even distribution of intensities. However, instead of equalizing over the whole image, CLAHE is Contrast Limited: it performs this operation on small tiles of the image, increasing local contrast without causing global changes. This operation is a built in algorithm within OpenCV, and it worked wonderfully. After applying CLAHE, the normal background subtractor does an excellent job finding the torso and upper legs of the climber, and detects the extremities around 50% of the time.

I packaged these alogrithms into a ClimberDetector class, which is part of the processing.py module.

Hold Detection:

One important piece of any sort of augmented reality climbing wall is the ability to know where the holds are on the wall. One option for this, that other teams have used, is to have the Kinect be totally immobile, with a constant arrangement of holds on the wall, and then manually mark all of the holds. I wanted the system to be more versatile and user friendly than this. At worst, the user should only have to confirm, and possibly correct a few mistakes, of the detection that the computer does.

As compared to many other computer vision object recognition tasks, finding the brightly colored, fairly predictably sized plastic holds on the homogenously colored background wall should be a simple order. The first thing I wanted to do was remove a lot of the complexity of the RGB input images, and get rid of texture and noise. A good tool for this job was OpenCV's pyrMeanShiftFiltering(), which performs a mean shift clustering algorithm on the image, resulting in a sort of 'cartoon' feel. Areas of similar color and texture are all filled with the same value, resulting in nice uniform blobs with sharp boundaries. This result is perfect for performing OpenCV's Canny() edge detection algorithm, which finds the gradient of color change throughout the image, and marks the maximums above some threshold to be edges.

The result is pretty good. There are only occasional false positives, and only the most similarly colored holds are missed. The sharp corners of the actual wall are also picked up, but they can easily be filtered out as being long skinny lines. Although the mean shift filtering algorithm is slow enough that it can't run in real time, it's ok because we can just detect the holds during a calibration period, before the climber ever gets on the wall. I still want to do a lot more work on perfecting hold detection, but this is a pretty good start.

Converting Between Depth and RGB Frames:

The RGB and depth camera on the Kinect are not alligned with each other, and have different FOV's. We need to be able to translate between the two different frames, so that we can do things such as relate the location of a hold detected in the RGB frame to the climber's hand detected in the depth frame. To do this we use OpenCV's getPerspectiveTransform() and warpPerspective() methods. First, we take a picture with the Kinect of a scene that has 4 landmarks distinguishable in both the depth and RGB frame. I used yardsticks stuck up in the air. Then, we find the coordinates of each of these points in both frames, and feed these 8 coordinates into getPerspectiveTransform(). This returns a 3x3 affine transformation matrix that encodes how to rotate, scale, and shear one image to get it to line up with the other. So, to convert an image from the depth frame to the RGB frame or vise versa, we call warpPerspective() with this transformation matrix and the source image. This is very useful for overlaying the mask of the climber, determined from the depth frame, over the top of the RGB image.

About

Uses an Xbox Kinect, Python, NumPy, and OpenCV2 to detect climbers and holds on an indoor rock-climbing wall.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages