Project 5 - Vehicle Detection and Tracking

The objective of this project is to develop a pipeline that takes a picture of a roadway and identifies the vehicles that are present therein. In order to reduce the occurrence rate of false positives, information is kept between sequential images. Single identifications are generally false positives, as vehicles do not appear and disappear within a single frame of video.

A Note About Setup

In order to run this pipeline, one must either have

A pair of subdirectories ('/vehicles' and '/non-vehicles') containing 64x64 pixel color PNG images on which to train
Saved numpy datafiles: X_train.npy, X_test.npy, y_train.npy, y_test.npy

If load_data = True the pipeline will take option 2, and load the numpy data. If load_data == False the pipeline will take option 1, and save the numpy files to save time later. Neither of these datasets are small enough to upload to this repository. For myself, I used the Vehicle Data and Non-Vehicle Data links provided by Udacity.

Feature Creation

Using the provided image set, there are 8792 images of vehicles (label = 1), and 8968 images of non-vehicles (label = 0).

In order to create a feature array for each image, I augmented data arrays of the following: HOG vector (8x8 pixels, 2x2 cells, 12 orientations), RGB red histogram (32 bins), RGB green histogram (32 bins), RGB blue histogram (32 bins), HLS saturation histogram (32 bins), RGB red spatial data (resized 32x32), RGB green spatial data (resized 32x32), RGB blue spatial data (resized 32x32) and HLS spatial data (resized 32x32). This results in each 64x64 pixel input image having 6576 features (7*7*2*2*12+32*4 + 1024*4). I applied manual normalization so that each element is scaled in the range [-0.5,0.5].

The code below shows how feature extraction is done for a 64x64x3 image.

#---------------------------------#
### Get feature vector function ###
#---------------------------------#

def get_feature(img, spatial_size = (32, 32), hist_width = 4, n_bins = 32, orient = 12, pix = 8, cells = 2):
    # take in an RGB image
    # extract color histograms
    # extract spatial color data
    # extract hog data
    # concatenate

    # metadata
    height = img.shape[0]

    # feature maximum value for normalizing
    feat_max = np.float32(img.shape[0]*img.shape[1])

    # hog data
    img_gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
    hog_array = hog(img_gray, orientations=orient, pixels_per_cell=(pix, pix),
                    cells_per_block=(cells, cells), visualise=False, feature_vector=True)
    hog_array = feat_max*hog_array/max(hog_array)

    # color histograms
    img_hsv = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)

    # color histograms
    img_S = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)[:,:,1]
    rhist = np.histogram(img[:,:,0], bins=n_bins, range=(0, 256))
    ghist = np.histogram(img[:,:,1], bins=n_bins, range=(0, 256))
    bhist = np.histogram(img[:,:,2], bins=n_bins, range=(0, 256))
    shist = np.histogram(img_S[:,:], bins=n_bins, range=(0, 256))

    # spatial color data
    small_rgb = cv2.resize(img, (spatial_size[0], spatial_size[1]))
    spatial_rgb = np.array(small_rgb.ravel(),dtype = np.float)
    spatial_rgb *= feat_max / 255.0
    small_s = cv2.resize(img_S, (spatial_size[0], spatial_size[1]))
    spatial_s = np.array(small_s.ravel(),dtype = np.float)
    spatial_s *= feat_max / 255.0

    # concatenate and normalize data
    feature = np.concatenate((hog_array, rhist[0], ghist[0], bhist[0], shist[0], spatial_rgb, spatial_s))/feat_max
    feature = feature.astype(np.float)-0.5

    return feature

The image below shows an example image, its tranformations, and the resulting feature vector.

Example image data features

HOG Parameter Optimization

In order to tune the hog() function parameters, I used a straightforward technique. I removed all other features, so that only HOG data was used to train the classifier. I then tuned the parameters in a random-walk type of manual process. If the test accuracy of the classifier improved, I kept the change, if it worsened, I reversed the change. I did this for several iterations until there was a diminishing return in terms of test accuracy.

Training and Validation

I used a LinearSVM to train my classifier. The image below shows the test accuracy and training accuracy versus the number of training iterations. Since overfitting appears to become an issue after around 20 iterations, that is what I chose for my final fit.

Training and validation error versus number of iterations

I made two errors when it came to creating training and validation datasets. Initially I used train_test_split() to create my training and validation datasets. This resulted in very high training and validation accuracies, but very poor performance on test images. This implied the model was overfit. Reviewing comments from the Confluence message boards, there is a very good point raised on this topic. Essentially, train_test_split() integrates a shuffling step of the data. Since the data is a set of images that are temporally very similar, shuffling these images puts nearly identical samples in the training and validation sets. This results in extreme overfitting.

So, to overcome this issue, I implemented a blunt training/validation data split: keep the images in order and just take the last 20% for my test data. This was a very effective suggestion by both Mikel and Gilad. In this way, very similar images are kept together as either training or testing information, but don't cross over. However, my first implementation of this split resulted in 100% of the validation data being "non-vehicle". Below is my original implementation, second implementation, and final implementation.

# Original data split technique
X_train, X_test, y_train, y_test = train_test_split(features,
                                                    labels,
                                                    test_size = 0.2,
                                                    random_state = 0)

# Second data split technique (validation data biased to non-vehicle)
X_train = features[0:14208]
X_test = features[14208:17760]
y_train = labels[0:14208]
y_test = labels[14208:17760]

# Final data split technique (training and validation sets evenly biased)
# create an array of feature filenames
X_train_veh_filenames = vehlist[0:int(num_veh_imgs*0.8)]
X_test_veh_filenames = vehlist[int(num_veh_imgs*0.8):num_veh_imgs]
X_train_nveh_filenames = nvehlist[0:int(num_nveh_imgs*0.8)]
X_test_nveh_filenames = nvehlist[int(num_nveh_imgs*0.8):num_nveh_imgs]
X_train_filenames = X_train_veh_filenames + X_train_nveh_filenames
X_test_filenames = X_test_veh_filenames + X_test_nveh_filenames

# create an array of labels
y_train = np.append(np.ones((len(train_veh_filenames),1),dtype = np.uint8),
                   np.zeros((len(train_nveh_filenames),1), dtype = np.uint8))
y_test = np.append(np.ones((len(test_veh_filenames),1),dtype = np.uint8),
                   np.zeros((len(test_nveh_filenames),1), dtype = np.uint8))
                   
X_train = np.zeros((num_train,feat_L), dtype = np.float)
X_test = np.zeros((num_test,feat_L), dtype = np.float)
for idx in range(num_train):
    fname = X_train_filenames[idx]
    img = get_image(fname)
    feat = get_feature(img)
    assert len(feat) == feat_L, 'length of feature is wrong!'
    X_train[idx] = feat

for idx in range(num_test):
    fname = X_test_filenames[idx]
    img = get_image(fname)
    feat = get_feature(img)
    assert len(feat) == feat_L, 'length of feature is wrong!'
    X_test[idx] = feat

Scanning a Single Frame

In order to create a function to scan a frame of video, I created class BBox(). Within this class are methods to update the box location within an image.

#----------------------------#
### Class for bounding box ###
#----------------------------#

class BBox():
    # Bounding box to search in an image
    def __init__(self, size, stride, origin, stop):
        # coordinates of the upper left corner to start search
        self.origin = origin
        self.origindeep = origin #never lose the original origin
        # width and height
        self.size = size
        # horizontal and vertical stride for scanning
        self.stride = stride
        # coordinates of the upper left corner of the box
        self.start = self.origin
        # y coordinate above which to scan
        self.stop = stop
        # flag to keep track of box within boundary
        self.out_of_bounds = 0

    def new_origin(self, origin):
        # update the origin of the box
        self.origin = origin
        self.start = origin

    def end(self):
        # update the end (lower-right corner) of the box
        x = self.start[0]+self.size[0]
        y = self.start[1]+self.size[1]
        return (x, y)

    def walk_right(self):
        # move the box in the x direction
        x = self.start[0] + self.stride[0]
        y = self.start[1]
        self.start = (x, y)
        self.end()

    def walk_down(self):
        # move the box in the y direction, reset x
        x = 0
        y = self.start[1] + self.stride[1]
        self.start = (x, y)
        self.end()

    def __str__(self):
        return "Start = %s, End = %s" % (self.start, self.end())

    def reset(self):
        # set the origin back to its original place
        self.origin = self.origindeep
        self.start = self.origin

Then I created a list of BBox() instances that I could use to iterate over a single frame. For all bounding boxes, I began searching at y-coordinate 350. There will not be any vehicles above this point, as it corresponds to the sky in the images. For the smallest bounding box, I stopped the search at y-coordinate 550. Vehicles that are below that coordinate will be close to the camera, and appear much larger than 96x96 pixels. For the larger images, I stopped the search at y-coordinate 680 (near the hood).

box1 = BBox(size = (196,196), stride = (32, 32), origin = (0,350), stop = 680)
box2 = BBox(size = (128,128), stride = (24, 24), origin = (0,350), stop = 680)
box3 = BBox(size = (96,96), stride = (16, 16), origin = (0,350), stop = 550)

boxes = [box1, box2, box3]

The resulting scan areas can be seen in the boxscan.mp4 video below. I chose this frame for my test since it has two cars present.

Scanning a single frame for vehicles

Heat Mapping

In order to identify and track vehicles within the video, I created class HeatMap(), which creates a black image, with bright red spots where there are many bounding boxes identified. The method def cool(self) is used to track the hot spots across multiple frames. This is beneficial for building confidence in areas that are identified in many consecutive frames, and ignoring temporary false positives.

class HeatMap():
    #blank images
    def __init__(self, image):
        self.img = np.zeros_like(image)
        self.cool_rate = 25
        self.threshold = 0

    def addheat(self, boxlist):
        # heat up (brighten) areas that have bounding boxes 
        for idx in range(len(boxlist.list)):
            x1 = boxlist.list[idx][0]
            y1 = boxlist.list[idx][1]
            x2 = boxlist.list[idx][2]
            y2 = boxlist.list[idx][3]
            for x in range(x1, x2):
                for y in range(y1, y2):
                    # don't allow overflow or bright areas turn
                    # black inadvertently
                    if self.img[y, x, 0] < 245:
                        self.img[y, x, 0] += 10
                    
    def cool(self):
        # cool down areas after each frame to remove false positives
        self.img[self.img < self.cool_rate] = 0
        self.img[self.img >= self.cool_rate] -= self.cool_rate

    def thresh(self):
        # throw out areas that aren't hot (bright) enough
        img = self.img
        img[img <= self.threshold] = 0

        return img

    def reset(self):
        self.img = 0

A simple thresholding, in conjuction with the cool() method above, is able to resolve the false positives. So, there are two key parameters that need to be tuned in order to implement robust heat mapping.

cool_rate - how much heat is removed after each frame
threshold - the amount of accumulated heat necessary to positively identify a region as a vehicle

The heatmap_video_out.mp4 video below shows how my heat mapping algorithm performs on the project video, with threshold = 0. It's clear that there a few false positives but that, in general, the performance is good. A final threshold = 30 was used for the final solution.

Heat map results over project video

Final Identification

Even following the heat mapping and thresholding, my final video still had some very small blips visible. To overcome these, I put a sort-of dimensionality check on the final bounding regions. If there are very small hotspots, they are not vehicles.

labels = label(image_thresh)

for box_label in range(labels[1]):
    pixels = (labels[0] == (box_label+1)).nonzero()
    x1 = np.min(np.array(pixels[1]))
    x2 = np.max(np.array(pixels[1]))
    y1 = np.min(np.array(pixels[0]))
    y2 = np.max(np.array(pixels[0]))
    
    # filter out very small bounding regions
    if (y2-y1) > 50 and (x2-x1)> 50:
        cv2.rectangle(image, (x1, y1), (x2, y2), (255,255,0), 6)

With this in place, my result is the project_video_out.mp4 video here. One issue that I was not able to resolve is that the white car is seemingly lost as it traverses the bright colored pavement (around 20sec). This may be due to insufficient training data for the classifier. I was unable to detect this area well and still robustly reject false positives. A better classifier (e.g. CNN) may help with this area.

Final output video with vehicles found

Reflections

The single biggest shortcoming with this pipeline, in my opinion, is the speed. It takes approximately 50 minutes to process a 50 second video, so it needs to be 60 to 100 times faster. I didn't extract all HOG data upfront, as this suggestion was posted to the lesson after my code design was underway, and it was not well structured to handle this change. It would be an extensive tear-up to make the code compatible with this approach.

Additional time could be saved by not running cv2 processing functions on full images. Even though the box scan only takes place on the lower region of the camera images, many image transformations still run on the complete image. This could be a simple way to speed up the pipeline.

In the end it is not clear to me why these methods (HOG, histogram, SVM, etc) were introduced after convolutional neural networks. My intuition is that using a CNN as the classifier here would give much better results. Based on experience in other projects, a well-constructed CNN will not identify a blank road or barrier as a vehicle.

There are some additional aspects of vehicle tracking that would be interesting to explore. This pipeline only locates a vehicle in a 2D plane. What is more interesting to an automated vehicle is the relative position and velocity of these vehicles. For instance, a vehicle that is approaching in the vehicle's current lane is much more interesting than a vehicle that is leaving and in a different lane.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
output_images		output_images
README.md		README.md
VehTracking.py		VehTracking.py
project_video_out.mp4		project_video_out.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output_images

output_images

README.md

README.md

VehTracking.py

VehTracking.py

project_video_out.mp4

project_video_out.mp4

Repository files navigation

Project 5 - Vehicle Detection and Tracking

A Note About Setup

Feature Creation

HOG Parameter Optimization

Training and Validation

Scanning a Single Frame

Heat Mapping

Final Identification

Reflections

About

Releases

Packages

Languages

jamesjosephreynolds/VehTracking

Folders and files

Latest commit

History

Repository files navigation

Project 5 - Vehicle Detection and Tracking

A Note About Setup

Feature Creation

HOG Parameter Optimization

Training and Validation

Scanning a Single Frame

Heat Mapping

Final Identification

Reflections

About

Resources

Stars

Watchers

Forks

Languages