Surveillance cameras are the standard tool for homeowners and business operators who want to monitor their premises. But reviewing footage is an arduous task, especially for users with many cameras. Camera systems that are networked and offer a remote viewing interface are also expensive.
We designed an inexpensive, scalable camera system that harnesses computer vision and machine learning techniques to save home and business users time. Our system only records video footage that is likely to be useful (where motion is occurring) and shows the user a summary of the actual video contents, eliminating the need to skip through hours of footage. We offer the following features and advantages:
Inexpensive, commodity hardware based on the Raspberry Pi. The price of our hardware continues to drop, whereas existing “smart” cameras cost 100-200 dollars (two to three times ours).
Cloud-based storage and processing in Amazon Web Services. Our backend has near-unlimited storage and scales to work with any number of cameras.
Automatic motion detection. Our system only records video that contains motion, so you won’t waste time viewing blank footage.
Face detection and face counting. Want to know how much foot traffic is crossing through your store, or even home, at any point during the day? We can keep count and provide you a summary or detailed look at activity.
Image recognition. What occurred in-frame when the camera detected motion? Did a person walk by, or was it just a housepet? We’ll give you a quick summary so you don’t have to watch every video.
Web interface. Review data collected by our system and watch video footage from your computer or phone, from wherever you are.
Live data streaming. Monitor a camera in real time from your browser. You’ll see all the video frames, regardless of whether any motion is occurring.
Our architecture, illustrated below, consisents of three major blocks:
At least one Raspberry Pi with a camera module.
Amazon Web Services (AWS) cloud infrastructure, comprising storage for video files, a database for metadata (e.g., source of video, timestamp, and information about contents), all of our backend video processing.
An interactive website with the user interface.
The Raspberry Pi processes incoming video frames from the camera module and performs motion detection using Python and the OpenCV library for video processing. The Raspberry Pi segments videos into 10 second files and uploads each file containing motion to a bucket in Amazon S3 and metadata (video source, timestamp) in JSON format to Amazon DynamoDB.
A Python script running in an Amazon EC2 virtual server queries DynamoDB for unprocessed videos, runs the face detection and counting algorithms on each video, and writes the results back to DynamoDB.
A second Python script running on a more powerful EC2 instance operates similarly, except it performs image classification on the video. We use a neural net trained using TensorFlow. The script samples frames from the video, obtains classification results from TensorFlow, and writes aggregated results back to DynamoDB.
The user can access all videos and information stored in S3 and DynamoDB through an interactive website (see below). The website allows a user to view video footage and data over a custom date range, as well as watch one of the Raspberry Pi cameras in real time.
Motion detection is an unsolved problem. There are many ways to approach motion detection. Our approach to motion detection uses two algorithms from the OpenCV library:
Background Subtraction using Mixture of Gaussians assumes that the camera sees the background most of the time and the foreground comes into the field of view intermittently. With this assumption, the algorithm estimates foreground pixel by pixel. Over a sliding window, for each pixel it estimates between 3 to 5 Gaussians. The Gaussians are weighted based on the number of data points they have. Pixels that fall in the ‘lighter’ gaussians are considered to be foreground. We trigger a frame to have motion if more than 5% of the pixels are foreground. Each frame is blurred using median blurring before this algorithm is applied. The below videos shows this in action.
Face Detection is done using an opencv implementation of the Viola Jones algorithm. The algorithm uses the following 4 broad concepts to do face detection:
Face Detection using Viola Jones is extremely fast and works in near real time. Faces are detected in each frame (of the video/stream), and since the same faces can appear in subsequent frames, Histogram Similarity of the Regions of Interests between subsequent frames is used to count the number of unique faces per frame.
The following video snippet provides a glimpse of how face detection and counting is done in the cloud. Blue rectangles (Regions of Interest: ROI) show up on the faces once they are detected in any frame.
For classifying the images/frames within the uploaded video, a technique called transfer learning was used which basically involved retraining the final layer of TensorFlow’s Inception_V3 Model, that was originally trained on the ImageNET dataset. Modern object recognition models have millions of parameters and can take weeks to fully train. Transfer Learning retrains from the existing weights for new classes while leaving all the others untouched. Currently, the images/frames are classified into 6 classes/categories: person, indoor, garage, outdoor, dog, cat, which can be enhanced and/or improved by training on a cluster of GPUs with a richer training set. We used handcrafted training data with images from open data sources (Find Below), and ones generated by our Raspberry Pis, and labeled them into one of the 6 categories.
Even with a limited training data of 100 odd images / category, the accuracy that Tensor Flow reported, during training, was as high as 90%. While results may not be as good as the fully trained model, transfer learning is surprisingly effective for many applications, such as ours.
This feature complemented Viola Jones nicely, since it is much more tolerant to lighting, and face/body postures - no longer we were restricted to faces to identify a human. However, in its current form, it is slow and takes almost half a minute to process a single frame, and is therefore not suited for real time processing.
Would you like to know more about this project or our team?
Check out our backend code and frontend code or contact us and we’ll be glad to tell you more.