About the Client
The client develops a new generation of home surveillance and assistance ecosystems that includes smart doorbells, home surveillance systems, and voice-activated home assistants empowered by AI.
The client has a range of devices for home surveillance and wants to use the information from video and audio channels to improve security in the home and neighborhood automatically analyzing and processing the incoming data and notifying its users in case of any suspicious activity.
The developed solution processes video and audio feeds from the recording devices. The video processing module is aimed at detecting people on the video feed and determining if the person belongs to the group of authorized users or not. The audio processing module listens to custom voice activation commands, e.g. “Okay Google”, “Hey Siri”, to be activated. Audio processing also includes target sound detection – baby cry monitoring, gunshot, glass-shattering sounds, and others. The modules send alerts to the user to notify him/her of any disturbance cases.
Quantum team has created a pipeline that takes video and audio frames from data streams and processes them. The processing for video frames includes target object detection – people, pets, etc. Audio processing consists of 2 separate modules – wake word detection and target sound detection. Wake word detection spots the keyword to activate the user’s command. Target sound detection (for example, baby crying, gunshot, glass shattering sounds) distinguishes sounds of interest in the audio stream and sends an alert to the system in case of positive classification.
The dataset for custom wake word detection was collected using a crowdsourcing website – Amazon Mechanical Turk. It allowed training the model for various accents and voice timbres. The interface was set up, the audio recordings filtered and assessed for quality for further usage in preprocessing, modeling and evaluation.
Let's discuss your idea!
The project was developed with Python. OpenCV and Tensorflow were mainly used in the video processing module, while audio processing modules incorporated Librosa, SoundFile, Tensorflow and ScikitLearn. Postprocessing included Pandas and NumPy usage.