Skip to content

PhaniVaddadi/voicerecog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VOICE SENTIMENT ANALYZER

Voice Sentiment Analyzer is a Model created for the purposes of analysis of real time audio or recorded audio for many service based companies which give insights of feedback and queries along with sentiment directly to them without any manual work on each call received from the customers. This help many organisations to improve their service providing systems efficiently and effectively.

CONTENTS

ARCHITECTURE

Above Image is the Voice Analysis Pipeline. Let’s Understand every phase of it’s working:
  1. Data Collection: Using the Flask API, we will collect the audio in the form of Real time Audio or Recording through Mic or By Uploading the File to the server or by Connecting Institutional or Organisation Database with it.
  2. Preprocessing: Preprocessing of Audio require when audio is being uploaded or database connect to them. Preprocessing includes Speaker Diarization Process which through some clustering mechanisms separate out different speaker voices into different audio files. After Diarization process, there will be Voice Activity Detection, which separate speech and silence into different audio chunks for fast working of audio analysis.
  3. Model Building: Model Building Process in our Product makes it unique among its competitors, Our Model comprises of two Models:

    • Speech to Text and Analysis It converts incoming speech parts into Text using Google Speech-to-Text API. After conversion, sentiment of text is obtained from pretrained model and will be classified into different sentiment as Positive, Neutral and Negative .

    • Speech Emotion Analyzer Using Pretrained Emotion Analyzer, we classify emotion of audio broadly into three categories as: **Happy, Neutral ** and Angry emotions .

  4. Result Generation: We will combine result obtained from two models into one to get our final result. We will visualize these results into various Graphs as Shown below:

Result Visualization

RESEARCH

Now we are going to discuss how we reach to our final approach of selection of Model and creation of pipeline along with problems faced and their respective solutions.
  1. Speaker Diarization : Speaker Diarization is the process in which we apply different clustering techniques to the features of the audio to seperate speakers present, to have a detailed indivisual sentiment also with overall audio sentiment.

    Our first approach goes to use Google supervised diarization algorithm for clustering mechanism known as UIS-RNN ( Unbounded Interleaved-State Recurrent Neural Network ) and combine with VGG-16 voice feature extraction but having a less accuracy having a overlapping problem.

    Currently we are using PyAudioAnalysis which uses K-Means and SVM to segregate speaker voices from audio file where number of speakers may be defined by user or it uses elbow method to determine appropriate number of clusters.

    In our further enhancements, we will be using RPNSD (Regional Proposed Network Speaker Diarization) , most accurate and resolved overlapped problem.

  2. Voice Activity Detection : It is a technique which classify different parts of audio file into speech or silence categories. This helps audio into different speech chunks after removal of silence from audio which helps effecient working of Model.

    We have used WebRTC Voice Activity Detection algorithm to classify the segments of audio into speech and silence .

  3. Speech Recognition: It is the process through which we convert speech into text in specific Languages: Currently we are converting them into 'English-Indian' and 'English-US'.

    We first approach to Mozilla's Open Source DeepSpeech Recognition Model trained on American English and having a WER ( Word Error Rate ) of 5.83%. But due to its memory effeciency is low as it's file size is about 1.2 Gb , making memory insufficient on Cloud services.

    Then our Final Approach goes to Google Speech to Text API trained and provide Services on vast group of Languages with WER ( Word Error Rate ) of about 3.44%.It is Highly effecient in terms of size and performance

  4. Emotion Recognition: We have a objective to add this to our model for increasng accuracy, because the tone with text decides the overall sentiment, as we have done with our speech recognition part, we now move on to emotion recognition part.

    We have trained our own model on Various datasets available for Public use and got an accuracy of 88.14% which classify the audio files into 3 categories: Angry, Neutral and Happy.

  5. Text Sentiment Analysis: Text sentiment is the process in which the analysis of the text is done whether the text has Positive , Negative or Neutral impact.

    Our first approach is VaderSentiment, which uses Dictionary Based Approach for sentiment. In a simplicity manner, we can understand it as like it has a million of words in its dictionary and on the basis of that it decides the sentiment of it.

    Then we move on to another advanced approach: Embedding Based Approach, means that a meaning of word changes with the context in which it is used. It depends on the neighbourhood words , for which we use Embeddings, which means an N-Dimensional space vector where words are present whose similarities between each word derives from the Cosine Similarity or Euclidean Distance . Models which are based on this approach are Flair and FastText ( by FaceBook ).

    This type of systems are mostly organisation based , means in some orgaisation consider some kind of words to be negative like in Banks such as Fraud. So here we need combination of Both Dictionary and Embedding Based Approach .

USAGE

Recording

1. Press the Record Button.



2. It will take you to another webpage , where it will be asking for microphone access of your browser and it will start recording and you can see live visualization graph .



3.After you have done recording, Press Stop Button , it will take some time to buffer all your audio and give a static graph, which is automatically saved to your history.

Uploading a File

  1. Click on Upload Button on Main page.



  2. Then upload the file using Browse button from your device, and then enter number of speakers present in the audio ( Optional ) and then click on Upload.



  3. After Successfull upload it will pop up a message and an Analyze Button. Click on it.



  4. After clicking on Analyze button, it will take you the next window , where it will show up a message of processing audio in Backend.



  5. After successfull processing, it will pop up a message for further instructions to use visualizations and type Speaker number according to it and Click Show button.



FURTHER ENHANCEMENT

  • Fine Tuning of Models
  • Connection of Database of Organisation
  • More Generic Application
  • More Features: Query Extractor, Voice verification

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published