Captionizer

Captionizer enables to create transcriptions of lectures, online meetings or videos in the docx format. It can capture sounds from any audio input, or speakers output (stereo mix). 2 speech recognition engines are supported (Google and Azure).

Tests and accuracy

The app is intended to transcribe lectures and other mostly single-speaker content, so such examples were chosen for tests. The Azure engine performs much better than the Google one, so I will use it for the most part. Mistake ratio is described as: (wrong words + 0.5 * correct words, but in a wrong form) / all words. Punctuation accuracy is not measured, although it's worth noting, that only Azure engine provides it (it has a solid accuracy in English, and acceptable in Polish)

AZURE / ENG:

Easy: World politics, Yale University lecture - 11:58-14:28 - see transcription fragment - 0% mistakes - perfectly transcribed and with perfect (reflecting speaker's intention) punctuation
Harder: Youtube vlog, history of Nutella - 1:12-2:42 - see transcription fragment - 1% mistakes - mostly small mistakes, video dynamics is not captured in punctuation, but it's ok in general

AZURE / PL

Easy: Prof. Jerzy Bralczyk: "Jak mówić, żeby nas słuchano" - 12:56-14:10 - see transcription fragment - 4% mistakes - mostly small, understandable mistakes, one serious - punctuation is not great, but still better, than plain text
Harder: Travel vlog, Rio Grande, Argentina - 8:46-10:40 - see transcription fragment - 6% mistakes - it's a non-scripted video, very spontaneous, so that was a really hard task - due to punctuation issues the transcription requires a little more effort to understand, but the accuracy is fairly good

GOOGLE / ENG

World politics, Yale University lecture - 11:58-12:25 - see transcription fragment - 9% mistakes - perfect for the most part, high mistake ratio due to one missing expression in the end (was probably classified as mumbling)

GOOGLE / PL

Prof. Jerzy Bralczyk: "Jak mówić, żeby nas słuchano" - 12:56-13:22 - see transcription fragment - 5% mistakes - surprisingly good, mistakes are understandable

Usage & UI

Install all dependencies - pip install -r reguirements.txt
Enable stereo mix in Windows -> right click on the sound icon on the taskbar -> sounds -> recording -> stereo mix -> enable
Get an api key for azure cognitive services (preferred) or google speech api (you can use the default google key, but it is very limited). Be aware that changing your default recording device might result in changing your current online meeting software default settings.
Run the app:

Go to preferences and set the engine, key and language. Set stereo mix as the input device, if you want to transcribe the sounds you hear:

Press record to start recording
Press "Finish recording" to stop:

Wait until transcription process finishes:

Transcription is stored to the specified path:

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Captionizer

Tests and accuracy

Usage & UI

About

Releases

Packages

Languages

MaciejMarkiewicz/captionizer

Folders and files

Latest commit

History

Repository files navigation

Captionizer

Tests and accuracy

Usage & UI

About

Resources

Stars

Watchers

Forks

Languages