This project is the research and creation of a voice transcription system for video games, that will transcribe voice data within a Discord channel to text and output it within the game window (similar to how subtitles work in TV and film). It is an audio accessibility feature designed to improve the social multiplayer gaming experience for deaf and hard-of-hearing players.
Below are the results of the first phase of the study, Research Element 1, which involved taking five-minute-long voice clips from a video each from a total of nine participants, putting these voice clips through four transcription systems, and measuring the results.
The four transcription systems used were TaaS (Transcription as a Service) APIs, used due to multiple factors preventing local transcription (primarily available computational power which would cause the transcription system to potentially affect the running game and vice versa). As a result latency was the primary concern, as well as reliability- these were combined to give an overall reliability percentage.
A significant advantage of going with TaaS was that these are dedicated transcription services and so they’re backed by people with years of experience in the field, which in theory would increase reliability over a local transcription system. They also have large sample sets to train the neural networks that are at the heart of these systems, but a local transcription system would require voice training from scratch which would put them at a significant disadvantage.
Three systems were actually used (Google Cloud decided to suspend my billing account immediately after signing up due to “suspicous activity” when I hadn’t even done anything yet- this was sorted but not in time), which were Amazon Web Services’ Transcribe, Azure’s Cognitive services, and rev.ai’s transcription system.
I ended up going with Azure, because latency was king despite having the medium accuracy rating of the three that were tested.
Maths and raw values can only reveal so much: a major deciding factor was that the observed accuracy for AWS’s transcriptions for certain critical words was low, for example:
|Base Transcription||AWS Transcription|
|That’s why you are on the hills||you are to me. Yeah, e u was on the hills|
The transcription accuracy of numbers and position markers is critical in order to provide the player with the correct information.
Imagine playing PUBG and one of your friends spots a sniper at “west, 285” but the transcribed text is “best two eighty”- the “best” part could be inferred as “west” but the “two eighty” part could be the difference between your team being killed by the the sniper and their team being killed by yours.
In the above table, the other transcription systems correctly transcribed all mentioned words. It should be noted that as Discord uses sound suppression technology the actual accuracy should in theory be much higher.