It’s amazing what technology can do, and how easily it can do it. Last night, I was playing around with some ideas that eventually led to curiosity for speech recognition in C#. That led me to this page on Microsoft docs.
It took all of five minutes to set this project up, copy some code, and start recognizing text…
Microsoft’s code is pretty bare-bones and straightforward so I won’t dive too much into that. However, I wanted to see what it would take to do something useful with the recognized text.
Before I move too much further, I think it’s important to note that Windows 10 has a built-in speech recognition app that it used for the actual recognition in this app, and will need to be setup and trained properly to get useful translations, but I digress.
Oh, and here is the full solution on Github, in case you don’t want to read any of this.
Make your words DO something
I didn’t spend a TON of time on this, but I wanted to see if I could make my computer tell me the answer to a simple addition problem. This mostly involved string manipulation and word-to-number conversion written into the SpeechRecognized event handler.
static void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) { var text = e.Result.Text; //uncomment to give test text instead of voice //text="what is 500,000 plus 200,000"; if (text.Contains("plus")) { var splitTextByPlus = text.Split(new string[] { "plus" }, StringSplitOptions.RemoveEmptyEntries); Console.WriteLine("Text before plus: " + splitTextByPlus[0] + " Text after plus: " + splitTextByPlus[1]); var splitBySpace1 = splitTextByPlus[0].Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries); var splitBySpace2 = splitTextByPlus[1].Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries); var beforePlus = splitBySpace1[splitBySpace1.Length - 1].Replace(",", ""); var afterPlus = splitBySpace2[0].Replace(",", ""); long num1; long num2; var wasLong1 = long.TryParse(beforePlus, out num1); var wasLong2 = long.TryParse(afterPlus, out num2); if (!wasLong1) { num1 = Regex.Matches(beforePlus, @"\w+").Cast<Match>() .Select(m => m.Value.ToLowerInvariant()) .Where(v => numberTable.ContainsKey(v)) .Select(v => numberTable[v]).FirstOrDefault(); } if (!wasLong2) { num2 = Regex.Matches(afterPlus, @"\w+").Cast<Match>() .Select(m => m.Value.ToLowerInvariant()) .Where(v => numberTable.ContainsKey(v)) .Select(v => numberTable[v]).FirstOrDefault(); } Console.WriteLine("Number before plus: " + num1 + " Number after plus: " + num2); var answer = num1 + num2; SpeechSynthesizer sythesizer = new SpeechSynthesizer(); sythesizer.SelectVoiceByHints(VoiceGender.Male, VoiceAge.Senior); sythesizer.Speak("The answer to your question is "+answer.ToString()); } Console.WriteLine("Recognized text: " + e.Result.Text); }
Let’s break it down
if (text.Contains("plus")) { var splitTextByPlus = text.Split(new string[] { "plus" }, StringSplitOptions.RemoveEmptyEntries); Console.WriteLine("Text before plus: " + splitTextByPlus[0] + " Text after plus: " + splitTextByPlus[1]); var splitBySpace1 = splitTextByPlus[0].Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries); var splitBySpace2 = splitTextByPlus[1].Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries); var beforePlus = splitBySpace1[splitBySpace1.Length - 1].Replace(",", ""); var afterPlus = splitBySpace2[0].Replace(",", "");
First, we only care if the word “plus” was recognized. Otherwise, we aren’t doing an addition problem. Then we get the left and right side of the string by splitting and “plus”.
Then we need to get the two numbers we want to add together. It’s going to be the word before the plus and the word after the plus. Split the string by spaces, remove empty strings (just in case), and then take the last word in the first string and the first word in the second string.
Lastly, we want to remove any commas that might have been recognized when dealing with large numbers (think 100,000).
Now our text is split up and ready to be added together. Almost.
long num1; long num2; var wasLong1 = long.TryParse(beforePlus, out num1); var wasLong2 = long.TryParse(afterPlus, out num2); if (!wasLong1) { num1 = Regex.Matches(beforePlus, @"\w+").Cast<Match>() .Select(m => m.Value.ToLowerInvariant()) .Where(v => numberTable.ContainsKey(v)) .Select(v => numberTable[v]).FirstOrDefault(); } if (!wasLong2) { num2 = Regex.Matches(afterPlus, @"\w+").Cast<Match>() .Select(m => m.Value.ToLowerInvariant()) .Where(v => numberTable.ContainsKey(v)) .Select(v => numberTable[v]).FirstOrDefault(); } Console.WriteLine("Number before plus: " + num1 + " Number after plus: " + num2);
When speech recognition hears “one hundred”, it converts it to “100” but when it hears “eight” or “two”, it doesn’t convert it. So we have to handle that ourselves. The first thing we are going to do is try to convert the given numbers to longs. If that works, we are good. If not, we have to do some fancy stuff.
I wasn’t sure how to do it but I found this post on stackoverflow. Basically, we create a dictionary that we can look up numbers to corresponding words.
private static Dictionary<string, long> numberTable = new Dictionary<string, long>{ {"zero",0},{"one",1},{"two",2},{"three",3},{"four",4},{"five",5},{"six",6}, {"seven",7},{"eight",8},{"nine",9},{"ten",10},{"eleven",11},{"twelve",12}, {"thirteen",13},{"fourteen",14},{"fifteen",15},{"sixteen",16},{"seventeen",17}, {"eighteen",18},{"nineteen",19},{"twenty",20},{"thirty",30},{"forty",40}, {"fifty",50},{"sixty",60},{"seventy",70},{"eighty",80},{"ninety",90}, {"hundred",100} };
Wrapping up
Now that we’ve got our numbers, we just need to add them together and tell the computer to speak the answer back to us.
var answer = num1 + num2; SpeechSynthesizer sythesizer = new SpeechSynthesizer(); sythesizer.SelectVoiceByHints(VoiceGender.Male, VoiceAge.Senior); sythesizer.Speak("The answer to your question is "+answer.ToString());
There’s really not much to it, and this concept could easily be expanding to other mathematical operations or even more interestingly, voice-driven script execution. I think the most important aspect of this is making sure your Windows Speech Recognition app is trained and setup properly. Otherwise, you are gonna’ have a bad time.