I love how some people here are talking about speech recognition and speech synthesis and facial recognition software like they are experts in regards to the various software. I have a friend who is blind and uses speech recognition to send emails. He says it works wonders for him, and that it was very good at picking up the exact words he was using, and rarely made mistakes. I haven't even talked to him for 4 years, so maybe the software is even better. And he's Japanese and has a horrible accent, and he still said that the software he used was so good that it barely made mistakes.
Granted, the software might be more expensive than more standard speech recognition software, but the technology is surely there. Also, there's more to speech recognition than just the software, but don't forget the microphone quality is important too. I don't know about you, but even I have trouble understanding some of my friends on the phone from their shitty micss, so how could you expect a computer to pick it up properly sometimes? Hell, I have trouble understanding some of my friends when they just speak sometimes, not everyone speaks equally succinctly.
Then in terms of emotion/gesture recognition, at least in terms of the face, all you have to do is set a baseline. Which is why Molyneux said before in some interview when someone tried Mile to smile and then frown, since you then set a baseline for both. It shouldn't be too hard to store those images and try to match them up at times. I don't think the software will see and understand each and every little thing, but it's a start.
But I'm excited for Milo. The implications of the work done on Milo are staggering. Imagine playing an RPG where you can actually play as a person and interact with people. It'll take a while until the A.I. truly works the way people want it to, but again, it's a start.