Well, after a bit of source browsing, and looking at the theora mailing list, the MMX asembly is written as gcc inline asm, which does not compile with VC. Now, there was talk about writing it as external ASM (using a seperate assebler like NASM), but, nothing's been decided/done as far as I can see. Anyway, from my tests and some numbers reported in the mailing list, only a 7% sppedup results from using this. Now, perhaps this will make a larger difference when it is more fully optimized, or for larger/quality movies.. But, I will put this aside for now.
The other things which may be able to help speed up are:
:arrow: YUV - > RGB Pixel shader - If someone could volunteer to try this, I would be happy

:arrow: And, maybe instead of doing a pixel buffer copy in the blitting routine, perhaps locking the buffer to write on might lead to less memory (and maybe speed) overhead.