Speech Recognition Woes

Saturday, August 1, 2009

Julius Speech Recognition and Tools

I did not work for longer time on Speech Recognition Engines. Say around 3-4 months of work. I had to do some sort of research on which ASR can be best suited for the idea we had in mind. Julius topped our list. Sphinx 4 is exciting as its in Java and I expect it to mature even more as time goes on.
So as I mentioned in my previous post, from Julius home page you won't get much details other than source code ans a handbook. To try out your hands immediately on Julius download Julius Quick Start Demo from voxforge.

Julius is a written in C and compiled using GCC. So if you are using Windows then you will be needing cygwin to run or compile and build the Julius source.

Juilus provides you with a set of tools which are pretty useful in building you SR system. These executable will be found in 'bin' folder. I am going to include on 3 of those handy tools which we will use mostly.

'Julius' - The main recognizer module that does all recognition part of speech. Julius needs a language model and an acoustic model to run as a speech recognizer. You can use HMM acoustic model, language model (word N-gram, grammar, isolated word). Input can be either in wave or mfc format or direct mic or even voice data from Network. Note that for waveform file input, only WAV (no compression) and RAW (mono, 16bit, big endian) are supported by default. There are options where one can specify a list of input files to be recognized in form of a file. There are dozens of command line options available. Going through Julius manual will give better idea. Its always better to use these options in a configuration file and pass this parameter to Julius.
'adinrec' - This tool helps you in recording voice in Julius acceptable audio format. The audio format is 16 bit, 1 channel, in WAV format. Here too like Julius one can set sampling frequency to record even at 48k Hz. The tools records every utterance as a single file.
'adintool' - This tool is similar to adinrec along with other options. All the Julius options like can be set, but since its just an audio tool other options will be skipped without any error. One interesting option is 'adinnet'. This option lets you run Julius in 'server mode'. With adinnet we can specify a port number which julius can listen to and a server name using -server option. This will make Julius receive data directly from adintool for recognition. Say the Julius recognition is on server side and you are running a SR program on client side. This option can indeed let you do real time recognition.

In my next post we will see a small example on how to run Julius in server mode. Explore Julius till then !

/A

Wednesday, June 17, 2009

A Dive into Japanese LVCSR Engine

In previous post, I mentioned on switching to Julius from Sphinx. I had no idea what I had in store. Googling for 'Speech Recognizer' gives you 8 on 10 results for CMU Sphinx. I found Julius from wiki link. I only wanted an Open Source Speech Recognizer. Seems like Julius was the only best possible option for me to dig out if I can make use of it.

I must say using Julius was not easy but end results achieved from Julius were great.
Let me highlight some initial problems you will face when you go for Julius.

Julius is an Open Source Japanese Speech Recognition. Julius was developed as Japanese LVCSR since 1997. They have home page both in English and Japanese.
The site have a user documentation which is actually first written in Japanese. An English version is still under development. But do not worry, Google Translator comes to our rescue. Here is the translated English version of Julius Book.
Now being Japanese Recognizer it had only acoustic model for Japanese Language. But good people are present all over web. The VoxForge-project is working on the creation of an open-source acoustic model for the English language.
If you go at the Julius home site, you might get lost after downloading the source code or binaries and reading bits-n-pieces of info. I suggest you to start by downloading Julius Quick Start from Voxforge. Its on 3.5.2 version of Julius but porting to latest version is as easy as copying acoustic model and grammar files.
Julius Forum is also a painful experience for me. They have English and Japanese Forum Topics. So again use Google Translator. I don't think whatever is asked by Japanese guys are reflected in English Forum.

The above mentioned points will definitely get you started with Julius especially the Quick start from voxforge. Check out voxforge forums too. Useful information but meagre for a novice.

Saturday, May 30, 2009

Switching Sphinx to Julius

Ah.. Its been long time since I updated this blog. I was really busy trying to use Julius.
Yes. right. I had to ditch Sphinx4 and move to Julius.

Following are the reasons that made me shift from Sphinx to Julius:

First and Foremost, Poor Recognition. I really could not get even 80% accuracy from this ASR. Me and one of my American friend tested it many times. Still no success.
Sphinx4 is based on Java. Hence its 'obviously' slow and hogs lot of memory.
It doesn't recognize words properly. However Digits are pretty accurately recognized.
No backward compatibility. Sphinx4 is re-written in Java. Whereas all previous versions are written in C/C++.

Reasons I will miss Sphinx:

Good documentation
Great helper demo examples
Active Forums and help by their developers
I kind of feel comfortable using Java. Hence using Eclipse for Sphinx4 really helped me in learning about Sphinx4 easy.

From next article onwards I will switch to Julius. I really couldn't get hang of Sphinx to make it work for my task. Sometime later, I shall work again on this and find out where I did wrong or has sphinx indeed become better recognizer :)

Thursday, May 7, 2009

Creating Your own Demo using Sphinx4

Sphinx4 provids good number of demos which I used in my program. I actually had to write an application which will record user speech on client side and send it as wav file to Server. On server side I had to recognize this wav file and return back the result with a confidence score attached as to how well the speech was recognized.
Sounds pretty simple?

I decided to use Java Applet like the one in voxforge. Display a list of sentences and ask user to record the voice. I was part successful in it. I developed an Applet that used Java Sound APIs for recording and playing it back. I ran into certain security issues as Applets are not supposed to save any file locally on client machine or access file system. After digging came over with this issue by signing my applet jar using Jarsigner. So my front end is ready. This applet sends the wav file to server.

Next, Server side planning. For demo I used Sockets to receive the input and send out results. Sphinx4 has a sample program that shows how to pass input audio file to sphinx for recognition. Thats it. My task over. I later on created a new program based on demo to recognize more words and used my own Language model for this task. This was my first application using Sphinx. I wished to let users download the application and test. But one problem with Sphinx4 is that its based on Java and the Acoustic Model and Dictionary make the program heavy for me to upload.

Commenting on accuracy, I was not very satisfied. There are various factors that determine accuracy of SRS, like pronunciation, microphone quality, surrounding noise, etc. I got good results when I used it to recognize Digits. But on providing random words for recognition, accuracy came down to less than 50%. I visited forums for solution, still no proper solution.
Still focus is on improving the results. Changing few parameters did increase the accuracy but it did not convince me to use it for production purpose. I had to leave this work stalled for now.

Edit: This is one of the initial samples I had developed. Download
Update 20/3/2012: The download link was broken. Thanks Jaishu for pointing it out.

Sunday, April 26, 2009

Sphinx4 Configuration file : config.xml

There are three primary modules in the Sphinx-4 framework:

The FrontEnd
The Decoder
The Linguist

Sphinx4 is very modular in nature. Every small block here can be separately configured. All these blocks can be separately configured in a Configuaration File. In this file we need to specify the front end which sphinx4 will use, Acoustic models and dictionary used to create a search graph which is used during recognition, language model(grammar) makes recognizer look for 'most likely' words occuring during recognition. Sphinx-4 Decoder block use output from the FrontEnd in conjunction with the SearchGraph (output) from the Linguist to generate recognition Result.

Let us now walk through a sample Configuration file: config.xml (download here)
Every config file has been logically separated into different sections. You can find syntax and rules for creating a configuration file at Sphinx Configuration management site.

Frequently Used Properties consists of properties that are used by other sections.
In Language Model we specify the grammar to use which will be used by Sphinx to match the speech. Pluggable language model support for ASCII and binary versions of unigram, bigram, trigram, Java Speech API Grammar Format (JSGF), and ARPA-format FST grammars.
Dictionary can be either Wall Street Journal (WSJ) or TIDIGTS or your own dictionary in standard ARPA format. You can find WSJ and TIDIGITS dictionary in Sphinx4 binaries itself. Dictionary consists of the words and their pronunciation phenome.
Next define Acoustic Model depending upon the type of Dictionary you use. Again Sphinx4 has included acoustic models for WSJ and TIDIGITS.
In Front End we can specify if the input is from Microphone or any Data Source (wav, au, etc format).

These are the major sections in any Sphinx4 configuration file. I will discuss few of these sections in subsequent articles.

Friday, April 17, 2009

Getting Started with Sphinx4 and Eclipse

I started reading the Getting Started page of Sphinx4. After downloading both source code and binaries it was time to set our environment. I have always used and enjoyed working on Eclipse. So went for setting up Eclipse environment.

Pre-requisites: Eclipse (Callisto, Europa or Ganymede). JRE 1.4+
Follow these steps to setup your development environement for Eclipse IDE:

Extract Sphinx4 source and binaries in folder. I have /Speech/Sphinx/sphinx4-1.0beta2.
Create a Empty Java Project titled 'sphinx' or any name you wish to.
In the Package Explorer, right click on Project and select 'Build Path -> Link Source'
A Dialog Box will appear asking you to enter the source folder from your filesystem. navigate your path to /Speech/Sphinx/sphinx4-1.0beta2/src/sphinx4.
Eclipse parses entire folder structure recursively and also names your source folder.
Click Next and Finish. A new source folder is now linked with your project.
It will contain lot of errors as libraries are not been added yet.
Add /lib/js.jar, lib/tags.jar and lib/jsapi.jar to your project classpath by right-clicking project and selecting 'Build Path -> Configure Build Path -> Libraries'.
Eclipse then refreshes the workspace and all errors are nullified.

Sphinx4 also provides you with a handful of sample programs. I found every sample program very useful and it covered most of the details needed to learn Sphinx4.

To Setup the environment for using the samples and viewing the source code, you can follow these steps.

Create another new project and title it as 'sphinx-demos'
Right click and Link this Source folder: /Speech/Sphinx/sphinx4-1.0beta2/src/apps
Add the missing Libraries: /lib/js.jar, lib/tags.jar and lib/jsapi.jar and lib/sphinx4.jar to your project classpath
Every demo is a simple Java file. To test any demo, just right-click select Run -> Run as Java Application File.

There you go. You are now ready to learn Sphinx4 with proper development environment setup.

Sphinx Project gives you access to entire source code of Sphinx.
Sphinx-demo Project gives you access to all the demo apps for Sphinx.

Wednesday, April 15, 2009

Now where to start?

Initially I had no idea what to look into and where to look into to find a good speech recognition software. I kind of 'specialize' in Java. So narrowed down my search by searching any SR based on Java.

I came across Java Speech API. The Java Speech API was developed by Sun Microsystems, Inc. in collaboration with leading speech technology companies: Apple Computer, Inc., AT&T, Dragon Systems, Inc., IBM Corporation, Novell, Inc., Philips Speech Processing, and Texas Instruments Incorporated. Now the moment I saw this, I felt It's SUPER COOL. Sun has done some brilliant work in every field. BUT then after reading few more stuffs I realized that these are just APIs.

The Java Speech API defines a standard, easy-to-use, cross-platform software interface to state-of-the-art speech technology. JS API defines two technologies: Speech Synthesis and Speech Recognition.

Speech Synthesis is basically Text To Speech.
Speech Recognition is Speech to Text.

These APIs lead me to SPHINX, IBM's Via Voice, Microsoft Speech API, Julius, and others.
I couldn't get IBM Via Voice or MS Speech API for my use. Hence I started off with Sphinx.

Sphinx was the first-of-its-kind continuous speech recognizer. It has only recognition part of JS API. Sphinx was developed at Carnegie Mellon University by Kai-Fu Lee. Sphinx is currently in 4th version. Sphinx 4 is a complete re-write of the Sphinx engine with the goal of providing a more flexible framework for research in speech recognition. It is written entirely in the Java programming language leveraging the Java Speech API Standard.

Sphinx4 currently is one of the best speech recognizer used. It is also available for devices called 'pocket-sphinx'. If you need a Speech Recognizer then I suggest you to start with this.
Sphinx has very good Documentation and an active Forum too. I really enjoyed exploring it. In following article I will write more detailed scripts for using Sphinx.

One thing that struck me was that Sphinx was developed by an Non-American (Kai-Fu Lee). Kai-Fu Lee is said to be man behind Microsoft Speech *WRECKecognition* Initiative. This video shows all. :)