Because tis is a major milestone in the MIND|CONSTRUCT project, I post this here instead of the other announcements topic
The first important milestone of the prototype implementation has been reached: we have implemented a state-of-the-art ‘Part of Speech tagger’ (POS-tagger). This means that the system is now at the level of current NLP-research and is actually at the forefront of development in this area.
The implemented POS-tagger is a Stochastic Tagger based (in part) on previous research by Eric Brill (Brill-tagger) and our own developed model for Word Sequence Aggregation. What sets our POS-tagger apart from other efforts in this domain, is the fact that our tagger does not use a previously tagged language corpus; it actually builds its own knowledge base (corpus) while it is being trained. The other discriminating feature is the fact that our POS-tagger uses a fairly small ruleset whereas other taggers use large amounts of rules (sometimes hundreds) to be able to get to a usable percentage of recognition in a sentence. Also, our tagger skips the step of tagging for grammar, instead words are directly tagged semantically.
The tagger gives impressive results on very sparse data sets; our implementation reached around 90% correct tagging after being trained with just 400 sentences (ranging between six and sixteen words), adding up to just over 1000 words in total. Because of how the tagger works (without a previously tagged external corpus), it can recognize unknown words and has no problems with typos. Currently we are still training the tagger to get to over 96% at least, but the best part is the fact that our tagger is actually learning recursively; eventually (around the 95-96% mark) we will no longer input corrections (supervised learning), but instead let the system figure out words it can not yet tag right away at a later stage, when it has enough information to do so.
Although our POS-tagger is pretty much state-of-the-art in Part of Speech Tagging, it is only a infrastructural facility in our system. It is basically aimed at assisting the training of our ASTRID system. When ASTRID has accumulated enough semantic knowledge, she will be able to infer rules for understanding language all by herself.
http://mindconstruct.com/webpages/newsitem/19