AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Major milestone: POS-tagger implemented
 
 

Because tis is a major milestone in the MIND|CONSTRUCT project, I post this here instead of the other announcements topic wink

The first important milestone of the prototype implementation has been reached: we have implemented a state-of-the-art ‘Part of Speech tagger’ (POS-tagger). This means that the system is now at the level of current NLP-research and is actually at the forefront of development in this area.

The implemented POS-tagger is a Stochastic Tagger based (in part) on previous research by Eric Brill (Brill-tagger) and our own developed model for Word Sequence Aggregation. What sets our POS-tagger apart from other efforts in this domain, is the fact that our tagger does not use a previously tagged language corpus; it actually builds its own knowledge base (corpus) while it is being trained. The other discriminating feature is the fact that our POS-tagger uses a fairly small ruleset whereas other taggers use large amounts of rules (sometimes hundreds) to be able to get to a usable percentage of recognition in a sentence. Also, our tagger skips the step of tagging for grammar, instead words are directly tagged semantically.

The tagger gives impressive results on very sparse data sets; our implementation reached around 90% correct tagging after being trained with just 400 sentences (ranging between six and sixteen words), adding up to just over 1000 words in total. Because of how the tagger works (without a previously tagged external corpus), it can recognize unknown words and has no problems with typos. Currently we are still training the tagger to get to over 96% at least, but the best part is the fact that our tagger is actually learning recursively; eventually (around the 95-96% mark) we will no longer input corrections (supervised learning), but instead let the system figure out words it can not yet tag right away at a later stage, when it has enough information to do so.

Although our POS-tagger is pretty much state-of-the-art in Part of Speech Tagging, it is only a infrastructural facility in our system. It is basically aimed at assisting the training of our ASTRID system. When ASTRID has accumulated enough semantic knowledge, she will be able to infer rules for understanding language all by herself.

http://mindconstruct.com/webpages/newsitem/19

 

 
  [ # 1 ]

We just passed 94% correct tagging words and 48% sentences correct with just under 600 sentences/ 1600 words training.

 

 
  [ # 2 ]

And how do you measure 94% correct tagging. Do you take a slice of the Pennbank and confirm there or what? How many sentences/words is your test suite?

 

 
  [ # 3 ]

Hay
¿Do you parse Spanish corpus? - or only English one

I make a special observation that the general task such as of POS tagging of short sentences in chat context, is hard and difficult, and very error prone, because the semantics and syntax is irregular, specially in online environments!

Just parsing short shit, is very hard,
I tried it for long time and desisted to parse in a traditional way, I do selective part of sentences chunking (heuristically driven)

What is the tagset you use? EAGLES or Penn?

thanks!

 

 
  [ # 4 ]

Hey Bruce,

Bruce Wilcox - Jun 9, 2015:

And how do you measure 94% correct tagging. Do you take a slice of the Pennbank and confirm there or what? How many sentences/words is your test suite?

We specifically do not use any pre-tagged corpus. The objective of our tagger is that it should get high-score results on very sparse training sets. So we train (mainly) with sentences from English novels. Sentences are inputted and scored for correctly tagged words, and then wrongly tagged words and non-tagged words are given to the system (supervised learning). Because the system can not ‘stop learning’ (as every new sentence builds additional pattern-knowledge), we score with a moving window over the last X-amount of sentences inputted.

We started with basically no words at all in the system, which gave of course a zero score for tagging. The overall score of the system is currently at around 85% correct for the whole data-set, including of course all low scored sentences from the beginning as well. Because we started out with a zero score we use the windowed score to reflect the current status of the tagger. As the data-set is growing, the overall score will of course get close to the windowed score as the amount of (very) low scored sentences will get smaller in regard of the whole data-set.

We do use the Penn-Treebank corpus, with an online available tagger, to reference the tagging of our own system in cases where we are not sure what the correct tagging should be (our native language is not English). However, the corpus data is not available to our tagger in any way. We specifically wanted to develop a tagger that learns ‘like humans’, instead of using a ‘big data’ approach with a large corpus to train the system (because that is obviously not how we learn language). This approach is also choosen because it reflects the way the whole ASTRID system is going to learn. The system learns from conversation and (virtualized) sensor data, and does not have any previousy prepared and/or formatted data loaded into the system.

The ideas, science and R&D behind our tagger will be the subject of an upcomming research paper wink

 

 
  [ # 5 ]

Hey Andres,

Andres Hohendahl - Jun 9, 2015:

Hay
¿Do you parse Spanish corpus? - or only English one

I make a special observation that the general task such as of POS tagging of short sentences in chat context, is hard and difficult, and very error prone, because the semantics and syntax is irregular, specially in online environments!

Just parsing short shit, is very hard,
I tried it for long time and desisted to parse in a traditional way, I do selective part of sentences chunking (heuristically driven)

What is the tagset you use? EAGLES or Penn?

As mentioned in my reply to Bruce, we don’t use any corpus for training. The system is completely self learning based on inputted sentences. 94% correct word tagging is certainly not ‘state of the art’, but pretty impressive (if I might say so myself) given the very small amount of data that the system has to achieve this. Mind you that this is just the current state, tagging results get better with almost every sentence we input into the system and this will continue as there is no actual training stage that gets to an end.

I do agree on your statement about short sentences. We train with sentences from novels (stories) as they tend to contain proper sentence structures. Same as we tend to train children with proper grammar and syntax (at least we try to). The ability to ‘understand’ short sentences or strange stuff like ‘chat language’ the system (like humans) first need to have a good grasp on language overall. Our tagger is only a training aid for our ASTRID system; sentence based tagging lacks the broader context of a conversation but ASTRID does have a continuous inner representation of the conversation context at any moment (episodic memory) and can infer meaning of one sentence from that context (if it is there of course). So ASTRID actually does not use POS-tagging at all, instead she gets tokenized input from a sentence and matches those tokens against her own knowledge graph for finding both meaning and to recognize word order occurances for pattern matching.

 

 
  [ # 6 ]

It sounds effective, but can you clarify: You have a POS tagger that doesn’t use POS-tagging at all?
What does it tag words as? Does it tag determiners, nouns, adjectives, verbs, but not grammatical subject/object/location roles?
Or does it treat all words as words and guesses how they are related to one another based on how they are usually (statistically) related in the corpus?

 

 
  [ # 7 ]
Don Patrick - Jun 9, 2015:

It sounds effective, but can you clarify: You have a POS tagger that doesn’t use POS-tagging at all?
What does it tag words as? Does it tag determiners, nouns, adjectives, verbs, but not grammatical subject/object/location roles?
Or does it treat all words as words and guesses how they are related to one another based on how they are usually (statistically) related in the corpus?

It DOES do POS-tagging, but we don’t tag with grammar tags, instead we tag with semantic tags. Grammar is not sufficient enough to describe certain semantic relations between words in a sentence. For more info you can look at this: http://en.wikipedia.org/wiki/Conceptual_dependency_theory (although we do not specifically use THIS method, we have developed our own).

Some tags we use are: Entity, Attribute, State, Action (past - present - future), Assigner, Concatenator, Logic, Quantity. There are several more, and some have overlap with grammatical tags while others are completely different. Certain grammatical tags are grouped in one semantic tag while other grammatical tags have several semantic tags describing them in more detail (semantically).

The main reason we tag semantically is because from there we can directly extract semantic constructs (including predicate logic) from a sentence.

Example: the(pointer) car(entity) has(assigner) a(pointer) new(state) set(entity) of(assigner) wheels(entity)

From here we can extract:

- the car
- car has wheels
- car has new wheels
- a set of wheels
- car has a set of wheels
- a new set of wheels
...etc.

based on the semantically tagged structure of a sentence we can mine for semantic structures like attribute-entity, entity-assigner-entity, entity-assigner-attribute-entity, action-assigner-entity, state-assigner-entity, etc.

The aim is to extract a rich contextual representation (as much as possible) for each concept in the sentence. Because of how our tags are structured we can extract all kinds of structural, causal, temporal, etc. relationships from a sentence.

 

 
  [ # 8 ]

Ah, that I understand, as it is very similar to my methods. I still do grammar before semantics, but hope to phase out most of the grammar eventually. I also distinguish in living entities and objects because they are treated distinctly with words like “it/he” and “who/what”, but there are edge cases and I acknowledge that considering them all ‘entities’ is preferable for a knowledge database.
Lately I’ve been wondering if “set” shouldn’t be considered a quantity (“a set of wheels” = 2 or more wheels), but that’s also a matter of ambiguity. Either way, you have overtaken my work in less than a year smile. Congratulations.

In the interest of the other readers though, I think it is fair to conclude that the 94% you mention is not directly comparable with classic POS-taggers like Stanford’s.

 

 
  [ # 9 ]
Don Patrick - Jun 10, 2015:

In the interest of the other readers though, I think it is fair to conclude that the 94% you mention is not directly comparable with classic POS-taggers like Stanford’s.

It is indeed not comparable, but in a most favorable way: classic taggers are trained with very large corpusses (1 mln. to 300 mln. words, in cases I studied) to derive either a (quite large) set of rules or to be queried directly by the tagger. Our tagger does (currently) not query a pre-tagged corpus, not even it’s own; each word (except learned unambiguous words) in every new sentence is determined again based on learned language patterns (something like hidden markov models).

Each new sentence does not exist in the data-set. So the sliding window test set is always different from what has been trained before that. The words stored in the system are therefore both the (small) amount of training data AND the very small data-set that is being used to mine for tag-sequence occurances.

I’d say that the current result is pretty impressive smile

The important part is that the way our tagger stores and queries language patterns is completely different from how current taggers are working. In a nutshell this is what we do:

- for each word we have a predecessor bigram and a successor bigram.
- for each bigram we search the word-space for all occuring possible trigrams for these predecessor and successor (consisting again of all available predecessor and successor bigrams).
- all found predecessor and successor bigrams for the actual preceding and succeeding word are run through a weighted rule-system to determine the best combination.
- the actual word is tagged based on the result from the rule-system.

The nested query structure to handle this is 10 levels deep wink

Besides the rule-system we do use some other strategies in concert with it (same as orther taggers do), for example when a word is tagged unambiguous we keep tagging it as unambiguous until it is learned otherwise. After that it is again being tagged by the rule-system.

 

 
  [ # 10 ]

Now working on a small update of the rule-system that will make it even more efficient (quite a bit actually), as a result of the training experiences we have so far.

The system also keeps counts for pattern occurance, which we don’t even use yet but can be another step to further optimize the tagger.

 

 
  [ # 11 ]

The tagger in action:

Image Attachments
atominspector.png
 

 
  login or register to react