The evaluation system of Chinese word segmentation and named entity recognition are the same as that in the former bakeoff 2006£¨Please click on bakeoff2006 or login http://sighan.cs.uchicago.edu/bakeoff2006/ for more information). The evaluation system of Chinese POS-tagging is described as follows:

The Chinese POS-tagging evaluation system

In this test, a participating system will take a given segmented corpus as the input, and only the POS-tagging performance will be evaluated.

Two evaluation tracks are available on every corpus: open and closed.
In the open tests, participants could use any external data in addition to the training corpus to train their systems.
In the closed tests, participants will only be allowed to use information found in the training data. Absolutely no other data or information could be used beyond that found in the training data.(If you have any questions, please click the hyperlinks given above to get more information.)

Definition of terms:

OOV POS tag: If a POS tag of a word is found in the test corpus, but not in the training corpus, or the word itself is an OOV word, the corresponding word-POS pair is called OOV POS tag.

IV POS tag: if the pair of word and POS tag does occur in the training corpus, the pair is called IV POS tag.

1.Accuracy

In this test, only the accuracy of POS tagging is evaluated, because no Recall or Precision could be calculated.

where denotes the number of words that are correctly tagged, and denotes the number of words in the truth corpus.

Furthermore, under the definition of IV POS tag and OOV POS tag, the accuracy of IV POS tags and OOV POS tags should be calculated too.

2.Evaluation for Multi-POS words tagging

Multi-POS words are the words that occur in the training corpus and have more than one POS-tag in either the training corpus or testing corpus. For instance, if an IV word has only one POS-tag in the training corpus, but has other POS-tags in the testing corpus, it is a multi-POS word.

With this definition, the accuracy of Multi-POS words tagging could be calculated.

3.Baseline

Baseline indicates the different degree of difficulty of tagging individual corpora.

The baseline of each corpus is calculated by generating a list of words and POS tags from the training corpus, then:
1. tagging those IV words in the testing corpus which have only one POS tag in the list.
2. for those IV words that have not only one tag in training corpus, the unique most frequent tag in training corpus will be assigned to them.
3. for those IV words that do not have a unique most frequent tag in training corpus and the OOV words, a unique overall most frequent tag in training corpus will assigned to them. If there is not an unique overall most frequent tag in training corpus, those words are considered incorrectly tagged.
This result will be compared to the truth corpus to calculate the baseline accuracy.

4.OOV POS tag Rate

OOV POS tag Rateis another measure that indicates the different degree of difficulty of tagging individual corpora.

where denotes the number of OOV POS tags in truth corpus£¬ denotes the number of words in truth corpus.