For Chinese word segmentation task, the corpora from following would be available:
| CKIP | Academia Sinica, Taipei (721551 tokens,Training) |
| CityU | City University of Hong Kong, Hong Kong (1092687 tokens,Training) |
| CTB | University of Colorado, United States (642246 tokens,Training) |
| NCC | State Language Commission of P.R.C.,Beijing (917255 tokens,Training) |
| SXU | Shanxi University, Taiyuan |
For Named Entity Recognition task, he corpora from following would be available:
| CityU | City University of Hong Kong, Hong Kong (1772202 characters,Training) |
| MSRA | Microsoft Research Asia, Beijing (1089050 characters,Training) |
| PKU | Peking University, Beijing (1833177 characters,Training) |
For Chinese POS-tagging task, the corpora from following would be available:
| CKIP | Academia Sinica, Taipei (721551 tokens,Training) |
| CityU | City University of Hong Kong, Hong Kong (1092687 tokens,Training) |
| CTB | University of Colorado, United States (642246 tokens,Training) |
| NCC | State Language Commission of P.R.C.,Beijing (535023 tokens,Training) |
| PKU | Peking University, Beijing (1116754 tokens,Training) |
The format of corpora are described below:
Overall Format:
1.Corpora could use Simplified or Traditional Chinese Character.
2.Corpora could be in GBK(Microsoft’s CP936) or BIG5(Microsoft’s CP950) or BIG5plus or BIG5/HKSCS but there should also be a Unicode(UTF-16 little endian) version for each corpus. ALL CHARACTERS in the corpora should be in Unicode BMP.
3.ALL the Arabic numbers, Latin letters and punctuations should be full-width characters.
Corpora for Chinese Word Segmentation
1.In training data and truth data the segmentation delimiters should be TWO half-width spaces (ASCII: 0x20, Unicode: 0x0020). Half-width space should not occur in words. The space in words should be full-width spaces(GBK: 0xA1A1, BIG5: 0x40A1, Unicode: 0x3000).
2.The testing data is raw corpus without segmentation delimiters, no half-width spaces (ASCII: 0x20, Unicode: 0x0020) should occur in testing corpora. If there need to be a space in testing data, it should be full-width space(GBK: 0xA1A1, BIG5: 0x40A1, Unicode: 0x3000).
3.There should be only one sentence in a line. A “Return”(ASCII:0x0D, Unicode:000D) and a “Newline”(ASCII: 0x0A, Unicode: 000A) should occur at the end of a line.
Corpora for Chinese Named Entity Recognition
1.Training data will be presented in two-column format, where the first column consists of the character and the second is a tag. The tag is specified as follows:
Tag |
Meaning |
N |
Not part of a named entity |
B-PER |
Beginning character of a person name |
I-PER |
Non-beginning character of a person name |
B-ORG |
Beginning character of an organization name |
I-ORG |
Non-beginning character of an organization name |
B-LOC |
Beginning character of a location name |
I-LOC |
Non-beginning character of a location name |
2.There should be an empty line between sentences.
3.The delimiter between the first column and the second column should be a half-width space (ASCII: 0x20, Unicode: 0x0020).
4.At the end of a line, there should be a “Return”(ASCII:0x0D, Unicode:000D) and a “Newline”(ASCII: 0x0A, Unicode: 000A).
Corpora for Chinese POS-tagging
1.The corpora for Chinese POS-tagging just like corpora for Chinese word segmentation, delimiters between words are TWO half-width spaces (ASCII: 0x20, Unicode: 0x0020).
2.Word and its POS-tag is delimited by one half-width “/” (ASCII: 0x2f, Unicode: 0x002f).
3.All POS-tags in corpus should be half-width Latin letters.
EXAMPLE(GBK):
A corresponding XML version for each corpus is also available.