For Chinese word segmentation task, the corpora from following would be available:

CKIP Academia Sinica, Taipei
(721551 tokens,Training)
CityU City University of Hong Kong, Hong Kong
(1092687 tokens,Training)
CTB University of Colorado, United States
(642246 tokens,Training)
NCC State Language Commission of P.R.C.,Beijing
(917255 tokens,Training)
SXU

Shanxi University, Taiyuan
(528238 tokens,Training)

For Named Entity Recognition task, he corpora from following would be available:

CityU City University of Hong Kong, Hong Kong
(1772202 characters,Training)
MSRA Microsoft Research Asia, Beijing
(1089050 characters,Training)
PKU Peking University, Beijing
(1833177 characters,Training)

For Chinese POS-tagging task, the corpora from following would be available:

CKIP Academia Sinica, Taipei
(721551 tokens,Training)
CityU City University of Hong Kong, Hong Kong
(1092687 tokens,Training)
CTB University of Colorado, United States
(642246 tokens,Training)
NCC State Language Commission of P.R.C.,Beijing
(535023 tokens,Training)
PKU Peking University, Beijing
(1116754 tokens,Training)

The format of corpora are described below:

Overall Format:

1.Corpora could use Simplified or Traditional Chinese Character.
2.Corpora could be in GBK(Microsoft’s CP936) or BIG5(Microsoft’s CP950) or BIG5plus or BIG5/HKSCS but there should also be a Unicode(UTF-16 little endian) version for each corpus. ALL CHARACTERS in the corpora should be in Unicode BMP.
3.ALL the Arabic numbers, Latin letters and punctuations should be full-width characters.

Corpora for Chinese Word Segmentation

1.In training data and truth data the segmentation delimiters should be TWO half-width spaces (ASCII: 0x20, Unicode: 0x0020). Half-width space should not occur in words. The space in words should be full-width spaces(GBK: 0xA1A1, BIG5: 0x40A1, Unicode: 0x3000).
2.The testing data is raw corpus without segmentation delimiters, no half-width spaces (ASCII: 0x20, Unicode: 0x0020) should occur in testing corpora. If there need to be a space in testing data, it should be full-width space(GBK: 0xA1A1, BIG5: 0x40A1, Unicode: 0x3000).
3.There should be only one sentence in a line. A “Return”(ASCII:0x0D, Unicode:000D) and a “Newline”(ASCII: 0x0A, Unicode: 000A) should occur at the end of a line.

EXAMPLE(GBK):

Corpora for Chinese Named Entity Recognition

1.Training data will be presented in two-column format, where the first column consists of the character and the second is a tag. The tag is specified as follows:

Tag

Meaning

N

Not part of a named entity

B-PER

Beginning character of a person name

I-PER

Non-beginning character of a person name

B-ORG

Beginning character of an organization name

I-ORG

Non-beginning character of an organization name

B-LOC

Beginning character of a location name

I-LOC

Non-beginning character of a location name

2.There should be an empty line between sentences.
3.The delimiter between the first column and the second column should be a half-width space (ASCII: 0x20, Unicode: 0x0020).
4.At the end of a line, there should be a “Return”(ASCII:0x0D, Unicode:000D) and a “Newline”(ASCII: 0x0A, Unicode: 000A).

Corpora for Chinese POS-tagging

1.The corpora for Chinese POS-tagging just like corpora for Chinese word segmentation, delimiters between words are TWO half-width spaces (ASCII: 0x20, Unicode: 0x0020).
2.Word and its POS-tag is delimited by one half-width “/” (ASCII: 0x2f, Unicode: 0x002f).
3.All POS-tags in corpus should be half-width Latin letters.

EXAMPLE(GBK):

A corresponding XML version for each corpus is also available.