Chinese ConceptNet

2022年4月5日
讀畢需時 1 分鐘

This dataset is a refined and expanded version of Chinese ConceptNet. (the original ConceptNet dataset is in commonsense/conceptnet5)

ConceptNet collected commonsense knowledge from voluntary web users all around the world by crowdsourcing. It contains a variety of domains in real-world and can be applied in different tasks, such as analogy, commonsense reasoning and natural language understanding.

Knowledge acquired by crowds tend to be noisy, redundancy and meaningless especially for unguided projects without supervision and voluntary participants.

Therefore, we refined ConceptNet to reduce its error rate. In the mean time, we increased the quality of ConceptNet. The quality here refers to correctness, coverage and number of concepts.

The table shows the comparison of modified/expanded version and original one.

Expanded ConceptNet is 3.2 times of the size of the original one. In fact, the actual number of each statistic in original ConceptNet should be smaller, because numerous incorrect concepts are included.

In order to translate concept to machine-readable representation, word embedding is needed. Average word embeddings of segmented words can’t fully represent or even differ from original meaning, especially when the concept contains ambiguous segmented words or the concept is a term that can’t be separated. Therefore, we decrease the number of multiple segmented ( > 1) concepts up to 44%, the words can be represented more precisely.

The score of human rating by using expanded ConceptNet in generating text task is twice as good as using the original one.

Data and research reports are in my Github

Ying-Ren Chen 陳櫻仁

Chinese ConceptNet

留言