Search | Engineering

Subscribe Submit

Home Journals Focus Achievement Fronts About Us 中文版

Resource Type

Journal Article 3

Year

2017 2

2015 1

Keywords

Chinese Wikipedia 1

Corpus 1

Corpus selection 1

Coverage rate 1

Domain adaptation 1

Entity classification 1

NER corpora 1

Primary English 1

Recognition rate 1

Word frequency 1

open ︾

Search scope:

排序： Display mode:

Corpus-based research on English word recognition rates in primary school and word selection strategy Article

Wen-yan XIAO,Ming-wen WANG,Zhen WENG,Li-lin ZHANG,Jia-li ZUO

Frontiers of Information Technology & Electronic Engineering 2017, Volume 18, Issue 3, Pages 362-372 doi: 10.1631/FITEE.1601118

Abstract: In this paper, we develop an English webpage corpus (EWC) and create a word frequency list using webBy comparing EWC word lists with the British National Corpus (BNC), we find that the BNC word frequencycomparing the word frequency lists of several corpora, including EWC, BNC, SUBTLEX-US, and Subtitle Corpus

Keywords： Corpus Primary English Recognition rate Word frequency Coverage rate

HTML PDF Collect

Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Jie ZHOU,Bi-cheng LI,Gang CHEN

Frontiers of Information Technology & Electronic Engineering 2015, Volume 16, Issue 11, Pages 940-956 doi: 10.1631/FITEE.1500067

Abstract: Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.

Keywords： NER corpora Chinese Wikipedia Entity classification Domain adaptation Corpus selection

HTML PDF Collect

Zipfian interpretation of textbook vocabulary lists: comments on Xiao et al.’s Corpus-based research Correspondence

Qiong HU, Ming YUE

Frontiers of Information Technology & Electronic Engineering 2017, Volume 18, Issue 7, Pages 863-866 doi: 10.1631/FITEE.1700418

Abstract: Xiao et al. (2017)在对比分析4个语料库的基础上，提出国内小学六年级学生的识词率增长不能令人满意的观点，建议人教版小学英语通用教材总词汇在原有726个的基础上再增加903个，并删除twelfth（序数词，第十二）这样的低频词。作为外语教师和语言学家，我们赞同他们应用先进信息技术对传统词表进行评估的做法，但认为这项工作：1. 在构建参考语料库时需重视语料抽样的合理性；2. 在解读词频时需重视齐夫定律（Zipf’s law，即英语词频与词秩成反比）的作用——识字率增长随词汇量增加而减缓的情况是合理的；3. 在提出教材选词策略时，需考虑小学生认知特点和课业负担等现实因素限制，以及语言教育的总体目标，不能随便删除twelfth这样承载文化的词汇；4. 学龄儿童全国通用外语教材编写是项复杂的系统工程，需要各领域专家共同关注。

Keywords：齐夫定律（Zipf’s law）；语料库；英语；教科书；词表

HTML PDF Collect

Title Author Date Type Operation

Corpus-based research on English word recognition rates in primary school and word selection strategy

Wen-yan XIAO,Ming-wen WANG,Zhen WENG,Li-lin ZHANG,Jia-li ZUO

Journal Article

Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Jie ZHOU,Bi-cheng LI,Gang CHEN

Journal Article

Zipfian interpretation of textbook vocabulary lists: comments on Xiao et al.’s Corpus-based research

Qiong HU, Ming YUE

Journal Article