长江流域取水许可领域专业语料库构建与实体识别

    Domain-specific corpus construction and entity recognition for water intake permission in Changjiang River Basin

    • 摘要: 取水许可管理是实现水资源节约和保护的重要手段,目前长江流域取水许可管理过程中产生了大量的重要文本资料,但资料的分析利用主要依赖人工,效率较低。为提升取水许可领域文本资料信息提取的智能化水平,提出一种长江流域取水许可领域专业语料库构建与实体识别自动化处理方法。针对取水许可领域名词专业性强、样本稀缺的问题,结合专家经验和行业标准,提出基于词典与预训练模型的数据增强方法,构建取水许可领域专业语料库。针对取水许可领域文本资料中句子结构复杂、语言习惯多样以及上下文关联性强的特点,提出融合多特征的水资源实体识别模型,实现取水许可领域资料专业性文本中实体的准确提取。实验评估表明,基于构建的取水许可领域专业语料库训练数据,取水许可领域文本资料实体识别的准确率达89.64%,召回率达88.71%,F1值达89.26%,业务审批总时间降低了约66%,为取水许可领域文本资料的自动化处理提供了有效支撑。

       

      Abstract: Water licensing management is an important measure for promoting water resource conservation and protection. At present, a large quantity of important textual materials have been generated during the management process in the Changjiang River Basin. However, the analysis and utilization of these materials still rely primarily on manual processing, which is inefficient. To improve intelligent information extraction from water licensing documents, this study proposes an automated approach to constructing a domain-specific corpus and conducting entity recognition for water licensing management in the Changjiang River Basin. Given the highly specialized terminology and the scarcity of labeled data in this domain, we propose a data augmentation method based on domain dictionaries and pre-trained models. This method integrates expert knowledge and industry standards to build a dedicated corpus. Furthermore, to handle the complex sentence structures, varied linguistic styles, and strong context dependency found in water intake permitting documents, we develop a multi-feature fusion entity recognition model to accurately extract entities from domain-specific texts. Experimental results show that, using the constructed domain corpus, the proposed method achieves an accuracy of 89.64%, a recall rate of 88.71%, and an F1-score of 89.26% for entity recognition in water licensing texts. Moreover, the total processing time for business approval is reduced by approximately 66%. The proposed method provides effective support for the automated processing of textual materials in water licensing management.

       

    /

    返回文章
    返回