Abstract:
Water licensing management is an important measure for promoting water resource conservation and protection. At present, a large quantity of important textual materials have been generated during the management process in the Changjiang River Basin. However, the analysis and utilization of these materials still rely primarily on manual processing, which is inefficient. To improve intelligent information extraction from water licensing documents, this study proposes an automated approach to constructing a domain-specific corpus and conducting entity recognition for water licensing management in the Changjiang River Basin. Given the highly specialized terminology and the scarcity of labeled data in this domain, we propose a data augmentation method based on domain dictionaries and pre-trained models. This method integrates expert knowledge and industry standards to build a dedicated corpus. Furthermore, to handle the complex sentence structures, varied linguistic styles, and strong context dependency found in water intake permitting documents, we develop a multi-feature fusion entity recognition model to accurately extract entities from domain-specific texts. Experimental results show that, using the constructed domain corpus, the proposed method achieves an accuracy of 89.64%, a recall rate of 88.71%, and an F1-score of 89.26% for entity recognition in water licensing texts. Moreover, the total processing time for business approval is reduced by approximately 66%. The proposed method provides effective support for the automated processing of textual materials in water licensing management.