Data enhancement and named entity recognition for knowledge extraction from Yangtze River water intake permit management documentation
-
-
Abstract
Water intake permit management is a crucial tool for water resource conservation and protection. In the Yangtze River Basin, a large volume of important textual data is generated during the water intake permit management process, but the analysis and utilization of this data still rely heavily on manual work, resulting in low efficiency. To enhance the level of automation in extracting information from textual data in the water intake permit domain, this paper proposes a method for constructing a professional corpus and automating entity recognition in the Yangtze River Basin’s water intake permit field. Given the high technicality of domain-specific terms and the scarcity of samples, a data augmentation approach based on dictionaries and pre-trained models is introduced, incorporating expert knowledge and industry standards to build the professional corpus for the water intake permit domain. Addressing the complexity of sentence structures, diverse language patterns, and strong contextual relationships in water intake permit texts, a multi-feature fusion water resource entity recognition model is proposed to accurately extract entities from the domain-specific text. The experimental evaluation indicates that training on the specialized corpus built for the water intake permit domain achieves an accuracy of 89.64%, a recall rate of 88.71%, and an F1 score of 89.26% for entity recognition in water intake permit text data. Additionally, the overall business approval time has been reduced by 66%, providing effective support for the automation of water intake permit text processing.
-
-