面向水利稽察的问题检索模型优化方法研究

    Research on the optimization of question retrieval models for water conservancy inspection

    • 摘要: 在水利稽察中,业务人员经常需要根据现场发现的问题,从海量的法律法规或技术标准中查找关联的具体条款内容,不仅专业性要求高,而且时间长、效率低。近年来,随着人工智能技术的快速发展,基于自然语言模型的知识检索系统应用越来越广泛,为该问题提供了新的思路。然而,稽察问题描述往往较为具体,而条款内容则通常采用严谨、抽象的概括性语言,两者之间存在显著的语义鸿沟,导致通用检索模型难以准确匹配。为此,本文构建了一套分层检索与优化方法(A Hierarchical Retrieval and Optimization Method,HROM),采用嵌入式模型生成文本向量,并利用重排序模型进行深度语义相关性重排序。为提升检索效果,本文提出了一套检索数据集,由从历史稽察报告提取的真实数据集和基于大模型生成的合成数据集组成,并利用该数据集对嵌入式模型和重排序模型进行优化。为了验证检索方法的有效性,将本方法与当前最新的检索模型以及大模型进行对比,取得了最优的综合性能。本研究具有一定的扩展性,将为其它业务领域的内容检索研究奠定基础。

       

      Abstract: Inspections in water resources often require retrieving pertinent clauses from extensive collections of legal documents and technical standards based on field observations—a process that demands specialized expertise and remains largely inefficient. Although recent advances in natural language processing have facilitated the development of intelligent retrieval systems, a fundamental semantic mismatch persists between concrete problem descriptions and the abstract, generalized language found in legal documents and technical standards.. To bridge this gap, we propose a Hierarchical Retrieval and Optimization Method (HROM), which leverages an embedding model for text vectors generation and employs a reranking model for deep semantic reranking. We further construct a hybrid training corpus, integrating authentic inspection records and synthetically generated data, to enhance both the embedding and reranking models. Experimental results demonstrate that our approach significantly outperforms existing state-of-the-art retrievers and large language models in overall accuracy. The proposed method is extensible and lays a groundwork for content retrieval in other domain-specific applications.

       

    /

    返回文章
    返回