Abstract:
Inspections in water resources often require retrieving pertinent clauses from extensive collections of legal documents and technical standards based on field observations—a process that demands specialized expertise and remains largely inefficient. Although recent advances in natural language processing have facilitated the development of intelligent retrieval systems, a fundamental semantic mismatch persists between concrete problem descriptions and the abstract, generalized language found in legal documents and technical standards.. To bridge this gap, we propose a Hierarchical Retrieval and Optimization Method (HROM), which leverages an embedding model for text vectors generation and employs a reranking model for deep semantic reranking. We further construct a hybrid training corpus, integrating authentic inspection records and synthetically generated data, to enhance both the embedding and reranking models. Experimental results demonstrate that our approach significantly outperforms existing state-of-the-art retrievers and large language models in overall accuracy. The proposed method is extensible and lays a groundwork for content retrieval in other domain-specific applications.