You are now in the main content area

IMPROVING QUERY EFFECTIVENESS THROUGH FEATURE-BASED QUERY REFINEMENT

MASc Oral Exam Announcement
By: Mahtab Tamannaee
March 23, 2022

Abstract:

With the massive and fast-growing amount of information on the Web, maintaining the effectiveness of Information Retrieval (IR) is a real challenge. The system in charge of online search must be able to search through billions of documents stored on millions of devices.(Manning Christopher D et al., 2010) Traditional information retrieval systems try to sort out the input queries by mostly emphasizing on lexical similarity and exact term matching between query and documents using frequency-based methods. In other words, the relevancy of a query to a document is viewed based on the closeness of the distribution of words in a candidate document to the query. Since the lexical content of the optimal response is not usually known to the user, the user formulates a query with vocabulary that may have minimal overlap with the vocabulary appearing in its optimal document. Low overlap between query and document vocabulary is called term mismatch which emerges in retrieval results as poor recall performance. The term mismatch problem also has been referred to as lexical gap or lexical chasm with query on one side of the gap and documents on the other side. IR systems use different techniques to bridge the lexical chasm and solve the term mismatch problem. Many different query refinement techniques have already been developed. Given the user query, each refinement technique outputs a modified version of user’s query that can be used as an arch over the lexical gap from the query side to the document side. 

This thesis addresses the lexical chasm problem aiming to study different unsupervised query refinement methods, and eventually build an extendable framework to predict the most effective query refinement method for any given input query by linking it to its best reformulation. This research topic necessitates an extensive comparison-based study of different lexical query refinement methods over a reliable query refinement dataset. To facilitate this line of research, as the first contribution of this thesis, a configurable framework, ReQue, has been designed and implemented to generate gold standard datasets by taking three inputs: (1) a dataset of queries along with their associated relevance judgments (e.g., TREC topics), (2) an information retrieval method (e.g., BM25), and (3) an evaluation metric (e.g., MAP).
As one of the contributions of ReQue, we sourced our ReQue framework with TREC collections (Robust04, ClueWeb09, ClueWb12, and Gov2) and generated golden datasets, called ReQue datasets, based on the TREC topics (queries). Leveraging these golden ReQue datasets in the second phase of this thesis, a Learning-to-rank framework is adopted within which 45 features are introduced and systematically classified in order to identify the best reformulation of any given input query. Features are broadly classified into (1) count-based, (2) retrieval-based, (3) content-based, and (4) text similarity-based categories. Results show that text similarity-based features that are based on external embeddings as well as content-based features that are based on user readability metrics are effective for finding the most adequate query refinement method for an input query.

(Anyone can attend this zoom session.  Please send an email to Dr. Ebrahim Bagheri requesting the zoom URL and passcode)