User query advanced processing for related searches

October 22, 2024

One of the main functionalities of FAME platform is the development of the search engine enabling the data assets discovery. For that purpose, it is mandatory to process user interests embedded in the searching box to display not only the most related results, but also additional ones that may be also interesting.

Our goal is, in addition of offering the most related results, promote the usage of the platform to boost the data commercialization market. Therefore, to implement this feature, it is required the combination of different technologies and approaches:

Query Understanding

The first step is to break down the query into individual tokens and normalize them (Lexical Analysis). This involves the tokenization to split the query into words, lowercasing, removing stop words, stemming or lemmatization to reduce words to their base forms

Then, it is needed the identification and classification of named entities in the query (Named Entity Recognition – NER), such as organizations, people, or specific data asset types, is crucial for understanding the query intent.

Semantic Processing

For that purpose, the word embedding technologies are used to represent query terms in a dense vector space allows for capturing semantic relationships. Main approaches are pre-trained models like Word2Vec, GloVe, or FastText to

To generate the contextual embeddings for the query, capturing nuanced meanings and relationships, state-of-the-art transformer model language such as BERT, RoBERTa, or T5 are the most common.

Retrieval and Ranking

To match query terms with data asset descriptions, traditional information retrieval techniques using Term Frequency-Inverse Document Frequency (TF-IDF) are used.

Query Expansion

Expanding the query with synonyms or related terms can help capture a broader range of relevant data assets. This can be combined with pseudo-relevance feedback, using the top-ranked results to expand the original query to refine the search and generate more accurate related topics.

Personalization (Only for registered users)

Incorporating user behaviour and preferences to recommend data assets and topics that similar users have found relevant. This can be built based on user profiles from past queries and interactions to tailor topic suggestions to individual interests.

By combining these approaches, a robust system can be developed to process queries effectively and generate relevant related topics for data assets discovery.

More info:
Jot Internet Media