🔍 mSearch finds documents with similar meaning 🔍
Below is a very basic collection configuration for news texts:
{
"collection_name": "news1",
"collection_id": "ef014064-1a99-4360-847c-f60071892146",
"semantic_config": {
"enabled": true,
"transformer_model_id": "paraphrase-multilingual-mpnet-base-v2"
},
"fulltext_config": {
"enabled": true,
"engine": "elastic"
},
"collection_schema": {
"fields": [
{
"field": "news_id",
"type": "id"
},
{
"field": "heading",
"type": "text",
"language": "en",
"sources": ["h1", "h2", "title"]
},
{
"field": "content",
"type": "text",
"language": "en",
"sources": ["p"]
},
{
"field": "category",
"type": "enum_text",
"language": "en"
},
{
"field": "author",
"type": "enum_text",
"language": "en"
}
]
},
"search_config": {
"max_best": 30,
"search_mode": "hybrid",
"restore_diacritics": true
},
"answer_config": {
"language": "en",
"max_tokens_in_text": 4000
},
"ingestion_config": {
"granularity": "page",
"skip_pages": [0],
}
}
The collection configuration consists of:
semantic_config
: enables/disables semantic search and configures its parameters.fulltext_config
: enables/disables TF/IDF full-text search and configures its parameters.search_config
: configures how search works at query time. Most parameters specified here set the
default behavior and can be overriden for each query.answer_config
: configures question answering options for retrieval assisted answer generation (RAG) where
mSearch hooks up with a suitable LLM to answer questions based on retrieved documents.ingestion_config
: specifies how documents are ingested. The granularity
parameter can be set to page
or paragraph
, the default value is paragraph
. The skip_pages
parameter can be used to skip the given pages of a
document.
(Pages are counted from 0.) The default value is an empty list. Currently works for PDF and HTML. The
paramater read_all_as_text
can be set to true
to read all text from Excel or CSV files and treat them like text. The
default value is false
.collection_schema
: specifies the document fields that are relevant for searching. Each field
should define its type (text
by default) and can specify the prevailing language
for the
content of that field, such as cs
or en
. The optional sources
property lists the sources from
which a field can be populated from ingested documents; see Ingesting documents.Major field types include:
id
: exactly one of the fields needs to be of type id
. This is used to uniquely identify each document
within the collection. This is important e.g. when uploading a new version of that document.text
: free form, potentially long, text content, included in search, cannot be used for filteringenum-text
: text content that is usually shorter, included in search, can be used for filtering search resultsenum
: excluded from search, can be used, can be used for filtering search resultsFor most applications, keeping both semantic and TF/IDF search enabled
will produce best results.
That is, both semantic_config.enabled
and fulltext_config.enabled
should
be true
which is also the default setting if left out.
{
"collection_name": "news1",
"collection_id": "ef014064-1a99-4360-847c-f60071892146",
"semantic_config": {
"enabled": true,
"transformer_model_id": "paraphrase-multilingual-mpnet-base-v2",
"max_tokens_per_vector": -1,
"vector_to_document_factor": 4.0,
"hnsw_m": 48,
"hnsw_ef_construction": 200,
"hnsw_ef_factor": 100,
},
"fulltext_config": {
"enabled": true,
"engine": "elastic"
},
"collection_schema": {
"fields": [
{
"field": "news_id",
"type": "id"
},
{
"field": "heading",
"type": "text",
"language": "en",
"index_standalone": true,
"boost": 1.5,
"sources": ["h1", "h2", "title"]
},
{
"field": "content",
"type": "text",
"language": "en",
"sources": ["p"]
},
{
"field": "category",
"type": "enum_text",
"language": "en"
},
{
"field": "author",
"type": "enum_text",
"language": "en",
"source_path": "/metadata/author"
}
]
},
"search_config": {
"max_best": 30,
"search_mode": "hybrid",
"restore_diacritics": true
},
"answer_config": {
"language": "en",
"provider": "azure",
"service": "llm1-themamaai",
"engine": "gpt35turbo",
"max_tokens_in_text": 4000,
"max_tokens_in_response": null,
"min_sentences": null,
"max_sentences": null,
"temperature": null,
"priming": "You are a newsroom agent who can comment on news on a chosen topic."
}
}
In the above collection config, several more config parameters are utilized, including:
field.index_standalone: true
: causes semantic search to create standalone vectors just for this field;field.boost: 2.0
: optional boost factor that promotes documents with matches of this field.
The default boost is 1.0. Only affects fulltext search, not semantic search.field.source_path
: used to specify an XPath-like location of this field's values when ingesting{
"news_id": "n1"
"metadata": {
"author": "Charlie"
}
}
search_config.restore_diacritics
is useful for languages cs
and sk
in case users may type their
queries unaccented. If set to true
, accents are automatically restored for the query text before
commencing search.search_config.rescore_method
is used to specify the method used to rescore the search results. The default value is none
.
The possible values are:none
: Return the top max_best
based on their scores. If search_mode
is hybrid
, the top max_best
documents are returned based on the combined scores of semantic and full-text search.llm
: The rescoring behavior depends on the search_mode:search_mode
is hybrid
, the results from both semantic
and full-text
search are combined into a set.
The LLM then evaluates the relevance of the documents in this set and rescales the scores based on relevance, semantic rank, and full-text rank.
Finally, it returns the max_best
number of documents with the highest scores.search_mode
is either semantic
or fulltext
, we return the relevant documents, if there is not max_best
number of relevant documents we complete them with not relevant.answer_config.*
parameters configure the LLM to use for question answering and control the
way how retrieved documents are summarized for the LLM.