🔍 mSearch finds documents with similar meaning 🔍
Below is a very basic collection configuration for news texts:
{
"collection_name": "news1",
"collection_id": "ef014064-1a99-4360-847c-f60071892146",
"semantic_config": {
"enabled": true,
"transformer_model_id": "paraphrase-multilingual-mpnet-base-v2"
},
"fulltext_config": {
"enabled": true,
"engine": "elastic"
},
"collection_schema": {
"fields": [
{
"field": "news_id",
"type": "id"
},
{
"field": "heading",
"type": "text",
"language": "en",
"sources": ["h1", "h2", "title"]
},
{
"field": "content",
"type": "text",
"language": "en",
"sources": ["p"]
},
{
"field": "category",
"type": "enum_text",
"language": "en"
},
{
"field": "author",
"type": "enum_text",
"language": "en"
}
]
},
"search_config": {
"max_best": 30,
"search_mode": "hybrid",
"restore_diacritics": true
},
"answer_config": {
"language": "en",
"max_tokens_in_text": 4000
},
"ingestion_config": {
"granularity": "page",
"skip_pages": [0],
}
}
The collection configuration consists of:
semantic_config
: enables/disables semantic search and configures its parameters.enabled
: default: true
. For most applications, keeping both semantic and fulltext search enabled produces best results.transformer_model_id
: The identifier of a transformer model used to transform text into semantic vectors.
This can be one of these locally served models:intfloat/multilingual-e5-small
(recommended for most use cases)intfloat/multilingual-e5-large
(slightly more precise, slightly slower, larger semantic index)paraphrase-multilingual-mpnet-base-v2
(legacy)semantic_config
can be set to use an online OpenAI embedding model, such as text-embedding-3-large
. To enable this, contact The Mama AI.max_tokens_per_vector
: optional, integer. Can be used to limit the number of tokens use to construct passages that are then translated to semantic vectors.
E.g. set this to 512 if you would like your documents to be split into passages of such granularity. Semantic search works at the level of passages;
the top documents returned are those that contain the passages that are most similar to the user query.
If omitted, the transformer model's max tokens limit is used to produce the longest passages that the model can handle. fulltext_config
: enables/disables TF/IDF full-text search and configures its parameters.enabled
: default: true
. For most applications, keeping both semantic and fulltext search enabled produces best results.engine
: the fulltext engine to use to do classic TF/IDF search (default: elastic
, else solr
)use_default_stop_words
: default: true
to use the language's default stopword list (defined per field)stop_words
: optional, may provide custom stop words per language, e.g. "stop_words": { "cs": ["book", "read"] }
.
Stop words only apply to fulltext search. Default and custom stop words can be used together, only one of them, or both disabled. search_config
: configures how search works at query time. Most parameters specified here set the
default behavior and can be overridden for each query. Useful properties:synonyms
: used to expand parts of user query with synonyms. Each key identifies the text in query that should be augmented with its synonyms. Values are lists of synonyms to add to the query,
right after the text identified by the key. The keys are of 2 kinds. (1) plain texts, such as "tldr" in the sample configration below, are exact-matched on whole words, ignoring case. (2) Keys delimited with //
are parsed as regular expressions and
may contain suffix flags i
(ignore case) and a
(ignore character accents). Regexes can match parts of words. If you want to avoid that, you can surround your regex with \b
word boundaries. restore_diacritics
: true/false: optionally attempts to restore accents (default: false)answer_config
: configures which LLM to use for question answering, and associated LLM parameters. Useful properties: model
: sets a logical name of the LLM deployment. Consult The Mama AI for available deployments.provider
: set to azure
to use Azure OpenAI LLM deployments, or to chat_completions
to use a custom LLM deployment that exposes the OpenAI style Chat Completions API. service
: a logical name of the service deployed ata provider. For the azure
provider, this is the name of the
Azure resource that contains the model
deployment, or the URL of that resource. For chat_completions
providers,
the service is set to the base URL of the Chat Completions API. system_prompt_template
: A string template that instructs the LLM how it should respond. This can be set
to override the default system prompt template below. The system prompt template text must include the placeholder
{text}
that will expand to summarized retrieved documents at runtime. The default system prompt template is:
Answer questions in {language} as if your knowledge was only based on this text:
"""{text}"""
Summarize the answer in {language} using {min_sentences} to {max_sentences} sentence paragraph.
{language}
, {min_sentences}
, {min_sentences}
. If present, they expand to the values of properties of the same name under this answer_config
section.
In case you want your assistant to respond in the same language as user question, instruct so in your system prompt template.max_tokens_in_response
: optionally used to limit answer length (integer).json_response
: set this to true
in case the answer should be a JSON-formatted object.
The system prompt template should then contain instructions, schema and/or few examples of the output JSON structure.summary_config
: controls how the retrieved relevant documents get pre-processed or "summarized" so that the
configured LLM can answer questions based on their content (retrieval-assisted answer generation, RAG). Useful properties: max_tokens
: for answer and summarization purposes, the retrieved documents can be "shortened" to only include parts that are
most relevant to the user query. This specifies the limit in tokens that the total length of document summaries should not exceed.
These shortened document "summaries" are used for two purposes: (1) they are optionally included in API responses if the include_content
parameter
contains "summaries" or "summary" (see Search requests) and (2) they are passed to LLM as the textual context based on which the LLM
is expected to answer questions. Including summaries in API responses can also serve for diagnosing and improving answers while tuning your RAG solution. required_sections
: optional list of field names that should always be included in document summaries, regardless of whether their value seems relevant to user query or not. This could be e.g. a document name or URL in case you instruct the LLM to reference that in its answers.place_required_sections_first
: in case there are some required_sections, whether they should be placed first in generated document summaries (default: false
). If false
, the ordering from the collection schema is used.include_field_names
: set to true
if document summaries should include also the field names, like "title: Chapter 1" or just the values, like "Chapter 1" (default: false
)exclude_field_names
: if include_field_names: true
, then this optionally declares a blacklist of fields whose names not to include in the summaries. E.g., set this to ["p"]
to prevent prefixing each paragraph with "p:" but to keep field names for other document metadata.field_name_delimiter
: delimits field name from its value (default: ": "
)section_delimiter
: inserted at the start of every document section; that is, every instance of a field: value pair (default: "\n"
)document_header
: inserted at the start of every shortened document (default: "\n"
)max_adjacent_sections_distance
: retrieved documents are shortened by only including sections (field values) tha contained passages of text that were highly similar to the user query.
Additionally, the neighboring sections can be included in order not to lose context. This controls how many sections adjacent to such relevant section should be included in document summaries.ingestion_config
: specifies how documents are ingested:granularity
can be set to page
or paragraph
(default: paragraph
)skip_pages
is a list of 0-based integer page numbers that will be ignored (default: empty list). Only affects PDF and HTML ingestion.read_all_as_text
can be set to true
to read all text from Excel or CSV files and treat them like text (default: false
)collection_schema
: specifies the document fields that are relevant for searching. Each field
should define some essential properties:name
: the field name; used to route texts from input documents during ingestion.
E.g. values of Excel or CSV columns named like the field name by default end up ingested as values of that field.
Similarly, JSON keys route their values to fields of that name. Documents in msearch are represented
as multiple document sections, each section having a field as its "type" and its value.type
: data type of the field (default: text
). Major field types include:id
: exactly one of the fields needs to be of type id
. This is used to uniquely identify each document
within the collection. This is important e.g. when uploading a new version of that document.text
: free form, potentially long, text content, included in search, cannot be used for filteringenum-text
: text content that is usually shorter, included in search, can be used to filter search resultsenum
: excluded from search but can be used to filter search resultstime
: ISO-formatted date stringlanguage
: (default: en
) use this to specify the prevailing language
for the content of that field,
such as cs
or en
. Affects full-text search, including treatment of stemming and accents. sources
: lists the text sources from which a field can be populated from ingested documents;
see Ingesting documents.{
"collection_name": "news1",
"collection_id": "ef014064-1a99-4360-847c-f60071892146",
"semantic_config": {
"enabled": true,
"transformer_model_id": "paraphrase-multilingual-mpnet-base-v2",
"max_tokens_per_vector": -1,
"vector_to_document_factor": 4.0,
"hnsw_m": 48,
"hnsw_ef_construction": 200,
"hnsw_ef_factor": 100,
},
"fulltext_config": {
"enabled": true,
"engine": "elastic"
},
"collection_schema": {
"fields": [
{
"field": "news_id",
"type": "id"
},
{
"field": "heading",
"type": "text",
"language": "en",
"index_standalone": true,
"boost": 1.5,
"sources": ["h1", "h2", "title"]
},
{
"field": "content",
"type": "text",
"language": "en",
"sources": ["p"]
},
{
"field": "category",
"type": "enum_text",
"language": "en"
},
{
"field": "author",
"type": "enum_text",
"language": "en",
"source_path": "/metadata/author"
}
]
},
"search_config": {
"max_best": 30,
"search_mode": "hybrid",
"restore_diacritics": true,
"synonyms": {
"tldr": [
"too long didn't read"
],
"/(HO|H\\.O\\.)/": [
"home office",
"work from home"
],
"/\bmáma\b/ia": [
"The Mama AI"
]
}
},
"answer_config": {
"language": "en",
"provider": "azure",
"service": "llm1-themamaai",
"engine": "gpt35turbo",
"max_tokens_in_text": 4000,
"max_tokens_in_response": null,
"min_sentences": null,
"max_sentences": null,
"temperature": null,
"priming": "You are a newsroom agent who can comment on news on a chosen topic."
}
}
In the above collection config, several more config parameters are utilized, including:
field.index_standalone: true
: causes semantic search to create standalone vectors just for this field;field.boost: 2.0
: optional boost factor that promotes documents with matches of this field.
The default boost is 1.0. Only affects fulltext search, not semantic search.field.source_path
: used to specify an XPath-like location of this field's values when ingesting{
"news_id": "n1"
"metadata": {
"author": "Charlie"
}
}
search_config.restore_diacritics
is useful for languages cs
and sk
in case users may type their
queries unaccented. If set to true
, accents are automatically restored for the query text before
commencing search.search_config.rescore_method
optionally enables a re-scoring step that cay move more relevant documents closer to the top (default: none
). Values:none
: Return the top max_best
documents based on their raw retrieval scores. If search_mode
is hybrid
, combined scores of semantic and full-text search are used.llm
: Consults LLM so that more relevant documents move closer to the top. With search_mode
set to hybrid
, this helps prioritize the more relevant documents from both sources. With semantic
or keyword
search modes, this helps avoid less relevant documents at the top.answer_config.*
parameters configure the LLM to use for question answering and control the way how retrieved documents are summarized for the LLM.