mSearch collection configuration

🔍 mSearch finds documents with similar meaning 🔍

Sample configuration

Below is a very basic collection configuration for news texts:

{
  "collection_name": "news1",
  "collection_id": "ef014064-1a99-4360-847c-f60071892146",
  "semantic_config": {
    "enabled": true,
    "transformer_model_id": "paraphrase-multilingual-mpnet-base-v2"
  },
  "fulltext_config": {
    "enabled": true,
    "engine": "elastic"
  },
  "collection_schema": {
    "fields": [
      {
        "field": "news_id",
        "type": "id"
      },
      {
        "field": "heading",
        "type": "text",
        "language": "en",
        "sources": ["h1", "h2", "title"]
      },
      {
        "field": "content",
        "type": "text",
        "language": "en",
        "sources": ["p"]
      },
      {
        "field": "category",
        "type": "enum_text",
        "language": "en"
      },
      {
        "field": "author",
        "type": "enum_text",
        "language": "en"
      }
    ]
  },
  "search_config": {
    "max_best": 30,
    "search_mode": "hybrid",
    "restore_diacritics": true
  },
  "answer_config": {
    "language": "en",
    "max_tokens_in_text": 4000
  },
  "ingestion_config": {
   "granularity": "page",
   "skip_pages": [0],
   }
}

The collection configuration consists of:

semantic_config: enables/disables semantic search and configures its parameters.
- enabled: default: true. For most applications, keeping both semantic and fulltext search enabled produces best results.
- transformer_model_id: The identifier of a transformer model used to transform text into semantic vectors. This can be one of these locally served models:
  - intfloat/multilingual-e5-small (recommended for most use cases)
  - intfloat/multilingual-e5-large (slightly more precise, slightly slower, larger semantic index)
  - paraphrase-multilingual-mpnet-base-v2 (legacy)
  - Alternatively, semantic_config can be set to use an online OpenAI embedding model, such as text-embedding-3-large. To enable this, contact The Mama AI.
- max_tokens_per_vector: optional, integer. Can be used to limit the number of tokens use to construct passages that are then translated to semantic vectors. E.g. set this to 512 if you would like your documents to be split into passages of such granularity. Semantic search works at the level of passages; the top documents returned are those that contain the passages that are most similar to the user query. If omitted, the transformer model's max tokens limit is used to produce the longest passages that the model can handle.
fulltext_config: enables/disables TF/IDF full-text search and configures its parameters.
- enabled: default: true. For most applications, keeping both semantic and fulltext search enabled produces best results.
- engine: the fulltext engine to use to do classic TF/IDF search (default: elastic, else solr)
- use_default_stop_words: default: true to use the language's default stopword list (defined per field)
- stop_words: optional, may provide custom stop words per language, e.g. "stop_words": { "cs": ["book", "read"] }. Stop words only apply to fulltext search. Default and custom stop words can be used together, only one of them, or both disabled.
search_config: configures how search works at query time. Most parameters specified here set the default behavior and can be overridden for each query. Useful properties:
- synonyms: used to expand parts of user query with synonyms. Each key identifies the text in query that should be augmented with its synonyms. Values are lists of synonyms to add to the query, right after the text identified by the key. The keys are of 2 kinds. (1) plain texts, such as "tldr" in the sample configration below, are exact-matched on whole words, ignoring case. (2) Keys delimited with // are parsed as regular expressions and may contain suffix flags i (ignore case) and a (ignore character accents). Regexes can match parts of words. If you want to avoid that, you can surround your regex with \b word boundaries.
- restore_diacritics: true/false: optionally attempts to restore accents (default: false)
answer_config: configures which LLM to use for question answering, and associated LLM parameters. Useful properties:
- model: sets a logical name of the LLM deployment. Consult The Mama AI for available deployments.
- provider: set to azure to use Azure OpenAI LLM deployments, or to chat_completions to use a custom LLM deployment that exposes the OpenAI style Chat Completions API.
- service: a logical name of the service deployed ata provider. For the azure provider, this is the name of the Azure resource that contains the model deployment, or the URL of that resource. For chat_completions providers, the service is set to the base URL of the Chat Completions API.
- system_prompt_template: A string template that instructs the LLM how it should respond. This can be set to override the default system prompt template below. The system prompt template text must include the placeholder {text} that will expand to summarized retrieved documents at runtime. The default system prompt template is: Answer questions in {language} as if your knowledge was only based on this text: """{text}""" Summarize the answer in {language} using {min_sentences} to {max_sentences} sentence paragraph.
- the default system prompt template includes placeholders {language}, {min_sentences}, {min_sentences}. If present, they expand to the values of properties of the same name under this answer_config section. In case you want your assistant to respond in the same language as user question, instruct so in your system prompt template.
- max_tokens_in_response: optionally used to limit answer length (integer).
- json_response: set this to true in case the answer should be a JSON-formatted object. The system prompt template should then contain instructions, schema and/or few examples of the output JSON structure.
summary_config: controls how the retrieved relevant documents get pre-processed or "summarized" so that the configured LLM can answer questions based on their content (retrieval-assisted answer generation, RAG). Useful properties:
- max_tokens: for answer and summarization purposes, the retrieved documents can be "shortened" to only include parts that are most relevant to the user query. This specifies the limit in tokens that the total length of document summaries should not exceed. These shortened document "summaries" are used for two purposes: (1) they are optionally included in API responses if the include_content parameter contains "summaries" or "summary" (see Search requests) and (2) they are passed to LLM as the textual context based on which the LLM is expected to answer questions. Including summaries in API responses can also serve for diagnosing and improving answers while tuning your RAG solution.
- required_sections: optional list of field names that should always be included in document summaries, regardless of whether their value seems relevant to user query or not. This could be e.g. a document name or URL in case you instruct the LLM to reference that in its answers.
- place_required_sections_first: in case there are some required_sections, whether they should be placed first in generated document summaries (default: false). If false, the ordering from the collection schema is used.
- include_field_names: set to true if document summaries should include also the field names, like "title: Chapter 1" or just the values, like "Chapter 1" (default: false)
- exclude_field_names: if include_field_names: true, then this optionally declares a blacklist of fields whose names not to include in the summaries. E.g., set this to ["p"] to prevent prefixing each paragraph with "p:" but to keep field names for other document metadata.
- field_name_delimiter: delimits field name from its value (default: ": ")
- section_delimiter: inserted at the start of every document section; that is, every instance of a field: value pair (default: "\n")
- document_header: inserted at the start of every shortened document (default: "\n")
- max_adjacent_sections_distance: retrieved documents are shortened by only including sections (field values) tha contained passages of text that were highly similar to the user query. Additionally, the neighboring sections can be included in order not to lose context. This controls how many sections adjacent to such relevant section should be included in document summaries.
ingestion_config: specifies how documents are ingested:
- granularity can be set to page or paragraph (default: paragraph)
- skip_pages is a list of 0-based integer page numbers that will be ignored (default: empty list). Only affects PDF and HTML ingestion.
- read_all_as_text can be set to true to read all text from Excel or CSV files and treat them like text (default: false)
collection_schema: specifies the document fields that are relevant for searching. Each field should define some essential properties:
- name: the field name; used to route texts from input documents during ingestion. E.g. values of Excel or CSV columns named like the field name by default end up ingested as values of that field. Similarly, JSON keys route their values to fields of that name. Documents in msearch are represented as multiple document sections, each section having a field as its "type" and its value.
- type: data type of the field (default: text). Major field types include:
  - id: exactly one of the fields needs to be of type id. This is used to uniquely identify each document within the collection. This is important e.g. when uploading a new version of that document.
  - text: free form, potentially long, text content, included in search, cannot be used for filtering
  - enum-text: text content that is usually shorter, included in search, can be used to filter search results
  - enum: excluded from search but can be used to filter search results
  - time: ISO-formatted date string
- language: (default: en) use this to specify the prevailing language for the content of that field, such as cs or en. Affects full-text search, including treatment of stemming and accents.
- sources: lists the text sources from which a field can be populated from ingested documents; see Ingesting documents.

An example of a more elaborate collection configuration:

{
  "collection_name": "news1",
  "collection_id": "ef014064-1a99-4360-847c-f60071892146",
  "semantic_config": {
    "enabled": true,
    "transformer_model_id": "paraphrase-multilingual-mpnet-base-v2",
    "max_tokens_per_vector": -1,
    "vector_to_document_factor": 4.0,
    "hnsw_m": 48,
    "hnsw_ef_construction": 200,
    "hnsw_ef_factor": 100,
  },
  "fulltext_config": {
    "enabled": true,
    "engine": "elastic"
  },
  "collection_schema": {
    "fields": [
      {
        "field": "news_id",
        "type": "id"
      },
      {
        "field": "heading",
        "type": "text",
        "language": "en",
        "index_standalone": true,
        "boost": 1.5,
        "sources": ["h1", "h2", "title"]
      },
      {
        "field": "content",
        "type": "text",
        "language": "en",
        "sources": ["p"]
      },
      {
        "field": "category",
        "type": "enum_text",
        "language": "en"
      },
      {
        "field": "author",
        "type": "enum_text",
        "language": "en",
        "source_path": "/metadata/author"
      }
    ]
  },
  "search_config": {
    "max_best": 30,
    "search_mode": "hybrid",
    "restore_diacritics": true,
    "synonyms": {
      "tldr": [
        "too long didn't read"
      ],
      "/(HO|H\\.O\\.)/": [
        "home office",
        "work from home"
      ],
      "/\bmáma\b/ia": [
        "The Mama AI"
      ]
    }
  },
  "answer_config": {
    "language": "en",
    "provider": "azure",
    "service": "llm1-themamaai",
    "engine": "gpt35turbo",
    "max_tokens_in_text": 4000,
    "max_tokens_in_response": null,
    "min_sentences": null,
    "max_sentences": null,
    "temperature": null,
    "priming": "You are a newsroom agent who can comment on news on a chosen topic."
  }
}

In the above collection config, several more config parameters are utilized, including:

field.index_standalone: true: causes semantic search to create standalone vectors just for this field;
prevents combination of the filed text with other fields of the same doc.
field.boost: 2.0: optional boost factor that promotes documents with matches of this field. The default boost is 1.0. Only affects fulltext search, not semantic search.
field.source_path: used to specify an XPath-like location of this field's values when ingesting
structured JSON or JSON lines documents. By default, a field called "author" is expected to be found under the key "author" in input JSON documents. Providing a source_path of "/metadata/author" will load it from depth 2 instead:

{ 
  "news_id": "n1"
  "metadata": {
    "author": "Charlie"
  }
}

search_config.restore_diacritics is useful for languages cs and sk in case users may type their queries unaccented. If set to true, accents are automatically restored for the query text before commencing search.
search_config.rescore_method optionally enables a re-scoring step that cay move more relevant documents closer to the top (default: none). Values:
- none: Return the top max_best documents based on their raw retrieval scores. If search_mode is hybrid, combined scores of semantic and full-text search are used.
- llm: Consults LLM so that more relevant documents move closer to the top. With search_mode set to hybrid, this helps prioritize the more relevant documents from both sources. With semantic or keyword search modes, this helps avoid less relevant documents at the top.
answer_config.* parameters configure the LLM to use for question answering and control the way how retrieved documents are summarized for the LLM.

API reference