mSearch collection configuration

🔍 mSearch finds documents with similar meaning 🔍

Sample configuration

Below is a very basic collection configuration for news texts:

{
  "collection_name": "news1",
  "collection_id": "ef014064-1a99-4360-847c-f60071892146",
  "semantic_config": {
    "enabled": true,
    "transformer_model_id": "paraphrase-multilingual-mpnet-base-v2"
  },
  "fulltext_config": {
    "enabled": true,
    "engine": "elastic"
  },
  "collection_schema": {
    "fields": [
      {
        "field": "news_id",
        "type": "id"
      },
      {
        "field": "heading",
        "type": "text",
        "language": "en",
        "sources": ["h1", "h2", "title"]
      },
      {
        "field": "content",
        "type": "text",
        "language": "en",
        "sources": ["p"]
      },
      {
        "field": "category",
        "type": "enum_text",
        "language": "en"
      },
      {
        "field": "author",
        "type": "enum_text",
        "language": "en"
      }
    ]
  },
  "search_config": {
    "max_best": 30,
    "search_mode": "hybrid",
    "restore_diacritics": true
  },
  "answer_config": {
    "language": "en",
    "max_tokens_in_text": 4000
  },
  "ingestion_config": {
   "granularity": "page",
   "skip_pages": [0],
   }
}

The collection configuration consists of:

Major field types include:

For most applications, keeping both semantic and TF/IDF search enabled will produce best results. That is, both semantic_config.enabled and fulltext_config.enabled should be true which is also the default setting if left out.

An example of a more elaborate collection configuration:

{
  "collection_name": "news1",
  "collection_id": "ef014064-1a99-4360-847c-f60071892146",
  "semantic_config": {
    "enabled": true,
    "transformer_model_id": "paraphrase-multilingual-mpnet-base-v2",
    "max_tokens_per_vector": -1,
    "vector_to_document_factor": 4.0,
    "hnsw_m": 48,
    "hnsw_ef_construction": 200,
    "hnsw_ef_factor": 100,
  },
  "fulltext_config": {
    "enabled": true,
    "engine": "elastic"
  },
  "collection_schema": {
    "fields": [
      {
        "field": "news_id",
        "type": "id"
      },
      {
        "field": "heading",
        "type": "text",
        "language": "en",
        "index_standalone": true,
        "boost": 1.5,
        "sources": ["h1", "h2", "title"]
      },
      {
        "field": "content",
        "type": "text",
        "language": "en",
        "sources": ["p"]
      },
      {
        "field": "category",
        "type": "enum_text",
        "language": "en"
      },
      {
        "field": "author",
        "type": "enum_text",
        "language": "en",
        "source_path": "/metadata/author"
      }
    ]
  },
  "search_config": {
    "max_best": 30,
    "search_mode": "hybrid",
    "restore_diacritics": true
  },
  "answer_config": {
    "language": "en",
    "provider": "azure",
    "service": "llm1-themamaai",
    "engine": "gpt35turbo",
    "max_tokens_in_text": 4000,
    "max_tokens_in_response": null,
    "min_sentences": null,
    "max_sentences": null,
    "temperature": null,
    "priming": "You are a newsroom agent who can comment on news on a chosen topic."
  }
}

In the above collection config, several more config parameters are utilized, including:

{ 
  "news_id": "n1"
  "metadata": {
    "author": "Charlie"
  }
}

API reference