Searching and extracting relevant passages and summaries

🔍 mSearch finds documents with similar meaning 🔍

Search requests

Relevant endpoints:

The two endpoints listed towards the end - search and answer - allow querying multiple collections at once. When using search across multiple collections, the embedding method used by those collections should be the same, to ensure compatible relevancy scores across collections.

A simple search request

POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json

{
  "query": "T-Shirts with animals on them"
}

A search request to return the full content of the 5 most relevant documents

Returning full content of documents is the default, as well as highlight on. Additionally, request the "sections" format instead of flat:

POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json

{
  "query": "T-Shirts with animals on them",
  "max_results": 5,
  "include_content": ["full"],
  "result_format": "sections",
  "highlight": true
}

A search request to return just document summaries that fit into a token limit

Instead of full document content, clients can retrieve just document excerpts that together fit into a token limit. Replacing summaries by summary would format summaries as a single string in the response.

POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json

{
  "query": "Co je to DPH",
  "include_content": ["summaries"],
  "summary_config": {
    "max_tokens": 3000
  }
}

A search request to return full documents with highlight as well as summaries

Several types of content can be returned at once:

POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json

{
  "query": "Co je to DPH",
  "include_content": ["full", summaries"],
  "result_format": "sections",
  "summary_config": {
    "max_tokens": 3000
  }
}

Query pre-processing

Before conducting search, the user query can be preprocessed in a number of ways:

Diacritics restoration: controlled by the parameter "restore_diacritics": true that can be either set in collection configuration under search_config or as a top-level property of individual search requests as below.
Synonym expansion: augments query with synonyms, configured per-collection in collection's search_config section.

In case the user query was indeed altered before searching, the altered version of the query is included in API response under metadata.effective_queries, as shown at the bottom of this page.

POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json

{
  "query": "Jaka je definice HO",
  "restore_diacritics": true,
  "include_content": ["full", summaries"],
  "result_format": "sections",
  "summary_config": {
    "max_tokens": 3000
  }
}

Control over what information to include in responses

The query and search endpoints of msearch aim to retrieve the documents most relevant to a user query. This works at the level of documents as they were ingested. The include_content parameter controls whether to return the full documents, or just document summaries, a single summary, or their combination. Other commonly used parameters include:

max_results: controls how many documents to retrieve at most,
result_format: the format of full documents, either sections or flat,
include_fields: optionally specifies the documents fields (section types) to return (defaults to all),
filter: optionally constraints the value of some fields; see more on Filtering

There are several types of information about the retrieved documents that mSearch can return.

document_id: unique document identifier within a collection
score: similarity or relevancy of the document to query, in range 0..1, used to sorted retrieved documents
source: whether the document was retrieved by semantic, keyword search, or both (sem, key, sem-key)
documents: full document content available in 2 formats sections or flat according to the result_format parameter
passages of relevant textual content extracted from the retrieved document. Passages come with character offsets into a section of the retrieved document. A passage always points to one document section, and can span the whole content of that section or just part of it. A passage identifies the section it points to by its section_index property, and the offsets are into that section's text content.

To include passages in the response, set the highlight parameter to true. It is recommended to use result_format set to sections when working with passages, as the section indexes and content character offsets point to document sections, and cannot be interpreted with the flat format.

Document formats

Full documents can be output in 2 formats:

sections: a list of sections, each section hase its type (like "title", "description", or "h1") and content with the section's text. For document-oriented formats like text files, HTML, docx or PDF documents, the ordering of sections reflects the ingested document flow. For record-oriented formats like JSON, JSON lines, CSV or spreadsheet formats, the ordering reflects the original ordering of columns/properties in the ingested documents.
flat: the document content comes as a flat dictionary of key-value pairs, where keys are section types, and values are section contents. In case a document has multiple sections of same type, the value becomes a list those section values. Not suitable for document-oriented formats, as this format does not preserve document flow.

Sample response showing "full" documents along with "passages" due to "highlight" enabled

{
  "search_id": "90ffb176-434d-4635-96ac-4c55817c3a2f"
  "documents": [
    {
      "document_id": "ZODPH155",
      "score": 0.5110244154930115,
      "source": "sem",
      "passages": [
        {
          "text": "odnést.\n(3) Prodávající je povinen na vyžádání zahraniční fyzické osoby vystavit doklad... ",
          "offsets": [
            866,
            1306
          ],
          "section_index": 2,
          "score": 0.47954487800598145
        },
        ...
      ],
      "sections": [
        {
          "type": "Id",
          "content": "ZODPH155"
        },
        {
          "type": "Paragraf",
          "content": "§ 84 DZ"
        },
        {
          "type": "Text",
          "content": "Vracení daně fyzickým osobám...
        },
        ...
      ]
    }
  ]
}

Returning only the relevant parts (summaries) of top documents

In addition or as an alternative to returning whole documents and their relevant passages, mSearch can return other types of content relevant to user query:

summaries - a list of summary texts that are excerpts from the top retrieved documents. The cumulative text length in these summaries, in tokens, is guaranteed not to exceed the setting summary_config.max_tokens specified in either the collection configuration, or directly in the summary_config included with the search request body. Each summary is an object that has its document_id set and the relevant content extracted. The summarization strategy aims to include the summaries for all max_results retrieved documents, and tries to progressively summarize documents more when they cannot all fit into the max_tokens limit.
But when they cannot all fit even with the most aggressive summarization strategy, then just the top retrieved documents make it into the summaries list.
summary - essentially the same as summaries, just in the form of a single text where the individual document summaries have been concatenated using the document separators as configured in summary_config.
answer - in case the user query was a question and the answer endpoint was used, this contains the answer to the question based on the summaries of the top retrieved documents. See more in the Answers topic.

Sample response showing document summaries that fit in 3000 tokens

{
  "search_id": "90ffb176-434d-4635-96ac-4c55817c3a2f",
  "summaries": [
    {
      "document_id": "ZODPH148",
      "content": "§80 DZ\nVracení daně osobám požívajícím výsad a imunit\n(1) Pro účely tohoto zákona ..."
    },
    ...
  ],
  "metadata": {
    "tokens": {
      "prompt_tokens": 2438
    },
    "original_documents_count": 5,
    "summary_documents_count": 5,
    "summary_method": "first_relevant_3_then_relevant_21 distance=1",
    "message": "C. Represented 5 documents with total 12367 tokens by the first 2 docs abbreviated, 
                and the rest abbreviated even more to total 2438 tokens which is less than max 3000 configured",
    "effective_queries": ["Vrácení DPH, daň z přidané hodnoty"]
  }
}

API reference