🔍 mSearch finds documents with similar meaning 🔍
Relevant endpoints:
The two endpoints listed towards the end - search and answer - allow querying multiple collections at once. When using search across multiple collections, the embedding method used by those collections should be the same, to ensure compatible relevancy scores across collections.
POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json
{
"query": "T-Shirts with animals on them"
}
Returning full content of documents is the default, as well as highlight on.
Additionally, request the "sections" format instead of flat:
POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json
{
"query": "T-Shirts with animals on them",
"max_results": 5,
"include_content": ["full"],
"result_format": "sections",
"highlight": true
}
Instead of full document content, clients can retrieve just document excerpts that together fit into a token limit.
Replacing summaries by summary would format summaries as a single string in the response.
POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json
{
"query": "Co je to DPH",
"include_content": ["summaries"],
"summary_config": {
"max_tokens": 3000
}
}
Several types of content can be returned at once:
POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json
{
"query": "Co je to DPH",
"include_content": ["full", summaries"],
"result_format": "sections",
"summary_config": {
"max_tokens": 3000
}
}
Before conducting search, the user query can be preprocessed in a number of ways:
"restore_diacritics": true that can be either set in collection
configuration under search_config or as a top-level property of individual search requests as below.In case the user query was indeed altered before searching, the altered version of the query is
included in API response under metadata.effective_queries, as shown at the bottom of this page.
POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json
{
"query": "Jaka je definice HO",
"restore_diacritics": true,
"include_content": ["full", summaries"],
"result_format": "sections",
"summary_config": {
"max_tokens": 3000
}
}
The query and
search endpoints
of msearch aim to retrieve the documents most relevant to a user query.
This works at the level of documents as they were ingested.
The include_content parameter controls whether to return the full documents, or just document summaries,
a single summary, or their combination.
Other commonly used parameters include:
max_results: controls how many documents to retrieve at most, result_format: the format of full documents, either sections or flat, include_fields: optionally specifies the documents fields (section types) to return (defaults to all),filter: optionally constraints the value of some fields; see more on FilteringThere are several types of information about the retrieved documents that mSearch can return.
document_id: unique document identifier within a collectionscore: similarity or relevancy of the document to query, in range 0..1, used to sorted retrieved documents source: whether the document was retrieved by semantic, keyword search, or both (sem, key, sem-key)documents: full document content available in 2 formats sections or flat according to the result_format parameterpassages of relevant textual content extracted from the retrieved document.
Passages come with character offsets into a section of the retrieved document.
A passage always points to one document section, and can span the whole content of that section
or just part of it. A passage identifies the section it points to by its section_index property,
and the offsets are into that section's text content. To include passages in the response, set the highlight parameter to true.
It is recommended to use result_format set to sections when working with passages, as the section indexes
and content character offsets point to document sections, and cannot be interpreted with the flat format.
Full documents can be output in 2 formats:
type (like "title", "description", or "h1") and content
with the section's text. For document-oriented formats like text files, HTML, docx or PDF documents,
the ordering of sections reflects the ingested document flow. For record-oriented formats like JSON, JSON lines, CSV
or spreadsheet formats, the ordering reflects the original ordering of columns/properties in the ingested documents.{
"search_id": "90ffb176-434d-4635-96ac-4c55817c3a2f"
"documents": [
{
"document_id": "ZODPH155",
"score": 0.5110244154930115,
"source": "sem",
"passages": [
{
"text": "odnést.\n(3) Prodávající je povinen na vyžádání zahraniční fyzické osoby vystavit doklad... ",
"offsets": [
866,
1306
],
"section_index": 2,
"score": 0.47954487800598145
},
...
],
"sections": [
{
"type": "Id",
"content": "ZODPH155"
},
{
"type": "Paragraf",
"content": "§ 84 DZ"
},
{
"type": "Text",
"content": "Vracení daně fyzickým osobám...
},
...
]
}
]
}
In addition or as an alternative to returning whole documents and their relevant passages,
mSearch can return other types of content relevant to user query:
summaries - a list of summary texts that are excerpts from the top retrieved documents. The cumulative text length
in these summaries, in tokens, is guaranteed not to exceed the setting summary_config.max_tokens specified
in either the collection configuration, or directly in the summary_config included with the search request body.
Each summary is an object that has its document_id set and the relevant content extracted.
The summarization strategy aims to include the summaries for all max_results retrieved documents,
and tries to progressively summarize documents more when they cannot all fit into the max_tokens limit.
But when they cannot all fit even with the most aggressive summarization strategy,
then just the top retrieved documents make it into the summaries list.
summary - essentially the same as summaries, just in the form of a single text where the individual document
summaries have been concatenated using the document separators as configured in summary_config.
answer - in case the user query was a question and the
answer endpoint was used, this contains the answer
to the question based on the summaries of the top retrieved documents. See more in the Answers topic.
{
"search_id": "90ffb176-434d-4635-96ac-4c55817c3a2f",
"summaries": [
{
"document_id": "ZODPH148",
"content": "§80 DZ\nVracení daně osobám požívajícím výsad a imunit\n(1) Pro účely tohoto zákona ..."
},
...
],
"metadata": {
"tokens": {
"prompt_tokens": 2438
},
"original_documents_count": 5,
"summary_documents_count": 5,
"summary_method": "first_relevant_3_then_relevant_21 distance=1",
"message": "C. Represented 5 documents with total 12367 tokens by the first 2 docs abbreviated,
and the rest abbreviated even more to total 2438 tokens which is less than max 3000 configured",
"effective_queries": ["Vrácení DPH, daň z přidané hodnoty"]
}
}