🔍 mSearch finds documents with similar meaning 🔍
Relevant endpoints:
The two endpoints listed towards the end - search and answer - allow querying multiple collections at once. When using search across multiple collections, the embedding method used by those collections should be the same, to ensure compatible relevancy scores across collections.
POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json
{
"query": "T-Shirts with animals on them"
}
Returning full
content of documents is the default, as well as highlight
on.
Additionally, request the "sections" format instead of flat
:
POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json
{
"query": "T-Shirts with animals on them",
"max_results": 5,
"include_content": ["full"],
"result_format": "sections",
"highlight": true,
}
Instead of full
document content, clients can retrieve just document excerpts that together fit into a token limit.
Replacing summaries
by summary
would format summaries as a single string in the response.
POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json
{
"query": "Co je to DPH",
"include_content": ["summaries"],
"summary_config": {
"max_tokens": 3000
}
}
Several types of content can be returned at once:
POST https://api.msearch.themama.ai/msearch/collections/{collection_id}/search
Authorization: CENSORED
Content-Type: application/json
{
"query": "Co je to DPH",
"include_content": ["full", summaries"],
"result_format": "sections",
"summary_config": {
"max_tokens": 3000
}
}
The query and
search endpoints
of msearch aim to retrieve the documents most relevant to a user query.
This works at the level of documents as they were ingested.
The include_content
parameter controls whether to return the full
documents, or just document summaries
,
a single summary
, or their combination.
Other commonly used parameters include:
max_results
: controls how many documents to retrieve at most, result_format
: the format of full documents, either sections
or flat
, include_fields
: optionally specifies the documents fields (section types) to return (defaults to all),filter
: optionally constraints the value of some fields; see more on FilteringThere are several types of information about the retrieved documents that mSearch can return.
document_id
: unique document identifier within a collectionscore
: similarity or relevancy of the document to query, in range 0..1, used to sorted retrieved documents source
: whether the document was retrieved by semantic, keyword search, or both (sem
, key
, sem-key
)documents
: full document content available in 2 formats sections
or flat
according to the result_format
parameterpassages
of relevant textual content
extracted from the retrieved document.
Passages come with character offsets
into a section of the retrieved document.
A passage always points to one document section, and can span the whole content of that section
or just part of it. A passage identifies the section it points to by its section_index
property,
and the offsets are into that section's text content
. To include passages in the response, set the highlight
parameter to true
.
It is recommended to use result_format
set to sections
when working with passages, as the section indexes
and content character offsets point to document sections, and cannot be interpreted with the flat
format.
Full documents can be output in 2 formats:
type
(like "title", "description", or "h1") and content
with the section's text. For document-oriented formats like text files, HTML, docx or PDF documents,
the ordering of sections reflects the ingested document flow. For record-oriented formats like JSON, JSON lines, CSV
or spreadsheet formats, the ordering reflects the original ordering of columns/properties in the ingested documents.{
"search_id": "90ffb176-434d-4635-96ac-4c55817c3a2f"
"documents": [
{
"document_id": "ZODPH155",
"score": 0.5110244154930115,
"source": "sem",
"passages": [
{
"text": "odnést.\n(3) Prodávající je povinen na vyžádání zahraniční fyzické osoby vystavit doklad... ",
"offsets": [
866,
1306
],
"section_index": 2,
"score": 0.47954487800598145
},
...
],
"sections": [
{
"type": "Id",
"content": "ZODPH155"
},
{
"type": "Paragraf",
"content": "§ 84 DZ"
},
{
"type": "Text",
"content": "Vracení daně fyzickým osobám...
},
...
]
}
]
}
In addition or as an alternative to returning whole documents
and their relevant passages,
mSearch can return other types of content relevant to user query:
summaries
- a list of summary texts that are excerpts from the top retrieved documents. The cumulative text length
in these summaries, in tokens, is guaranteed not to exceed the setting summary_config.max_tokens
specified
in either the collection configuration, or directly in the summary_config
included with the search request body.
Each summary is an object that has its document_id
set and the relevant content
extracted.
The summarization strategy aims to include the summaries for all max_results
retrieved documents,
and tries to progressively summarize documents more when they cannot all fit into the max_tokens
limit.
But when they cannot all fit even with the most aggressive summarization strategy,
then just the top retrieved documents make it into the summaries
list.
summary
- essentially the same as summaries
, just in the form of a single text where the individual document
summaries have been concatenated using the document separators as configured in summary_config
.
answer
- in case the user query
was a question and the
answer endpoint was used, this contains the answer
to the question based on the summaries of the top retrieved documents. See more in the Answers topic.
{
"search_id": "90ffb176-434d-4635-96ac-4c55817c3a2f",
"summaries": [
{
"document_id": "ZODPH148",
"content": "§80 DZ\nVracení daně osobám požívajícím výsad a imunit\n(1) Pro účely tohoto zákona ..."
},
...
],
"metadata": {
"tokens": {
"prompt_tokens": 2438
},
"original_documents_count": 5,
"summary_documents_count": 5,
"summary_method": "first_relevant_3_then_relevant_21 distance=1",
"message": "C. Represented 5 documents with total 12367 tokens by the first 2 docs abbreviated,
and the rest abbreviated even more to total 2438 tokens which is less than max 3000 configured"
}
}