🔍 mSearch finds documents with similar meaning 🔍
POST /msearch/collections/{collection_id}/documents
content-type: application/json
{
"documents": [
{
"document_id": "ticket_123",
"ticket_title": "Keybard replacement",
"ticket_text": "Spilled coffee on my laptop and now I can't type, need help"
},
{
"document_id": "ticket_456",
"ticket_title": "Ventilator broken",
"ticket_text": "The ceiling fan in my office won't stop, makes loud noise"
}
]
}
POST /msearch/collections/{collection_id}/documents
content-type: application/pdf
{bytes of the pdf}
POST /msearch/collections/{collection_id}/documents
content-type: application/zip
{bytes of a zip file that contains multiple docs in various formats}
Documents in a collection consist of multiple fields that hold that document's data.
A collection defines a schema that describes the data types of those fields, such as text, id, or enum.
All documents in one collection are assumed to share the same schema (share the same fields).
Not all fields need to be populated for every document in a collection.
Each document however must have its document_id
property set to an identifier unique within its collection.
The document_id
is populated from a field typed id
. There should be exactly one field typed id
in every
collection schema; the name of that field (as well as all other fields) can be arbitrary.
When a document or documents are uploaded, their document_id
is either specified explicitly as part of the
upload request, or is computed automatically as a hash digest from document content.
When uploading a document whose document_id
already exists in a collection, that document is updated.
With auto-computed document ids, uploading the same document twice will not insert another copy of that document,
but instead will update the existing doc. It is however recommended to use meaningful identifiers to populate
document_id
of uploaded documents, such as ids derived from uploaded file names or assigned explicitly.
mSearch can ingest documents of various formats. These endpoints can be used to upload documents:
The first two accept single or multiple documents. Both are functionally equivalent but differ in how they are
documented in the interactive mSearch Swagger doc. The first one is documented to accept a
raw POST request body with bytes in a given format. The second is documented to accept a multipart/form-data
POST request body, allowing for the upload of multiple files of various formats and file names.
Outside the Swagger doc, both endpoints accept the same content.
The uploaded document(s) - passed as uploaded files within a multipart/form-data POST request,
or as raw bytes of a POST request body - need to be in one of the formats listed below.
Additionally, the HTTP content-type
header should be set correctly for each input document.
When ingesting documents packaged in an archive, or when the content-type
is not given or ambiguous, mSearch
tries to infer content-type
from file extension.
mSearch can ingest file formats that fall under these 3 groups.
mSearch extracts text from each ingested document, but depending on format, there may be multiple named "sources" of
text in ingested documents. For instance, spreadsheet column names are the "sources" of text for Excel files.
PDF and MS Word files are currently read so that headings (e.g. h1
) can be distinguished from other text
(p
for paragraph). If collection field names differ from text sources inside ingested documents,
the sources can be mapped to each field as follows in collection configuration. In the example below, 1st and 2nd
level headings are stored under the "my_heading" field.
{
"collection_schema": {
"fields": [
{
"field": "my_heading",
"type": "text",
"language": "cs",
"sources": ["h1", "h2"] <<
},
...
}
The tables below describe the standardized sources of text for supported document formats. There are also some common sources that apply across all document formats:
Source | What's in the source |
---|---|
file_name | File name of each ingested file. When ingesting a zip file, this stores the path and name of each file inside the archive. |
Format | File extension | Content-type | Sources |
---|---|---|---|
JSON | json | application/json | JSON keys |
JSON lines | jsonl | application/x-jsonlines | JSON keys |
MS Excel | xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | Column names |
CSV | csv | text/csv | Column names |
Format | File extension | Content-type | Sources |
---|---|---|---|
Adobe PDF | application/pdf | p, h1 | |
Rich Text Format | rtf | application/rtf | p |
MS Word | docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document | p, h1, h2, h3 |
MS PowerPoint | pptx | application/vnd.openxmlformats-officedocument.presentationml.presentation | p |
downloaded email | eml | application/octet-stream; name="xxx.eml" | subject, from, to, cc, bcc, body |
HTML | html | text/html | p, title, h1-h6, div, pre |
Text | txt | text/plain | p, title, h1 |
When ingesting documents from archives, sources of text depend on each file in the archive.
Format | File extension | Content-type |
---|---|---|
Zip archive | zip | application/zip |
mSearch uses a "mapping" to read various types of content to collection schema fields.
The default mapping used by mSearch for record-based formats is "identity". For instance, a column name of an input CSV file automatically populates a collection field of the same name, if it exists. The same holds for the names of JSON object properties when reading JSON files.
Document-oriented formats also honor the default "identity" mapping.
So if a collection schema contains fields named h1
or p
, those get populated
with the corresponding sections of input docx and pdf documents having those styles.
Each field defined by collection schema can explicitly list the sources
that
should populate it when ingesting documents. For instance, a single collection
field "content" might source its values from document sections styled either ["p", "h1"]
.
A field can collect all the content of input document by setting its sources
to ["*"]
;
this can be set for at most a single field.
Sources can also be prefixed with document formats, narrowing their scope to e.g. just
docx documents. For instance, "sources": ["docx.p", "docx.h1"]
means the
containing field will collect paragraphs and titles only for Microsoft Word documents.
A simple JSON mapping can optionally be provided with every document upload request.
This is a flat JSON file uploaded as part of a multi-part POST request under the name
_field_map
, alongside the document or documents actually uploaded, which comprise
the remaining parts of the request. The field map is a flat string-to-string mapping
object that links the type of content in a file to the field name that it should populate.
mSearch only processes the first sheet in an Excel file.
Data is expected to come in "one document per row" fashion.
The first row of the sheet contains column names.
Those column names should either correspond to the collection's field names, or they should be
mapped to field names using the sources
property on the target fields they should populate.
POST /msearch/collections/{collection_id}/documents
content-type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
{bytes of the xlsx file whose column names match the names of collection schema fields}
A document that already exists in a collection can be updated (overwritten) as follows.
POST /msearch/collections/{collection_id}/documents/{doc_id}
content-type: application/pdf
{bytes of the updated pdf}
Document(s) POSTed directly to the /documents
endpoint will also update existing documents if the uploaded
documents contain their document IDs and documents of those IDs already exist in the collection.
The example below shows how documents can be uploaded where some fields of those documents come from structured JSON, like document metadata, and other fields may be populated by the text coming from one or more textual files. This can be used to couple personal metadata with CVs in PDF form, or to build a collection of stories with their full texts attached, as shown below.
POST /msearch/collections/{collection_id}/documents
Content-Type: multipart/form-data; boundary=----SomeUniqueBoundary123
------SomeUniqueBoundary123
Content-Disposition: form-data; name="document"; filename="pohadky.json"
Content-Type: application/json
{
"documents": [
{
"doc_id": "fairy_tale_1",
"heading": "Pohádka o Budulínkovi",
"category": "klasika",
"minimální věk": "3",
"_references": {
"pohadka-budulinek.pdf": {
"field_map": {
"*": "content"
}
}
}
},
{
"doc_id": "fairy_tale_2",
"heading": "O Klekánici",
"category": ["o strašidlech", "děsivé pohádky", "horory"],
"minimální věk": "5",
"_references": {
"pohadka-klekanice-part1.pdf": {
"field_map": {
"*": "content"
}
},
"pohadka-klekanice-part2.pdf": {
"field_map": {
"*": "content"
}
}
}
}
]
}
------SomeUniqueBoundary123
Content-Disposition: form-data; name="pohadka-budulinek.pdf"; filename="pohadka-budulinek.pdf"
Content-Type: application/pdf
{bytes of the pdf}
------SomeUniqueBoundary123
Content-Disposition: form-data; name="pohadka-klekanice-part1.pdf"; filename="pohadka-klekanice-part1.pdf"
Content-Type: application/pdf
{bytes of the pdf}
------SomeUniqueBoundary123
Content-Disposition: form-data; name="pohadka-klekanice-part2.pdf"; filename="pohadka-klekanice-part2.pdf"
Content-Type: application/pdf
{bytes of the pdf}
------SomeUniqueBoundary123