Ingesting documents with the mSearch API

🔍 mSearch finds documents with similar meaning 🔍

Ingesting documents

Simple Examples

Upload 2 structured documents from JSON

POST /msearch/collections/{collection_id}/documents
content-type: application/json

{
  "documents": [
    {
      "document_id": "ticket_123",
      "ticket_title": "Keybard replacement",
      "ticket_text": "Spilled coffee on my laptop and now I can't type, need help"
    },
    {
      "document_id": "ticket_456",
      "ticket_title": "Ventilator broken",
      "ticket_text": "The ceiling fan in my office won't stop, makes loud noise"
    }
  ]
}

Upload 1 textual document

POST /msearch/collections/{collection_id}/documents
content-type: application/pdf

{bytes of the pdf}

Upload several textual documents

POST /msearch/collections/{collection_id}/documents
content-type: application/zip

{bytes of a zip file that contains multiple docs in various formats}

How ingestion works

Documents in a collection consist of multiple fields that hold that document's data. A collection defines a schema that describes the data types of those fields, such as text, id, or enum. All documents in one collection are assumed to share the same schema (share the same fields). Not all fields need to be populated for every document in a collection. Each document however must have its document_id property set to an identifier unique within its collection. The document_id is populated from a field typed id. There should be exactly one field typed id in every collection schema; the name of that field (as well as all other fields) can be arbitrary.

When a document or documents are uploaded, their document_id is either specified explicitly as part of the upload request, or is computed automatically as a hash digest from document content. When uploading a document whose document_id already exists in a collection, that document is updated. With auto-computed document ids, uploading the same document twice will not insert another copy of that document, but instead will update the existing doc. It is however recommended to use meaningful identifiers to populate document_id of uploaded documents, such as ids derived from uploaded file names or assigned explicitly.

Endpoints and formats

mSearch can ingest documents of various formats. These endpoints can be used to upload documents:

The first two accept single or multiple documents. Both are functionally equivalent but differ in how they are documented in the interactive mSearch Swagger doc. The first one is documented to accept a raw POST request body with bytes in a given format. The second is documented to accept a multipart/form-data POST request body, allowing for the upload of multiple files of various formats and file names. Outside the Swagger doc, both endpoints accept the same content.

The uploaded document(s) - passed as uploaded files within a multipart/form-data POST request, or as raw bytes of a POST request body - need to be in one of the formats listed below. Additionally, the HTTP content-type header should be set correctly for each input document. When ingesting documents packaged in an archive, or when the content-type is not given or ambiguous, mSearch tries to infer content-type from file extension.

Supported document formats

mSearch can ingest file formats that fall under these 3 groups. mSearch extracts text from each ingested document, but depending on format, there may be multiple named "sources" of text in ingested documents. For instance, spreadsheet column names are the "sources" of text for Excel files. PDF and MS Word files are currently read so that headings (e.g. h1) can be distinguished from other text (p for paragraph). If collection field names differ from text sources inside ingested documents, the sources can be mapped to each field as follows in collection configuration. In the example below, 1st and 2nd level headings are stored under the "my_heading" field.

{
     "collection_schema": {
        "fields": [
            {
                "field": "my_heading",
                "type": "text",
                "language": "cs",
                "sources": ["h1", "h2"]  <<
            },
        ...
}

The tables below describe the standardized sources of text for supported document formats. There are also some common sources that apply across all document formats:

Source	What's in the source
file_name	File name of each ingested file. When ingesting a zip file, this stores the path and name of each file inside the archive.

Record-based formats

Format	File extension	Content-type	Sources
JSON	json	application/json	JSON keys
JSON lines	jsonl	application/x-jsonlines	JSON keys
MS Excel	xlsx	application/vnd.openxmlformats-officedocument.spreadsheetml.sheet	Column names
CSV	csv	text/csv	Column names

Document-oriented formats

Format	File extension	Content-type	Sources
Adobe PDF	pdf	application/pdf	p, h1
Rich Text Format	rtf	application/rtf	p
MS Word	docx	application/vnd.openxmlformats-officedocument.wordprocessingml.document	p, h1, h2, h3
MS PowerPoint	pptx	application/vnd.openxmlformats-officedocument.presentationml.presentation	p
downloaded email	eml	application/octet-stream; name="xxx.eml"	subject, from, to, cc, bcc, body
HTML	html	text/html	p, title, h1-h6, div, pre
Text	txt	text/plain	p, title, h1

Format	File extension	Content-type
Zip archive	zip	application/zip

Mapping file content to collection fields

mSearch uses a "mapping" to read various types of content to collection schema fields.

1. The default mapping

The default mapping used by mSearch for record-based formats is "identity". For instance, a column name of an input CSV file automatically populates a collection field of the same name, if it exists. The same holds for the names of JSON object properties when reading JSON files.

Document-oriented formats also honor the default "identity" mapping. So if a collection schema contains fields named h1 or p, those get populated with the corresponding sections of input docx and pdf documents having those styles.

2. Providing a custom mapping as part of collection schema

Each field defined by collection schema can explicitly list the sources that should populate it when ingesting documents. For instance, a single collection field "content" might source its values from document sections styled either ["p", "h1"]. A field can collect all the content of input document by setting its sources to ["*"]; this can be set for at most a single field.

Sources can also be prefixed with document formats, narrowing their scope to e.g. just docx documents. For instance, "sources": ["docx.p", "docx.h1"] means the containing field will collect paragraphs and titles only for Microsoft Word documents.

3. Providing a custom mapping with every upload request

A simple JSON mapping can optionally be provided with every document upload request. This is a flat JSON file uploaded as part of a multi-part POST request under the name _field_map, alongside the document or documents actually uploaded, which comprise the remaining parts of the request. The field map is a flat string-to-string mapping object that links the type of content in a file to the field name that it should populate.

Advanced examples

Upload 300 structured documents from Excel rows

mSearch only processes the first sheet in an Excel file. Data is expected to come in "one document per row" fashion. The first row of the sheet contains column names. Those column names should either correspond to the collection's field names, or they should be mapped to field names using the sources property on the target fields they should populate.

POST /msearch/collections/{collection_id}/documents
content-type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

{bytes of the xlsx file whose column names match the names of collection schema fields}

Update an existing document

A document that already exists in a collection can be updated (overwritten) as follows.

POST /msearch/collections/{collection_id}/documents/{doc_id}
content-type: application/pdf

{bytes of the updated pdf}

Document(s) POSTed directly to the /documents endpoint will also update existing documents if the uploaded documents contain their document IDs and documents of those IDs already exist in the collection.

Upload documents whose fields come from both JSON and PDF

The example below shows how documents can be uploaded where some fields of those documents come from structured JSON, like document metadata, and other fields may be populated by the text coming from one or more textual files. This can be used to couple personal metadata with CVs in PDF form, or to build a collection of stories with their full texts attached, as shown below.

POST /msearch/collections/{collection_id}/documents
Content-Type: multipart/form-data; boundary=----SomeUniqueBoundary123

------SomeUniqueBoundary123
Content-Disposition: form-data; name="document"; filename="pohadky.json"
Content-Type: application/json

{
  "documents": [
    {
      "doc_id": "fairy_tale_1",
      "heading": "Pohádka o Budulínkovi",
      "category": "klasika",
      "minimální věk": "3",
      "_references": {
        "pohadka-budulinek.pdf": {
          "field_map": {
            "*": "content"
          }
        }
      }
    },
    {
      "doc_id": "fairy_tale_2",
      "heading": "O Klekánici",
      "category": ["o strašidlech", "děsivé pohádky", "horory"],
      "minimální věk": "5",
      "_references": {
        "pohadka-klekanice-part1.pdf": {
          "field_map": {
            "*": "content"
          }
        },
        "pohadka-klekanice-part2.pdf": {
          "field_map": {
            "*": "content"
          }
        }
      }
    }
  ]
}

------SomeUniqueBoundary123
Content-Disposition: form-data; name="pohadka-budulinek.pdf"; filename="pohadka-budulinek.pdf"
Content-Type: application/pdf

{bytes of the pdf}

------SomeUniqueBoundary123
Content-Disposition: form-data; name="pohadka-klekanice-part1.pdf"; filename="pohadka-klekanice-part1.pdf"
Content-Type: application/pdf

{bytes of the pdf}

------SomeUniqueBoundary123
Content-Disposition: form-data; name="pohadka-klekanice-part2.pdf"; filename="pohadka-klekanice-part2.pdf"
Content-Type: application/pdf

{bytes of the pdf}

------SomeUniqueBoundary123