Document Stores

You can think of the Document Store as a "database" that:

stores your texts and meta data
provides them to the Retriever at query time

By far the most common way to use a Document Store in Haystack is to fetch documents using a Retriever. A Document Store needs to be provided as an argument to the initialization of a Retriever. Note that the Retriever functions as a Node while a Document Store does not.

Initialisation

Initialising a new DocumentStore within Haystack is straightforward.

Input Format

DocumentStores expect Documents in dictionary form, like that below. They are loaded using the DocumentStore.write_documents() method. See PreProcessor} for more information on the cleaning and splitting steps that will help you maximize Haystack's performance.

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore()
dicts = [
    {
        'content': DOCUMENT_TEXT_HERE,
        'meta': {'name': DOCUMENT_NAME, ...}
    }, ...
]
document_store.write_documents(dicts)

Writing Documents (Sparse Retrievers)

Haystack allows for you to write store documents in an optimised fashion so that query times can be kept low. For sparse, keyword based retrievers such as BM25 and TF-IDF, you simply have to call DocumentStore.write_documents(). The creation of the inverted index which optimises querying speed is handled automatically.

document_store.write_documents(dicts)

Writing Documents (Dense Retrievers)

For dense neural network based retrievers like Dense Passage Retrieval, or Embedding Retrieval, indexing involves computing the Document embeddings which will be compared against the Query embedding.

The storing of the text is handled by DocumentStore.write_documents() and the computation of the embeddings is started by DocumentStore.update_embeddings().

document_store.write_documents(dicts)
document_store.update_embeddings(retriever)

This step is computationally intensive since it will engage the transformer based encoders. Having GPU acceleration will significantly speed this up.

Choosing the Right Document Store

The Document Stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment:

Our Recommendations

Restricted environment: Use the InMemoryDocumentStore, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases

Allrounder: Use the ElasticSearchDocumentStore, if you want to evaluate the performance of different retrieval options (dense vs. sparse) and are aiming for a smooth transition from PoC to production

Vector Specialist: Use the MilvusDocumentStore, if you want to focus on dense retrieval and possibly deal with larger datasets

Working with Existing Databases

If you have an existing Elasticsearch or OpenSearch database with indexed documents, you can very quickly make a Haystack compliant version using our elasticsearch_index_to_document_store or open_search_index_to_document_store function.

from haystack.document_stores.utils import elasticsearch_index_to_document_store

new_ds = elasticsearch_index_to_document_store(
    document_store=empty_document_store,
    original_content_field="content",
    original_index_name="document",
    original_name_field="title",
    preprocessor=preprocessor,
    port=9201,
    verify_certs=False,
    scheme="https",
    username="admin",
    password="admin"
)

from haystack.document_stores.utils import open_search_index_to_document_store

new_ds = open_search_index_to_document_store(
    document_store=empty_document_store,
    original_content_field="content",
    original_index_name="document",
    original_name_field="title",
    preprocessor=preprocessor,
    port=9201,
    verify_certs=False,
    scheme="https",
    username="admin",
    password="admin"
)

Stars

5319

Edit on GitHub

Start a Discussion!

Initialisation
Input Format
Writing Documents (Sparse Retrievers)
Writing Documents (Dense Retrievers)
Choosing the Right Document Store
Working with Existing Databases