Documents, Answers and Labels

In Haystack, there are a handful of core classes that are regularly used in many different places. These are classes that carry data through the system. Users will likely interact with these as either the input or output of their pipeline.

Document

The Document class is the primary data object in Haystack. It stores textual, tabular or image data along with its id and metadata. It may also contain information created in the pipeline including the confidence score of the model that retrieved it or the embedding created for it during indexing.

Documents can be written into DocumentStores via DocumentStore.write_documents(). They are also returned by Retriever and Ranker Nodes or Pipelines containing those Nodes.

Attributes

class Document:
    content: Union[str, pd.DataFrame]
    content_type: Literal["text", "table", "image"]
    id: str
    meta: Dict[str, Any]
    score: Optional[float] = None
    embedding: Optional[np.ndarray] = None
    id_hash_keys: Optional[List[str]] = None

You can find examples of these attributes below.

pipeline = DocumentSearchPipeline(retriever)
results = pipeline.run("Arya Stark father")
document = results["documents"][0]

type(document)
# <class 'haystack.schema.Document'>

document.content
# " ===On the Kingsroad=== City Watchmen search the caravan for Gendry but are turned away by Yoren."

document.id
# 'a4d2cc51d351b785c6effddd3345bb39'

document.meta
# {'name': '224_The_Night_Lands.txt'}

document.embedding
# 'array([9.25596313e+61, 1.00000000e+00 ...])'

document.score
# 0.7827358902378247

For more detailed description of these fields, have a look at the Primitives API documentation.

Conversion

A Document object can be converted into or initialized from either dictionary or json.

document_dict = document.to_dict()
document_object = document.from_dict(document_dict)

document_json = document.to_json()
document_object = document.from_json(document_json)

Answer

The Answer class contains all the information about the prediction made by a Reader model or a pipeline with a Reader model at the end. In it, you will find the answer string, the model's confidence score, the context around the answer the document id and its metadata. You will also find the start and end offsets of the answer string relative to the full document text and the context window.

Answer objects are returned by Reader and Generator Nodes or Pipelines containing those Nodes.

Attributes

class Answer:
    answer: str
    type: Literal["generative", "extractive", "other"] = "extractive"
    score: Optional[float] = None
    context: Optional[Union[str, pd.DataFrame]] = None
    offsets_in_document: Optional[List[Span]] = None
    offsets_in_context: Optional[List[Span]] = None
    document_id: Optional[str] = None
    meta: Optional[Dict[str, Any]] = None

You can find examples of these attributes below.

pipeline = ExtractiveQAPipeline(reader, retriever)
result = pipeline.run("Who is the father of Arya Stark?")
answer = result["answers"][0]

type(answer)
# <class 'haystack.schema.Answer'>

answer.answer
# 'Eddard'

answer.score
# 0.9946763813495636

answer.context
# 'She travels with her father, Eddard, to King\'s Landing when he is...'

answer.offsets_in_context
# [Span(start=72, end=78)]

answer.offsets_in_document
# [Span(start=147, end=153)]

answer.document_id
# 'ba2a8e87ddd95e380bec55983ee7d55f'

For more detailed description of these fields, have a look at the Primitives API documentation.

Conversion

An Answer object can be converted into or initialized from either dictionary or json.

answer_dict = answer.to_dict()
answer_object = answer.from_dict(answer_dict)

answer_json = answer.to_json()
answer_object = answer.from_json(answer_json)

Label

A Label class contains all the information relevant to one document retrieval or question answering annotation. It is generally used for evaluation and can be fetched from a document store via document_store.get_all_labels(). However, in scenarios where there may be more than one annotation per query, you will want to use document_store.get_all_labels_aggregated(). This will return a list of MultiLabel objects which in turn contains a list of Labels.

Labels are returned when a DocumentStore containing labelled data is called via the DocumentStore.get_all_labels() method is called. They also need to be supplied in Evaluation Pipelines where each sample has one annotation.

Attributes

class Label:
    id: str
    query: str
    document: Document
    is_correct_answer: bool
    is_correct_document: bool
    origin: Literal["user-feedback", "gold-label"]
    answer: Optional[Answer] = None
    no_answer: Optional[bool] = None
    pipeline_id: Optional[str] = None
    created_at: Optional[str] = None
    updated_at: Optional[str] = None
    meta: Optional[dict] = None

You can find examples of these attributes below.

multi_labels = document_store.get_all_labels_aggregated()
label = multi_labels[0].labels[0]

type(label)
# <class 'haystack.schema.Answer'>

labels.query
# 'who is written in the book of life'

labels.id
# '47413f49-012a-4258-b897-9196c6ad525e'

labels.document
# <Document: id=fcc5f12d8d1f8f57c34e1a3dc574913f-0, content='Book of Life...'>

labels.answer
# <Answer: answer='every  person who is destined for Heaven or the World to Come', score=0.0, context='Book of Life - wikipedia Book of Life Jump to: nav...'>

labels.no_answer
# False

For more detailed description of these fields, have a look at the Primitives API documentation.

MultiLabel

There are often multiple Labels associated with a single query. For example, there can be multiple annotated answers for one question or multiple documents containing the information you want for a query. This class groups them together and provides some extra functionality when working with multiple annotations.

MultiLabel are returned when a DocumentStore containing labelled data is called via the DocumentStore.get_all_labels_aggregated() method is called. They also need to be supplied in Evaluation Pipelines where each sample might have multiple annotations.

Attributes

class MultiLabel:
    labels: List[Label]
    drop_negative_labels=False
    drop_no_answers=False

For more detailed description of these fields, have a look at the Primitives API documentation.

When document_store.get_all_labels_aggregated() is called, a list of MultiLabel objects is returned

multi_labels = document_store.get_all_labels_aggregated()

Span

This class contains the indices of the start and end of a span. In extractive QA, these are the character indices of where an answer starts and ends. In table QA, these are the start and end indices of the answer cells, counted from top left to bottom right of table.

Span objects are found within Answer objects.

Attributes

class Span:
    start: int
    end: int

Stars

5319

Edit on GitHub

Start a Discussion!

Document
Answer
Label
MultiLabel
Span