Documents, Answers and Labels
In Haystack, there are a handful of core classes that are regularly used in many different places. These are classes that carry data through the system. Users will likely interact with these as either the input or output of their pipeline.
Document
The Document class is the primary data object in Haystack. It stores textual, tabular or image data along with its id and metadata. It may also contain information created in the pipeline including the confidence score of the model that retrieved it or the embedding created for it during indexing.
Documents can be written into DocumentStores via DocumentStore.write_documents()
.
They are also returned by Retriever and Ranker Nodes or Pipelines containing those Nodes.
Attributes
class Document: content: Union[str, pd.DataFrame] content_type: Literal["text", "table", "image"] id: str meta: Dict[str, Any] score: Optional[float] = None embedding: Optional[np.ndarray] = None id_hash_keys: Optional[List[str]] = None
You can find examples of these attributes below.
pipeline = DocumentSearchPipeline(retriever)results = pipeline.run("Arya Stark father")document = results["documents"][0]
type(document)# <class 'haystack.schema.Document'>
document.content# " ===On the Kingsroad=== City Watchmen search the caravan for Gendry but are turned away by Yoren."
document.id# 'a4d2cc51d351b785c6effddd3345bb39'
document.meta# {'name': '224_The_Night_Lands.txt'}
document.embedding# 'array([9.25596313e+61, 1.00000000e+00 ...])'
document.score# 0.7827358902378247
For more detailed description of these fields, have a look at the Primitives API documentation.
Conversion
A Document object can be converted into or initialized from either dictionary or json.
document_dict = document.to_dict()document_object = document.from_dict(document_dict)
document_json = document.to_json()document_object = document.from_json(document_json)
Answer
The Answer class contains all the information about the prediction made by a Reader model or a pipeline with a Reader model at the end. In it, you will find the answer string, the model's confidence score, the context around the answer the document id and its metadata. You will also find the start and end offsets of the answer string relative to the full document text and the context window.
Answer objects are returned by Reader and Generator Nodes or Pipelines containing those Nodes.
Attributes
class Answer: answer: str type: Literal["generative", "extractive", "other"] = "extractive" score: Optional[float] = None context: Optional[Union[str, pd.DataFrame]] = None offsets_in_document: Optional[List[Span]] = None offsets_in_context: Optional[List[Span]] = None document_id: Optional[str] = None meta: Optional[Dict[str, Any]] = None
You can find examples of these attributes below.
pipeline = ExtractiveQAPipeline(reader, retriever)result = pipeline.run("Who is the father of Arya Stark?")answer = result["answers"][0]
type(answer)# <class 'haystack.schema.Answer'>
answer.answer# 'Eddard'
answer.score# 0.9946763813495636
answer.context# 'She travels with her father, Eddard, to King\'s Landing when he is...'
answer.offsets_in_context# [Span(start=72, end=78)]
answer.offsets_in_document# [Span(start=147, end=153)]
answer.document_id# 'ba2a8e87ddd95e380bec55983ee7d55f'
For more detailed description of these fields, have a look at the Primitives API documentation.
Conversion
An Answer object can be converted into or initialized from either dictionary or json.
answer_dict = answer.to_dict()answer_object = answer.from_dict(answer_dict)
answer_json = answer.to_json()answer_object = answer.from_json(answer_json)
Label
A Label class contains all the information relevant to one document retrieval or question answering annotation.
It is generally used for evaluation and can be fetched from a document store via document_store.get_all_labels()
.
However, in scenarios where there may be more than one annotation per query, you will want to use document_store.get_all_labels_aggregated()
.
This will return a list of MultiLabel objects which in turn contains a list of Labels.
Labels are returned when a DocumentStore containing labelled data is called via the
DocumentStore.get_all_labels()
method is called.
They also need to be supplied in Evaluation Pipelines where each sample has one annotation.
Attributes
class Label: id: str query: str document: Document is_correct_answer: bool is_correct_document: bool origin: Literal["user-feedback", "gold-label"] answer: Optional[Answer] = None no_answer: Optional[bool] = None pipeline_id: Optional[str] = None created_at: Optional[str] = None updated_at: Optional[str] = None meta: Optional[dict] = None
You can find examples of these attributes below.
multi_labels = document_store.get_all_labels_aggregated()label = multi_labels[0].labels[0]
type(label)# <class 'haystack.schema.Answer'>
labels.query# 'who is written in the book of life'
labels.id# '47413f49-012a-4258-b897-9196c6ad525e'
labels.document# <Document: id=fcc5f12d8d1f8f57c34e1a3dc574913f-0, content='Book of Life...'>
labels.answer# <Answer: answer='every person who is destined for Heaven or the World to Come', score=0.0, context='Book of Life - wikipedia Book of Life Jump to: nav...'>
labels.no_answer# False
For more detailed description of these fields, have a look at the Primitives API documentation.
MultiLabel
There are often multiple Labels
associated with a single query.
For example, there can be multiple annotated answers for one question
or multiple documents containing the information you want for a query.
This class groups them together and provides some extra functionality when working with multiple annotations.
MultiLabel are returned when a DocumentStore containing labelled data is called via the
DocumentStore.get_all_labels_aggregated()
method is called.
They also need to be supplied in Evaluation Pipelines where each sample might have multiple annotations.
Attributes
class MultiLabel: labels: List[Label] drop_negative_labels=False drop_no_answers=False
For more detailed description of these fields, have a look at the Primitives API documentation.
When document_store.get_all_labels_aggregated()
is called, a list of MultiLabel objects is returned
multi_labels = document_store.get_all_labels_aggregated()
Span
This class contains the indices of the start and end of a span. In extractive QA, these are the character indices of where an answer starts and ends. In table QA, these are the start and end indices of the answer cells, counted from top left to bottom right of table.
Span objects are found within Answer objects.
Attributes
class Span: start: int end: int