Haystack docs home page

Module export_utils

def print_answers(results: dict,
                  details: str = "all",
                  max_text_len: Optional[int] = None)

Utility function to print results of Haystack pipelines

Arguments:

  • results: Results that the pipeline returned.
  • details: Defines the level of details to print. Possible values: minimum, medium, all.
  • max_text_len: Specifies the maximum allowed length for a text field. If you don't want to shorten the text, set this value to None.

Returns:

None

def print_documents(results: dict,
                    max_text_len: Optional[int] = None,
                    print_name: bool = True,
                    print_meta: bool = False)

Utility that prints a compressed representation of the documents returned by a pipeline.

Arguments:

  • max_text_len: Shorten the document's content to a maximum number of characters. When set to None, the document is not shortened.
  • print_name: Whether to print the document's name from the metadata.
  • print_meta: Whether to print the document's metadata.

def print_questions(results: dict)

Utility to print the output of a question generating pipeline in a readable format.

export_answers_to_csv

def export_answers_to_csv(agg_results: list, output_file)

Exports answers coming from finder.get_answers() to a CSV file.

Arguments:

  • agg_results: A list of predictions coming from finder.get_answers().
  • output_file: The name of the output file.

Returns:

None

convert_labels_to_squad

def convert_labels_to_squad(labels_file: str)

Convert the export from the labeling UI to the SQuAD format for training.

Arguments:

  • labels_file: The path to the file containing labels.

Module preprocessing

convert_files_to_docs

def convert_files_to_docs(
        dir_path: str,
        clean_func: Optional[Callable] = None,
        split_paragraphs: bool = False,
        encoding: Optional[str] = None,
        id_hash_keys: Optional[List[str]] = None) -> List[Document]

Convert all files(.txt, .pdf, .docx) in the sub-directories of the given path to Documents that can be written to a

Document Store.

Arguments:

  • dir_path: The path of the directory containing the Files.
  • clean_func: A custom cleaning function that gets applied to each Document (input: str, output: str).
  • split_paragraphs: Whether to split text by paragraph.
  • encoding: Character encoding to use when converting pdf documents.
  • id_hash_keys: A list of Document attribute names from which the Document ID should be hashed from. Useful for generating unique IDs even if the Document contents are identical. To ensure you don't have duplicate Documents in your Document Store if texts are not unique, you can modify the metadata and pass ["content", "meta"] to this field. If you do this, the Document ID will be generated by using the content and the defined metadata.

tika_convert_files_to_docs

def tika_convert_files_to_docs(
        dir_path: str,
        clean_func: Optional[Callable] = None,
        split_paragraphs: bool = False,
        merge_short: bool = True,
        merge_lowercase: bool = True,
        id_hash_keys: Optional[List[str]] = None) -> List[Document]

Convert all files (.txt, .pdf) in the sub-directories of the given path to Documents that can be written to a

Document Store.

Arguments:

  • merge_lowercase: Whether to convert merged paragraphs to lowercase.
  • merge_short: Whether to allow merging of short paragraphs
  • dir_path: The path to the directory containing the files.
  • clean_func: A custom cleaning function that gets applied to each doc (input: str, output:str).
  • split_paragraphs: Whether to split text by paragraphs.
  • id_hash_keys: A list of Document attribute names from which the Document ID should be hashed from. Useful for generating unique IDs even if the Document contents are identical. To ensure you don't have duplicate Documents in your Document Store if texts are not unique, you can modify the metadata and pass ["content", "meta"] to this field. If you do this, the Document ID will be generated by using the content and the defined metadata.

Module squad_data

SquadData

class SquadData()

This class is designed to manipulate data that is in SQuAD format

SquadData.__init__

def __init__(squad_data)

Arguments:

  • squad_data: SQuAD format data, either as a dictionary with a data key, or just a list of SQuAD documents.

SquadData.merge_from_file

def merge_from_file(filename: str)

Merge the contents of a JSON file in the SQuAD format with the data stored in this object.

SquadData.merge

def merge(new_data: List)

Merge data in SQuAD format with the data stored in this object.

Arguments:

  • new_data: A list of SQuAD document data.

SquadData.from_file

@classmethod
def from_file(cls, filename: str)

Create a SquadData object by providing the name of a JSON file in the SQuAD format.

SquadData.save

def save(filename: str)

Write the data stored in this object to a JSON file.

SquadData.to_document_objs

def to_document_objs()

Export all paragraphs stored in this object to haystack.Document objects.

SquadData.to_label_objs

def to_label_objs()

Export all labels stored in this object to haystack.Label objects.

SquadData.to_df

@staticmethod
def to_df(data)

Convert a list of SQuAD document dictionaries into a pandas dataframe (each row is one annotation).

SquadData.count

def count(unit="questions")

Count the samples in the data. Choose a unit: "paragraphs", "questions", "answers", "no_answers", "span_answers".

SquadData.df_to_data

@classmethod
def df_to_data(cls, df)

Convert a data frame into the SQuAD format data (list of SQuAD document dictionaries).

SquadData.sample_questions

def sample_questions(n)

Return a sample of n questions in the SQuAD format (a list of SQuAD document dictionaries). Note that if the same question is asked on multiple different passages, this function treats that as a single question.

SquadData.get_all_paragraphs

def get_all_paragraphs()

Return all paragraph strings.

SquadData.get_all_questions

def get_all_questions()

Return all question strings. Note that if the same question appears for different paragraphs, this function returns it multiple times.

SquadData.get_all_document_titles

def get_all_document_titles()

Return all document title strings.

Module early_stopping

EarlyStopping

class EarlyStopping()

An object you can to control early stopping with a Node's train() method or a Trainer class. You can use a custom EarlyStopping class instead as long as it implements the method check_stopping() and provides the attribute save_dir.

EarlyStopping.__init__

def __init__(head: int = 0,
             metric: Union[str, Callable] = "loss",
             save_dir: Optional[str] = None,
             mode: Literal["min", "max"] = "min",
             patience: int = 0,
             min_delta: float = 0.001,
             min_evals: int = 0)

Arguments:

  • head: The index of the prediction head that you are evaluating to determine the chosen metric. In Haystack, the large majority of the models are trained from the loss signal of a single prediction head so the default value of 0 should work in most cases.
  • save_dir: The directory where to save the final best model. If you set it to None, the model is not saved.
  • metric: The name of a dev set metric to monitor (default: loss) which is extracted from the prediction head specified by the variable head, or a function that extracts a value from the trainer dev evaluation result. For FARMReader training, some available metrics to choose from are "EM", "f1", and "top_n_accuracy". For DensePassageRetriever training, some available metrics to choose from are "acc", "f1", and "average_rank". NOTE: This is different from the metric that is specified in the Processor which defines how to calculate one or more evaluation metric values from the prediction and target sets. The metric variable in this function specifies the name of one particular metric value, or it is a method to calculate a value from the result returned by the Processor metric.
  • mode: When set to "min", training stops if the metric does not continue to decrease. When set to "max", training stops if the metric does not continue to increase.
  • patience: How many evaluations with no improvement to perform before stopping training.
  • min_delta: Minimum difference to the previous best value to count as an improvement.
  • min_evals: Minimum number of evaluations to perform before checking that the evaluation metric is improving.

EarlyStopping.check_stopping

def check_stopping(eval_result: List[Dict]) -> Tuple[bool, bool, float]

Provides the evaluation value for the current evaluation. Returns true if stopping should occur.

This saves the model if you provided self.save_dir when initializing EarlyStopping.

Arguments:

  • eval_result: The current evaluation result which consists of a list of dictionaries, one for each prediction head. Each dictionary contains the metrics and reports generated during evaluation.

Returns:

A tuple (stopprocessing, savemodel, eval_value) indicating if processing should be stopped and if the current model should get saved and the evaluation value used.