File Converters

Use File Converters to extract text from files in different formats and cast it into the unified Document format.


Position in a Pipeline	Either at the very beginning of an indexing Pipeline or after a File Classifier
Input	File name
Output	Documents
Classes	PDFToTextConverter PDFToTextOCRConverter DocxToTextConverter AzureConverter ImageToTextConverter MarkdownConverter ParsrConverter TikaConverter TextConverter

Tutorial: To see an example of file converters in a pipeline, see the advanced indexing tutorial.

File Converter Classes

Here's what each of the file convertes type can do:

PDFToTextConverter: Extracts text from a PDF file using the pdftotext library.
PDFToTextOCRConverter: Extracts text from PDF files that contain images using the pytesseract library.
DocxToTextConverter: Extracts text from .docx files.
AzureConverter: Extracts text and tables from files in the following formats: PDF, JPEG, PNG, BMP, and TIFF. Uses Microsoft Azure's Form Recognizer. To use this converter, you must have an active Azure account and a Form Recognizer or Cognitive Services resource. For more information, see Form Recognizer.
ImageToTextConverter: Extracts text from image files using the pytesseract library.
MarkdownConverter: Converts markdown to plain text.
ParsrConverter: Extracts text and tables from PDF and .docx files using the open-source Parsr by axa-group.
TikaConverter: Converts files into Documents using Apache Tika.
TextConverter: Preprocesses text files and returns documents.

Click a tab to read more about each converter and see how to initialize it:

Haystack also has a convert_files_to_docs() utility function that will convert all txt or pdf files in a given directory.

from haystack.utils.preprocessing import convert_files_to_docs
docs = convert_files_to_docs(dir_path=doc_dir)

Stars

5319