Haystack docs home page

File Classifier

A File Classifier distinguishes between text, PDF, Markdown, Docx and HTML files and routes them to the appropriate File Converter in an indexing pipeline.

Position in a PipelineAt the very beginning of an indexing pipeline
InputFile name
OutputFile name (routed)
ClassesFileTypeClassifier

Usage

By default, the FileTypeClassifier has 5 outgoing edges. It routes an incoming file through one of these to a File Converter, which then converts them into Documents.

These are the default outgoing edges of the File Classifier:

Outgoing EdgeFile Type
1Text
2PDF
3Markdown
4Docx
5HTML

Note: The FileTypeClassifier works best when you pass in one file per Pipeline.run() call. If you pass multiple files of different format into Pipeline.run() or Pipeline.run_batch(), FileTypeClassifier returns an error.

To use a FileTypeClassifier in an indexing pipeline, run:

from haystack.pipelines import Pipeline
from haystack.nodes import TextConverter, FileTypeClassifier, PDFToTextConverter, MarkdownConverter, DocxToTextConverter, PreProcessor
file_type_classifier = FileTypeClassifier()
text_converter = TextConverter()
pdf_converter = PDFToTextConverter()
md_converter = MarkdownConverter()
docx_converter = DocxToTextConverter()
preprocessor = PreProcessor()
# This is an indexing pipeline
p = Pipeline()
p.add_node(component=file_type_classifier, name="FileTypeClassifier", inputs=["File"])
p.add_node(component=text_converter, name="TextConverter", inputs=["FileTypeClassifier.output_1"])
p.add_node(component=pdf_converter, name="PdfConverter", inputs=["FileTypeClassifier.output_2"])
p.add_node(component=md_converter, name="MarkdownConverter", inputs=["FileTypeClassifier.output_3"])
p.add_node(component=docx_converter, name="DocxConverter", inputs=["FileTypeClassifier.output_4"])
p.add_node(
component=preprocessor,
name="Preprocessor",
inputs=["TextConverter", "PdfConverter", "MarkdownConverter", "DocxConverter"],
)