QANode Logo

Extract File Node

The Extract File node reads a file received as fileRef and returns text or structured data according to the selected extraction type.


Overview

PropertyValue
Typefile-extract
CategoryFiles
Color🟤 Gold (#C89F65)
Inputin
Outputout

When to Use

Use this node when the flow needs to:

  • read TXT or Markdown as text;
  • validate or reuse JSON content;
  • transform CSV into structured rows;
  • read Excel spreadsheets;
  • extract text from PDF;
  • use OCR as fallback when the PDF is image-based.

Configuration

FieldTypeDescription
FilefileRefInput file
Extraction typeselectionTXT, JSON, CSV, Excel, or PDF

The following fields change according to the extraction type.

TXT

FieldDescription
EncodingEncoding used to read the text, such as utf8

JSON

FieldDescription
EncodingEncoding used to read the file

The content must be valid JSON.

CSV

FieldDescription
DelimiterColumn separator, such as , or ;
Has headerUses the first line as column names
EncodingFile encoding

Excel

FieldDescription
Sheet nameSheet to read. If empty, uses the first sheet
Header rowRow used as header

PDF

FieldDescription
Page rangePages to extract, such as 1-3,5
OCREnables OCR fallback for image-based PDFs
OCR languagesOCR languages, such as por+eng+spa
OCR scaleScale used when rendering before OCR

OCR in PDF

When the PDF has real text, QANode tries to extract it directly. If the PDF appears to be image-based, OCR can be used automatically as fallback.

Use OCR for:

  • scanned PDFs;
  • payslips and receipts as images;
  • scanned documents;
  • files without a text layer.

Common languages:

ValueLanguages
porPortuguese
engEnglish
spaSpanish
por+eng+spaPortuguese, English, and Spanish

Outputs

Outputs depend on the selected type.

TXT

OutputTypeDescription
textstringExtracted text

JSON

OutputTypeDescription
jsonanyParsed JSON
textstringOriginal text

CSV and Excel

OutputTypeDescription
rowsarrayExtracted rows
columnsarrayDetected columns
rowCountnumberNumber of rows
sheetsarraySheets found, for Excel

PDF

OutputTypeDescription
textstringExtracted text
pagesarrayText by page
pageCountnumberNumber of processed pages

Examples

Extract CSV

File: {{ steps["file-generate"].outputs.fileRef }}
Extraction type: CSV
Delimiter: ,
Has header: true

Later use:

{{ steps["file-extract"].outputs.rows[0].email }}
{{ steps["file-extract"].outputs.rowCount }}

Extract PDF text

File: {{ steps["http-request"].outputs.fileRef }}
Extraction type: PDF
Page range: 1-2
OCR: true
OCR languages: por+eng+spa

Type Validation

The node validates whether the file is compatible with the extraction type. If the file is not supported, the field turns red in the panel and execution fails with a clear message, for example:

Unsupported file: spreadsheet.xlsx

Tips

  • Choose the extraction type according to the real file content.
  • For scanned PDFs, enable OCR; for PDFs with real text, leave OCR as fallback.
  • For CSV with ;, change the delimiter before running.
  • Use rows to feed loops, components, file generation, or validations.