Extract File Node

The Extract File node reads a file received as fileRef and returns text or structured data according to the selected extraction type.

Overview

Property	Value
Type	`file-extract`
Category	Files
Color	🟤 Gold (#C89F65)
Input	`in`
Output	`out`

When to Use

Use this node when the flow needs to:

read TXT or Markdown as text;
validate or reuse JSON content;
transform CSV into structured rows;
read Excel spreadsheets;
extract text from PDF;
use OCR as fallback when the PDF is image-based.

Configuration

Field	Type	Description
File	`fileRef`	Input file
Extraction type	selection	TXT, JSON, CSV, Excel, or PDF

The following fields change according to the extraction type.

TXT

Field	Description
Encoding	Encoding used to read the text, such as `utf8`

JSON

Field	Description
Encoding	Encoding used to read the file

The content must be valid JSON.

CSV

Field	Description
Delimiter	Column separator, such as `,` or `;`
Has header	Uses the first line as column names
Encoding	File encoding

Excel

Field	Description
Sheet name	Sheet to read. If empty, uses the first sheet
Header row	Row used as header

PDF

Field	Description
Page range	Pages to extract, such as `1-3,5`
OCR	Enables OCR fallback for image-based PDFs
OCR languages	OCR languages, such as `por+eng+spa`
OCR scale	Scale used when rendering before OCR

OCR in PDF

When the PDF has real text, QANode tries to extract it directly. If the PDF appears to be image-based, OCR can be used automatically as fallback.

Use OCR for:

scanned PDFs;
payslips and receipts as images;
scanned documents;
files without a text layer.

Common languages:

Value	Languages
`por`	Portuguese
`eng`	English
`spa`	Spanish
`por+eng+spa`	Portuguese, English, and Spanish

Outputs

Outputs depend on the selected type.

TXT

Output	Type	Description
`text`	`string`	Extracted text

JSON

Output	Type	Description
`json`	`any`	Parsed JSON
`text`	`string`	Original text

CSV and Excel

Output	Type	Description
`rows`	`array`	Extracted rows
`columns`	`array`	Detected columns
`rowCount`	`number`	Number of rows
`sheets`	`array`	Sheets found, for Excel

PDF

Output	Type	Description
`text`	`string`	Extracted text
`pages`	`array`	Text by page
`pageCount`	`number`	Number of processed pages

Examples

Extract CSV

File: {{ steps["file-generate"].outputs.fileRef }}
Extraction type: CSV
Delimiter: ,
Has header: true

Later use:

{{ steps["file-extract"].outputs.rows[0].email }}
{{ steps["file-extract"].outputs.rowCount }}

Extract PDF text

File: {{ steps["http-request"].outputs.fileRef }}
Extraction type: PDF
Page range: 1-2
OCR: true
OCR languages: por+eng+spa

Type Validation

The node validates whether the file is compatible with the extraction type. If the file is not supported, the field turns red in the panel and execution fails with a clear message, for example:

Unsupported file: spreadsheet.xlsx

Tips

Choose the extraction type according to the real file content.
For scanned PDFs, enable OCR; for PDFs with real text, leave OCR as fallback.
For CSV with ;, change the delimiter before running.
Use rows to feed loops, components, file generation, or validations.