Structured Extraction with LLM

When the extraction method is LLM, a model reads meaning from your documents and populates schema-defined fields with inferred values. This page covers those options — schema definition, model selection, schema prompt, and extraction guidance. To compare LLM and Regex before choosing, see Choose an extraction method.

For Unstructured Pipelines users, see Unstructured Pipelines settings for structured extraction with LLM
For Unstructured API users, see Unstructured API Settings for structured extraction with LLM

Unstructured Pipelines settings for structured extraction with LLM

The following sections describe how to use the Unstructured Pipelines to specify settings for structured extraction with LLM.

Define your schema (Pipelines only)

In Unstructured Pipelines, you can build your extraction schema directly in the visual schema builder, or generate a starting point from a plain-language prompt. Once generated, you can refine the schema in the builder and export it as JSON. Be aware that generating a new schema from the plain-language prompt will overwrite any existing builder content.

If you already have a schema in the visual schema builder and want to try generating one from a plain-language prompt, export your current schema to a JSON file first. You can upload it again later if you prefer the original.

Visual schema builder and JSON upload/export (Pipelines only)

In Unstructured Pipelines, on the Start page or in the workflow designer, you can access the visual schema builder in the Define Schema view. From there you can:

Upload a JSON file to the editor.
Edit the fields in the schema directly in the editor.
Export the schema you have defined to a JSON file for reuse.

An extraction schema is a JSON-formatted schema that defines the structure of the data that Unstructured extracts. If you already have an extraction schema defined in a JSON file, you can click Upload JSON to upload the file to Unstructured.

The schema must conform to the OpenAI Structured Outputs guidelines, which are a subset of the JSON Schema language. Per OpenAI’s guidelines, the maximum supported JSON schema nesting depth is 10 levels.

The following shows the extraction schema for the sample real estate listing — first in the visual schema builder, then as a JSON schema file. The LLM visual schema builder:

LLM visual schema builder showing an extraction schema with the Export schema as JSON option

JSON schema file:

{
  "type": "object",
  "properties": {
    "street_address": {
      "type": "string",
      "description": "The full street address of the property including street number, street name, city, state, and postal code"
    },
    "square_footage": {
      "type": "number",
      "description": "The total living space area of the property, in square feet"
    },
    "price": {
      "type": "number",
      "description": "The listed selling price of the property, in local currency"
    },
    "features": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "A list of property features and highlights"
    },
    "agent_contact": {
      "type": "object",
      "properties": {
        "phone": {
          "type": "string",
          "description": "The agent's contact phone number"
        }
      },
      "required": [
        "phone"
      ],
      "additionalProperties": false,
      "description": "Contact information for the real estate agent"
    }
  },
  "additionalProperties": false,
  "required": [
    "street_address",
    "square_footage",
    "price",
    "features",
    "agent_contact"
  ]
}

Plain language in a schema prompt (Pipelines only)

Unstructured Pipelines allows you to specify your extraction schema with a schema prompt instead of by using a visual schema designer or a JSON schema. A schema prompt is plain-language instructions that describe what to extract from your documents, similar to a prompt you would give a chatbot or AI agent. Unstructured generates an extraction schema from those instructions: a structured definition (fields, types, and constraints) that guides extraction from the source documents.

This option is only available from the Start page.

From the Start page click Suggest, enter your prompt in the Prompt a Schema dialog, then click Generate schema. Following your prompt instructions, Unstructured will generate a schema that will display in the visual schema builder.

Prompt a Schema dialog showing a plain-language prompt for a real estate listing

Selecting Generate schema overwrites the existing schema that’s displayed in the Define Schema pane. If you’d like to save the current schema before generating a new one, click the ellipses (three dots) icon, then click Export schema as JSON.

The generated schema displays in the visual schema builder. You can continue to edit the schema from the visual schema builder if you wish. For this real estate listing example, you might enter the following prompt:

Extract the following information from the listing, and present it in the following format:

- street_address: The full street address of the property including street number, street name, city, state, and postal code.
- square_footage: The total living space area of the property, in square feet.
- price: The listed selling price of the property, in local currency.
- features: A list of property features and highlights.
- agent_contact: Contact information for the real estate agent.
- phone: The agent's contact phone number.

The following image shows the generated schema that displays in the visual schema builder.

LLM visual schema builder displaying a schema generated from a plain-language prompt

Select your LLM provider and model (Pipelines only)

In Unstructured Pipelines, you can select a provider and model for the LLM extraction method. For Model, select your provider and model from the drop-down.

Provider and model selection dropdown in the workflow designer Extract node

This option is only available from the workflow designer.

Configure your output (Pipelines only)

In Unstructured Pipelines, once your schema determines which fields to extract and what types they return, settings control what the output looks like. Schema-only output lets you strip away Unstructured’s document elements and return just the extracted fields. Extraction guidance lets you tell the LLM how to format, normalize, or summarize values into the fields your schema defines.

Schema-only output (Pipelines only)

In Unstructured Pipelines, the Schema-Only Output setting controls whether Unstructured’s document elements are stripped away and returns just the extracted fields. The Schema-Only Output setting applies to both the LLM and Regex extraction methods. In the workflow designer, select the workflow’s Extract node. Under Output settings, you can set Schema-Only Output to ON or OFF whenever you edit the workflow.

When Schema-Only Output is ON, the Extract node returns only the JSON produced for your explicitly defined fields. In workflow JSON, that is the extracted data only layout from Custom defined output (no surrounding Unstructured element list).
When Schema-Only Output is OFF (the default), Unstructured also emits the usual document elements and metadata alongside those extracted values. In workflow JSON, that is the elements with extracted data layout from the same Custom defined output section (structured fields under DocumentData plus the rest of the element list).

Schema-Only Output toggle in the Extract node Output settings

This option is only available from the workflow designer.

Extraction guidance (Pipelines only)

In Unstructured Pipelines, in the workflow designer, use the Extraction Guidance Prompt to tell the LLM how to format, normalize, or present values after your schema defines which fields to extract.

This option is only available from the workflow designer.

The schema still defines what to extract (fields, types, and constraints). Extraction guidance adds plain-language direction for how to format, normalize, or summarize that output when JSON Schema alone is not enough. For example, you can ask the model to standardize addresses, return dates in a consistent format, or summarize long text into a predefined field. You can save this guidance in the workflow designer with the Extract node settings and with the workflow you’re defining, so later runs, including API operations against that workflow, use the same guidance. You can add or revise an Extraction Guidance Prompt in the workflow designer after you add or select the Extract node. From the structured data extractor, click + Add Prompt to enter plain-language instructions for how the LLM should format or present values after your schema has defined the fields. Saving writes the prompt into the node’s settings. Extracted values must still conform to the schema; the prompt only describes presentation and cleanup on top of that contract. You can edit and save the extraction guidance again as you iterate.

Extraction Guidance Prompt field in the workflow designer Extract node

Unstructured API settings for structured extraction with LLM

The following sections describe how to use the Unstructured API to specify settings for structured extraction with LLM.

Define your schema (API only)

An extraction schema is a JSON-formatted schema that defines the structure of the data that Unstructured extracts.

To specify an extraction schema with the Unstructured API, use the LLM method of an Extract node. In this node, set the schema_to_extract.json_schema key in the settings object as either as an object in a workflow_nodes array (for curl) or as a WorkflowNode in a WorkflowNodes collection (for Python). This object or collection applies whenever you create a workflow, update a workflow, or create a workflow job that processes local files.

Specify your LLM provider and model (API only)

You must specify an LLM provider and model for Unstructured to perform the extraction. To do this with the Unstructured API, use the LLM method of an Extract node. In this node, set the provider and model keys in the settings object as either as an object in a workflow_nodes array (for curl) or as a WorkflowNode in a WorkflowNodes collection (for Python). This object or collection applies whenever you create a workflow, update a workflow, or create a workflow job that processes local files.

Configure your output (API only)

Once your schema determines which fields to extract and what types they return, settings control what the output looks like. Schema-only output lets you strip away Unstructured’s document elements and return just the extracted fields. Extraction guidance lets you tell the LLM how to format, normalize, or summarize values into the fields your schema defines.

Schema-only output (API only)

You can use the output_mode setting with the Unstructured API to control whether Unstructured’s document elements are stripped away and returns just the extracted fields:

Set output_mode to extracted_data_only to output only the extracted data as JSON, without any parent DocumentData element or any other built-in Unstructured document elements.
Set output_mode to elements_with_extracted_data to output the extracted data as JSON, inside of a parent DocumentData element. This element is also included with any other built-in Unstructured document elements.

To specify this setting, use the LLM method of an Extract node. In this node, set the output_mode key in the settings object. You set this object as either as an object in a workflow_nodes array (for curl) or as a WorkflowNode in a WorkflowNodes collection (for Python). This object or collection applies whenever you create a workflow, update a workflow, or create a workflow job that processes local files.

Extraction guidance (API only)

You can use the Extraction Guidance Prompt setting with the Unstructured API to tell the LLM how to format, normalize, or present values after your schema defines which fields to extract. To specify this setting, use the LLM method of an Extract node. In this node, set the schema_to_extract.extraction_guidance key in the settings object as either as an object in a workflow_nodes array (for curl) or as a WorkflowNode in a WorkflowNodes collection (for Python). This object or collection applies whenever you create a workflow, update a workflow, or create a workflow job that processes local files.

API limitations

The Unstructured API does not support the following options for structured extraction with LLM. To use either of these options, you must use Unstructured Pipelines instead. To learn how, see the following links:

Visual schema builder and JSON upload/export (Pipelines only)
Plain language in a schema prompt (Pipelines only)

Structured data extractor

Enriching

Structured Extraction with LLM

Unstructured Pipelines settings for structured extraction with LLM

Define your schema (Pipelines only)

Visual schema builder and JSON upload/export (Pipelines only)

Plain language in a schema prompt (Pipelines only)

Select your LLM provider and model (Pipelines only)

Configure your output (Pipelines only)

Schema-only output (Pipelines only)

Extraction guidance (Pipelines only)

Unstructured API settings for structured extraction with LLM

Define your schema (API only)

Specify your LLM provider and model (API only)

Configure your output (API only)

Schema-only output (API only)

Extraction guidance (API only)

API limitations

​Unstructured Pipelines settings for structured extraction with LLM

​Define your schema (Pipelines only)

​Visual schema builder and JSON upload/export (Pipelines only)

​Plain language in a schema prompt (Pipelines only)

​Select your LLM provider and model (Pipelines only)

​Configure your output (Pipelines only)

​Schema-only output (Pipelines only)

​Extraction guidance (Pipelines only)

​Unstructured API settings for structured extraction with LLM

​Define your schema (API only)

​Specify your LLM provider and model (API only)

​Configure your output (API only)

​Schema-only output (API only)

​Extraction guidance (API only)

​API limitations

Unstructured Pipelines settings for structured extraction with LLM

Define your schema (Pipelines only)

Visual schema builder and JSON upload/export (Pipelines only)

Plain language in a schema prompt (Pipelines only)

Select your LLM provider and model (Pipelines only)

Configure your output (Pipelines only)

Schema-only output (Pipelines only)

Extraction guidance (Pipelines only)

Unstructured API settings for structured extraction with LLM

Define your schema (API only)

Specify your LLM provider and model (API only)

Configure your output (API only)

Schema-only output (API only)

Extraction guidance (API only)

API limitations