The Extract Datamodel node extracts structured data from the current page based on a JSON schema. This is useful for scraping data, validating page content, or capturing information for later use in the workflow.Documentation Index
Fetch the complete documentation index at: https://docs.cloudcruise.com/llms.txt
Use this file to discover all available pages before exploring further.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
extract_data_model | object | Yes | JSON schema defining the data structure to extract |
execution | string | No | Execution type: STATIC (UI: “Static”), LLM_DOM (UI: “AI (HTML)”), LLM_VISION (UI: “AI (Screenshot)”), or PROMPT (UI: “AI (Context)”). If omitted in workflow JSON, the runtime treats it as STATIC |
selector | string | Conditional | XPath related to extraction. For STATIC: optional; used only as a readiness gate (waits until the element appears before extracting) and does not scope or filter what is extracted. For LLM_DOM: required; defines the DOM subtree whose outerHTML is sent to the model. Not used for LLM_VISION or PROMPT |
prompt | string | Conditional | Additional instructions for LLM extraction. Required for PROMPT execution |
wait_time | number | No | Maximum time (ms) to wait for the selector. Default: 15000. Only used when selector is provided |
selector_error_message | string | No | Custom error message if selector is not found |
llm_model | string | No | Override the default LLM model for extraction. Only used for LLM_DOM and LLM_VISION |
keep_html_metadata | boolean | No | If true, preserve HTML attributes (classes, IDs, data attributes) when sending to LLM. Default: false (HTML is sanitized). Enable this when you need to extract data from HTML attributes e.g. IDs. Only used for LLM_DOM |
How the selector interacts with the datamodel
The selector parameter behaves very differently between STATIC and LLM_DOM — be careful not to confuse the two.
STATIC: Theselectoris a readiness gate only. Before extracting, the runtime waits up towait_timems for the selector to resolve (throwingselector_error_messageif it never does). It is not prepended to field paths and does not restrict what is extracted — everypathinsideextract_data_modelis evaluated against the full document (including iframes and open shadow DOM). To scopeSTATICextraction to a specific region, put the XPath on the array or field’s ownpathinstead (e.g.,"path": "//table//tbody/tr"for row containers with relative".//td[1]"children).LLM_DOM: Theselectoris required. The runtime waits for the element, then takes itsouterHTMLand sends only that subtree to the model as the extraction input. Choose the smallest element that still contains every field you want the model to extract.LLM_VISIONandPROMPT: Theselectoris not used.
Schema Structure
Theextract_data_model follows JSON Schema with CloudCruise extensions:
Schema Properties
| Property | Description |
|---|---|
type | Data type: string, number, boolean, array, object |
selected | Set to true to include this field in extraction |
description | Description to help LLM understand what to extract |
path | XPath expression for STATIC extraction |
mode | Set to xpath for XPath-based extraction |
Examples
Basic Extraction with LLM_DOM
Extract user information using AI:STATIC Extraction with XPath
Extract data using explicit XPath selectors:id, href, data-*) by pointing the XPath to the attribute:
Arrays
Extract Array of Items
Extract a list of items from the page:Static Array Extraction
To extract an array usingSTATIC execution, provide an XPath that matches multiple elements. Each matched element becomes an item in the array:
path on the array to match the repeating container elements, then use relative XPaths for each property within the items:
path matches each <tr> row, and each property uses a relative XPath to extract the corresponding cell within that row.
Overwrite Arrays
Arrays are ‘append’ by default. If you extract into the same array twice e.g. in a loop, new items will be appended. You can override this behavior by adding the array key to theoverwriteArrayKeys array. Here’s an example JSON schema you could use in a ExtractDatamodel node:
Access Browser Variables
We allow extraction of some browser variables:- The complete URL the browser agent is on:
{{window.location.href}} - The path name of the current URL:
{{window.location.pathname}} - The query string of the current URL:
{{window.location.search}}
STATIC.
Extract Raw HTML
You can extract the HTML content of the current page using document variables:- Sanitized HTML (
{{document.sanitized}}): Extracts a simplified version of the HTML that removes most attributes and only maintains the structure, tags, and content. This is useful for cleaner data extraction and reduces noise when processing HTML. - Complete HTML (
{{document}}): Extracts the entire raw HTML with all attributes intact, including classes, IDs, data attributes, styles, and other metadata.
STATIC.
Notes
- Use
STATICexecution with XPaths for speed and reliability when page structure is stable - Use
LLM_DOMfor complex pages or when selectors frequently change - Add clear descriptions for each field to help the LLM understand what data to extract
- Arrays extracted multiple times (e.g., in a loop) append by default; use
overwriteArrayKeysto replace - For
STATIC, scope row containers via the array’s ownpath(not the nodeselector), and prefer relative child paths like.//td[1]for cleanest semantics

