Extract Datamodel

The Extract Datamodel node extracts structured data from the current page based on a JSON schema. This is useful for scraping data, validating page content, or capturing information for later use in the workflow.

Parameters

Parameter	Type	Required	Description
`extract_data_model`	object	Yes	JSON schema defining the data structure to extract
`execution`	string	No	Execution type: `STATIC` (UI: “Static”), `LLM_DOM` (UI: “AI (HTML)”), `LLM_VISION` (UI: “AI (Screenshot)”), or `PROMPT` (UI: “AI (Context)”). If omitted in workflow JSON, the runtime treats it as `STATIC`
`selector`	string	Conditional	XPath related to extraction. For `STATIC`: optional; used only as a readiness gate (waits until the element appears before extracting) and does not scope or filter what is extracted. For `LLM_DOM`: required; defines the DOM subtree whose `outerHTML` is sent to the model. Not used for `LLM_VISION` or `PROMPT`
`prompt`	string	Conditional	Additional instructions for LLM extraction. Required for `PROMPT` execution
`wait_time`	number	No	Maximum time (ms) to wait for the selector. Default: 15000. Only used when `selector` is provided
`selector_error_message`	string	No	Custom error message if selector is not found
`llm_model`	string	No	Override the default LLM model for extraction. Only used for `LLM_DOM` and `LLM_VISION`
`keep_html_metadata`	boolean	No	If true, preserve HTML attributes (classes, IDs, data attributes) when sending to LLM. Default: false (HTML is sanitized). Enable this when you need to extract data from HTML attributes e.g. IDs. Only used for `LLM_DOM`

How the `selector` interacts with the datamodel

The selector parameter behaves very differently between STATIC and LLM_DOM — be careful not to confuse the two.

STATIC: The selector is a readiness gate only. Before extracting, the runtime waits up to wait_time ms for the selector to resolve (throwing selector_error_message if it never does). It is not prepended to field paths and does not restrict what is extracted — every path inside extract_data_model is evaluated against the full document (including iframes and open shadow DOM). To scope STATIC extraction to a specific region, put the XPath on the array or field’s own path instead (e.g., "path": "//table//tbody/tr" for row containers with relative ".//td[1]" children).
LLM_DOM: The selector is required. The runtime waits for the element, then takes its outerHTML and sends only that subtree to the model as the extraction input. Choose the smallest element that still contains every field you want the model to extract.
LLM_VISION and PROMPT: The selector is not used.

Schema Structure

The extract_data_model follows JSON Schema with CloudCruise extensions:

{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "selected": true,
      "description": "Description for LLM extraction",
      "path": "//xpath/expression",
      "mode": "xpath"
    }
  }
}

Schema Properties

Property	Description
`type`	Data type: `string`, `number`, `boolean`, `array`, `object`
`selected`	Set to `true` to include this field in extraction
`description`	Description to help LLM understand what to extract
`path`	XPath expression for `STATIC` extraction
`mode`	Set to `xpath` for XPath-based extraction

Examples

Basic Extraction with LLM_DOM

Extract user information using AI:

{
  "id": "abc123",
  "name": "Extract user details",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "LLM_DOM",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "user_name": {
          "type": "string",
          "selected": true,
          "description": "The user's full name displayed in the header"
        },
        "email": {
          "type": "string",
          "selected": true,
          "description": "The user's email address"
        },
        "account_status": {
          "type": "string",
          "selected": true,
          "description": "The account status (active, inactive, pending)"
        }
      }
    }
  }
}

STATIC Extraction with XPath

Extract data using explicit XPath selectors:

{
  "id": "abc123",
  "name": "Extract order details",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "order_id": {
          "type": "string",
          "selected": true,
          "path": "//span[@data-testid='order-id']",
          "mode": "xpath"
        },
        "total_amount": {
          "type": "string",
          "selected": true,
          "path": "//div[@class='total']//span[@class='amount']",
          "mode": "xpath"
        }
      }
    }
  }
}

You can also extract HTML attributes (e.g., id, href, data-*) by pointing the XPath to the attribute:

{
  "product_ids": {
    "type": "array",
    "items": {
      "type": "string"
    },
    "selected": true,
    "path": "//div[@class='product-card']/@data-product-id",
    "mode": "xpath"
  }
}

Arrays

Extract Array of Items

Extract a list of items from the page:

{
  "id": "abc123",
  "name": "Extract product list",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "LLM_DOM",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "products": {
          "type": "array",
          "selected": true,
          "items": {
            "type": "object",
            "properties": {
              "name": {
                "type": "string",
                "description": "Product name"
              },
              "price": {
                "type": "string",
                "description": "Product price"
              },
              "sku": {
                "type": "string",
                "description": "Product SKU"
              }
            }
          },
          "description": "List of all products shown in the search results"
        }
      }
    }
  }
}

Static Array Extraction

To extract an array using STATIC execution, provide an XPath that matches multiple elements. Each matched element becomes an item in the array:

{
  "id": "abc123",
  "name": "Extract all product names",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "product_names": {
          "type": "array",
          "items": {
            "type": "string"
          },
          "selected": true,
          "path": "//div[@class='product-card']//h3[@class='product-name']",
          "mode": "xpath"
        }
      }
    }
  }
}

For extracting an array of objects (e.g., table rows with multiple columns), define the path on the array to match the repeating container elements, then use relative XPaths for each property within the items:

{
  "id": "abc123",
  "name": "Extract table rows",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "orders": {
          "type": "array",
          "selected": true,
          "path": "//table[@id='orders-table']//tbody/tr",
          "mode": "xpath",
          "items": {
            "type": "object",
            "properties": {
              "order_id": {
                "type": "string",
                "path": "/td[1]",
                "mode": "xpath"
              },
              "customer": {
                "type": "string",
                "path": "/td[2]",
                "mode": "xpath"
              },
              "amount": {
                "type": "string",
                "path": "/td[3]",
                "mode": "xpath"
              },
              "status": {
                "type": "string",
                "path": "/td[4]",
                "mode": "xpath"
              }
            }
          }
        }
      }
    }
  }
}

The array’s path matches each <tr> row, and each property uses a relative XPath to extract the corresponding cell within that row.

Overwrite Arrays

Arrays are ‘append’ by default. If you extract into the same array twice e.g. in a loop, new items will be appended. You can override this behavior by adding the array key to the overwriteArrayKeys array. Here’s an example JSON schema you could use in a ExtractDatamodel node:

{
  "type": "object",
  "properties": {
    "organization_names": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "selected": true,
      "description": "A list of all organization names. The organization name is written on top of each card and enclosed by a headline tag"
    }
  },
  "overwriteArrayKeys": [
    "organization_names"
  ]
}

Access Browser Variables

We allow extraction of some browser variables:

The complete URL the browser agent is on: {{window.location.href}}
The path name of the current URL: {{window.location.pathname}}
The query string of the current URL: {{window.location.search}}

Here’s an example JSON schema you can use in a ExtractDatamodel node:

{
  "type": "object",
  "properties": {
    "current_url": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.href}}",
      "mode": "xpath"
    },
    "path_name": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.pathname}}",
      "mode": "xpath"
    },
    "query_string": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.search}}",
      "mode": "xpath"
    }
  }
}

Note that the execution type for this needs to be STATIC.

Extract Raw HTML

You can extract the HTML content of the current page using document variables:

Sanitized HTML ({{document.sanitized}}): Extracts a simplified version of the HTML that removes most attributes and only maintains the structure, tags, and content. This is useful for cleaner data extraction and reduces noise when processing HTML.
Complete HTML ({{document}}): Extracts the entire raw HTML with all attributes intact, including classes, IDs, data attributes, styles, and other metadata.

Here’s an example JSON schema you can use in a ExtractDatamodel node:

{
  "type": "object",
  "properties": {
    "sanitized_html": {
      "type": "string",
      "selected": true,
      "path": "{{document.sanitized}}",
      "mode": "xpath",
      "description": "Clean HTML with structure, tags, and content only"
    },
    "complete_html": {
      "type": "string",
      "selected": true,
      "path": "{{document}}",
      "mode": "xpath",
      "description": "Full raw HTML with all attributes"
    }
  }
}

Note that the execution type for this needs to be STATIC.

Notes

Use STATIC execution with XPaths for speed and reliability when page structure is stable
Use LLM_DOM for complex pages or when selectors frequently change
Add clear descriptions for each field to help the LLM understand what data to extract
Arrays extracted multiple times (e.g., in a loop) append by default; use overwriteArrayKeys to replace
For STATIC, scope row containers via the array’s own path (not the node selector), and prefer relative child paths like .//td[1] for cleanest semantics

Getting Started

Concepts

API Reference

SDK

Integrations

Parameters

How the `selector` interacts with the datamodel

Schema Structure

Schema Properties

Examples

Basic Extraction with LLM_DOM

STATIC Extraction with XPath

Arrays

Extract Array of Items

Static Array Extraction

Overwrite Arrays

Access Browser Variables

Extract Raw HTML

Notes

Getting Started

Concepts

API Reference

SDK

Integrations

Documentation Index

​Parameters

​How the selector interacts with the datamodel

​Schema Structure

​Schema Properties

​Examples

​Basic Extraction with LLM_DOM

​STATIC Extraction with XPath

​Arrays

​Extract Array of Items

​Static Array Extraction

​Overwrite Arrays

​Access Browser Variables

​Extract Raw HTML

​Notes

Parameters

How the `selector` interacts with the datamodel

Schema Structure

Schema Properties

Examples

Basic Extraction with LLM_DOM

STATIC Extraction with XPath

Arrays

Extract Array of Items

Static Array Extraction

Overwrite Arrays

Access Browser Variables

Extract Raw HTML

Notes