> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cloudcruise.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract Datamodel

> Extract structured data from the page

The **Extract Datamodel** node extracts structured data from the current page based on a JSON schema. This is useful for scraping data, validating page content, or capturing information for later use in the workflow.

## Parameters

| Parameter                | Type    | Required    | Description                                                                                                                                                                                                                                                                                                               |
| ------------------------ | ------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `extract_data_model`     | object  | Yes         | JSON schema defining the data structure to extract                                                                                                                                                                                                                                                                        |
| `execution`              | string  | No          | Execution type: `STATIC` (UI: "Static"), `LLM_DOM` (UI: "AI (HTML)"), `LLM_VISION` (UI: "AI (Screenshot)"), or `PROMPT` (UI: "AI (Context)"). If omitted in workflow JSON, the runtime treats it as `STATIC`                                                                                                              |
| `selector`               | string  | Conditional | XPath related to extraction. For `STATIC`: optional; used only as a readiness gate (waits until the element appears before extracting) and does **not** scope or filter what is extracted. For `LLM_DOM`: required; defines the DOM subtree whose `outerHTML` is sent to the model. Not used for `LLM_VISION` or `PROMPT` |
| `prompt`                 | string  | Conditional | Additional instructions for LLM extraction. Required for `PROMPT` execution                                                                                                                                                                                                                                               |
| `wait_time`              | number  | No          | Maximum time (ms) to wait for the selector. Default: 15000. Only used when `selector` is provided                                                                                                                                                                                                                         |
| `selector_error_message` | string  | No          | Custom error message if selector is not found                                                                                                                                                                                                                                                                             |
| `model`                  | string  | No          | Override the default LLM model for extraction. Only used for `LLM_DOM` and `LLM_VISION`                                                                                                                                                                                                                                   |
| `keep_html_metadata`     | boolean | No          | If true, preserve HTML attributes (classes, IDs, data attributes) when sending to LLM. Default: false (HTML is sanitized). Enable this when you need to extract data from HTML attributes e.g. IDs. Only used for `LLM_DOM`                                                                                               |

### How the `selector` interacts with the datamodel

The `selector` parameter behaves very differently between `STATIC` and `LLM_DOM` — be careful not to confuse the two.

* **`STATIC`**: The `selector` is a readiness gate only. Before extracting, the runtime waits up to `wait_time` ms for the selector to resolve (throwing `selector_error_message` if it never does). It is **not** prepended to field paths and does **not** restrict what is extracted — every `path` inside `extract_data_model` is evaluated against the full document (including iframes and open shadow DOM). To scope `STATIC` extraction to a specific region, put the XPath on the array or field's own `path` instead (e.g., `"path": "//table//tbody/tr"` for row containers with relative `".//td[1]"` children).
* **`LLM_DOM`**: The `selector` is required. The runtime waits for the element, then takes its `outerHTML` and sends only that subtree to the model as the extraction input. Choose the smallest element that still contains every field you want the model to extract.
* **`LLM_VISION`** and **`PROMPT`**: The `selector` is not used.

## Schema Structure

The `extract_data_model` follows JSON Schema with CloudCruise extensions:

```json theme={null}
{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "selected": true,
      "description": "Description for LLM extraction",
      "path": "//xpath/expression",
      "mode": "xpath"
    }
  }
}
```

### Schema Properties

| Property      | Description                                                 |
| ------------- | ----------------------------------------------------------- |
| `type`        | Data type: `string`, `number`, `boolean`, `array`, `object` |
| `selected`    | Set to `true` to include this field in extraction           |
| `description` | Description to help LLM understand what to extract          |
| `path`        | XPath expression for `STATIC` extraction                    |
| `mode`        | Set to `xpath` for XPath-based extraction                   |

## Examples

### Basic Extraction with LLM\_DOM

Extract user information using AI:

```json theme={null}
{
  "id": "abc123",
  "name": "Extract user details",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "LLM_DOM",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "user_name": {
          "type": "string",
          "selected": true,
          "description": "The user's full name displayed in the header"
        },
        "email": {
          "type": "string",
          "selected": true,
          "description": "The user's email address"
        },
        "account_status": {
          "type": "string",
          "selected": true,
          "description": "The account status (active, inactive, pending)"
        }
      }
    }
  }
}
```

### STATIC Extraction with XPath

Extract data using explicit XPath selectors:

```json theme={null}
{
  "id": "abc123",
  "name": "Extract order details",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "order_id": {
          "type": "string",
          "selected": true,
          "path": "//span[@data-testid='order-id']",
          "mode": "xpath"
        },
        "total_amount": {
          "type": "string",
          "selected": true,
          "path": "//div[@class='total']//span[@class='amount']",
          "mode": "xpath"
        }
      }
    }
  }
}
```

You can also extract HTML attributes (e.g., `id`, `href`, `data-*`) by pointing the XPath to the attribute:

```json theme={null}
{
  "product_ids": {
    "type": "array",
    "items": {
      "type": "string"
    },
    "selected": true,
    "path": "//div[@class='product-card']/@data-product-id",
    "mode": "xpath"
  }
}
```

## Arrays

### Extract Array of Items

Extract a list of items from the page:

```json theme={null}
{
  "id": "abc123",
  "name": "Extract product list",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "LLM_DOM",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "products": {
          "type": "array",
          "selected": true,
          "items": {
            "type": "object",
            "properties": {
              "name": {
                "type": "string",
                "description": "Product name"
              },
              "price": {
                "type": "string",
                "description": "Product price"
              },
              "sku": {
                "type": "string",
                "description": "Product SKU"
              }
            }
          },
          "description": "List of all products shown in the search results"
        }
      }
    }
  }
}
```

### Static Array Extraction

To extract an array using `STATIC` execution, provide an XPath that matches multiple elements. Each matched element becomes an item in the array:

```json theme={null}
{
  "id": "abc123",
  "name": "Extract all product names",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "product_names": {
          "type": "array",
          "items": {
            "type": "string"
          },
          "selected": true,
          "path": "//div[@class='product-card']//h3[@class='product-name']",
          "mode": "xpath"
        }
      }
    }
  }
}
```

For extracting an array of objects (e.g., table rows with multiple columns), define the `path` on the array to match the repeating container elements, then use relative XPaths for each property within the items:

```json theme={null}
{
  "id": "abc123",
  "name": "Extract table rows",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "orders": {
          "type": "array",
          "selected": true,
          "path": "//table[@id='orders-table']//tbody/tr",
          "mode": "xpath",
          "items": {
            "type": "object",
            "properties": {
              "order_id": {
                "type": "string",
                "path": "/td[1]",
                "mode": "xpath"
              },
              "customer": {
                "type": "string",
                "path": "/td[2]",
                "mode": "xpath"
              },
              "amount": {
                "type": "string",
                "path": "/td[3]",
                "mode": "xpath"
              },
              "status": {
                "type": "string",
                "path": "/td[4]",
                "mode": "xpath"
              }
            }
          }
        }
      }
    }
  }
}
```

The array's `path` matches each `<tr>` row, and each property uses a relative XPath to extract the corresponding cell within that row.

### Overwrite Arrays

Arrays are 'append' by default. If you extract into the same array twice e.g. in a loop, new items will be appended. You can override this behavior by adding the array key to the `overwriteArrayKeys` array. Here's an example JSON schema you could use in a ExtractDatamodel node:

```json theme={null}
{
  "type": "object",
  "properties": {
    "organization_names": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "selected": true,
      "description": "A list of all organization names. The organization name is written on top of each card and enclosed by a headline tag"
    }
  },
  "overwriteArrayKeys": [
    "organization_names"
  ]
}
```

## Access Browser Variables

We allow extraction of some browser variables:

* The complete URL the browser agent is on: `{{window.location.href}}`
* The path name of the current URL: `{{window.location.pathname}}`
* The query string of the current URL: `{{window.location.search}}`

Here's an example JSON schema you can use in a ExtractDatamodel node:

```json theme={null}
{
  "type": "object",
  "properties": {
    "current_url": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.href}}",
      "mode": "xpath"
    },
    "path_name": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.pathname}}",
      "mode": "xpath"
    },
    "query_string": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.search}}",
      "mode": "xpath"
    }
  }
}
```

Note that the execution type for this needs to be `STATIC`.

## Extract Raw HTML

You can extract the HTML content of the current page using document variables:

* **Sanitized HTML** (`{{document.sanitized}}`): Extracts a simplified version of the HTML that removes most attributes and only maintains the structure, tags, and content. This is useful for cleaner data extraction and reduces noise when processing HTML.
* **Complete HTML** (`{{document}}`): Extracts the entire raw HTML with all attributes intact, including classes, IDs, data attributes, styles, and other metadata.

Here's an example JSON schema you can use in a ExtractDatamodel node:

```json theme={null}
{
  "type": "object",
  "properties": {
    "sanitized_html": {
      "type": "string",
      "selected": true,
      "path": "{{document.sanitized}}",
      "mode": "xpath",
      "description": "Clean HTML with structure, tags, and content only"
    },
    "complete_html": {
      "type": "string",
      "selected": true,
      "path": "{{document}}",
      "mode": "xpath",
      "description": "Full raw HTML with all attributes"
    }
  }
}
```

Note that the execution type for this needs to be `STATIC`.

## Notes

* Use `STATIC` execution with XPaths for speed and reliability when page structure is stable
* Use `LLM_DOM` for complex pages or when selectors frequently change
* Add clear descriptions for each field to help the LLM understand what data to extract
* Arrays extracted multiple times (e.g., in a loop) append by default; use `overwriteArrayKeys` to replace
* For `STATIC`, scope row containers via the array's own `path` (not the node `selector`), and prefer relative child paths like `.//td[1]` for cleanest semantics
