Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cloudcruise.com/llms.txt

Use this file to discover all available pages before exploring further.

The Extract Datamodel node extracts structured data from the current page based on a JSON schema. This is useful for scraping data, validating page content, or capturing information for later use in the workflow.

Parameters

ParameterTypeRequiredDescription
extract_data_modelobjectYesJSON schema defining the data structure to extract
executionstringNoExecution type: STATIC (UI: “Static”), LLM_DOM (UI: “AI (HTML)”), LLM_VISION (UI: “AI (Screenshot)”), or PROMPT (UI: “AI (Context)”). If omitted in workflow JSON, the runtime treats it as STATIC
selectorstringConditionalXPath related to extraction. For STATIC: optional; used only as a readiness gate (waits until the element appears before extracting) and does not scope or filter what is extracted. For LLM_DOM: required; defines the DOM subtree whose outerHTML is sent to the model. Not used for LLM_VISION or PROMPT
promptstringConditionalAdditional instructions for LLM extraction. Required for PROMPT execution
wait_timenumberNoMaximum time (ms) to wait for the selector. Default: 15000. Only used when selector is provided
selector_error_messagestringNoCustom error message if selector is not found
llm_modelstringNoOverride the default LLM model for extraction. Only used for LLM_DOM and LLM_VISION
keep_html_metadatabooleanNoIf true, preserve HTML attributes (classes, IDs, data attributes) when sending to LLM. Default: false (HTML is sanitized). Enable this when you need to extract data from HTML attributes e.g. IDs. Only used for LLM_DOM

How the selector interacts with the datamodel

The selector parameter behaves very differently between STATIC and LLM_DOM — be careful not to confuse the two.
  • STATIC: The selector is a readiness gate only. Before extracting, the runtime waits up to wait_time ms for the selector to resolve (throwing selector_error_message if it never does). It is not prepended to field paths and does not restrict what is extracted — every path inside extract_data_model is evaluated against the full document (including iframes and open shadow DOM). To scope STATIC extraction to a specific region, put the XPath on the array or field’s own path instead (e.g., "path": "//table//tbody/tr" for row containers with relative ".//td[1]" children).
  • LLM_DOM: The selector is required. The runtime waits for the element, then takes its outerHTML and sends only that subtree to the model as the extraction input. Choose the smallest element that still contains every field you want the model to extract.
  • LLM_VISION and PROMPT: The selector is not used.

Schema Structure

The extract_data_model follows JSON Schema with CloudCruise extensions:
{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "selected": true,
      "description": "Description for LLM extraction",
      "path": "//xpath/expression",
      "mode": "xpath"
    }
  }
}

Schema Properties

PropertyDescription
typeData type: string, number, boolean, array, object
selectedSet to true to include this field in extraction
descriptionDescription to help LLM understand what to extract
pathXPath expression for STATIC extraction
modeSet to xpath for XPath-based extraction

Examples

Basic Extraction with LLM_DOM

Extract user information using AI:
{
  "id": "abc123",
  "name": "Extract user details",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "LLM_DOM",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "user_name": {
          "type": "string",
          "selected": true,
          "description": "The user's full name displayed in the header"
        },
        "email": {
          "type": "string",
          "selected": true,
          "description": "The user's email address"
        },
        "account_status": {
          "type": "string",
          "selected": true,
          "description": "The account status (active, inactive, pending)"
        }
      }
    }
  }
}

STATIC Extraction with XPath

Extract data using explicit XPath selectors:
{
  "id": "abc123",
  "name": "Extract order details",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "order_id": {
          "type": "string",
          "selected": true,
          "path": "//span[@data-testid='order-id']",
          "mode": "xpath"
        },
        "total_amount": {
          "type": "string",
          "selected": true,
          "path": "//div[@class='total']//span[@class='amount']",
          "mode": "xpath"
        }
      }
    }
  }
}
You can also extract HTML attributes (e.g., id, href, data-*) by pointing the XPath to the attribute:
{
  "product_ids": {
    "type": "array",
    "items": {
      "type": "string"
    },
    "selected": true,
    "path": "//div[@class='product-card']/@data-product-id",
    "mode": "xpath"
  }
}

Arrays

Extract Array of Items

Extract a list of items from the page:
{
  "id": "abc123",
  "name": "Extract product list",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "LLM_DOM",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "products": {
          "type": "array",
          "selected": true,
          "items": {
            "type": "object",
            "properties": {
              "name": {
                "type": "string",
                "description": "Product name"
              },
              "price": {
                "type": "string",
                "description": "Product price"
              },
              "sku": {
                "type": "string",
                "description": "Product SKU"
              }
            }
          },
          "description": "List of all products shown in the search results"
        }
      }
    }
  }
}

Static Array Extraction

To extract an array using STATIC execution, provide an XPath that matches multiple elements. Each matched element becomes an item in the array:
{
  "id": "abc123",
  "name": "Extract all product names",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "product_names": {
          "type": "array",
          "items": {
            "type": "string"
          },
          "selected": true,
          "path": "//div[@class='product-card']//h3[@class='product-name']",
          "mode": "xpath"
        }
      }
    }
  }
}
For extracting an array of objects (e.g., table rows with multiple columns), define the path on the array to match the repeating container elements, then use relative XPaths for each property within the items:
{
  "id": "abc123",
  "name": "Extract table rows",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "orders": {
          "type": "array",
          "selected": true,
          "path": "//table[@id='orders-table']//tbody/tr",
          "mode": "xpath",
          "items": {
            "type": "object",
            "properties": {
              "order_id": {
                "type": "string",
                "path": "/td[1]",
                "mode": "xpath"
              },
              "customer": {
                "type": "string",
                "path": "/td[2]",
                "mode": "xpath"
              },
              "amount": {
                "type": "string",
                "path": "/td[3]",
                "mode": "xpath"
              },
              "status": {
                "type": "string",
                "path": "/td[4]",
                "mode": "xpath"
              }
            }
          }
        }
      }
    }
  }
}
The array’s path matches each <tr> row, and each property uses a relative XPath to extract the corresponding cell within that row.

Overwrite Arrays

Arrays are ‘append’ by default. If you extract into the same array twice e.g. in a loop, new items will be appended. You can override this behavior by adding the array key to the overwriteArrayKeys array. Here’s an example JSON schema you could use in a ExtractDatamodel node:
{
  "type": "object",
  "properties": {
    "organization_names": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "selected": true,
      "description": "A list of all organization names. The organization name is written on top of each card and enclosed by a headline tag"
    }
  },
  "overwriteArrayKeys": [
    "organization_names"
  ]
}

Access Browser Variables

We allow extraction of some browser variables:
  • The complete URL the browser agent is on: {{window.location.href}}
  • The path name of the current URL: {{window.location.pathname}}
  • The query string of the current URL: {{window.location.search}}
Here’s an example JSON schema you can use in a ExtractDatamodel node:
{
  "type": "object",
  "properties": {
    "current_url": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.href}}",
      "mode": "xpath"
    },
    "path_name": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.pathname}}",
      "mode": "xpath"
    },
    "query_string": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.search}}",
      "mode": "xpath"
    }
  }
}
Note that the execution type for this needs to be STATIC.

Extract Raw HTML

You can extract the HTML content of the current page using document variables:
  • Sanitized HTML ({{document.sanitized}}): Extracts a simplified version of the HTML that removes most attributes and only maintains the structure, tags, and content. This is useful for cleaner data extraction and reduces noise when processing HTML.
  • Complete HTML ({{document}}): Extracts the entire raw HTML with all attributes intact, including classes, IDs, data attributes, styles, and other metadata.
Here’s an example JSON schema you can use in a ExtractDatamodel node:
{
  "type": "object",
  "properties": {
    "sanitized_html": {
      "type": "string",
      "selected": true,
      "path": "{{document.sanitized}}",
      "mode": "xpath",
      "description": "Clean HTML with structure, tags, and content only"
    },
    "complete_html": {
      "type": "string",
      "selected": true,
      "path": "{{document}}",
      "mode": "xpath",
      "description": "Full raw HTML with all attributes"
    }
  }
}
Note that the execution type for this needs to be STATIC.

Notes

  • Use STATIC execution with XPaths for speed and reliability when page structure is stable
  • Use LLM_DOM for complex pages or when selectors frequently change
  • Add clear descriptions for each field to help the LLM understand what data to extract
  • Arrays extracted multiple times (e.g., in a loop) append by default; use overwriteArrayKeys to replace
  • For STATIC, scope row containers via the array’s own path (not the node selector), and prefer relative child paths like .//td[1] for cleanest semantics