Skip to main content
The Extract Datamodel node extracts structured data from the current page based on a JSON schema. This is useful for scraping data, validating page content, or capturing information for later use in the workflow.

Parameters

ParameterTypeRequiredDescription
extract_data_modelobjectYesJSON schema defining the data structure to extract
executionstringNoExecution type: STATIC (UI: “Static”), LLM_DOM (UI: “AI (HTML)”), LLM_VISION (UI: “AI (Screenshot)”), or PROMPT (UI: “AI (Context)”). Default: LLM_DOM
selectorstringConditionalXPath to scope the extraction area. Required for LLM_DOM, optional for STATIC (to wait for element before extracting), not used for LLM_VISION
promptstringConditionalAdditional instructions for LLM extraction. Required for PROMPT execution
wait_timenumberNoMaximum time (ms) to wait for the selector. Default: 15000. Only used when selector is provided
selector_error_messagestringNoCustom error message if selector is not found
llm_modelstringNoOverride the default LLM model for extraction. Only used for LLM_DOM and LLM_VISION
keep_html_metadatabooleanNoIf true, preserve HTML attributes (classes, IDs, data attributes) when sending to LLM. Default: false (HTML is sanitized). Enable this when you need to extract data from HTML attributes e.g. IDs. Only used for LLM_DOM

Schema Structure

The extract_data_model follows JSON Schema with CloudCruise extensions:
{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "selected": true,
      "description": "Description for LLM extraction",
      "path": "//xpath/expression",
      "mode": "xpath"
    }
  }
}

Schema Properties

PropertyDescription
typeData type: string, number, boolean, array, object
selectedSet to true to include this field in extraction
descriptionDescription to help LLM understand what to extract
pathXPath expression for STATIC extraction
modeSet to xpath for XPath-based extraction

Examples

Basic Extraction with LLM_DOM

Extract user information using AI:
{
  "id": "abc123",
  "name": "Extract user details",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "LLM_DOM",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "user_name": {
          "type": "string",
          "selected": true,
          "description": "The user's full name displayed in the header"
        },
        "email": {
          "type": "string",
          "selected": true,
          "description": "The user's email address"
        },
        "account_status": {
          "type": "string",
          "selected": true,
          "description": "The account status (active, inactive, pending)"
        }
      }
    }
  }
}

STATIC Extraction with XPath

Extract data using explicit XPath selectors:
{
  "id": "abc123",
  "name": "Extract order details",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "order_id": {
          "type": "string",
          "selected": true,
          "path": "//span[@data-testid='order-id']",
          "mode": "xpath"
        },
        "total_amount": {
          "type": "string",
          "selected": true,
          "path": "//div[@class='total']//span[@class='amount']",
          "mode": "xpath"
        }
      }
    }
  }
}
You can also extract HTML attributes (e.g., id, href, data-*) by pointing the XPath to the attribute:
{
  "product_ids": {
    "type": "array",
    "items": {
      "type": "string"
    },
    "selected": true,
    "path": "//div[@class='product-card']/@data-product-id",
    "mode": "xpath"
  }
}

Arrays

Extract Array of Items

Extract a list of items from the page:
{
  "id": "abc123",
  "name": "Extract product list",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "LLM_DOM",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "products": {
          "type": "array",
          "selected": true,
          "items": {
            "type": "object",
            "properties": {
              "name": {
                "type": "string",
                "description": "Product name"
              },
              "price": {
                "type": "string",
                "description": "Product price"
              },
              "sku": {
                "type": "string",
                "description": "Product SKU"
              }
            }
          },
          "description": "List of all products shown in the search results"
        }
      }
    }
  }
}

Static Array Extraction

To extract an array using STATIC execution, provide an XPath that matches multiple elements. Each matched element becomes an item in the array:
{
  "id": "abc123",
  "name": "Extract all product names",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "product_names": {
          "type": "array",
          "items": {
            "type": "string"
          },
          "selected": true,
          "path": "//div[@class='product-card']//h3[@class='product-name']",
          "mode": "xpath"
        }
      }
    }
  }
}
For extracting an array of objects (e.g., table rows with multiple columns), define the path on the array to match the repeating container elements, then use relative XPaths for each property within the items:
{
  "id": "abc123",
  "name": "Extract table rows",
  "action": "EXTRACT_DATAMODEL",
  "parameters": {
    "execution": "STATIC",
    "extract_data_model": {
      "type": "object",
      "properties": {
        "orders": {
          "type": "array",
          "selected": true,
          "path": "//table[@id='orders-table']//tbody/tr",
          "mode": "xpath",
          "items": {
            "type": "object",
            "properties": {
              "order_id": {
                "type": "string",
                "path": "/td[1]",
                "mode": "xpath"
              },
              "customer": {
                "type": "string",
                "path": "/td[2]",
                "mode": "xpath"
              },
              "amount": {
                "type": "string",
                "path": "/td[3]",
                "mode": "xpath"
              },
              "status": {
                "type": "string",
                "path": "/td[4]",
                "mode": "xpath"
              }
            }
          }
        }
      }
    }
  }
}
The array’s path matches each <tr> row, and each property uses a relative XPath to extract the corresponding cell within that row.

Overwrite Arrays

Arrays are ‘append’ by default. If you extract into the same array twice e.g. in a loop, new items will be appended. You can override this behavior by adding the array key to the overwriteArrayKeys array. Here’s an example JSON schema you could use in a ExtractDatamodel node:
{
  "type": "object",
  "properties": {
    "organization_names": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "selected": true,
      "description": "A list of all organization names. The organization name is written on top of each card and enclosed by a headline tag"
    }
  },
  "overwriteArrayKeys": [
    "organization_names"
  ]
}

Access Browser Variables

We allow extraction of some browser variables:
  • The complete URL the browser agent is on: {{window.location.href}}
  • The path name of the current URL: {{window.location.pathname}}
  • The query string of the current URL: {{window.location.search}}
Here’s an example JSON schema you can use in a ExtractDatamodel node:
{
  "type": "object",
  "properties": {
    "current_url": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.href}}",
      "mode": "xpath"
    },
    "path_name": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.pathname}}",
      "mode": "xpath"
    },
    "query_string": {
      "type": "string",
      "selected": true,
      "path": "{{window.location.search}}",
      "mode": "xpath"
    }
  }
}
Note that the execution type for this needs to be STATIC.

Extract Raw HTML

You can extract the HTML content of the current page using document variables:
  • Sanitized HTML ({{document.sanitized}}): Extracts a simplified version of the HTML that removes most attributes and only maintains the structure, tags, and content. This is useful for cleaner data extraction and reduces noise when processing HTML.
  • Complete HTML ({{document}}): Extracts the entire raw HTML with all attributes intact, including classes, IDs, data attributes, styles, and other metadata.
Here’s an example JSON schema you can use in a ExtractDatamodel node:
{
  "type": "object",
  "properties": {
    "sanitized_html": {
      "type": "string",
      "selected": true,
      "path": "{{document.sanitized}}",
      "mode": "xpath",
      "description": "Clean HTML with structure, tags, and content only"
    },
    "complete_html": {
      "type": "string",
      "selected": true,
      "path": "{{document}}",
      "mode": "xpath",
      "description": "Full raw HTML with all attributes"
    }
  }
}
Note that the execution type for this needs to be STATIC.

Notes

  • Use STATIC execution with XPaths for speed and reliability when page structure is stable
  • Use LLM_DOM for complex pages or when selectors frequently change
  • Add clear descriptions for each field to help the LLM understand what data to extract
  • Arrays extracted multiple times (e.g., in a loop) append by default; use overwriteArrayKeys to replace