-
Notifications
You must be signed in to change notification settings - Fork 120
feat: n8n user docs for WCC actor app #1763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
protoss70
wants to merge
12
commits into
master
Choose a base branch
from
feat/n8n-wcc-docs
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+193
−9
Open
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
e2a7544
feat: n8n user docs for WCC actor app
protoss70 8c05fc4
fix: image path update
protoss70 2a6822a
fix: minor md fix
protoss70 41701e0
fix: doc image path update
protoss70 090a292
feat: improve docs for WCC n8n app
protoss70 5bbe301
Merge remote-tracking branch 'origin/master' into feat/n8n-wcc-docs
protoss70 7bdb83a
fix: minor image path fix
protoss70 028d4bc
feat: ai agent use case docs
protoss70 313cebd
fix: remove repetition
protoss70 7c7aec9
feat: improve wording
protoss70 59ee9a3
feat: improvements
protoss70 646cf47
Update sources/platform/integrations/workflows-and-notifications/n8n/…
protoss70 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
184 changes: 184 additions & 0 deletions
184
sources/platform/integrations/workflows-and-notifications/n8n/ai-crawling.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
--- | ||
title: n8n - AI crawling Actor integration | ||
description: Learn about AI Crawling scraper modules. | ||
sidebar_label: AI Crawling | ||
sidebar_position: 6 | ||
slug: /integrations/n8n/ai-crawling | ||
toc_max_heading_level: 4 | ||
--- | ||
|
||
## Apify Scraper for AI Crawling | ||
|
||
Apify Scraper for AI Crawling from [Apify](https://apify.com/apify/website-content-crawler) lets you extract text content from websites to feed AI models, LLM applications, vector databases, or Retrieval Augmented Generation (RAG) pipelines. It supports rich formatting using Markdown, cleans the HTML of irrelevant elements, downloads linked files, and integrates with AI ecosystems like LangChain, LlamaIndex, and other LLM frameworks. | ||
|
||
To use these modules, you need an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token in the [Apify Console](https://console.apify.com/) under **Settings > Integrations**. After connecting, you can automate content extraction at scale and incorporate the results into your AI workflows. | ||
|
||
## Prerequisites | ||
|
||
Before you begin, make sure you have: | ||
|
||
- An [Apify account](https://console.apify.com/) | ||
- An [n8n instance](https://docs.n8n.io/getting-started/) (self‑hosted or cloud) | ||
|
||
## Install the Apify Node (self-hosted) | ||
|
||
If you're running a self-hosted n8n instance, you can install the Apify community node directly from the editor. This process adds the node to your available tools, enabling Apify operations in workflows. | ||
|
||
1. Open your n8n instance. | ||
1. Go to **Settings > Community Nodes**. | ||
1. Select **Install**. | ||
1. Enter the npm package name: `@apify/n8n-nodes-apify-content-crawler` (for latest version). To install a specific [version](https://www.npmjs.com/package/@apify/n8n-nodes-apify-content-crawler?activeTab=versions) enter e.g `@apify/[email protected]`. | ||
1. Agree to the [risks](https://docs.n8n.io/integrations/community-nodes/risks/) of using community nodes and select **Install**. | ||
1. You can now use the node in your workflows. | ||
|
||
 | ||
|
||
## Install the Apify Scraper for AI Crawling Node (n8n Cloud) | ||
|
||
For n8n Cloud users, installation is even simpler and doesn't require manual package entry. Just search and add the node from the canvas. | ||
|
||
1. Go to the **Canvas** and open the **nodes panel** | ||
1. Search for **Apify Scraper for AI Crawling** in the community node registry | ||
1. Click **Install node** to add the Apify node to your instance | ||
|
||
:::note Verified community nodes visibility | ||
|
||
On n8n Cloud, instance owners can toggle visibility of verified community nodes in the Cloud Admin Panel. Ensure this setting is enabled to install the Apify Scraper for AI Crawling node. | ||
|
||
::: | ||
|
||
## Connect Apify Scraper for AI Crawling (self-hosted) | ||
|
||
1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account. | ||
|
||
 | ||
|
||
1. To connect your Apify account to n8n, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console. | ||
|
||
 | ||
|
||
1. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**. | ||
1. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your n8n workflow interface. | ||
|
||
 | ||
|
||
1. In n8n, click **Create new credential** of the chosen Apify Scraper module. | ||
1. In the **API key** field, paste the API token you copied from Apify and click **Save**. | ||
|
||
 | ||
|
||
### OAuth2 (cloud instance only) | ||
|
||
1. In n8n Cloud, select **Create Credential**. | ||
1. Search for Apify OAuth2 API and select **Continue**. | ||
1. Select **Connect my account** and authorize with your Apify account. | ||
1. n8n automatically retrieves and stores the OAuth2 tokens. | ||
|
||
 | ||
|
||
:::note | ||
|
||
For simplicity on n8n Cloud, use the API key method if you prefer manual control over credentials. | ||
|
||
::: | ||
|
||
With authentication set up, you can now create workflows that incorporate the Apify node. | ||
|
||
## Apify Scraper for AI Crawling modules | ||
|
||
After connecting the app, you can use one of the two modules as native scrapers to extract website content. | ||
|
||
### Standard Settings module | ||
|
||
The Standard Settings module lets you quickly extract content from websites using optimized default settings. This module is ideal for extracting content from blogs, documentation, and knowledge bases to feed into AI models. | ||
|
||
#### How it works | ||
|
||
The crawler starts with one or more URLs. It then crawls these initial URLs and discovers links to other pages on the same site, which it adds to a queue. The crawler will recursively follow these links as long as they are under the same path as the start URL. You can customize this behavior by defining specific URL patterns for inclusion or exclusion. To ensure efficiency, the crawler automatically skips any duplicate pages it encounters. A variety of settings are available to fine-tune the crawling process, including the crawler type, the maximum number of pages to crawl, the crawl depth, and concurrency. | ||
|
||
Once a page is loaded, the Actor processes its HTML to extract high-quality content. It can be configured to wait for dynamic content to load and can scroll the page to trigger the loading of additional content. To access information hidden in interactive sections, the crawler can be set up to expand clickable elements. It also cleans the HTML by removing irrelevant DOM nodes, such as navigation bars, headers, and footers, and can be configured to keep only the content that matches specific CSS selectors. The crawler also handles cookie warnings automatically and transforms the page to extract the main content. | ||
|
||
#### Output data | ||
|
||
For each crawled web page, you'll receive: | ||
|
||
- _Page metadata_: URL, title, description, canonical URL | ||
- _Cleaned text content_: The main article content with irrelevant elements removed | ||
- _Markdown formatting_: Structured content with headers, lists, links, and other formatting preserved | ||
- _Crawl information_: Loaded URL, referrer URL, timestamp, HTTP status | ||
- _Optional file downloads_: PDFs, DOCs, and other linked documents | ||
|
||
```json title="Sample output (shortened)" | ||
{ | ||
"url": "https://docs.apify.com/academy/web-scraping-for-beginners", | ||
"crawl": { | ||
"loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", | ||
"loadedTime": "2025-04-22T14:33:20.514Z", | ||
"referrerUrl": "https://docs.apify.com/academy", | ||
"depth": 1, | ||
"httpStatusCode": 200 | ||
}, | ||
"metadata": { | ||
"canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", | ||
"title": "Web scraping for beginners | Apify Documentation", | ||
"description": "Learn the basics of web scraping with a step-by-step tutorial and practical exercises.", | ||
"languageCode": "en", | ||
"markdown": "# Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\n## What is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\n## Why learn web scraping?\n\n- **Data collection**: Gather information for research, analysis, or business intelligence\n- **Automation**: Save time by automating repetitive data collection tasks\n- **Integration**: Connect web data with your applications or databases\n- **Monitoring**: Track changes on websites automatically\n\n## Getting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n...", | ||
"text": "Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\nWhat is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\nWhy learn web scraping?\n\n- Data collection: Gather information for research, analysis, or business intelligence\n- Automation: Save time by automating repetitive data collection tasks\n- Integration: Connect web data with your applications or databases\n- Monitoring: Track changes on websites automatically\n\nGetting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n..." | ||
} | ||
} | ||
``` | ||
|
||
### Advanced Settings module | ||
|
||
The Advanced Settings module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction. | ||
|
||
#### Key features | ||
|
||
- _Multiple Crawler Options_: Choose between headless browsers (Playwright) or faster HTTP clients (Cheerio) | ||
- _Custom Content Selection_: Specify exactly which elements to keep or remove | ||
- _Advanced Navigation Control_: Set crawling depth, scope, and URL patterns | ||
- _Dynamic Content Handling_: Wait for JavaScript-rendered content to load | ||
- _Interactive Element Support_: Click expandable sections to reveal hidden content | ||
- _Multiple Output Formats_: Save content as Markdown, HTML, or plain text | ||
- _Proxy Configuration_: Use proxies to handle geo-restrictions or avoid IP blocks | ||
- _Content Transformation Options_: Multiple algorithms for optimal content extraction | ||
|
||
#### How it works | ||
|
||
The Advanced Settings module provides granular control over the entire crawling process. For _Crawler selection_, you can choose from Playwright (Firefox/Chrome) or Cheerio, depending on the complexity of the target website. _URL management_ allows you to define the crawling scope with include and exclude URL patterns. You can also exercise precise _DOM manipulation_ by controlling which HTML elements to keep or remove. To ensure the best results, you can apply specialized algorithms for _Content transformation_ and select from various _Output formatting_ options for better AI model compatibility. | ||
|
||
#### Configuration options | ||
|
||
Advanced Settings offers a wide range of configuration options. You can select the _Crawler type_ by choosing the rendering engine (browser or HTTP client) and the _Content extraction algorithm_ from multiple HTML transformers. _Element selectors_ allow you to specify which elements to keep, remove, or click, while _URL patterns_ let you define inclusion and exclusion rules with glob syntax. You can also set _Crawling parameters_ like concurrency, depth, timeouts, and retries. For robust crawling, you can configure _Proxy configuration_ settings and select from various _Output options_ for content formats and storage. | ||
|
||
#### Output data | ||
|
||
In addition to the standard output fields, this module provides: | ||
|
||
- _Multiple format options_: Content in Markdown, HTML, or plain text | ||
- _Debug information_: Detailed extraction diagnostics and snapshots | ||
- _HTML transformations_: Results from different content extraction algorithms | ||
- _File storage options_: Flexible storage for HTML, screenshots, or downloaded files | ||
|
||
You can access any of thousands of our scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify). | ||
|
||
## Usage as an AI Agent Tool | ||
|
||
You can setup Apify's Scraper for AI Crawling node as a tool for your AI Agents. | ||
|
||
 | ||
|
||
### Dynamic URL crawling | ||
|
||
In the Website Content Crawler module you can set the **Start URLs** to be filled in by your AI Agent dynamically. This allows the Agent to decide on which pages to scrape off the internet. | ||
|
||
We recommend using the Advanced Settings module with your AI Agent. Two key parameters to set are **Max crawling depth** and **Max pages**. Remember that the scraping results are passed into the AI Agent’s context, so using smaller values helps stay within context limits. | ||
|
||
 | ||
|
||
### Example usage | ||
|
||
Here, the agent was used to find information about Apify's latest blog post. It correctly filled in the URL for the blog and summarized its content. | ||
|
||
 |
Binary file added
BIN
+45.5 KB
sources/platform/integrations/workflows-and-notifications/n8n/images/config.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+87.8 KB
sources/platform/integrations/workflows-and-notifications/n8n/images/install.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+445 KB
sources/platform/integrations/workflows-and-notifications/n8n/images/result.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+117 KB
sources/platform/integrations/workflows-and-notifications/n8n/images/setup.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+73 KB
sources/platform/integrations/workflows-and-notifications/n8n/images/token.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section should be below ## Apify Scraper for AI crawling. First we need to introduce the concept, then show what is needed