Web Scraping: Extract, Aggregate, and Research Content from Websites

This workflow automates web scraping to extract structured content from websites using CSS selectors. It supports content aggregation and research with respectful scraping practices. Key nodes include Webhook for API triggers, Manual Trigger for testing, Condition for security and input validation, Set for data sanitization, HTTP Request for fetching web content, HTML Extract for content parsing, Aggregate for deduplication, and Webhook Response for output. No external API credentials are required. To set up, download n8n from n8n.io for self-hosting or sign up at cloud.n8n.io. No API keys are needed as the workflow uses standard HTTP requests. Import the workflow JSON via Workflows > Import in n8n. Configure the Webhook node (path: /web-scraping-content-extraction, method: POST) with CORS headers (Access-Control-Allow-Origin: *). Use ngrok for local webhook testing (e.g., ngrok http 5678). In the Set node, ensure the userAgent is set (e.g., 'Scraper/1.0.0 (+https://example.com/bot)') and respectRobots is true. Verify HTTPS in the Condition node for secure transmission. Check the HTML Extract node for valid CSS selectors (e.g., 'h2' for primary content, 'a' for titles/URLs). Test by sending a POST request to the webhook with JSON (e.g., {targetUrl: 'https://example.com', selectors: 'h2', maxItems: 10}) using tools like Postman. Include headers (x-api-key, x-request-count: 1). Validate output in the Set node for scrapingResponse (e.g., success: true, totalItems > 0). Check extractedItems for valid titles and URLs. Handle errors like invalid URLs (400 response with 'INVALID_INPUT'), blocked domains (403 response with 'URL_VALIDATION_FAILED'), or fetch failures (502 response with 'FETCH_FAILED'). For issues, verify URL accessibility, HTTPS, and selector accuracy. Deploy by saving and enabling the workflow in n8n. Test with a sample URL (e.g., https://hackernoon.com) to confirm content extraction. Check Webhook Response for processingStats (e.g., validItems > 0). If extraction fails, inspect HTTP Request node for timeouts or blocked requests and adjust selectors in the n8n editor.

$6.99

Workflow steps: 17

Integrated apps: webhook, manualTrigger, if

Web Scraping: Extract, Aggregate, and Research Content from Websites  preview