# API Reference (/docs/api-reference) Explore the full AnakinScraper REST API. All endpoints use the base URL `https://api.anakin.io/v1` and require an `X-API-Key` header for authentication. ### Products ### Authentication Every request requires an API key passed via the `X-API-Key` header: ``` X-API-Key: your_api_key ``` Get your API key from the [Dashboard](/dashboard). --- # Agentic Search (/docs/api-reference/agentic-search) Advanced 4-stage AI research pipeline that automatically searches the web, scrapes relevant sources, and synthesizes a comprehensive research report. Jobs are processed asynchronously. ### How It Works 1. **Query refinement** — AI refines your research question for optimal search 2. **Web search** — discovers relevant sources and citations 3. **Citation scraping** — automatically scrapes top citation URLs for full content 4. **Analysis & synthesis** — produces a comprehensive research report ### Endpoints --- # GET Get Results (/docs/api-reference/agentic-search/get-search-result) Retrieve the status and results of an agentic search job. Agentic searches run through multiple stages and may take several minutes — poll every 10 seconds. --- ### Path Parameters | Parameter | Type | Description | |-----------|------|-------------| | `id` **required** | string | The job ID returned from the submit endpoint | --- ### Response — Processing ```json { "job_id": "3f8aa45d-6ea3-4107-88ce-7f39ecf48a84", "status": "pending", "message": "Job is pending", "created_at": "2024-01-01T12:00:00.000Z" } ``` ### Response — Completed ```json { "id": "3f8aa45d-6ea3-4107-88ce-7f39ecf48a84", "status": "completed", "jobType": "agentic_search", "generatedJson": { "summary": "Summary of the research findings...", "structured_data": { "developments": [ { "title": "Quantum Computing Advances", "description": "Recent developments in quantum computing...", "organization": "IBM", "date": "2024-01" } ] }, "data_schema": { "description": "Schema for structured data extraction", "fields": { "developments": { "type": "array", "description": "List of developments" } } } }, "createdAt": "2024-01-01T12:00:00.000Z", "completedAt": "2024-01-01T12:05:00.000Z", "durationMs": 33500 } ``` --- ### Response Fields | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique identifier for the job | | `status` | string | `pending`, `processing`, `completed`, or `failed` | | `jobType` | string | Always `agentic_search` for this endpoint | | `generatedJson` | object | The full agentic search result (see below) | | `createdAt` | string | ISO 8601 timestamp of job creation | | `completedAt` | string | ISO 8601 timestamp of completion | | `durationMs` | number | Total processing time in milliseconds | ### generatedJson Fields | Field | Type | Description | |-------|------|-------------| | `summary` | string | Concise summary of findings | | `structured_data` | object | Dynamic structured data matching `data_schema.fields` | | `data_schema` | object | Schema describing the structured data format | ### Job Statuses | Status | Description | |--------|-------------| | `pending` | Job is queued | | `processing` | Research pipeline is running | | `completed` | Research report is ready | | `failed` | Job encountered an error | --- ### Code Examples ```bash curl -X GET https://api.anakin.io/v1/agentic-search/3f8aa45d-6ea3-4107-88ce-7f39ecf48a84 \ -H "X-API-Key: your_api_key" ``` ```python import requests job_id = "3f8aa45d-6ea3-4107-88ce-7f39ecf48a84" result = requests.get( f'https://api.anakin.io/v1/agentic-search/{job_id}', headers={'X-API-Key': 'your_api_key'} ) data = result.json() if data['status'] == 'completed': result_data = data['generatedJson'] print(f"Summary: {result_data['summary']}") print(f"Schema: {result_data['data_schema']}") print(f"Data: {result_data['structured_data']}") ``` ```javascript const jobId = '3f8aa45d-6ea3-4107-88ce-7f39ecf48a84'; const res = await fetch(`https://api.anakin.io/v1/agentic-search/${jobId}`, { headers: { 'X-API-Key': 'your_api_key' } }); const data = await res.json(); if (data.status === 'completed') { const resultData = data.generatedJson; console.log(resultData.summary); console.log(resultData.data_schema); console.log(resultData.structured_data); } ``` For polling patterns, see the [Polling Jobs](/docs/api-reference/polling-jobs) reference. --- # POST Research Query (/docs/api-reference/agentic-search/submit-search) Start an agentic research pipeline. The job runs through 4 stages (query refinement, web search, citation scraping, analysis) and may take several minutes to complete. Poll for results using [GET /v1/agentic-search/\{id\}](/docs/api-reference/agentic-search/get-search-result). --- ### Request Body ```json { "prompt": "Comprehensive analysis of quantum computing trends" } ``` | Parameter | Type | Description | |-----------|------|-------------| | `prompt` **required** | string | Research query or question | --- ### Response ```json { "job_id": "3f8aa45d-6ea3-4107-88ce-7f39ecf48a84", "status": "pending", "message": "Agentic search job queued successfully", "created_at": "2024-01-01T12:00:00.000Z" } ``` ### Response Fields | Field | Type | Description | |-------|------|-------------| | `job_id` | string | Unique identifier for the agentic search job | | `status` | string | Job status (`pending`) | | `message` | string | Confirmation message | | `created_at` | string | ISO 8601 timestamp of job creation | Use the `job_id` with [GET /v1/agentic-search/\{id\}](/docs/api-reference/agentic-search/get-search-result) to poll for results. Agentic searches typically take longer than standard scrapes — poll every 10 seconds. --- ### Code Examples ```bash curl -X POST https://api.anakin.io/v1/agentic-search \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{ "prompt": "Comprehensive analysis of quantum computing trends" }' ``` ```python import requests response = requests.post( 'https://api.anakin.io/v1/agentic-search', headers={'X-API-Key': 'your_api_key'}, json={ 'prompt': 'Comprehensive analysis of quantum computing trends' } ) data = response.json() print(f"Agentic search submitted: {data['job_id']}") ``` ```javascript const response = await fetch('https://api.anakin.io/v1/agentic-search', { method: 'POST', headers: { 'X-API-Key': 'your_api_key', 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt: 'Comprehensive analysis of quantum computing trends' }) }); const data = await response.json(); console.log(data.job_id); ``` --- # Browser Sessions (/docs/api-reference/browser-sessions) > **Tip:** > - Session data is protected using **AES-256-GCM encryption** with complete user isolation. > - The system does not collect, store, or retain passwords, authentication secrets, or credentials at any time. > - Session data is permanently and irreversibly deleted upon user-initiated session removal. ## What are Browser Sessions? Browser sessions allow you to scrape content that requires authentication. Instead of handling complex login flows programmatically, you log in once through a real browser, and we save your session for future API requests. This is useful for scraping: - Account dashboards and order history - Subscription-based content - Social media profiles - Any page that requires a login --- ## How It Works ### 1. Create a Session From your [dashboard](https://anakin.io/dashboard), click **Create Session** to launch an interactive browser. This opens a real browser in the cloud that you control remotely. ### 2. Log In Manually Navigate to the website you want to scrape and log in with your credentials. Complete any two-factor authentication or captchas as you normally would. ### 3. Save the Session Once logged in, click **Save Session**. We encrypt and store your cookies and localStorage data so you can reuse this authenticated state. ### 4. Use in API Requests Include the `sessionId` in your scrape requests. The API will use your saved session to access authenticated pages. --- ## Using Sessions with the API Add the `sessionId` parameter to your [URL Scraper](/docs/api-reference/url-scraper/submit-scrape-job) request: ```json { "url": "https://amazon.com/your-orders", "sessionId": "session_abc123xyz", "country": "us" } ``` When using a session, browser-based scraping is automatically enabled since sessions require a full browser environment. ```bash curl -X POST https://api.anakin.io/v1/url-scraper \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://amazon.com/your-orders", "sessionId": "session_abc123xyz", "country": "us" }' ``` ```python import requests response = requests.post( 'https://api.anakin.io/v1/url-scraper', headers={'X-API-Key': 'your_api_key'}, json={ 'url': 'https://amazon.com/your-orders', 'sessionId': 'session_abc123xyz', 'country': 'us' } ) data = response.json() print(f"Job submitted: {data['jobId']}") ``` ```javascript const response = await fetch('https://api.anakin.io/v1/url-scraper', { method: 'POST', headers: { 'X-API-Key': 'your_api_key', 'Content-Type': 'application/json' }, body: JSON.stringify({ url: 'https://amazon.com/your-orders', sessionId: 'session_abc123xyz', country: 'us' }) }); const data = await response.json(); console.log(data.jobId); ``` --- ## Managing Sessions You can manage your sessions from the [dashboard](https://anakin.io/dashboard): - **View** all saved sessions and their details - **Check** when a session was last used - **Delete** sessions you no longer need --- # Error Responses (/docs/api-reference/error-responses) ### 400 Bad Request ```json { "error": "Invalid URL format" } ``` Invalid request parameters or malformed URL. --- ### 401 Unauthorized ```json { "error": "Unauthorized" } ``` Missing or invalid API key. --- ### 402 Payment Required ```json { "error": "Payment required. Please upgrade your plan." } ``` Account requires a plan upgrade to continue. --- ### 404 Not Found ```json { "error": "Job not found" } ``` The requested job ID does not exist. --- ### 429 Too Many Requests ```json { "error": "Rate limit exceeded. Please slow down your requests." } ``` You are sending requests too quickly. Back off and retry after a short delay. --- ### 500 Internal Server Error ```json { "error": "Internal server error" } ``` An unexpected error occurred on our servers. Retry after a short delay. If the error persists, contact support. --- ### 503 Service Unavailable ```json { "error": "Scraper service is unavailable. Please try again later." } ``` The scraper service is temporarily unavailable. Retry after 30-60 seconds. --- ### Retry Recommendations | Error Code | Retry? | Strategy | |------------|--------|----------| | `400` | No | Fix the request parameters | | `401` | No | Check your API key | | `402` | No | Upgrade your plan | | `404` | No | Verify the job ID exists | | `429` | Yes | Wait 5-10 seconds, then retry | | `500` | Yes | Wait 5 seconds, then retry (max 3 attempts) | | `503` | Yes | Wait 30-60 seconds, then retry | --- # Polling Jobs (/docs/api-reference/polling-jobs) Most AnakinScraper endpoints process jobs asynchronously. After submitting a request, you receive a `jobId` and poll a GET endpoint until the job completes. --- ### How it works 1. **Submit** a POST request — you receive a `jobId` with status `pending` 2. **Poll** the corresponding GET endpoint with the `jobId` 3. **Check** the `status` field — repeat until `completed` or `failed` | Product | Submit | Poll | |---------|--------|------| | URL Scraper | `POST /v1/url-scraper` | `GET /v1/url-scraper/{id}` | | URL Scraper (batch) | `POST /v1/url-scraper/batch` | `GET /v1/url-scraper/{id}` | | Web Scraper | `POST /v1/web-scraper` | `GET /v1/web-scraper/{id}` | | Agentic Search | `POST /v1/agentic-search` | `GET /v1/agentic-search/{id}` | > **Search API** (`POST /v1/search`) is synchronous — results are returned immediately, no polling needed. --- ### Recommended polling interval | Product | Interval | Typical completion | |---------|----------|--------------------| | URL Scraper | 2–5 seconds | 3–15 seconds | | Web Scraper | 2–5 seconds | 3–10 seconds | | Agentic Search | 10 seconds | 1–5 minutes | --- ### Polling examples ```python import requests import time def poll_job(endpoint, job_id, api_key, interval=5, timeout=300): """Poll a job until completed or failed.""" elapsed = 0 while elapsed < timeout: result = requests.get( f'https://api.anakin.io/v1/{endpoint}/{job_id}', headers={'X-API-Key': api_key} ) data = result.json() if data['status'] == 'completed': return data if data['status'] == 'failed': raise Exception(data.get('error', 'Job failed')) time.sleep(interval) elapsed += interval raise TimeoutError('Job polling timed out') # URL Scraper result = poll_job('url-scraper', 'job_abc123xyz', 'your_api_key', interval=3) print(result['markdown']) # Agentic Search (longer interval) result = poll_job('agentic-search', 'agentic_abc123xyz', 'your_api_key', interval=10, timeout=600) print(result['markdown']) ``` ```javascript async function pollJob(endpoint, jobId, apiKey, interval = 5000, timeout = 300000) { const start = Date.now(); while (Date.now() - start < timeout) { const res = await fetch(`https://api.anakin.io/v1/${endpoint}/${jobId}`, { headers: { 'X-API-Key': apiKey } }); const data = await res.json(); if (data.status === 'completed') return data; if (data.status === 'failed') throw new Error(data.error || 'Job failed'); await new Promise(r => setTimeout(r, interval)); } throw new Error('Job polling timed out'); } // URL Scraper const result = await pollJob('url-scraper', 'job_abc123xyz', 'your_api_key', 3000); console.log(result.markdown); // Agentic Search (longer interval) const report = await pollJob('agentic-search', 'agentic_abc123xyz', 'your_api_key', 10000, 600000); console.log(report.markdown); ``` --- ### Status values | Status | Description | |--------|-------------| | `pending` | Job is queued, not yet started | | `queued` | Job is waiting for a worker (agentic search only) | | `processing` | Job is actively being processed | | `completed` | Results are ready — stop polling | | `failed` | Job encountered an error — stop polling | --- # Search API (/docs/api-reference/search) AI-powered web search that returns structured results with citations, snippets, and relevance scores. Results are returned immediately (synchronous). ### Features - **Synchronous** — results returned instantly, no polling needed - **AI-generated summaries** — get an answer alongside raw results - **Citations with scores** — relevance-ranked results with snippets ### Endpoints --- # POST Search (/docs/api-reference/search/search) Perform an AI-powered web search. Returns an AI-generated summary alongside structured search results with citations, snippets, and relevance scores. Results are returned synchronously. --- ### Request Body ```json { "prompt": "latest AI developments 2024", "limit": 5 } ``` | Parameter | Type | Description | |-----------|------|-------------| | `prompt` **required** | string | Search query or question | | `limit` | number | Maximum number of results to return. Default `5`. | --- ### Response ```json { "id": "63385e99-3ef5-4667-84a7-e7b398ec8e06", "results": [ { "url": "https://example.com/article", "title": "AI Developments 2024", "snippet": "Recent advancements in AI...", "date": "2024-01-15", "last_updated": "2024-01-20" } ] } ``` ### Response Fields | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique identifier for the search request | | `results` | array | Array of search result objects | | `results[].url` | string | Source URL | | `results[].title` | string | Page title | | `results[].snippet` | string | Relevant text excerpt | | `results[].date` | string | Publication date (when available) | | `results[].last_updated` | string | Last updated date (when available) | --- ### Code Examples ```bash curl -X POST https://api.anakin.io/v1/search \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{ "prompt": "latest AI developments 2024", "limit": 5 }' ``` ```python import requests response = requests.post( 'https://api.anakin.io/v1/search', headers={'X-API-Key': 'your_api_key'}, json={ 'prompt': 'latest AI developments 2024', 'limit': 5 } ) data = response.json() print(f"Search ID: {data['id']}") for result in data['results']: print(f"\nTitle: {result['title']}") print(f"URL: {result['url']}") print(f"Snippet: {result['snippet']}") ``` ```javascript const response = await fetch('https://api.anakin.io/v1/search', { method: 'POST', headers: { 'X-API-Key': 'your_api_key', 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt: 'latest AI developments 2024', limit: 5 }) }); const data = await response.json(); console.log(`Search ID: ${data.id}`); data.results.forEach(r => console.log(r.title, r.url)); ``` --- # Supported Countries & Territories (/docs/api-reference/supported-countries) Use the `country` parameter in your API requests to route through a specific location. Codes follow ISO 3166-1 alpha-2 (lowercase). | Country | Code | |---------|------| | Afghanistan | `af` | | Aland | `ax` | | Albania | `al` | | Algeria | `dz` | | Andorra | `ad` | | Angola | `ao` | | Antarctica | `aq` | | Antigua and Barbuda | `ag` | | Argentina | `ar` | | Armenia | `am` | | Aruba | `aw` | | Australia | `au` | | Austria | `at` | | Azerbaijan | `az` | | Bahamas | `bs` | | Bahrain | `bh` | | Bangladesh | `bd` | | Barbados | `bb` | | Belarus | `by` | | Belgium | `be` | | Belize | `bz` | | Benin | `bj` | | Bermuda | `bm` | | Bhutan | `bt` | | Bolivia | `bo` | | Bonaire | `bq` | | Bosnia and Herzegovina | `ba` | | Botswana | `bw` | | Brazil | `br` | | British Indian Ocean Territory | `io` | | British Virgin Islands | `vg` | | Brunei | `bn` | | Bulgaria | `bg` | | Burkina Faso | `bf` | | Cambodia | `kh` | | Cameroon | `cm` | | Canada | `ca` | | Cape Verde | `cv` | | Cayman Islands | `ky` | | Chile | `cl` | | China | `cn` | | Colombia | `co` | | Cook Islands | `ck` | | Costa Rica | `cr` | | Croatia | `hr` | | Cuba | `cu` | | Curacao | `cw` | | Cyprus | `cy` | | Czech Republic | `cz` | | Democratic Republic of the Congo | `cd` | | Denmark | `dk` | | Djibouti | `dj` | | Dominican Republic | `do` | | East Timor | `tl` | | Ecuador | `ec` | | Egypt | `eg` | | El Salvador | `sv` | | Estonia | `ee` | | Eswatini | `sz` | | Ethiopia | `et` | | Faroe Islands | `fo` | | Fiji | `fj` | | Finland | `fi` | | France | `fr` | | French Guiana | `gf` | | French Polynesia | `pf` | | Gabon | `ga` | | Georgia | `ge` | | Germany | `de` | | Ghana | `gh` | | Gibraltar | `gi` | | Greece | `gr` | | Greenland | `gl` | | Grenada | `gd` | | Guadeloupe | `gp` | | Guam | `gu` | | Guatemala | `gt` | | Guernsey | `gg` | | Guinea | `gn` | | Guyana | `gy` | | Haiti | `ht` | | Honduras | `hn` | | Hong Kong | `hk` | | Hungary | `hu` | | Iceland | `is` | | India | `in` | | Indonesia | `id` | | Iran | `ir` | | Iraq | `iq` | | Ireland | `ie` | | Isle of Man | `im` | | Israel | `il` | | Italy | `it` | | Ivory Coast | `ci` | | Jamaica | `jm` | | Japan | `jp` | | Jersey | `je` | | Jordan | `jo` | | Kazakhstan | `kz` | | Kenya | `ke` | | Kosovo | `xk` | | Kuwait | `kw` | | Kyrgyzstan | `kg` | | Laos | `la` | | Latvia | `lv` | | Lebanon | `lb` | | Lesotho | `ls` | | Liberia | `lr` | | Libya | `ly` | | Liechtenstein | `li` | | Lithuania | `lt` | | Luxembourg | `lu` | | Macao | `mo` | | Madagascar | `mg` | | Malawi | `mw` | | Malaysia | `my` | | Maldives | `mv` | | Mali | `ml` | | Malta | `mt` | | Martinique | `mq` | | Mauritania | `mr` | | Mauritius | `mu` | | Mexico | `mx` | | Micronesia | `fm` | | Moldova | `md` | | Monaco | `mc` | | Mongolia | `mn` | | Montenegro | `me` | | Montserrat | `ms` | | Morocco | `ma` | | Mozambique | `mz` | | Myanmar (Burma) | `mm` | | Namibia | `na` | | Nepal | `np` | | Netherlands | `nl` | | New Caledonia | `nc` | | New Zealand | `nz` | | Nicaragua | `ni` | | Niger | `ne` | | Nigeria | `ng` | | North Macedonia | `mk` | | Northern Mariana Islands | `mp` | | Norway | `no` | | Oman | `om` | | Pakistan | `pk` | | Palestine | `ps` | | Panama | `pa` | | Papua New Guinea | `pg` | | Paraguay | `py` | | Peru | `pe` | | Philippines | `ph` | | Poland | `pl` | | Portugal | `pt` | | Puerto Rico | `pr` | | Qatar | `qa` | | Republic of the Congo | `cg` | | Reunion | `re` | | Romania | `ro` | | Russia | `ru` | | Rwanda | `rw` | | Saint Kitts and Nevis | `kn` | | Saint Lucia | `lc` | | Saint Martin | `mf` | | Saint Vincent and the Grenadines | `vc` | | Sao Tome and Principe | `st` | | Saudi Arabia | `sa` | | Senegal | `sn` | | Serbia | `rs` | | Sierra Leone | `sl` | | Singapore | `sg` | | Sint Maarten | `sx` | | Slovakia | `sk` | | Slovenia | `si` | | Solomon Islands | `sb` | | Somalia | `so` | | South Africa | `za` | | South Korea | `kr` | | Spain | `es` | | Sri Lanka | `lk` | | Suriname | `sr` | | Sweden | `se` | | Switzerland | `ch` | | Syria | `sy` | | Taiwan | `tw` | | Tajikistan | `tj` | | Tanzania | `tz` | | Thailand | `th` | | Togo | `tg` | | Tonga | `to` | | Trinidad and Tobago | `tt` | | Tunisia | `tn` | | Turkey | `tr` | | Turks and Caicos Islands | `tc` | | U.S. Virgin Islands | `vi` | | Uganda | `ug` | | Ukraine | `ua` | | United Arab Emirates | `ae` | | United Kingdom | `gb` | | United States | `us` | | Uruguay | `uy` | | Uzbekistan | `uz` | | Vanuatu | `vu` | | Venezuela | `ve` | | Vietnam | `vn` | | Yemen | `ye` | | Zambia | `zm` | | Zimbabwe | `zw` | --- ### Usage Example ```json { "url": "https://example.com", "country": "jp" } ``` This routes the request through a residential proxy in Japan. --- ### Programmatic Access You can also fetch this list programmatically: ```bash curl https://api.anakin.io/v1/countries ``` Returns a JSON array of all supported countries with their codes. --- # URL Scraper (/docs/api-reference/url-scraper) The URL Scraper is the core scraping API. Submit any URL and receive the scraped HTML, markdown, and optionally AI-extracted JSON data. Supports single URL and batch (up to 10 URLs) modes. ### Features - **Single & batch** scraping in one API - **30x faster** with intelligent caching - **Zero blocks** with anti-detection and proxy routing across [207 countries and territories](/docs/api-reference/supported-countries) - **AI JSON extraction** — structured data from any page - **Browser mode** — headless Chrome for JS-heavy sites and SPAs ### Endpoints --- # POST Scrape URLs (/docs/api-reference/url-scraper/batch-url-scraping) Submit up to 10 URLs for scraping in a single request. All URLs are processed in parallel. Use the returned `jobId` to [poll for results](/docs/api-reference/url-scraper/get-job-status). --- ### Request Body ```json { "urls": [ "https://example.com/page1", "https://example.com/page2", "https://example.com/page3" ], "country": "us", "useBrowser": false, "generateJson": false } ``` | Parameter | Type | Description | |-----------|------|-------------| | `urls` **required** | string[] | Array of URLs to scrape (1–10). | | `country` | string | Country code for proxy routing. Default `"us"`. See [Supported Countries](/docs/api-reference/supported-countries) (207 locations). | | `useBrowser` | boolean | Use headless Chrome with Playwright. Default `false`. | | `generateJson` | boolean | AI-extract structured JSON from the content. Default `false`. | --- ### Response ```json { "jobId": "batch_abc123", "status": "pending" } ``` You receive a parent job ID that tracks overall batch progress. Use [GET /v1/url-scraper/\{id\}](/docs/api-reference/url-scraper/get-job-status) to poll for results — the response will include a `results` array with individual URL outcomes. --- ### Code Examples ```bash curl -X POST https://api.anakin.io/v1/url-scraper/batch \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{ "urls": [ "https://example.com/page1", "https://example.com/page2", "https://example.com/page3" ], "country": "us", "useBrowser": false, "generateJson": true }' ``` ```python import requests response = requests.post( 'https://api.anakin.io/v1/url-scraper/batch', headers={'X-API-Key': 'your_api_key'}, json={ 'urls': [ 'https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3' ], 'country': 'us', 'useBrowser': False, 'generateJson': True } ) data = response.json() print(f"Batch job submitted: {data['jobId']}") ``` ```javascript const response = await fetch('https://api.anakin.io/v1/url-scraper/batch', { method: 'POST', headers: { 'X-API-Key': 'your_api_key', 'Content-Type': 'application/json' }, body: JSON.stringify({ urls: [ 'https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3' ], country: 'us', useBrowser: false, generateJson: true }) }); const data = await response.json(); console.log(data.jobId); ``` --- # GET Get Results (/docs/api-reference/url-scraper/get-job-status) Retrieve the status and results of a scrape job. Use this to poll for completion after submitting a [single URL](/docs/api-reference/url-scraper/submit-scrape-job) or [batch](/docs/api-reference/url-scraper/batch-url-scraping) scrape request. --- ### Path Parameters | Parameter | Type | Description | |-----------|------|-------------| | `id` **required** | string | The job ID returned from the submit endpoint | --- ### Response — Single URL Job ```json { "id": "job_abc123xyz", "status": "completed", "url": "https://example.com", "jobType": "url_scraper", "country": "us", "html": "...", "cleanedHtml": "
...
", "markdown": "# Page content...", "generatedJson": { "data": {} }, "cached": false, "error": null, "createdAt": "2024-01-01T12:00:00Z", "completedAt": "2024-01-01T12:00:05Z", "durationMs": 5000 } ``` ### Response — Batch Job ```json { "id": "batch_abc123", "status": "completed", "jobType": "batch_url_scraper", "country": "us", "urls": ["https://example.com/page1", "https://example.com/page2"], "results": [ { "index": 0, "url": "https://example.com/page1", "status": "completed", "html": "...", "cleanedHtml": "
...
", "markdown": "# Content...", "generatedJson": { "data": {} }, "cached": false, "durationMs": 3000 }, { "index": 1, "url": "https://example.com/page2", "status": "failed", "error": "Connection timeout", "durationMs": 5000 } ], "createdAt": "2024-01-01T12:00:00Z", "completedAt": "2024-01-01T12:00:10Z", "durationMs": 10000 } ``` --- ### Response Fields | Field | Type | Description | |-------|------|-------------| | `status` | string | `pending`, `processing`, `completed`, or `failed` | | `html` | string | Raw HTML content. Only present when completed. | | `cleanedHtml` | string | Cleaned HTML with non-essential elements removed. | | `markdown` | string | Markdown version of the content. | | `generatedJson` | object | AI-extracted structured JSON. Only when `generateJson: true` was set. | | `cached` | boolean | `true` if served from cache. | | `error` | string | Error message. Only present when failed. | | `durationMs` | number | Processing time in milliseconds. | | `results` | array | Batch jobs only — array of per-URL results. | ### Job Statuses | Status | Description | |--------|-------------| | `pending` | Job is queued | | `processing` | Job is being executed | | `completed` | Results are ready | | `failed` | Job encountered an error | --- ### Code Examples ```bash curl -X GET https://api.anakin.io/v1/url-scraper/job_abc123xyz \ -H "X-API-Key: your_api_key" ``` ```python import requests job_id = "job_abc123xyz" result = requests.get( f'https://api.anakin.io/v1/url-scraper/{job_id}', headers={'X-API-Key': 'your_api_key'} ) data = result.json() if data['status'] == 'completed': print(data['markdown']) ``` ```javascript const jobId = 'job_abc123xyz'; const res = await fetch(`https://api.anakin.io/v1/url-scraper/${jobId}`, { headers: { 'X-API-Key': 'your_api_key' } }); const data = await res.json(); if (data.status === 'completed') { console.log(data.markdown); } ``` For polling patterns, see the [Polling Jobs](/docs/api-reference/polling-jobs) reference. --- # POST Scrape URL (/docs/api-reference/url-scraper/submit-scrape-job) Submit a single URL for scraping. The job is processed asynchronously — use the returned `jobId` to [poll for results](/docs/api-reference/url-scraper/get-job-status). --- ### Request Body ```json { "url": "https://example.com", "country": "us", "useBrowser": false, "generateJson": false } ``` | Parameter | Type | Description | |-----------|------|-------------| | `url` **required** | string | The URL to scrape. Must be valid HTTP/HTTPS. | | `country` | string | Country code for proxy routing. Default `"us"`. See [Supported Countries](/docs/api-reference/supported-countries) (207 locations). | | `useBrowser` | boolean | Use headless Chrome with Playwright. Default `false`. Best for JS-heavy sites. | | `generateJson` | boolean | AI-extract structured JSON from the content. Default `false`. | | `sessionId` | string | Browser session ID for scraping authenticated pages. See [Browser Sessions](/docs/api-reference/browser-sessions). | --- ### Response ```json { "jobId": "job_abc123xyz", "status": "pending" } ``` The job is processed asynchronously. Use the `jobId` with [GET /v1/url-scraper/\{id\}](/docs/api-reference/url-scraper/get-job-status) to check status and retrieve results. --- ### Code Examples ```bash curl -X POST https://api.anakin.io/v1/url-scraper \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "country": "us", "useBrowser": false, "generateJson": false }' ``` ```python import requests response = requests.post( 'https://api.anakin.io/v1/url-scraper', headers={'X-API-Key': 'your_api_key'}, json={ 'url': 'https://example.com', 'country': 'us', 'useBrowser': False, 'generateJson': True } ) data = response.json() print(f"Job submitted: {data['jobId']}") ``` ```javascript const response = await fetch('https://api.anakin.io/v1/url-scraper', { method: 'POST', headers: { 'X-API-Key': 'your_api_key', 'Content-Type': 'application/json' }, body: JSON.stringify({ url: 'https://example.com', country: 'us', useBrowser: false, generateJson: true }) }); const data = await response.json(); console.log(data.jobId); ``` --- # Web Scraper (/docs/api-reference/web-scraper) The Web Scraper runs custom-built, high-throughput scraper configurations to extract structured data from websites. Unlike the [URL Scraper](/docs/api-reference/url-scraper) which returns raw HTML/markdown, the Web Scraper returns data in a predefined schema specific to each scraper. ### Features - **Structured output** — returns data in a schema defined by the scraper configuration - **Custom scrapers** — purpose-built scrapers for specific sites and data types - **High throughput** — optimized for speed and reliability at scale - **Async processing** — submit jobs and poll for results ### Endpoints --- # GET Get Results (/docs/api-reference/web-scraper/get-scrape-result) Retrieve the status and results of a web scraper job. --- ### Path Parameters | Parameter | Type | Description | |-----------|------|-------------| | `id` **required** | string | The job ID returned from the run scrape endpoint | --- ### Response ```json { "id": "job_xyz789", "status": "completed", "jobType": "web_scraper", "url": "https://example.com/product-page", "generatedJson": { "name": "Product Name", "price": "$29.99", "description": "Product description..." }, "cached": false, "error": null, "createdAt": "2024-01-01T12:00:00Z", "completedAt": "2024-01-01T12:00:05Z", "durationMs": 5000 } ``` --- ### Response Fields | Field | Type | Description | |-------|------|-------------| | `status` | string | `pending`, `processing`, `completed`, or `failed` | | `url` | string | The URL that was scraped | | `generatedJson` | object | Structured data extracted by the scraper | | `cached` | boolean | `true` if served from cache | | `error` | string | Error message. Only present when failed. | | `durationMs` | number | Processing time in milliseconds | --- ### Code Examples ```bash curl -X GET https://api.anakin.io/v1/web-scraper/job_xyz789 \ -H "X-API-Key: your_api_key" ``` ```python import requests job_id = "job_xyz789" result = requests.get( f'https://api.anakin.io/v1/web-scraper/{job_id}', headers={'X-API-Key': 'your_api_key'} ) data = result.json() if data['status'] == 'completed': print(data['generatedJson']) ``` ```javascript const jobId = 'job_xyz789'; const res = await fetch(`https://api.anakin.io/v1/web-scraper/${jobId}`, { headers: { 'X-API-Key': 'your_api_key' } }); const data = await res.json(); if (data.status === 'completed') { console.log(data.generatedJson); } ``` For polling patterns, see the [Polling Jobs](/docs/api-reference/polling-jobs) reference. --- # POST Run Scraper (/docs/api-reference/web-scraper/submit-scrape-job) Submit a URL for scraping using a custom scraper configuration. The job is processed asynchronously — use the returned `jobId` to [poll for results](/docs/api-reference/web-scraper/get-scrape-result). > **Tip:** You can copy ready-to-use API payloads for any scraper from the **Web Scrapers** section in your [dashboard](https://anakin.io/dashboard). --- ### Request Body ```json { "url": "https://example.com", "scraper_code": "your_scraper_code", "scraper_scope": "GLOBAL", "scraper_params": { "param1": "value1", "param2": "value2" } } ``` | Parameter | Type | Description | |-----------|------|-------------| | `url` **required** | string | The URL to scrape. Must be valid HTTP/HTTPS. | | `scraper_code` **required** | string | Identifier of the scraper configuration to use. | | `scraper_scope` **required** | string | Scope of scraper. Use `"GLOBAL"` for global scrapers. | | `scraper_params` **required** | object | Key-value parameters for scraper execution. Varies by scraper. | | `action_type` **required** | string | Type of action. Default `"scrape_data"`. | --- ### Response ```json { "jobId": "job_xyz789", "status": "pending" } ``` The job is processed asynchronously. Use the `jobId` with [GET /v1/web-scraper/\{id\}](/docs/api-reference/web-scraper/get-scrape-result) to check status and retrieve results. --- ### Example: Instagram Hashtag Search ```json { "url": "https://instagram.com", "scraper_code": "instagram_hashtag_search", "scraper_scope": "GLOBAL", "scraper_params": { "hashtags": ["webscraping", "automation"], "results_limit": 20 } } ``` --- ### Code Examples ```bash curl -X POST https://api.anakin.io/v1/web-scraper \ -H "Content-Type: application/json" \ -H "X-API-Key: your_api_key" \ -d '{ "url": "https://example.com", "scraper_code": "your_scraper_code", "scraper_scope": "GLOBAL", "scraper_params": {} }' ``` ```python import requests response = requests.post( 'https://api.anakin.io/v1/web-scraper', headers={'X-API-Key': 'your_api_key'}, json={ 'url': 'https://example.com', 'scraper_code': 'your_scraper_code', 'scraper_scope': 'GLOBAL', 'scraper_params': {} } ) data = response.json() print(f"Job submitted: {data['jobId']}") ``` ```javascript const response = await fetch('https://api.anakin.io/v1/web-scraper', { method: 'POST', headers: { 'X-API-Key': 'your_api_key', 'Content-Type': 'application/json' }, body: JSON.stringify({ url: 'https://example.com', scraper_code: 'your_scraper_code', scraper_scope: 'GLOBAL', scraper_params: {} }) }); const data = await response.json(); console.log(data.jobId); ``` --- # Overview (/docs/documentation) Scrape any website, extract structured data with AI, and search the web — all through a simple REST API. ### Products ### Key features - **Zero blocks** — anti-detection and proxy routing across [207 countries and territories](/docs/api-reference/supported-countries) - **Async job pattern** — submit a job, poll for results when ready - **AI extraction** — structured JSON from any page with `generateJson: true` - **Headless browser** — JS-heavy sites and SPAs with `useBrowser: true` - **Intelligent caching** — 30x faster on repeat requests ### Use cases See all [use cases](/docs/documentation/use-cases) for more examples. Get started with the [Quick Start](/docs/documentation/getting-started) guide. --- # Quick Start (/docs/documentation/getting-started) Get your API key and scrape your first page — choose the path that fits your workflow. --- ## 1. Get your API key Sign up at the [Dashboard](/dashboard) — it's free, no credit card required. You start with **500 credits**. Copy your API key from the dashboard. It starts with `ak-`. --- ## 2. Choose your path --- ## Products | I want to... | Product | |---|---| | Extract content from a URL | [URL Scraper](/docs/api-reference/url-scraper) | | Scrape multiple URLs at once | [URL Scraper (batch)](/docs/api-reference/url-scraper/batch-url-scraping) | | Run custom scrapers at scale | [Web Scraper](/docs/api-reference/web-scraper) | | Search the web with AI | [Search API](/docs/api-reference/search) | | Deep multi-source research | [Agentic Search](/docs/api-reference/agentic-search) | | Scrape login-protected pages | [Browser Sessions](/docs/api-reference/browser-sessions) | See [Pricing & Credits](/docs/documentation/pricing) for costs per operation. --- ## Quick reference | | | |---|---| | **Base URL** | `https://api.anakin.io/v1` | | **Auth header** | `X-API-Key: ak-your-key-here` | | **Free credits** | 500 on signup | | **Rate limits** | 60/min scrape, 30/min search, 10/min agentic — [details](/docs/documentation/rate-limits) | | **Failed jobs** | Not charged — credits deducted only on success | --- # Quick Start: API (/docs/documentation/getting-started/api) Use the AnakinScraper REST API to scrape pages, extract data, and search the web from any language. --- ## Authentication Every request requires your API key in the `X-API-Key` header: ``` X-API-Key: ak-your-key-here ``` The `Authorization: Bearer ak-your-key-here` header is also accepted. Get your key from the [Dashboard](/dashboard). **Base URL:** ``` https://api.anakin.io/v1 ``` --- ## Submit a scrape request ```bash curl -X POST https://api.anakin.io/v1/url-scraper \ -H "X-API-Key: ak-your-key-here" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' ``` ```python import requests response = requests.post( "https://api.anakin.io/v1/url-scraper", headers={"X-API-Key": "ak-your-key-here"}, json={"url": "https://example.com"} ) data = response.json() print(data["jobId"]) # e.g. "job_abc123xyz" ``` ```javascript const response = await fetch("https://api.anakin.io/v1/url-scraper", { method: "POST", headers: { "X-API-Key": "ak-your-key-here", "Content-Type": "application/json" }, body: JSON.stringify({ url: "https://example.com" }) }); const data = await response.json(); console.log(data.jobId); // e.g. "job_abc123xyz" ``` Response: ```json { "jobId": "job_abc123xyz", "status": "pending" } ``` The job is processed asynchronously. Use the `jobId` to poll for results. --- ## Poll for results Jobs typically complete in 3–15 seconds. Poll every 3 seconds until the status is `completed` or `failed`. ```bash # Repeat every 3 seconds until status is "completed" curl https://api.anakin.io/v1/url-scraper/job_abc123xyz \ -H "X-API-Key: ak-your-key-here" ``` ```python import time job_id = data["jobId"] while True: result = requests.get( f"https://api.anakin.io/v1/url-scraper/{job_id}", headers={"X-API-Key": "ak-your-key-here"} ) job = result.json() if job["status"] == "completed": print(job["markdown"]) break elif job["status"] == "failed": print(f"Error: {job.get('error')}") break time.sleep(3) ``` ```javascript const jobId = data.jobId; while (true) { const res = await fetch( `https://api.anakin.io/v1/url-scraper/${jobId}`, { headers: { "X-API-Key": "ak-your-key-here" } } ); const job = await res.json(); if (job.status === "completed") { console.log(job.markdown); break; } if (job.status === "failed") { console.error(job.error); break; } await new Promise(r => setTimeout(r, 3000)); } ``` Completed response: ```json { "id": "job_abc123xyz", "status": "completed", "url": "https://example.com", "html": "...", "cleanedHtml": "
...
", "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...", "cached": false, "durationMs": 4200 } ``` You get back `html`, `cleanedHtml`, and `markdown` — use whichever format fits your pipeline. See [Polling Jobs](/docs/api-reference/polling-jobs) for advanced patterns and intervals. --- ## Go further ### Extract structured JSON with AI Add `generateJson: true` to have AI extract structured data from any page: ```bash curl -X POST https://api.anakin.io/v1/url-scraper \ -H "X-API-Key: ak-your-key-here" \ -H "Content-Type: application/json" \ -d '{"url": "https://news.ycombinator.com", "generateJson": true}' ``` ```python response = requests.post( "https://api.anakin.io/v1/url-scraper", headers={"X-API-Key": "ak-your-key-here"}, json={ "url": "https://news.ycombinator.com", "generateJson": True } ) ``` ```javascript const response = await fetch("https://api.anakin.io/v1/url-scraper", { method: "POST", headers: { "X-API-Key": "ak-your-key-here", "Content-Type": "application/json" }, body: JSON.stringify({ url: "https://news.ycombinator.com", generateJson: true }) }); ``` The completed response includes a `generatedJson` field: ```json { "generatedJson": { "articles": [ { "title": "Show HN: I built a web scraping API", "url": "https://example.com/article", "points": 142, "author": "user123", "comments": 58 } ] } } ``` ### Scrape JavaScript-heavy sites Add `useBrowser: true` for SPAs and dynamically-loaded pages: ```bash curl -X POST https://api.anakin.io/v1/url-scraper \ -H "X-API-Key: ak-your-key-here" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/spa", "useBrowser": true}' ``` > Only use browser mode when needed — standard scraping is faster. ### Search the web with AI The Search API is **synchronous** — results come back immediately, no polling: ```bash curl -X POST https://api.anakin.io/v1/search \ -H "X-API-Key: ak-your-key-here" \ -H "Content-Type: application/json" \ -d '{"prompt": "best web scraping libraries 2025"}' ``` ```python response = requests.post( "https://api.anakin.io/v1/search", headers={"X-API-Key": "ak-your-key-here"}, json={"prompt": "best web scraping libraries 2025"} ) print(response.json()) ``` ```javascript const response = await fetch("https://api.anakin.io/v1/search", { method: "POST", headers: { "X-API-Key": "ak-your-key-here", "Content-Type": "application/json" }, body: JSON.stringify({ prompt: "best web scraping libraries 2025" }) }); console.log(await response.json()); ``` --- ## Next steps --- # Quick Start: CLI (/docs/documentation/getting-started/cli) The CLI is the fastest way to use AnakinScraper. It handles job submission, polling, and output formatting for you. --- ## Install the CLI Requires Python 3.10+. ```bash pip install anakin-cli ``` ```bash pipx install anakin-cli ``` --- ## Authenticate Save your API key (you only need to do this once): ```bash anakin login --api-key "ak-your-key-here" ``` Verify it worked: ```bash anakin status ``` --- ## Scrape your first page ```bash anakin scrape "https://example.com" ``` The CLI submits the job, polls until it's done, and prints the markdown output to your terminal. That's it. **Save to a file:** ```bash anakin scrape "https://example.com" -o page.md ``` --- ## Go further ### Extract structured JSON with AI ```bash anakin scrape "https://news.ycombinator.com" --format json ``` AI automatically extracts structured data from the page — no schema needed. ### Scrape JavaScript-heavy sites ```bash anakin scrape "https://example.com/spa" --browser ``` Enables full browser rendering for SPAs and dynamic content. Only use when needed — standard scraping is faster. ### Batch scrape multiple URLs ```bash anakin scrape-batch "https://a.com" "https://b.com" "https://c.com" ``` Scrape up to 10 URLs in parallel with a single command. ### Search the web with AI ```bash anakin search "best web scraping libraries 2025" ``` Returns AI-powered search results instantly. ### Deep research ```bash anakin research "comparison of web frameworks 2025" -o report.json ``` Runs a multi-stage AI research pipeline across 20+ sources. Takes 1–5 minutes. --- ## Next steps --- # Pricing & Credits (/docs/documentation/pricing) ## Per-Request Pricing AnakinScraper uses a simple credit-based system. Each API request consumes credits based on the type of operation performed. | Request Type | Credits | Description | |--------------|---------|-------------| | URL scrape | 1 credit | HTML extraction with optional JS rendering | | Batch scraping | 1 credit × URLs | 1 credit per URL in the batch | | Search API | 3 credits | Web search with full content extraction | | Agentic Search | 10 + 1/URL | 10 base + 1 credit per URL scraped during research | ## Plans ### Starter (Free) - 500 credits - Basic scraping - API access ### Pro ($9/month) - 3,000 credits/month - Priority support - Advanced features - 99.9% uptime SLA ### Enterprise (Custom) - Unlimited credits - Dedicated support - Custom integrations - SLA guarantees [Contact sales](https://calendly.com/d/ctqw-64s-rgt/let-s-talk-about-your-use-case-anakin-io) for Enterprise pricing. ## Credit Usage Examples ### Basic HTML Scrape ```bash curl -X POST https://api.anakin.io/v1/scrape \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' ``` **Cost: 1 credit** ### URL Scrape with JS Rendering ```bash curl -X POST https://api.anakin.io/v1/scrape \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "useBrowser": true}' ``` **Cost: 1 credit** (same as basic — JS rendering has no additional cost) ### Batch Scraping (10 URLs) ```bash curl -X POST https://api.anakin.io/v1/scrape/batch \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"urls": ["url1", "url2", ...], "useBrowser": true}' ``` **Cost: 10 credits** (1 credit × 10 URLs) --- # Rate Limits (/docs/documentation/rate-limits) AnakinScraper applies rate limits per API key to ensure reliable performance for all users. Limits are enforced using a token-bucket algorithm — short bursts above the limit are allowed, but sustained overuse will trigger throttling. --- ## Limits by endpoint | Endpoint | Rate limit | Bucket | |----------|-----------|--------| | `POST /v1/url-scraper` | 60 requests/min | Scraping | | `POST /v1/url-scraper/batch` | 60 requests/min | Scraping | | `POST /v1/web-scraper` | 60 requests/min | Scraping | | `POST /v1/search` | 30 requests/min | Search | | `POST /v1/agentic-search` | 10 requests/min | Agentic Search | | `GET /v1/url-scraper/{id}` | No limit | — | | `GET /v1/web-scraper/{id}` | No limit | — | | `GET /v1/agentic-search/{id}` | No limit | — | > **Polling endpoints are not rate-limited.** You can poll for job results as frequently as you need without hitting a limit. --- ## Rate limit response When you exceed a rate limit, the API returns a `429 Too Many Requests` response: ```json { "error": "Rate limit exceeded. Please slow down your requests." } ``` --- ## Handling rate limits ### Retry with exponential backoff The recommended approach is to wait and retry with exponential backoff. Start with a short delay and double it on each retry. ```python import requests import time def scrape_with_retry(url, api_key, max_retries=3): """Submit a scrape job with automatic retry on rate limit.""" delay = 2 for attempt in range(max_retries + 1): response = requests.post( "https://api.anakin.io/v1/url-scraper", headers={"X-API-Key": api_key}, json={"url": url} ) if response.status_code == 429: if attempt == max_retries: raise Exception("Rate limit exceeded after retries") print(f"Rate limited, retrying in {delay}s...") time.sleep(delay) delay *= 2 continue response.raise_for_status() return response.json() result = scrape_with_retry("https://example.com", "ak-your-key-here") print(result["jobId"]) ``` ```javascript async function scrapeWithRetry(url, apiKey, maxRetries = 3) { let delay = 2000; for (let attempt = 0; attempt <= maxRetries; attempt++) { const response = await fetch("https://api.anakin.io/v1/url-scraper", { method: "POST", headers: { "X-API-Key": apiKey, "Content-Type": "application/json" }, body: JSON.stringify({ url }) }); if (response.status === 429) { if (attempt === maxRetries) { throw new Error("Rate limit exceeded after retries"); } console.log(`Rate limited, retrying in ${delay / 1000}s...`); await new Promise(r => setTimeout(r, delay)); delay *= 2; continue; } if (!response.ok) throw new Error(`HTTP ${response.status}`); return await response.json(); } } const result = await scrapeWithRetry("https://example.com", "ak-your-key-here"); console.log(result.jobId); ``` ### Use batch endpoints If you're scraping multiple URLs, use the [batch endpoint](/docs/api-reference/url-scraper/batch-url-scraping) instead of submitting individual requests. A single batch request can include up to 10 URLs and only counts as one request against the rate limit. ```bash # Bad: 10 requests, 10 against rate limit for url in url1 url2 ... url10; do curl -X POST .../v1/url-scraper -d "{\"url\": \"$url\"}" done # Good: 1 request, 1 against rate limit curl -X POST https://api.anakin.io/v1/url-scraper/batch \ -H "X-API-Key: ak-your-key-here" \ -H "Content-Type: application/json" \ -d '{"urls": ["url1", "url2", "...", "url10"]}' ``` ### Spread requests over time If you have a large list of URLs, pace your submissions rather than sending them all at once. A simple approach is to add a short delay between requests: ```python import time urls = ["https://example.com/1", "https://example.com/2", ...] for url in urls: result = scrape_with_retry(url, api_key) job_ids.append(result["jobId"]) time.sleep(1) # ~60 requests/min stays within the limit ``` --- ## Tips - **Rate limits apply to submit endpoints only.** Poll as often as you like — GET endpoints for checking job status are not rate-limited. - **Batch when possible.** A single batch request with 10 URLs uses 1 rate-limit slot, not 10. - **Cache results.** AnakinScraper caches responses for 24 hours. Repeat requests for the same URL return instantly and cost zero credits, but they still count against rate limits. - **Use the CLI for simple workloads.** The [Anakin CLI](/docs/sdks/cli) handles rate limiting and retries automatically. --- ## Increasing your limits If you need higher rate limits for your use case, contact us: - **Email** — support@anakin.io - **Enterprise plan** — includes custom rate limits. [Talk to sales](https://calendly.com/d/ctqw-64s-rgt/let-s-talk-about-your-use-case-anakin-io). --- # Use Cases (/docs/documentation/use-cases) Explore how different teams leverage AnakinScraper to power their AI applications, data pipelines, and business workflows. --- # AI & Agent Data Ingestion (/docs/documentation/use-cases/ai-agent-data-ingestion) Use Anakin's scraping API to turn web pages into **structured JSON** (or clean text/markdown) for **RAG pipelines**, **AI agents**, and **support copilots**. Typical workflows include crawling docs, help centers, product pages, and forum threads, then chunking and indexing for retrieval. --- ### Common sources * Documentation sites (MDX/Docs frameworks, API references) * Help centers / knowledge bases (Zendesk-style, custom) * Product pages + changelogs * Forums / community threads --- ### What to extract * Title, headings hierarchy (H1/H2/H3) * Main content blocks (exclude nav/footer) * Code blocks (language + code) * Tables, lists, callouts * Canonical URL, publish/updated timestamps * Outbound links for crawl expansion --- ### Implementation notes * Prefer **browser rendering** for JS docs sites and SPAs. * Use **structured extraction** to keep stable fields: `title`, `sections[]`, `code_blocks[]`, `tables[]`. * Use **dedupe keys** (canonical URL + content hash) to avoid re-indexing. * For RAG: chunk by heading boundaries; store metadata (source URL, section path). --- ### FAQs Do you generate embeddings or manage vector databases? No. The API extracts structured content from webpages. Embeddings, chunking, and indexing are handled in your own AI pipeline. Can I extract clean content from JS-heavy documentation sites? Yes. Use browser rendering to extract the fully rendered DOM and then structure headings, paragraphs, code blocks, and tables. How do I avoid reprocessing unchanged content? Store a content hash of the structured output and compare it across runs. Only re-embed if the hash changes. Can I scrape private or logged-in knowledge bases? Yes, if you provide valid session credentials or cookies using authenticated sessions. Does the API clean or rewrite content? No. It extracts what is present on the page. Any cleaning or transformation happens in your pipeline. --- # Browser Automation & Web Workflows (/docs/documentation/use-cases/browser-automation) Automate JS-heavy flows to reach data that isn't available via static HTML alone. This includes **authenticated dashboards**, multi-step navigation, and extracting data after interactions. --- ### Common sources * Authenticated portals (member-only pages, dashboards) * Multi-step flows (filters, tabs, pagination controls) * JS-rendered apps with client-side routing --- ### What to extract * Post-login content (tables, lists, metrics) * UI state-specific data (selected filters, tabs, date ranges) * Export links / report links (when available) --- ### Implementation notes * Use authenticated sessions (cookies/tokens) and browser rendering. * Extract stateful metadata: `filters_applied`, `date_range`, `view_name`. * Build idempotent workflows: re-runable steps, clear failure modes, retries. --- ### FAQs Is this full RPA? No. It enables browser-rendered extraction and authenticated flows. Workflow orchestration is built on your side. Can I scrape dashboards behind login? Yes, using authenticated sessions with valid credentials or cookies. Can I interact with filters or tabs? Yes, if your workflow triggers those states before extraction. What if the page uses client-side routing? Browser rendering resolves the final DOM state for extraction. Does the API maintain persistent sessions automatically? Session management logic should be handled in your integration. --- # Competitive Intelligence & Pricing (/docs/documentation/use-cases/competitive-intel) Extract competitor site data (pricing pages, feature pages, release notes, product catalogs) into structured fields for **competitive analysis**, **pricing intelligence**, and **feature comparison**. --- ### Common sources * Pricing pages (plan tiers, add-ons, limits, feature matrices) * Feature pages, integrations pages * Changelogs / release notes * Public product catalogs and category pages --- ### What to extract * Plans and tiers: name, price, billing cadence, currency * Feature matrices (rows/columns normalized) * Usage limits: seats, API calls, storage, rate limits * Add-ons and overage pricing * Promotions, coupons, seasonal offers (where present) * Change metadata: last updated date, page version id, content hash --- ### Implementation notes * Rendering often required for pricing widgets and tabs. * Normalize plan data into a consistent schema across vendors. * Use `content_hash` for change detection and diffing (your system can compute diffs). * Handle geo-based pricing by varying locale headers / region routing (if needed). --- ### FAQs Does this track competitor changes automatically? No. You must schedule recurring scrapes and compare structured outputs to detect changes. Can pricing tables that require interaction be extracted? Yes, if they are rendered in the DOM after interaction. Browser rendering is required for most dynamic pricing widgets. Can I extract feature comparison tables? Yes. Tables and structured lists can be normalized into JSON fields. Does the API interpret pricing logic (discounts, bundles)? No. It extracts visible pricing data. Interpretation is handled downstream. How do I detect real pricing changes? Compare structured pricing JSON across runs instead of raw HTML to avoid layout noise. --- # Content Aggregation (/docs/documentation/use-cases/content-aggregation) Aggregate content from multiple webpages into normalized outputs for **news aggregation**, **research feeds**, **internal dashboards**, and **content pipelines**. --- ### Common sources * Blogs and news archives * Company update pages * Release notes/changelog pages * Documentation announcement pages --- ### What to extract * Article list entries: title, URL, excerpt, publish date, author * Full article content (main body + headings) * Tags/categories * Media: featured image URL, embeds (if needed) * Canonical URL + source attribution --- ### Implementation notes * Two-stage approach: scrape index pages → collect article URLs → scrape article pages. * Normalize into a single schema across sources. * Use content hashing to detect updates without storing huge diffs. --- ### FAQs Can I scrape blog archives? Yes. Extract article listings and then scrape individual article pages. Does the API provide RSS feeds? No. You can generate RSS downstream using extracted content. How do I detect new articles? Re-scrape index pages and compare extracted URLs or content hashes. Can I extract publication dates and authors? Yes, if they are visible in the page content. Does the API summarize content? No. It extracts content; summarization must be done using your own AI pipeline. --- # Financial & Corporate Filings (/docs/documentation/use-cases/financial-filings) Extract structured data from corporate websites for **investor relations**, **press releases**, **earnings pages**, and **public disclosures**. Useful for analysis pipelines and research workflows. --- ### Common sources * Investor Relations (IR) pages (quarterly results, presentations) * Press release archives * Leadership pages and governance pages * Public filings portals (where web accessible) --- ### What to extract * Press release entries: title, date, category, URL * Earnings artifacts: PDF links, webcast links, transcripts (if present) * Company metadata: legal name, HQ location, leadership roster * Document metadata: file type, published date, version identifiers --- ### Implementation notes * Many IR sites load content via JS; enable rendering. * Extract document links and store them (download/processing happens downstream). * Preserve timestamps and original source URLs for traceability. --- ### FAQs Can I extract press release archives? Yes. You can scrape index pages and extract titles, dates, and URLs for each entry. Can the API parse PDFs? No. It extracts webpage content. PDF parsing must be handled separately. How do I identify the latest earnings release? Scrape the archive page and sort entries by extracted publish date. Can I extract document download links? Yes. If links are visible in the rendered page, they can be captured. Does the API validate financial data? No. It only extracts what is publicly displayed on the webpage. --- # Lead Generation (/docs/documentation/use-cases/lead-generation) Extract structured business data from the public web (directories, company pages, partner lists) for **prospecting**, **account research**, and **CRM enrichment**. --- ### Common sources * B2B directories, association listings * Partner / reseller / agency directories * Conference sponsor/exhibitor lists * Company "Contact" and "About" pages --- ### What to extract * Company name, domain, category/industry * Location, service areas * Contact channels: emails (if public), phone, contact forms * Social links, team pages (if public) * Signals: certifications, partner badges, technologies listed --- ### Implementation notes * Be strict about "public web only" fields; store provenance. * Use extraction schema that separates `contact_methods[]` from `people[]`. * Avoid brittle scraping of obfuscated emails—prefer consistent selectors. --- ### FAQs Does the API provide private contact data? No. It extracts only what is publicly available on the scraped pages. Can I scrape company directories? Yes, including listing pages and company profile pages. Does the API enrich company data? No. It extracts visible fields. Enrichment requires external systems. How do I prevent duplicate companies? Use domain normalization and entity deduplication in your own database. Can I scrape contact forms? You can extract form structure, but submission workflows require browser automation logic on your side. --- # Structured Market & Web Research (/docs/documentation/use-cases/market-research) Collect structured facts from many web sources for **market research**, **category research**, and **competitive landscape mapping**. This includes aggregating lists, extracting entities, and building a dataset you can query. --- ### Common sources * Directories (companies, tools, marketplaces) * Public listings pages (partners, agencies, vendor ecosystems) * Industry reports pages and statistics pages * Public datasets pages and data portals --- ### What to extract * Entities: name, description, category, website, location * Pricing/positioning summaries (when publicly listed) * Metadata: tags, industries, integrations, target audience * Tables and structured lists * Links to "detail pages" for deeper extraction --- ### Implementation notes * Start with list pages → collect detail URLs → scrape detail pages for full schema. * Use dedupe by domain + entity name normalization. * Keep provenance: every extracted entity should retain its source URL and scrape timestamp. --- ### FAQs Can I scrape entire directories? Yes. Start from listing pages, collect detail page URLs, then scrape each detail page. Does the API deduplicate entities across sources? No. Deduplication must be implemented in your data pipeline. Can I extract structured attributes consistently across different sites? Yes, but you define the schema. Each source may require slightly different extraction logic. Is rendering required for directory sites? Often yes. Many modern directories load content via client-side JavaScript. Can I crawl multiple levels deep? Yes, as long as you manage crawl scope and URL constraints in your workflow. --- # ML & Training Data Collection (/docs/documentation/use-cases/ml-training-data) Create datasets from the web for **machine learning**, including **NLP**, **information extraction**, **classification**, and **entity recognition** pipelines. --- ### Common sources * Public articles, blogs, documentation, forums * Public product catalogs and listings * Tables and structured lists useful as labeled sources --- ### What to extract * Clean text with provenance (URL, timestamp) * Structured fields suitable for supervised labels (title/category/price/attributes) * Tables as normalized rows * Image URLs / media metadata (if needed for CV pipelines) --- ### Implementation notes * Keep dataset rows deterministic: same URL → same schema fields. * Store raw + cleaned versions (raw HTML optional). * Use stable identifiers: `source_url`, `content_hash`, `extraction_version`. --- ### FAQs Does the API label training data? No. It extracts raw structured data; labeling is your responsibility. Can I collect large volumes of data? Yes, subject to your usage limits and rate management. Is the extracted output deterministic? Yes, if the source page content is unchanged. Can I extract multilingual content? Yes. The API returns content as rendered on the page. Should I store raw HTML for training datasets? That depends on your model objective. Structured JSON is typically easier to manage. --- # Review & Sentiment Data Extraction (/docs/documentation/use-cases/review-sentiment) Collect structured text from reviews, forums, and community posts for **sentiment analysis**, **topic modeling**, and **voice-of-customer** pipelines. --- ### Common sources * Review pages (product/service reviews) * Community threads and Q&A forums * Public social-like pages (web accessible) --- ### What to extract * Review/post text, rating (if present), date * Author handle (if public), verified flags (if present) * Helpful votes / reactions (if present) * Thread structure: parent/child relationships * Product/entity identifiers (SKU, product name, URL) --- ### Implementation notes * Prefer structured fields: `rating`, `body`, `timestamp`, `thread_id`, `reply_to`. * Handle pagination carefully; store page cursors. * Keep raw text clean (strip UI noise like "Read more", "Translate"). --- ### FAQs Does the API calculate sentiment? No. It extracts review text and metadata. Sentiment scoring must be done separately. Can I scrape paginated reviews? Yes. You must manage pagination logic in your workflow. Can I extract ratings and timestamps? Yes, if those elements are visible in the rendered DOM. How do I avoid duplicate reviews? Use review IDs or compute a hash of stable fields like author, date, and content. Can I extract nested replies? Yes, as long as reply hierarchy is present in the DOM. --- # SEO & Search Intelligence (/docs/documentation/use-cases/seo-marketing) Extract data from search and content surfaces for **SERP research**, **keyword intelligence**, **content auditing**, and **AEO/answer engine optimization** workflows. --- ### Common sources * Search results pages (where accessible) * Competitor content hubs and blog archives * Programmatic landing pages * FAQ-rich pages / schema-heavy pages --- ### What to extract * Page title, meta description, canonical URL * Headings, FAQs, structured data markers * Internal links and topic clusters * Publish date and author (if present) * Content layout: lists, tables, definitions --- ### Implementation notes * Rendering helps when SERP pages are JS-heavy. * Extract "main content" + metadata; ignore navigation elements. * Store structured blocks for downstream scoring (readability, topical coverage, schema presence). --- ### FAQs Is this a rank tracking tool? No. You can extract SERP or page content, but tracking and comparison must be built separately. Can I extract meta tags and structured data? Yes. Titles, meta descriptions, canonical tags, and schema markup can be captured. Does the API analyze keyword performance? No. It extracts page content and metadata. Analysis happens downstream. Can I extract FAQ sections from pages? Yes, if they are present in the DOM or marked up with structured data. Is rendering required for search result pages? Often yes, as many are dynamically generated. --- # Integrations (/docs/integrations) Use AnakinScraper inside your IDE, AI frameworks, and workflow automation platforms. --- ### Plugins & Skills Use AnakinScraper inside your AI agent or code editor. Scrape, search, and research without leaving your environment. | Integration | Status | Description | |-------------|--------|-------------| | [Claude Code](/docs/integrations/ide-plugins/claude-code) | Available | Plugin for Claude Code with skills, agents, and hooks | | [Cursor](/docs/integrations/ide-plugins/cursor) | Available | Plugin for Cursor with rules, skills, and agents | | [OpenClaw](/docs/integrations/ide-plugins/openclaw) | Available | Skill for OpenClaw AI agents on ClawHub | --- ### AI Frameworks Integrate AnakinScraper into AI application frameworks. | Integration | Status | Description | |-------------|--------|-------------| | [Google ADK](/docs/integrations/ai-frameworks/google-adk) | Available | ADK tools for scraping, search, and research in Gemini agents | | [LangChain](/docs/integrations/ai-frameworks/langchain) | Coming Soon | Document Loader, Tools, and Retriever for scraping and search | | [LlamaIndex](/docs/integrations/ai-frameworks/llamaindex) | Coming Soon | Reader and Tools for RAG pipelines and agents | | [CrewAI](/docs/integrations/ai-frameworks/crewai) | Coming Soon | Tools for multi-agent scraping, search, and research | | [Langflow](/docs/integrations/ai-frameworks/langflow) | Coming Soon | Visual drag-and-drop components for scraping and search | | [Flowise](/docs/integrations/ai-frameworks/flowise) | Coming Soon | Visual chatflow nodes for scraping and search | --- ### Workflow Automation Trigger scrapes, searches, and research from your automation workflows. | Integration | Status | Description | |-------------|--------|-------------| | [Dify](/docs/integrations/workflow/dify) | Available | Plugin with 5 tools for Dify AI workflows | | [n8n](/docs/integrations/workflow/n8n) | Coming Soon | Community node with scrape, search, and agentic search | | [Zapier](/docs/integrations/workflow/zapier) | Available | 4 actions: scrape, AI search, agentic search, get results | | [Make](/docs/integrations/workflow/make) | Available | 4 native modules: scrape, poll, agentic search, AI search | --- # CrewAI (/docs/integrations/ai-frameworks/crewai) | | | |---|---| | **Framework** | [CrewAI](https://www.crewai.com) | | **Type** | Tool | --- ### What to expect The CrewAI integration will provide tools that agents can use autonomously: - **Scrape Tool** — Agents scrape any URL and get back clean markdown or structured JSON - **Search Tool** — Agents perform AI-powered web searches with instant results - **Research Tool** — Agents run deep multi-stage research on any topic --- ### In the meantime You can integrate AnakinScraper into your CrewAI agents today using the [REST API](/docs/api-reference) with CrewAI's custom tool support, or use the [Anakin CLI](/docs/sdks/cli) as a subprocess tool. --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io --- # Flowise (/docs/integrations/ai-frameworks/flowise) | | | |---|---| | **Framework** | [Flowise](https://flowiseai.com) | | **Type** | Component | --- ### What to expect The Flowise integration will provide visual nodes for: - **Scrape Node** — Scrape any URL and pass clean data to downstream nodes - **Search Node** — AI-powered web search as a chatflow node - **Research Node** — Deep agentic research as a chatflow node --- ### In the meantime You can use AnakinScraper in Flowise today via the **Custom Tool** or **HTTP Request** node with the [REST API](/docs/api-reference). --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io --- # Google ADK (/docs/integrations/ai-frameworks/google-adk) Google ADK tools for web scraping, search, and research — powered by Anakin. Build AI agents with Gemini that can extract data from any website, perform intelligent web searches, and conduct deep autonomous research. | | | |---|---| | **PyPI** | [pypi.org/project/anakin-adk](https://pypi.org/project/anakin-adk/) | | **Source** | [GitHub](https://github.com/Anakin-Inc/anakin-adk) | | **Type** | Tool | | **Version** | 0.1.2 | | **Tools** | 4 | | **License** | MIT | | **Requires** | Python >=3.10 | --- ### How it works You register Anakin tools with your Google ADK agent. When a user asks something that requires web data, Gemini automatically selects the right tool, fills in the parameters, and returns the results — no manual configuration needed. ``` User → "What's on this page?" → Gemini agent → scrape_website → Anakin API → results → Gemini → response ``` The tools expose their parameter schemas to Gemini via ADK's tool protocol, so the model knows when to use the browser, which country to route through, and whether to extract structured JSON — all based on the conversation context and your agent's instructions. --- ### Key features - **Anti-detection** — Proxy routing across 207 countries prevents blocking - **Intelligent Caching** — Up to 30x faster on repeated requests - **AI Extraction** — Convert any webpage into structured JSON - **Browser Automation** — Full headless Chrome support for SPAs and JS-heavy sites - **Batch Processing** — Scrape up to 10 URLs in a single request - **Deep Research** — Autonomous multi-stage research combining search, scraping, and AI synthesis --- ### Setup #### 1. Get your API key 1. Sign up at [anakin.io/signup](/signup) 2. Go to your [Dashboard](/dashboard) 3. Copy your API key (starts with `ask_`) #### 2. Install the package ```bash pip install anakin-adk ``` You also need the Anakin CLI installed and authenticated: ```bash pip install anakin-cli anakin login --api-key "ask_your-key-here" ``` --- ### Tools Each tool is exposed to Gemini with a full parameter schema. The model decides which parameters to use based on the user's request and your agent instructions. You can guide tool behavior by including hints in your agent's `instruction` field (e.g., "always use the browser for JavaScript-heavy sites" or "route through UK proxies"). #### 1. scrape_website Scrape a single URL and return clean markdown or structured JSON. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `url` | string | Yes | — | Target URL to scrape (HTTP/HTTPS) | | `country` | string | No | `us` | Proxy location from [207 countries](/docs/api-reference/supported-countries) | | `use_browser` | boolean | No | `false` | Enable headless Chrome for JavaScript-heavy sites | | `generate_json` | boolean | No | `false` | Use AI to extract structured data | | `session_id` | string | No | — | Browser session ID for [authenticated pages](/docs/api-reference/browser-sessions) | **Response includes:** Raw HTML, cleaned HTML, markdown conversion, structured JSON (if `generate_json` enabled), cache status, timing metrics. --- #### 2. batch_scrape Scrape up to 10 URLs at once and return combined results. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `urls` | string | Yes | — | Comma-separated list of URLs (1–10) | | `country` | string | No | `us` | Proxy location from [207 countries](/docs/api-reference/supported-countries) | | `use_browser` | boolean | No | `false` | Enable headless Chrome for JavaScript-heavy sites | | `generate_json` | boolean | No | `false` | Use AI to extract structured data from each page | **Response includes:** Per-URL results with HTML, markdown, and optional structured JSON. --- #### 3. search_web AI-powered web search returning results with citations. Results are returned immediately without polling. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `prompt` | string | Yes | — | Search query or question | | `limit` | number | No | `5` | Maximum results to return | **Response includes:** Array of results with URLs, titles, snippets, publication dates, last updated timestamps. --- #### 4. deep_research Autonomous multi-stage research pipeline combining search, scraping, and AI synthesis. Takes 1–5 minutes. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `prompt` | string | Yes | Research question or topic | **Response includes:** Comprehensive AI-generated report, structured findings, citations with source URLs, scraped source data, processing metrics. --- ### Processing times | Tool | Type | Typical Duration | |------|------|------------------| | scrape_website | Async | 3–15 seconds | | batch_scrape | Async | 5–30 seconds | | search_web | **Sync** | Immediate | | deep_research | Async | 1–5 minutes | --- ### Usage #### Full toolkit Pass all 4 tools to your agent at once: ```python from anakin_adk import AnakinToolkit from google.adk.agents import Agent agent = Agent( model="gemini-2.5-pro", name="web_researcher", instruction="Help users extract data from the web", tools=AnakinToolkit().get_tools(), ) ``` Run with the ADK dev UI: ```bash adk web ``` #### Individual tools Use specific tools instead of the full toolkit: ```python from anakin_adk import ScrapeWebsiteTool, SearchWebTool from google.adk.agents import Agent agent = Agent( model="gemini-2.5-pro", name="search_and_scrape", instruction="Search the web and scrape relevant pages", tools=[SearchWebTool(), ScrapeWebsiteTool()], ) ``` #### Product research agent An agent that compares products by scraping multiple pages: ```python from anakin_adk import AnakinToolkit from google.adk.agents import Agent agent = Agent( model="gemini-2.5-pro", name="product_researcher", instruction="""You are a product research assistant. When asked to compare products: 1. Use search_web to find relevant product pages 2. Use batch_scrape with generate_json=true to extract structured data 3. Summarize findings in a comparison table""", tools=AnakinToolkit().get_tools(), ) ``` #### Deep research agent An agent for comprehensive research reports: ```python from anakin_adk import DeepResearchTool from google.adk.agents import Agent agent = Agent( model="gemini-2.5-pro", name="deep_researcher", instruction="""You are a research analyst. Use deep_research for broad questions that need multiple sources. Use search_web + scrape_website for targeted fact-checking.""", tools=AnakinToolkit().get_tools(), ) ``` #### Geo-targeted scraping agent An agent that routes through specific country proxies: ```python from anakin_adk import AnakinToolkit from google.adk.agents import Agent agent = Agent( model="gemini-2.5-pro", name="geo_scraper", instruction="""You scrape websites for users. Always ask which country to route through. Use use_browser=true for JavaScript-heavy sites. Use generate_json=true when the user wants structured data.""", tools=AnakinToolkit().get_tools(), ) ``` #### More examples The [examples directory](https://github.com/Anakin-Inc/anakin-adk/tree/main/examples) includes: - **`basic_scraping.py`** — Simple scrape agent - **`research_agent.py`** — Deep research agent - **`search_and_scrape.py`** — Multi-step: search then scrape --- ### Troubleshooting | Code | Meaning | Action | |------|---------|--------| | 400 | Invalid parameters | Check your agent's instructions — Gemini may be passing unexpected values | | 401 | Invalid API key | Run `anakin login` to re-authenticate | | 402 | Plan upgrade required | Upgrade at [Pricing](/docs/documentation/pricing) | | 404 | Job not found | Job may have expired | | 429 | Rate limit exceeded | Reduce request frequency or upgrade your plan | | 5xx | Server error | Retry with backoff | **Common issues:** | Issue | Fix | |-------|-----| | Agent never uses tools | Check that `tools=` is set correctly and the instruction mentions web tasks | | Empty scrape results | Add `use_browser=true` hint to your agent instruction for JS-heavy sites | | Wrong country data | Add a country hint to your instruction (e.g., "always route through `gb`") | | CLI not authenticated | Run `anakin status` to check, then `anakin login` if needed | --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io --- # LangChain (/docs/integrations/ai-frameworks/langchain) --- ### What to expect The LangChain integration will provide: - **Document Loader** — Load web pages as LangChain documents using AnakinScraper's scraping engine - **Search Tool** — AI-powered web search as a LangChain tool for agents - **Research Tool** — Deep agentic research as a LangChain tool --- ### In the meantime You can integrate AnakinScraper into your LangChain applications today using the [REST API](/docs/api-reference) with LangChain's custom tool support, or use the [Anakin CLI](/docs/sdks/cli) as a subprocess tool. --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io --- # Langflow (/docs/integrations/ai-frameworks/langflow) | | | |---|---| | **Framework** | [Langflow](https://www.langflow.org) | | **Type** | Component | --- ### What to expect The Langflow integration will provide visual drag-and-drop components for: - **Scrape Component** — Scrape any URL and feed clean data into your flow - **Search Component** — AI-powered web search as a flow node - **Research Component** — Deep agentic research as a flow node --- ### In the meantime You can use AnakinScraper in Langflow today via the **HTTP Request** component with the [REST API](/docs/api-reference). --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io --- # LlamaIndex (/docs/integrations/ai-frameworks/llamaindex) | | | |---|---| | **Framework** | [LlamaIndex](https://www.llamaindex.ai) | | **Type** | Reader | --- ### What to expect The LlamaIndex integration will provide: - **AnakinReader** — Load web pages as LlamaIndex `Document` objects for RAG pipelines, indexing, and querying - **Search Tool** — AI-powered web search as a LlamaIndex tool for agents - **Research Tool** — Deep agentic research as a LlamaIndex tool --- ### In the meantime You can integrate AnakinScraper into your LlamaIndex applications today using the [REST API](/docs/api-reference) with LlamaIndex's custom reader support, or use the [Anakin CLI](/docs/sdks/cli) as a subprocess tool. --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io --- # Claude Code (/docs/integrations/ide-plugins/claude-code) Scrape websites, search the web, and run deep research directly inside Claude Code. | | | |---|---| | **Source** | [GitHub](https://github.com/Anakin-Inc/anakin-claude-plugin) | | **License** | MIT | | **Requires** | [anakin-cli](/docs/sdks/cli) (Python 3.10+) | --- ### Prerequisites - **Claude Code** installed and working - **Python 3.10+** — required for [anakin-cli](/docs/sdks/cli) (installed automatically by the plugin) - **API key** — get one from the [Dashboard](/dashboard) (the plugin will prompt you if needed) --- ### Setup Clone the plugin from GitHub and point Claude Code to it: ```bash git clone https://github.com/Anakin-Inc/anakin-claude-plugin.git claude --plugin-dir ./anakin-claude-plugin ``` The plugin handles the rest automatically — its `check-auth` hook verifies that `anakin-cli` is installed and authenticated before running commands, and the `/anakin:setup` skill can install the CLI and configure your API key if needed. --- ### Skills The plugin adds these skills to Claude Code: | Skill | Command | Description | |-------|---------|-------------| | Scrape Website | `/anakin:scrape-website [url]` | Scrape a single URL to markdown, JSON, or raw | | Scrape Batch | `/anakin:scrape-batch [url1] [url2]` | Scrape up to 10 URLs at once | | Search Web | `/anakin:search-web [query]` | AI-powered web search with instant results | | Deep Research | `/anakin:deep-research [topic]` | Deep agentic multi-step research (1–5 min) | | Setup | `/anakin:setup` | Install CLI, configure API key, set up output directory | | CLI Knowledge | *(auto)* | Background knowledge: escalation workflow, CLI rules, output organization | --- ### Agents | Agent | Description | |-------|-------------| | `data-extraction-architect` | Plans which anakin-cli commands to use for complex extraction tasks | --- ### Hooks | Hook | Event | Description | |------|-------|-------------| | `check-auth` | `PreToolUse` (Bash) | Verifies anakin-cli is installed and authenticated before running commands | --- ### Usage Once the plugin is active, Claude will use Anakin automatically for scraping and search tasks. You can also invoke skills directly: ``` /anakin:search-web latest React documentation /anakin:scrape-website https://example.com /anakin:deep-research pros and cons of microservices vs monolith /anakin:scrape-batch https://a.com https://b.com https://c.com ``` All output is saved to the `.anakin/` directory to keep your context window clean: ``` .anakin/ ├── search-react-docs.json ├── example.com.md ├── batch-results.json └── research-microservices.json ``` --- ### Configuration | Variable | Description | |----------|-------------| | `ANAKIN_API_KEY` | API key (env var, takes precedence over config file) | | `~/.anakin/config.json` | Stored API key (set via `anakin login`) | --- # Cursor (/docs/integrations/ide-plugins/cursor) Scrape websites, search the web, and run deep research directly inside Cursor. | | | |---|---| | **Source** | [GitHub](https://github.com/Anakin-Inc/anakin-cursor-plugin) | | **License** | MIT | | **Requires** | [anakin-cli](/docs/sdks/cli) (Python 3.10+) | --- ### Prerequisites - **Cursor** installed and working - **Python 3.10+** — required for [anakin-cli](/docs/sdks/cli) (installed automatically by the plugin) - **API key** — get one from the [Dashboard](/dashboard) (the plugin will prompt you if needed) --- ### Setup Clone the plugin from GitHub and add it to Cursor: ```bash git clone https://github.com/Anakin-Inc/anakin-cursor-plugin.git /add-plugin anakin ``` The plugin handles the rest automatically — its `anakin-setup` rule ensures `anakin-cli` is installed and authenticated before running commands. --- ### Skills | Skill | Description | |-------|-------------| | `scrape-website` | Scrape a single URL to markdown, JSON, or raw using `anakin scrape` | | `scrape-batch` | Scrape up to 10 URLs at once using `anakin scrape-batch` | | `search-web` | AI-powered web search using `anakin search` | | `deep-research` | Deep agentic research across multiple sources using `anakin research` | --- ### Rules | Rule | Description | |------|-------------| | `anakin-setup` | Ensures anakin-cli is installed and authenticated before running commands | | `anakin-cli-usage` | URL quoting, output handling, format defaults, and error recovery | --- ### Agents | Agent | Description | |-------|-------------| | `data-extraction-architect` | Plans which anakin-cli commands to use for complex extraction tasks | --- ### Usage Once the plugin is installed, Cursor's AI agent will automatically use Anakin for web scraping and search tasks. You can also reference the skills directly in your prompts. --- ### Configuration | Variable | Description | |----------|-------------| | `ANAKIN_API_KEY` | API key (env var, takes precedence over config file) | | `~/.anakin/config.json` | Stored API key (set via `anakin login`) | --- # OpenClaw (/docs/integrations/ide-plugins/openclaw) Skill for [OpenClaw](https://openclaw.ai) that gives your AI agent web scraping, batch scraping, AI search, and autonomous research capabilities. Available on [ClawHub](https://clawhub.ai/Viraal-Bambori/anakin). | | | |---|---| | **Marketplace** | [ClawHub](https://clawhub.ai/Viraal-Bambori/anakin) | | **Platform** | [OpenClaw](https://openclaw.ai) | | **Version** | 1.0.0 | | **Security** | VirusTotal Benign, OpenClaw Benign (high confidence) | | **Requirements** | `anakin` binary, `ANAKIN_API_KEY` env var | --- ### Setup #### 1. Install the skill ```bash clawhub install anakin ``` This downloads the skill and its dependencies (including `anakin-cli`) into your `./skills` directory automatically. Or browse and download directly from [ClawHub](https://clawhub.ai/Viraal-Bambori/anakin). #### 2. Configure your API key 1. Sign up at [anakin.io/signup](/signup) and get your API key from the [Dashboard](/dashboard) 2. Set the `ANAKIN_API_KEY` environment variable in your OpenClaw config #### 3. Restart OpenClaw Start a new OpenClaw session so it picks up the installed skill. --- ### What the agent gets Once installed, your OpenClaw agent can autonomously: - **Scrape URLs** — Extract content from any web page as clean markdown, structured JSON, or raw HTML - **Batch scrape** — Scrape up to 10 URLs in parallel with a single call - **AI search** — Run intelligent web searches with citations and relevance scoring - **Deep research** — Autonomous multi-source research that synthesizes comprehensive reports (1–5 minutes) The agent decides which capability to use based on the user's request. The skill's `SKILL.md` includes a decision guide so the agent picks the right tool automatically. --- ### How it works The skill is a `SKILL.md` file that instructs the OpenClaw agent how to use the `anakin` CLI. When a user asks the agent to scrape a website, search the web, or research a topic, the agent: 1. Creates a `.anakin/` output folder in the working directory 2. Runs the appropriate `anakin` CLI command 3. Reads the output file and summarizes the results All output is saved to files (never floods the agent's context), and the agent handles errors like rate limits and timeouts automatically. --- ### Security The skill has been scanned and verified: | Scanner | Status | Detail | |---------|--------|--------| | VirusTotal | Benign | [View report](https://www.virustotal.com/gui/file/d23ce9f141cfc820b1031d3e7664da171b75709bb58517ab6a122562a4cc5b39) | | OpenClaw | Benign | High confidence — only requires `anakin` binary and `ANAKIN_API_KEY` | The skill does not request unrelated credentials, system files, or hidden endpoints. It only uses the `anakin` CLI and a single API key. --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io --- # Dify (/docs/integrations/workflow/dify) Web scraping and AI-powered search plugin for Dify. Extract data from any website, perform intelligent web searches, and conduct deep research — all inside your Dify workflows and agents. | | | |---|---| | **Marketplace** | [Dify Plugin Store](https://marketplace.dify.ai/plugin/anakin/anakin) | | **Source** | [GitHub](https://github.com/Anakin-Inc/anakin-dify-plugins/tree/main/anakin) | | **Type** | Tool Plugin | | **Version** | 0.0.1 | | **Tools** | 5 | --- ### Key features - **Anti-detection** — Proxy routing across 207 countries prevents blocking - **Intelligent Caching** — Up to 30x faster on repeated requests - **AI Extraction** — Convert any webpage into structured JSON - **Browser Automation** — Full headless Chrome support for SPAs and JS-heavy sites - **Session Management** — Authenticated scraping with encrypted session storage (AES-256-GCM) - **Batch Processing** — Submit multiple URLs in a single request --- ### Setup #### 1. Get your API key 1. Sign up at [anakin.io/signup](/signup) 2. Go to your [Dashboard](/dashboard) 3. Copy your API key (starts with `ask_`) #### 2. Install in Dify 1. Install the Anakin plugin in your Dify workspace from the [Plugin Store](https://marketplace.dify.ai/plugin/anakin/anakin) 2. Go to **Plugins** > **Anakin** > **Configure** 3. Enter a name for the authorization (e.g., "Production") 4. Paste your API key 5. Click **Save** --- ### Tools #### 1. URL Scraper Scrapes a single URL, returning HTML, markdown, and optionally structured JSON. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `url` | string | Yes | — | Target URL to scrape (HTTP/HTTPS) | | `country` | string | No | `us` | Proxy location from 207 countries | | `use_browser` | boolean | No | `false` | Enable headless Chrome for JavaScript-heavy sites | | `generate_json` | boolean | No | `false` | Use AI to extract structured data | | `session_id` | string | No | — | Browser session ID for authenticated pages | **Response includes:** Raw HTML, cleaned HTML, markdown conversion, structured JSON (if `generate_json` enabled), cache status, timing metrics. --- #### 2. Batch URL Scraper Scrape up to 10 URLs simultaneously in parallel. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `urls` | string | Yes | — | Comma-separated list of URLs (1–10) | | `country` | string | No | `us` | Proxy location from 207 countries | | `use_browser` | boolean | No | `false` | Enable headless Chrome for JavaScript-heavy sites | | `generate_json` | boolean | No | `false` | Use AI to extract structured data from each page | --- #### 3. AI Search Synchronous AI-powered web search returning results with citations and relevance scoring. Results are returned immediately without polling. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `prompt` | string | Yes | — | Search query or question | | `limit` | number | No | `5` | Maximum results to return | **Response includes:** Array of results with URLs, titles, snippets, publication dates, last updated timestamps. --- #### 4. Deep Research (Agentic Search) Multi-stage automated research pipeline combining search, scraping, and AI synthesis. Takes 1–5 minutes. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `prompt` | string | Yes | Research question or topic | **Response includes:** AI-generated comprehensive answers, summaries, structured findings, citations with source URLs, scraped source data, processing metrics. --- #### 5. Custom Web Scraper Execute pre-configured scraper templates for domain-specific structured data extraction. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `url` | string | Yes | Target URL to scrape | | `scraper_code` | string | Yes | Configuration identifier | | `scraper_params` | string | No | JSON string of scraper-specific parameters | **Response:** Structured JSON matching the scraper's defined schema. --- ### Examples #### In a Workflow 1. Add a **Tool** node to your workflow 2. Select **Anakin** and choose your tool 3. Configure parameters (e.g., enter URL, enable `generate_json`) 4. Connect to the next node for processing #### In an Agent 1. Create an Agent app 2. Add Anakin tools to the agent's toolset 3. The agent will automatically use scraping/search based on user queries #### Scraping with AI extraction ``` Tool: URL Scraper URL: https://example.com/products Generate JSON: true ``` Returns structured product data automatically extracted by AI. #### Authenticated scraping ``` Tool: URL Scraper URL: https://example.com/dashboard Session ID: your-session-id-from-dashboard Use Browser: true ``` Scrapes pages that require login using your saved browser session. Learn more about [Browser Sessions](/docs/api-reference/browser-sessions). --- ### Processing times | Tool | Type | Typical Duration | |------|------|------------------| | URL Scraper | Async | 3–15 seconds | | Batch Scraper | Async | 5–30 seconds | | AI Search | **Sync** | Immediate | | Deep Research | Async | 1–5 minutes | | Custom Scraper | Async | 3–15 seconds | --- ### Troubleshooting | Code | Meaning | Action | |------|---------|--------| | 400 | Invalid parameters | Check your input | | 401 | Invalid API key | Verify your API key in plugin settings | | 402 | Plan upgrade required | Upgrade at [Pricing](/docs/documentation/pricing) | | 404 | Job not found | Job may have expired | | 429 | Rate limit exceeded | Wait and retry | | 5xx | Server error | Retry with backoff | --- ### Country codes Proxy routing supports [207 countries](/docs/api-reference/supported-countries). Common codes: | Code | Country | |------|---------| | `us` | United States (default) | | `gb` | United Kingdom | | `de` | Germany | | `fr` | France | | `jp` | Japan | | `au` | Australia | --- # Make (/docs/integrations/workflow/make) Official verified integration for Make (formerly Integromat) with native modules for web scraping, AI search, and deep research. | | | |---|---| | **Marketplace** | [make.com/integrations/anakin](https://www.make.com/en/integrations/anakin) | | **Status** | Verified, Official Vendor | | **Modules** | 4 (3 Actions, 1 Search) | --- ### Setup #### 1. Get your API key Sign up at [anakin.io/signup](/signup) and get your API key from the [Dashboard](/dashboard). #### 2. Add the Anakin module 1. In your Make scenario, click **+** to add a module 2. Search for **Anakin** 3. Select the module you need (UniversalDataExtractor, Search, etc.) #### 3. Connect your account 1. Click **Create a connection** 2. Enter your API key 3. Click **Save** --- ### Modules #### UniversalDataExtractor (Action) Extract data from any website. Submits a scrape job and returns the results including HTML, markdown, and structured data. Maps to the [URL Scraper API](/docs/api-reference/url-scraper). --- #### DataPoller (Action) Fetch the results for a previously submitted job using its job ID. Use this after **UniversalDataExtractor** or **AgenticSearch** if you need to poll for results separately. Maps to the scrape job status endpoint (`GET /v1/request/{id}`) or agentic search status endpoint (`GET /v1/agentic-search/{id}`). --- #### AgenticSearch (Action) Start an advanced AI search job that searches web sources and extracts structured data. Returns a job ID to check results later. Takes 1–5 minutes. Maps to the [Agentic Search API](/docs/api-reference/agentic-search). --- #### Search (Search) Perform an AI-powered web search. Returns instant results with answers, citations, and sources. This is synchronous — no polling needed. Maps to the [Search API](/docs/api-reference/search). --- ### Examples #### Scrape a URL and save to Google Sheets ``` Schedule → Anakin (UniversalDataExtractor) → Google Sheets (Add Row) ``` 1. **Schedule** — trigger on a daily/hourly schedule 2. **Anakin UniversalDataExtractor** — enter the URL to scrape 3. **Google Sheets** — map the markdown or structured data to columns #### Research and email a report ``` Webhook → Anakin (AgenticSearch) → Delay → Anakin (DataPoller) → Gmail (Send Email) ``` 1. **Webhook** — receive a research topic 2. **Anakin AgenticSearch** — start deep research, get a job ID 3. **Delay** — wait 3–5 minutes for research to complete 4. **Anakin DataPoller** — fetch the completed results using the job ID 5. **Gmail** — send the research report #### AI search to Notion ``` Webhook → Anakin (Search) → Notion (Create Page) ``` 1. **Webhook** — receive a search query 2. **Anakin Search** — get instant AI-powered results 3. **Notion** — create a page with the search results #### Popular connections Google Sheets, OpenAI, Gmail, Google Drive, Telegram Bot, Airtable, Notion, Google Docs, Slack, Shopify, HubSpot CRM, WordPress, and [more](https://www.make.com/en/integrations/anakin). --- ### Troubleshooting | Issue | Fix | |-------|-----| | Connection failed | Verify your API key is correct and has credits | | Scrape returns empty | Try enabling browser mode if the option is available, or check if the URL is accessible | | Agentic search still processing | Add a **Delay** module (3–5 min) before polling with DataPoller | | Rate limit error (429) | Add a delay between requests or reduce scenario frequency | --- # n8n (/docs/integrations/workflow/n8n) | | | |---|---| | **Platform** | [n8n](https://n8n.io) | | **Type** | Community Node | --- ### What to expect The n8n community node will provide: - **Scrape URL** — Extract content and structured data from any website with automatic polling - **AI Search** — Synchronous AI-powered web search with instant results - **Agentic Search** — Multi-stage deep research pipeline with structured data extraction --- ### In the meantime You can integrate AnakinScraper into your n8n workflows today using n8n's **HTTP Request** node with the [REST API](/docs/api-reference), or use the [Anakin CLI](/docs/sdks/cli) via the **Execute Command** node. --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io --- # Zapier (/docs/integrations/workflow/zapier) Extract structured data from websites and perform AI-powered searches — connected to 8,000+ apps on Zapier. | | | |---|---| | **Marketplace** | [zapier.com/apps/anakin](https://zapier.com/apps/anakin/integrations) | | **Category** | AI Agents | | **Status** | Beta | --- ### Setup #### 1. Create a new Zap Go to [zapier.com/app/zaps](https://zapier.com/app/zaps) and click **Create Zap**. #### 2. Add a trigger Choose any trigger that produces data you want to scrape or research — Webhooks by Zapier, Schedule, Gmail, Slack, or any of 8,000+ apps. #### 3. Add an Anakin action 1. Click **+** to add an action 2. Search for **Anakin** 3. Choose your action (Extract Website Data, Perform AI Search, etc.) 4. Click **Sign in** and enter your API key (get one from the [Dashboard](/dashboard)) 5. Map fields from your trigger 6. Click **Test action**, then **Publish** --- ### Actions #### Extract Website Data Extracts structured data from any website including HTML, markdown, and generated JSON. Submits a scrape job and polls until complete. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `url` | string | Yes | — | The full URL of the website to extract data from | | `country` | string | No | `us` | Country code for proxy routing (e.g., `us`, `gb`, `de`). See [supported countries](/docs/api-reference/supported-countries) | | `forceFresh` | boolean | No | `false` | Bypass cache and force fresh data extraction | | `maxWaitTime` | integer | No | `300` | Maximum seconds to wait for the job to complete | | `pollInterval` | integer | No | `3` | Seconds between status checks | **Output fields:** | Field | Type | Description | |-------|------|-------------| | `url` | string | The URL that was scraped | | `status` | string | Job status (`completed` or `failed`) | | `result` | text | Raw HTML content | | `markdown` | text | Clean markdown version of the page | | `generatedJson` | string | AI-extracted structured data | | `cached` | boolean | Whether the result came from cache | | `success` | boolean | Success flag | | `durationMs` | number | Processing time in milliseconds | --- #### Perform AI Search AI-powered search using Perplexity. Returns instant answers with citations and sources. This is synchronous — results are returned immediately. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `searchQuery` | text | Yes | — | The search query or question (e.g., "What are the latest trends in AI?") | | `maxResults` | integer | No | `5` | Maximum number of search results to return | --- #### Start Agentic Search Starts an advanced AI search job that searches web sources and extracts structured data. Returns a job ID to check results later. Takes 1–5 minutes. | Field | Type | Required | Description | |-------|------|----------|-------------| | `searchPrompt` | text | Yes | The research question or topic | | `useBrowser` | boolean | No | Use headless browser for more reliable scraping | **Output:** Returns a `job_id` to use with **Get Agentic Search Results**. --- #### Get Agentic Search Results Fetches current status and results for an agentic search job. Returns immediately with the current state (`processing`, `completed`, or `failed`). | Field | Type | Required | Description | |-------|------|----------|-------------| | `jobId` | string | Yes | The job ID from the Start Agentic Search action | --- ### Examples #### Scrape a URL from a webhook 1. **Trigger:** Webhooks by Zapier — Catch Hook 2. **Action:** Anakin — Extract Website Data (map webhook URL to `url`) 3. **Action:** Google Sheets — Create Row (save markdown and structured data) Send data to your webhook: ```bash curl -X POST YOUR_WEBHOOK_URL \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' ``` #### Daily product monitoring 1. **Trigger:** Schedule by Zapier — Every Day 2. **Action:** Anakin — Extract Website Data (hardcode the product URL) 3. **Action:** Google Sheets — Create Row (append price data) 4. **Action:** Filter — Only continue if price changed 5. **Action:** Slack — Send Channel Message #### Research pipeline 1. **Trigger:** Webhook (receives a research topic) 2. **Action:** Anakin — Start Agentic Search 3. **Action:** Delay — Wait 3 minutes 4. **Action:** Anakin — Get Agentic Search Results 5. **Action:** Gmail — Send Email with the report #### AI search to Airtable 1. **Trigger:** Airtable — New Record (with a search query field) 2. **Action:** Anakin — Perform AI Search (map the query field) 3. **Action:** Airtable — Update Record (write results back) #### More ideas - **Price monitoring** — Scrape product pages on a schedule, save to Google Sheets, alert on changes - **Lead enrichment** — Scrape company websites from a CRM trigger, extract structured data - **News aggregation** — Scrape articles from RSS URLs, save cleaned markdown to Notion - **Competitor analysis** — Monitor competitor pricing pages, compare with previous data - **Market research** — Run agentic search on topics, email structured reports --- ### Troubleshooting | Issue | Fix | |-------|-----| | Scrape returns empty | Try setting `forceFresh` to true to bypass cache | | Timeout | Increase `maxWaitTime` (default is 300 seconds) | | Auth error | Re-enter your API key in the Zapier connection settings | | Agentic search still processing | Use a **Delay** action (3–5 min) between Start and Get Results | | Geo-restricted content | Set `country` to match the target site's region | --- # SDKs & CLI (/docs/sdks) Official client libraries and command-line tools for interacting with the AnakinScraper API. --- ### CLI | Tool | Status | Description | |------|--------|-------------| | [Anakin CLI](/docs/sdks/cli) | Available | Scrape, search, and research from your terminal | --- ### SDKs | SDK | Status | Description | |-----|--------|-------------| | [Python SDK](/docs/sdks/python) | Coming Soon | Python client library for the AnakinScraper API | | [Node.js SDK](/docs/sdks/node) | Coming Soon | TypeScript/JavaScript client library for the AnakinScraper API | --- # Anakin CLI (/docs/sdks/cli) Scrape websites, search the web, and run deep research — all from your terminal. | | | |---|---| | **Latest version** | 0.1.0 | | **License** | MIT | | **Python** | 3.10+ | | **PyPI** | [anakin-cli](https://pypi.org/project/anakin-cli/) | | **Source** | [GitHub](https://github.com/Anakin-Inc/anakin-cli) | ```bash pip install anakin-cli ``` --- ### Prerequisites Before installing, make sure you have: 1. **Python 3.10 or higher** — check with `python --version` 2. **pip** — Python's package manager (included with Python 3.10+) 3. **An API key** — get one from the [Dashboard](/dashboard). If you don't have an account, [sign up here](/signup) --- ### Installation ```bash pip install anakin-cli ``` ```bash # Install in an isolated environment (recommended for CLI tools) pipx install anakin-cli ``` Verify the installation: ```bash anakin status ``` To upgrade to the latest version: ```bash pip install --upgrade anakin-cli ``` --- ### Authentication Set up your API key so the CLI can make requests on your behalf. **Login command (recommended)** ```bash anakin login --api-key "ak-your-key-here" ``` This saves your key locally. You only need to do this once. **Environment variable** ```bash export ANAKIN_API_KEY="ak-your-key-here" ``` **Interactive prompt** If no key is configured, the CLI will prompt you to enter one. --- ### Quick start ```bash # Scrape a page to markdown anakin scrape "https://example.com" # Extract structured JSON with AI anakin scrape "https://example.com/product" --format json # Scrape a JS-heavy site with headless browser anakin scrape "https://example.com/spa" --browser # Batch scrape multiple URLs anakin scrape-batch "https://a.com" "https://b.com" "https://c.com" # AI-powered web search anakin search "python async best practices" # Deep research (takes 1–5 min) anakin research "comparison of web frameworks 2025" -o report.json ``` --- ### Commands overview | Command | Description | |---------|-------------| | [`anakin login`](/docs/sdks/cli/commands#login) | Save your API key locally | | [`anakin status`](/docs/sdks/cli/commands#status) | Check version and authentication status | | [`anakin scrape`](/docs/sdks/cli/commands#scrape) | Scrape a single URL to markdown, JSON, or raw | | [`anakin scrape-batch`](/docs/sdks/cli/commands#scrape-batch) | Scrape up to 10 URLs in parallel | | [`anakin search`](/docs/sdks/cli/commands#search) | AI-powered web search (instant results) | | [`anakin research`](/docs/sdks/cli/commands#research) | Deep multi-stage agentic research | See the full [Commands Reference](/docs/sdks/cli/commands) for all flags and options, or check out [Examples & Recipes](/docs/sdks/cli/examples) for real-world usage patterns. --- ### Output modes Every command that returns data supports the `-o` flag to write to a file. Without it, output goes to stdout. ```bash # Print to terminal anakin scrape "https://example.com" # Save to file anakin scrape "https://example.com" -o page.md ``` The `scrape` command also supports three output formats: | Format | Flag | What you get | |--------|------|-------------| | Markdown | `--format markdown` | Clean readable text (default) | | JSON | `--format json` | AI-extracted structured data | | Raw | `--format raw` | Full API response with HTML and metadata | --- ### Tips **Always quote URLs** containing `?`, `&`, or `#` — shells interpret these as special characters: ```bash # Wrong — zsh will fail anakin scrape https://example.com/page?id=123 # Correct anakin scrape "https://example.com/page?id=123" ``` **Piping works cleanly** because all progress messages go to stderr: ```bash anakin scrape "https://example.com" --format json | jq '.title' ``` **Use `--browser`** for JavaScript-heavy sites, SPAs, and dynamically loaded content. **Use `--country`** to route requests through a specific country's proxy. See all [207 supported countries](/docs/api-reference/supported-countries). --- ### Support - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io - **PyPI** — [pypi.org/project/anakin-cli](https://pypi.org/project/anakin-cli/) --- # CLI Commands (/docs/sdks/cli/commands) All available commands for the Anakin CLI. --- ### login Save your API key for future sessions. ```bash anakin login --api-key "ak-your-key-here" ``` | Flag | Description | |------|-------------| | `--api-key` | Your AnakinScraper API key | --- ### status Check the CLI version and whether you're authenticated. ```bash anakin status ``` --- ### scrape Scrape a single URL. Returns clean markdown by default. ```bash anakin scrape "https://example.com" ``` | Flag | Type | Description | Default | |------|------|-------------|---------| | `--format` | string | Output format: `markdown`, `json`, or `raw` | `markdown` | | `--browser` | flag | Use headless browser for JS-heavy sites | off | | `--country` | string | Two-letter country code for geo-located scraping | `us` | | `--session-id` | string | Browser session ID for authenticated scraping | — | | `--timeout` | number | Polling timeout in seconds | `120` | | `-o, --output` | string | Save output to a file instead of stdout | stdout | #### Output formats | Format | What you get | Best for | |--------|-------------|----------| | `markdown` | Clean readable page text | Reading, LLM context | | `json` | AI-extracted structured data | Data pipelines | | `raw` | Full API response (HTML, metadata, everything) | Debugging | #### Examples ```bash # Default markdown output anakin scrape "https://example.com" # Save to file anakin scrape "https://example.com" -o page.md # Extract structured JSON anakin scrape "https://example.com/product" --format json -o product.json # Full raw API response anakin scrape "https://example.com" --format raw -o debug.json # JavaScript-heavy site anakin scrape "https://example.com/spa" --browser # Scrape from the UK anakin scrape "https://example.com" --country gb # Longer timeout for slow sites anakin scrape "https://example.com" --timeout 300 # Authenticated scraping with a saved browser session anakin scrape "https://example.com/dashboard" --session-id "session_abc123" ``` --- ### scrape-batch Scrape up to 10 URLs simultaneously. All URLs are processed in parallel. ```bash anakin scrape-batch "https://a.com" "https://b.com" "https://c.com" ``` | Flag | Type | Description | Default | |------|------|-------------|---------| | `-o, --output` | string | Save output to a file | stdout | #### Examples ```bash # Scrape 3 URLs anakin scrape-batch "https://a.com" "https://b.com" "https://c.com" # Save batch results to file anakin scrape-batch "https://a.com" "https://b.com" -o results.json ``` --- ### search AI-powered web search. Returns results instantly (synchronous). ```bash anakin search "your search query" ``` | Flag | Type | Description | Default | |------|------|-------------|---------| | `-o, --output` | string | Save output to a file | stdout | #### Examples ```bash # Search the web anakin search "python async best practices" # Save search results anakin search "best web scraping tools 2025" -o results.json # Pipe to jq anakin search "latest AI news" | jq '.results[0]' ``` --- ### research Deep agentic research. Runs a multi-stage pipeline: query refinement, web search, citation scraping, and AI synthesis. Takes 1–5 minutes. ```bash anakin research "your research topic" ``` | Flag | Type | Description | Default | |------|------|-------------|---------| | `-o, --output` | string | Save output to a file | stdout | #### Examples ```bash # Run deep research anakin research "comparison of web frameworks 2025" # Save research report anakin research "quantum computing industry trends" -o report.json ``` --- ## Error handling The CLI provides clear error messages: | Error | Code | Fix | |-------|------|-----| | Authentication failed | 401 | Run `anakin login --api-key "ak-xxx"` | | Plan upgrade required | 402 | Visit [Pricing](/docs/documentation/pricing) | | Rate limit exceeded | 429 | Wait a few seconds and retry | | Job timed out | — | Increase with `--timeout 300` | | Job failed | — | Check if the URL is accessible | **Exit codes:** `0` for success, `1` for any error. --- # CLI Examples (/docs/sdks/cli/examples) Real-world usage patterns for the Anakin CLI. --- ### Scrape a page to markdown The most common use case — get clean, readable content from any URL: ```bash anakin scrape "https://docs.python.org/3/tutorial/index.html" -o tutorial.md ``` --- ### Extract structured data with AI Use `--format json` to get AI-extracted structured data instead of raw text: ```bash anakin scrape "https://amazon.com/dp/B0EXAMPLE" --format json -o product.json ``` The AI analyzes the page and returns structured fields like title, price, description, etc. --- ### Scrape a JavaScript-heavy site For SPAs, React/Next.js sites, or pages with dynamically loaded content: ```bash anakin scrape "https://app.example.com/dashboard" --browser ``` The `--browser` flag launches a headless browser to render JavaScript before extracting content. --- ### Batch scrape multiple pages Scrape up to 10 URLs in a single command. All are processed in parallel: ```bash anakin scrape-batch \ "https://example.com/page-1" \ "https://example.com/page-2" \ "https://example.com/page-3" \ "https://example.com/page-4" \ "https://example.com/page-5" \ -o pages.json ``` --- ### Scrape from a specific country Route your request through a proxy in a specific country. Useful for geo-restricted content: ```bash # Scrape from the UK anakin scrape "https://example.co.uk/deals" --country gb # Scrape from Japan anakin scrape "https://example.jp/products" --country jp ``` See the full list of [207 supported countries](/docs/api-reference/supported-countries). --- ### Scrape authenticated pages Use a saved browser session to scrape pages that require login: ```bash # First, create a session from the dashboard at anakin.io/dashboard # Then use the session ID: anakin scrape "https://example.com/my-account" --session-id "session_abc123" ``` Learn more about [Browser Sessions](/docs/api-reference/browser-sessions). --- ### Pipe output to other tools Progress messages go to stderr, so piping works cleanly: ```bash # Extract a specific field with jq anakin scrape "https://example.com" --format json | jq '.title' # Count words in scraped markdown anakin scrape "https://example.com" | wc -w # Feed into another script anakin search "latest AI papers" | python process_results.py ``` --- ### Research a topic Run deep multi-stage research that scrapes and synthesizes multiple sources: ```bash anakin research "best practices for web scraping in 2025" -o research.json ``` This takes 1–5 minutes and produces a comprehensive report with citations. --- ### Use in shell scripts ```bash #!/bin/bash # scrape-urls.sh — Scrape a list of URLs from a file while IFS= read -r url; do filename=$(echo "$url" | sed 's|https\?://||;s|/|_|g').md echo "Scraping: $url -> $filename" >&2 anakin scrape "$url" -o "$filename" done < urls.txt ``` --- ### Debug a failing scrape Use `--format raw` to see the full API response including headers, status codes, and error details: ```bash anakin scrape "https://example.com" --format raw -o debug.json ``` If the default HTTP handler fails, try with `--browser` to use the headless browser: ```bash anakin scrape "https://example.com" --browser --format raw -o debug.json ``` --- # Node.js SDK (/docs/sdks/node) | | | |---|---| | **Language** | TypeScript / JavaScript | | **Package** | `anakin-sdk` (planned) | | **Runtime** | Node.js 18+ | | **License** | MIT | --- ### What to expect The Node.js SDK will provide a typed client for the AnakinScraper API: - **Scrape** — Scrape any URL and get back clean markdown, HTML, or AI-extracted structured JSON - **Batch Scrape** — Scrape up to 10 URLs in parallel with a single call - **Search** — AI-powered web search with instant results and citations - **Research** — Deep agentic research that synthesizes multiple sources - **TypeScript-first** — Full type definitions for all request and response objects - **Automatic polling** — Submit async jobs and get results back without manual polling --- ### In the meantime You can use the [REST API](/docs/api-reference) directly with `fetch` or `axios`. --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io --- # Python SDK (/docs/sdks/python) | | | |---|---| | **Language** | Python 3.10+ | | **Package** | `anakin-sdk` (planned) | | **License** | MIT | --- ### What to expect The Python SDK will provide a native client for the AnakinScraper API: - **Scrape** — Scrape any URL and get back clean markdown, HTML, or AI-extracted structured JSON - **Batch Scrape** — Scrape up to 10 URLs in parallel with a single call - **Search** — AI-powered web search with instant results and citations - **Research** — Deep agentic research that synthesizes multiple sources - **Async support** — Full `asyncio` support for high-throughput applications - **Automatic polling** — Submit async jobs and get results back without manual polling --- ### In the meantime You can use the [REST API](/docs/api-reference) directly with `requests` or `httpx`, or use the [Anakin CLI](/docs/sdks/cli) which is already available as a Python package: ```bash pip install anakin-cli ``` --- ### Stay updated - **Discord** — [discord.gg/gP2YCJKH](https://discord.gg/gP2YCJKH) - **Email** — support@anakin.io