Farcaster Hub Scraper

Protocol-native Farcaster data ingestion for research, analytics, and social graph analysis. Collect casts, reactions, follows, user profiles, and real-time events directly from Farcaster Hubs via HTTP API.

Features

✅ Protocol-First Design - Direct Hub HTTP API integration (no third-party dependencies) ✅ Three Ingestion Modes - Deterministic backfill by FIDs, time-bounded studies, or incremental event tailing ✅ Comprehensive Data - Casts, reactions (likes/recasts), follows, user profiles, and events ✅ Optional Enrichment - Parse Frames/Mini-Apps metadata from embedded URLs ✅ State Checkpointing - Migration-safe, resumable runs with automatic state persistence ✅ Rate Limiting & Retries - Production-grade reliability with exponential backoff ✅ Neynar v2 Support - Optional integration with Neynar hosted hubs ✅ Multiple Views - Pre-configured dataset views for easy data exploration

Who Uses This Actor?

🎯 Target Users

📊 Web3 Data Analysts & Researchers (Dune, Flipside)

Export Farcaster data to SQL databases for analytics dashboards
Track protocol growth, user engagement trends, and network effects
Cross-reference social data with onchain transactions

🛠️ Farcaster Frame/Mini-App Developers

Monitor Frame engagement and interaction patterns
Track which users interact with your Mini-Apps
Analyze viral content and user acquisition funnels

📢 Web3 Marketing Agencies & Brands

Track influencer campaigns and brand mentions
Measure content reach and engagement rates
Identify key opinion leaders in the Farcaster ecosystem

🎓 Academic Researchers

Study decentralized social network dynamics
Analyze information diffusion and community formation
Research Web3 social graph topology

Use Cases by Persona

📊 For Data Analysts

Influencer Ranking Dashboard

{
  "mode": "byFids",
  "fids": [2, 3, 6833, 5650, 7890],
  "include": {"casts": true, "reactions": true, "userData": true},
  "maxRecords": 50000
}

→ Export to Dune to calculate engagement rates, follower growth, content velocity

Protocol Growth Metrics

{
  "mode": "tailEvents",
  "maxRecords": 100000
}

→ Stream all events to track daily active users, network growth, retention

🛠️ For Frame Developers

Frame Interaction Analysis

{
  "mode": "byFids",
  "fids": [list of users who interacted],
  "include": {"casts": true, "reactions": true},
  "fetchEmbeds": true
}

→ Identify which casts contain your Frame, track engagement patterns

Real-Time Frame Monitoring

{
  "mode": "tailEvents",
  "tail": {"fromEventId": "latest"},
  "maxRecords": 10000
}

→ Get notified when users interact with your Frames in real-time

📢 For Marketing Agencies

Campaign Performance Tracking

{
  "mode": "byFids",
  "fids": [brand_account, influencer1, influencer2],
  "startTimestamp": 130000000,
  "stopTimestamp": 130100000,
  "include": {"casts": true, "reactions": true}
}

→ Measure campaign reach during specific time window

Influencer Discovery

{
  "mode": "byFids",
  "fids": [competitor_followers],
  "include": {"links": true, "userData": true, "reactions": true}
}

→ Find high-engagement users in target communities

🎓 For Researchers

Social Network Topology Study

{
  "mode": "byFids",
  "discoverFids": true,
  "shardIds": [0, 1, 2],
  "include": {"links": true, "userData": true},
  "maxRecords": 500000
}

→ Build complete follow graph for network analysis

Information Diffusion Analysis

{
  "mode": "byTime",
  "fids": [seed_users],
  "startTimestamp": 100000000,
  "stopTimestamp": 100500000,
  "include": {"casts": true, "reactions": true}
}

→ Track how content spreads through the network over time

Quick Start

Basic Example: Backfill by FIDs

{
  "hubBaseUrl": "https://hub.pinata.cloud",
  "mode": "byFids",
  "fids": [2, 3, 6833],
  "include": {
    "casts": true,
    "reactions": true,
    "links": true,
    "userData": true
  },
  "pageSize": 1000,
  "maxRecords": 10000
}

Time-Bounded Study

{
  "hubBaseUrl": "https://hub.pinata.cloud",
  "mode": "byTime",
  "fids": [2, 3],
  "startTimestamp": 100000000,
  "stopTimestamp": 100050000,
  "include": {
    "casts": true,
    "reactions": true
  }
}

Real-Time Event Tail

{
  "hubBaseUrl": "https://hub.pinata.cloud",
  "mode": "tailEvents",
  "tail": {
    "fromEventId": "0",
    "shardIndex": 0
  },
  "maxRecords": 1000
}

Auto-Discover FIDs via Shard Scan

{
  "hubBaseUrl": "https://hub.pinata.cloud",
  "mode": "byFids",
  "discoverFids": true,
  "shardIds": [0, 1],
  "include": {
    "casts": true,
    "userData": true
  },
  "maxRecords": 5000
}

With Frame/Mini-App Metadata Parsing

{
  "hubBaseUrl": "https://hub.pinata.cloud",
  "mode": "byFids",
  "fids": [2],
  "fetchEmbeds": true,
  "maxEmbedsPerRun": 100,
  "proxy": "RESIDENTIAL",
  "include": {
    "casts": true
  }
}

Input Configuration

Required Fields

Field	Type	Description	Default
`hubBaseUrl`	`string`	HTTP endpoint of Farcaster Hub	`https://hub.pinata.cloud`
`mode`	`enum`	Ingestion mode: `byFids`, `byTime`, `tailEvents`	`byFids`

Mode-Specific Fields

By FIDs Mode

Field	Type	Description	Default
`fids`	`array<integer>`	List of Farcaster IDs to scrape	`[]`
`discoverFids`	`boolean`	Auto-discover FIDs via shard scan	`false`
`shardIds`	`array<integer>`	Shard IDs to scan when discovering	`[]`

By Time Mode

Field	Type	Description	Default
`fids`	`array<integer>`	FIDs to scrape (required)	`[]`
`startTimestamp`	`integer`	Start time (Farcaster epoch seconds)	-
`stopTimestamp`	`integer`	Stop time (Farcaster epoch seconds)	-

Tail Events Mode

Field	Type	Description	Default
`tail.fromEventId`	`string`	Start from event ID (empty = start from 0)	`"0"`
`tail.shardIndex`	`integer`	Shard index to tail (optional)	-

Entity Filters

Field	Type	Description	Default
`include.casts`	`boolean`	Include cast messages	`true`
`include.reactions`	`boolean`	Include reactions (likes/recasts)	`true`
`include.links`	`boolean`	Include follows	`true`
`include.userData`	`boolean`	Include user profiles	`true`

Optional Features

Field	Type	Description	Default
`fetchEmbeds`	`boolean`	Parse embedded URLs for Frames/Mini-Apps	`false`
`maxEmbedsPerRun`	`integer`	Max embeds to fetch per run	`500`
`neynarApiKey`	`string`	Neynar v2 API key (optional)	-
`clientApi`	`boolean`	Enable Farcaster Client API (experimental)	`false`
`proxy`	`string`	Apify Proxy groups or custom URL	-

Performance & Limits

Field	Type	Description	Default
`pageSize`	`integer`	Records per page (max 1000)	`1000`
`maxRecords`	`integer`	Stop after N records (safety limit)	-
`requestPerMinute`	`integer`	Rate limit for Hub API calls	`600`

Output Schema

The actor produces normalized entities with the following types:

Cast Entity

{
  "entity_type": "cast",
  "fid": 2,
  "hash": "0x1234567890abcdef",
  "ts": 123456789,
  "ts_iso": "2025-01-15T10:30:00.000Z",
  "text": "Hello Farcaster!",
  "mentions": [3, 6833],
  "parent": {
    "castId": { "fid": 2, "hash": "0xabc..." }
  },
  "embeds": {
    "urls": ["https://example.com"],
    "castIds": []
  },
  "derived": {
    "urls": ["https://example.com"],
    "frame_meta": {
      "name": "My App",
      "url": "https://app.example.com"
    }
  },
  "ingest_source": "hub_http",
  "ingest_ts": "2025-01-15T10:31:00.000Z",
  "raw": { /* original Hub message */ }
}

Reaction Entity

{
  "entity_type": "reaction",
  "fid": 3,
  "type": "like",
  "target": {
    "castId": { "fid": 2, "hash": "0x1234..." }
  },
  "ts": 123456790,
  "ts_iso": "2025-01-15T10:31:00.000Z",
  "hash": "0xabcd...",
  "ingest_source": "hub_http",
  "ingest_ts": "2025-01-15T10:32:00.000Z",
  "raw": { /* original Hub message */ }
}

Link Entity (Follow)

{
  "entity_type": "link",
  "fid": 3,
  "targetFid": 2,
  "type": "follow",
  "ts": 123456791,
  "ts_iso": "2025-01-15T10:32:00.000Z",
  "hash": "0xdef...",
  "ingest_source": "hub_http",
  "ingest_ts": "2025-01-15T10:33:00.000Z",
  "raw": { /* original Hub message */ }
}

User Data Entity

{
  "entity_type": "user_data",
  "fid": 2,
  "username": "vitalik.eth",
  "display": "Vitalik",
  "pfp": "https://example.com/pfp.png",
  "bio": "Ethereum co-founder",
  "url": "https://vitalik.ca",
  "location": "Singapore",
  "github": "vbuterin",
  "twitter": "VitalikButerin",
  "ts": 123456792,
  "ts_iso": "2025-01-15T10:33:00.000Z",
  "ingest_source": "hub_http",
  "ingest_ts": "2025-01-15T10:34:00.000Z",
  "raw": [ /* original Hub messages */ ]
}

Event Entity (Tail Mode)

{
  "entity_type": "event",
  "event_id": "12345",
  "event_type": "MERGE_MESSAGE",
  "ts": 123456793,
  "ts_iso": "2025-01-15T10:34:00.000Z",
  "shard_index": 0,
  "message": { /* hydrated message if MERGE_MESSAGE */ },
  "ingest_source": "hub_http",
  "ingest_ts": "2025-01-15T10:35:00.000Z",
  "raw": { /* original Hub event */ }
}

Farcaster Timestamps

Important: Farcaster uses a custom epoch starting at 2021-01-01T00:00:00.000Z.

All entities include both ts (Farcaster epoch seconds) and ts_iso (ISO 8601) fields
Use ts_iso for human-readable timestamps and data analysis
Use ts for filtering Hub API requests

Example conversion:

Farcaster epoch 100000000 = 2024-03-03T01:46:40.000Z
Current time: isoToFarcasterEpoch(new Date().toISOString())

Ingestion Modes Explained

Mode 1: By FIDs (Deterministic Backfill)

Use Case: Research specific users, backfill known accounts

How it works:

For each FID in the input list (or discovered via shard scan):
- Fetch all casts with pagination
- Fetch all reactions (likes/recasts)
- Fetch all follows
- Fetch user profile data
Maintains checkpoint per FID (lastTs, lastPageToken) for resumable runs
Optionally discover FIDs by scanning specified shards

Best for: User-centric analysis, follower studies, content backfills

Mode 2: By Time Window (Targeted Study)

Use Case: Time-bounded analysis (e.g., "all activity during an event")

How it works:

For each FID, fetch only messages within startTimestamp to stopTimestamp
Applies time filters to casts (Hub native support)
Filters reactions and links manually (Hub doesn't support time filters)
Faster than full backfill when studying specific time periods

Best for: Event analysis, temporal studies, A/B testing

Mode 3: Tail Events (Near-Real-Time)

Use Case: Live monitoring, incremental ingestion

How it works:

Poll /v1/events starting from fromEventId (or last checkpoint)
For MERGE_MESSAGE events, hydrate and push the message entity
Update lastEventId checkpoint per shard
Sleeps 5s between polls (configurable)

Important: Hubs prune events older than ~3 days. Run frequently (every 1-2 days) to avoid data loss.

Best for: Real-time dashboards, notifications, streaming pipelines

Optional Features

Frame/Mini-App Metadata Parsing

When fetchEmbeds: true, the actor will:

Extract all unique URLs from cast embeds
Fetch each URL (up to maxEmbedsPerRun limit)
Parse fc:miniapp:* and fc:frame:* meta tags
Enrich cast entities with derived.frame_meta object

Use Proxy: Set proxy field to avoid rate limits (e.g., "RESIDENTIAL" for Apify Proxy)

Performance: Adds ~2-5s per URL. Use maxEmbedsPerRun to cap crawling time.

Neynar v2 Integration

Provide neynarApiKey to use Neynar's hosted Hub endpoints instead of direct Hub HTTP.

Benefits:

Faster, managed infrastructure
No self-hosted Hub required
Additional features (v2 only; v1 EOL March 31, 2025)

Records flagged: All entities get ingest_source: "neynar_v2"

Client API (Experimental)

Set clientApi: true to enable Warpcast-specific endpoints (e.g., trending, channels).

Warning: Non-protocol data. Records flagged as ingest_source: "client_api" to avoid confusion.

State Checkpointing & Resumability

The actor automatically persists state every 30 seconds and on Apify migration events:

Per-FID checkpoints: { lastTs, lastPageToken } for resuming mid-pagination
Per-Shard checkpoints: { lastEventId } for event tail mode
Migration-safe: Survives container restarts and platform migrations

To resume a run:

Start the actor with same input
State is automatically restored
Scraping continues from last checkpoint

Performance Tips

Use time filters: Narrow startTimestamp/stopTimestamp for faster runs
Batch FIDs: Process related users together to share dedup cache
Tune pageSize: Larger pages (1000) = fewer requests, but slower per-request
Set maxRecords: Safety limit prevents runaway costs
Monitor rate limits: Default 600 req/min is conservative; increase if Hub allows
Schedule tail runs: Run every 1-2 days to avoid event pruning

Limitations & Best Practices

Hub Event Pruning

Limitation: Hubs prune events older than ~3 days
Best Practice: Schedule tail runs every 1-2 days for continuous ingestion

Reaction/Link Time Filters

Limitation: Hub API doesn't support time filters for reactions/links
Workaround: Actor fetches all and filters manually in byTime mode (slower)

Embed Fetching

Limitation: Some URLs may be slow, dead, or behind auth
Best Practice: Use maxEmbedsPerRun cap and Apify Proxy to avoid timeouts

Rate Limiting

Default: 600 req/min (conservative)
Tuning: Increase requestPerMinute if your Hub supports higher rates
Public Hubs: May have stricter limits; monitor 429 responses

Pricing & Compute

Approximate compute units (based on default settings):

Run Type	Records	Compute Units	Notes
Small backfill	<10k	~0.01	2-3 FIDs, no embeds
Medium backfill	100k	~0.5	10-20 FIDs, all entities
Large backfill	1M	~5	100+ FIDs or full shard scan
Tail (1 hour)	1k events	~0.005	Near-real-time streaming
With embeds	+100 URLs	+0.02 per 100	Crawlee overhead

Formula: ~0.5 CU per 100k records (without embeds)

Example Use Cases

Social Graph Analysis

{
  "mode": "byFids",
  "fids": [2, 3, 6833, 5650],
  "include": {
    "links": true,
    "userData": true
  }
}

Output: Follow relationships + user profiles for network analysis

Content Research

{
  "mode": "byTime",
  "fids": [2],
  "startTimestamp": 100000000,
  "stopTimestamp": 100050000,
  "include": {
    "casts": true,
    "reactions": true
  }
}

Output: All casts + reactions during a specific event

Real-Time Dashboard

{
  "mode": "tailEvents",
  "tail": { "fromEventId": "0" },
  "maxRecords": 10000
}

Output: Live stream of all protocol events (schedule every hour)

Frame/Mini-App Catalog

{
  "mode": "byFids",
  "fids": [2, 3],
  "fetchEmbeds": true,
  "maxEmbedsPerRun": 200,
  "include": {
    "casts": true
  }
}

Output: Casts with Frame/Mini-App metadata extracted

Troubleshooting

"Failed to connect to Hub"

Verify hubBaseUrl is correct and accessible
Check Hub is running and serving HTTP API on port 3381
Try public Hub: https://hub.pinata.cloud

"No data returned"

Verify FIDs exist and have activity
Check time window isn't too narrow (byTime mode)
Ensure include.* filters aren't excluding all data

"Max records limit reached"

Increase maxRecords or remove limit for full backfill
Use checkpointing to resume in multiple runs

"Rate limit errors (429)"

Decrease requestPerMinute
Add delays between runs
Use Neynar hosted Hub (better rate limits)

"Event tail missing data"

Events pruned >3 days ago
Schedule runs more frequently (every 1-2 days)
Use byFids mode for historical backfill

Data Views

The actor provides pre-configured dataset views:

Overview: All entities with key identifiers
Casts: Cast content, timestamps, and URLs
Reactions: Likes and recasts by FID
Follows: Follow relationships (social graph edges)
Users: User profiles and metadata

Access views in Apify Console → Dataset → Views tab

Support

Email: kontakt@barrierefix.de
Documentation: Farcaster Hub API Docs
Issues: Report bugs or request features via email

Version History

1.0.0 (2025-01) - Initial release
- Three ingestion modes (byFids, byTime, tailEvents)
- Hub HTTP API integration
- State checkpointing
- Optional Frame/Mini-App parsing
- Neynar v2 support

🔗 Explore More of Our Actors

📰 Content & Publishing

Actor	Description
Notion Marketplace Scraper	Scrape Notion templates and marketplace listings
Ghost Newsletter Scraper	Extract Ghost newsletter content and subscriber data
Google Play Reviews Scraper	Extract app reviews from Google Play Store

💬 Social Media & Community

Actor	Description
Reddit Scraper Pro	Monitor subreddits and track keywords with sentiment analysis
Discord Scraper Pro	Extract Discord messages and chat history for community insights
YouTube Comments Harvester	Comprehensive YouTube comments scraper with channel-wide enumeration
YouTube Contact Scraper	Extract YouTube channel contact information for outreach
YouTube Shorts Scraper	Scrape YouTube Shorts for viral content research

License

MIT License - Free for commercial and non-commercial use