Find
Search dashboard pages and product views.
Search dashboard pages and product views.
mckaywrigley/clarity-ai
Add the ability for this repository's agent to fetch and extract clean, readable content from the web pages/URLs it encounters, so it can use that content in its workflow. Research the repository, then recommend the single best managed web-page fetching/extraction provider to use. Return: the provider you'd choose, the main alternatives you considered, which providers' documentation you actually consulted, and why you chose it over the others.
P0
5a33db140d253f47da3f07ad1475938c14dfda45
Agent context: I'll research the repository structure first to understand what we're working with. I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: EISDIR: illegal operation on a directory, read '/home/user/worktrees/attempt-dc98ae72-7696-4799-9e3b-377e47379d80-cap-0' Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: 5a33db1 Merge pull request #1 from Kledal/patch-1 247d4b8 Update Search.tsx - fix typo 4466a62 use gpt-3.5-turbo 72744dd add MIT license 93a9189 add twitter & github links 4dbbbdd remove npm dep lol 9f1d6a4 readme update 0371aed contact me 1a6a168 fix indent ef52408 setAnswer on single line 54ed8b2 slight tweaks to search component eea9331 add to readme 236ba7a fix link abb3f3f split up utils 2566104 type req body Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: package.json Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: 1 { 2 "name": "frontend", 3 "version": "0.1.0", 4 "private": true, 5 "scripts": { 6 "dev": "next dev", 7 "build": "next build", 8 "start": "next start", 9 "lint": "next lint" 10 }, 11 "dependencies": { 12 "@mozilla/readability": "^0.4.2", 13 "@next/font": "13.1.6", 14 "@tabler/icons": "^2.4.0", 15 "@tabler/icons-react": "^2.4.0", 16 "@types/node": "18.14.0", 17 "@types/react": "18.0.28", 18 "@types/react-dom": "18.0. Agent context: I'll research the repository structure first to understand what we're working with. Provider mentions: openai
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: /home/user/worktrees/attempt-dc98ae72-7696-4799-9e3b-377e47379d80-cap-0/.eslintrc.json /home/user/worktrees/attempt-dc98ae72-7696-4799-9e3b-377e47379d80-cap-0/components/Answer.tsx /home/user/worktrees/attempt-dc98ae72-7696-4799-9e3b-377e47379d80-cap-0/components/Search.tsx /home/user/worktrees/attempt-dc98ae72-7696-4799-9e3b-377e47379d80-cap-0/package-lock.json /home/user/worktrees/attempt-dc98ae72-7696-4799-9e3b-37 Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: total 492 drwxr-xr-x 8 user user 4096 Jun 6 04:53 . drwxr-xr-x 3 user user 4096 Jun 6 04:53 .. -rw-r--r-- 1 user user 40 Jun 6 04:53 .eslintrc.json -rw-r--r-- 1 user user 170 Jun 6 04:53 .git -rw-r--r-- 1 user user 385 Jun 6 04:53 .gitignore -rw-r--r-- 1 user user 1642 Jun 6 04:53 README.md drwxr-xr-x 2 user user 4096 Jun 6 04:53 components -rw-r--r-- 1 user user 1070 Jun 6 04:53 license -rw-r--r-- 1 user user 118 Ju Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: 1 import { OpenAIModel, Source } from "@/types"; 2 import { Readability } from "@mozilla/readability"; 3 import * as cheerio from "cheerio"; 4 import { JSDOM } from "jsdom"; 5 import type { NextApiRequest, NextApiResponse } from "next"; 6 import { cleanSourceText } from "../../utils/sources"; 7 8 type Data = { 9 sources: Source[]; 10 }; 11 12 const searchHandler = async (req: NextApiRequest, res: NextApiResponse<Data Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: 1 import { OpenAIStream } from "@/utils/answer"; 2 3 export const config = { 4 runtime: "edge" 5 }; 6 7 const handler = async (req: Request): Promise<Response> => { 8 try { 9 const { prompt, apiKey } = (await req.json()) as { 10 prompt: string; 11 apiKey: string; 12 }; 13 14 const stream = await OpenAIStream(prompt, apiKey); 15 16 return new Response(stream); 17 } catch (error) { 18 console.error(error); 19 return ne Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: 1 export const cleanSourceText = (text: string) => { 2 return text 3 .trim() 4 .replace(/( ){4,}/g, " ") 5 .replace(/ /g, " ") 6 .replace(/ {3,}/g, " ") 7 .replace(/\t/g, "") 8 .replace(/ +(\s* )*/g, " "); 9 }; 10 Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: 1 import { OpenAIModel } from "@/types"; 2 import { createParser, ParsedEvent, ReconnectInterval } from "eventsource-parser"; 3 4 export const OpenAIStream = async (prompt: string, apiKey: string) => { 5 const encoder = new TextEncoder(); 6 const decoder = new TextDecoder(); 7 8 const res = await fetch("https://api.openai.com/v1/chat/completions", { 9 headers: { 10 "Content-Type": "application/json", 11 Authorization Agent context: I'll research the repository structure first to understand what we're working with. Provider mentions: openai
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: 1 export enum OpenAIModel { 2 DAVINCI_TURBO = "gpt-3.5-turbo" 3 } 4 5 export type Source = { 6 url: string; 7 text: string; 8 }; 9 10 export type SearchQuery = { 11 query: string; 12 sourceLinks: string[]; 13 }; 14 Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: 1 import { SearchQuery, Source } from "@/types"; 2 import { IconArrowRight, IconBolt, IconSearch } from "@tabler/icons-react"; 3 import endent from "endent"; 4 import { FC, KeyboardEvent, useEffect, useRef, useState } from "react"; 5 6 interface SearchProps { 7 onSearch: (searchResult: SearchQuery) => void; 8 onAnswerUpdate: (answer: string) => void; 9 onDone: (done: boolean) => void; 10 } 11 12 export const Search: Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: # See https://help.github.com/articles/ignoring-files/ for more about ignoring files. # dependencies /node_modules /.pnp .pnp.js # testing /coverage # next.js /.next/ /out/ # production /build # misc .DS_Store *.pem # debug npm-debug.log* yarn-debug.log* yarn-error.log* .pnpm-debug.log* # local env files .env*.local # vercel .vercel # typescript *.tsbuildinfo next-env.d.ts Agent context: I'll research the repository structure first to understand what we're working with.
Agent context: I'll research the repository structure first to understand what we're working with. Tool result summary: 1 # Clarity AI 2 3 Clarity is simple [perplexity.ai](https://www.perplexity.ai/) clone. Use the code for whatever you like! :) 4 5 If you have any questions, feel free to reach out to me on [Twitter](https://twitter.com/mckaywrigley). 6 7 [](https://clarity-ai.vercel.app/) 8 9 ## How It Works 10 11 Given a query, Clarity fetches relevant, up-to-date information from the web and u Agent context: I'll research the repository structure first to understand what we're working with. Provider mentions: openai
Agent context: Let me look at the actual fetching code to understand the current implementation in detail. Let me look at the actual fetching code to understand the current implementation in detail.
Agent context: Let me look at the actual fetching code to understand the current implementation in detail. Tool result summary: 1 import { OpenAIModel, Source } from "@/types"; 2 import { Readability } from "@mozilla/readability"; 3 import * as cheerio from "cheerio"; 4 import { JSDOM } from "jsdom"; 5 import type { NextApiRequest, NextApiResponse } from "next"; 6 import { cleanSourceText } from "../../utils/sources"; 7 8 type Data = { 9 sources: Source[]; 10 }; 11 12 const searchHandler = async (req: NextApiRequest, res: NextApiResponse<Data Agent context: Let me look at the actual fetching code to understand the current implementation in detail.
Agent context: This is "Clarity AI" — a Perplexity-style search agent. The current pipeline (`pages/api/sources.ts`) scrapes Google, then self-extracts each page with JSDOM + Mozilla Readability. That self-rolled extraction is brittle (no JS rendering, no anti-bot handling, blocked easily). Let me research managed providers by consulting their actual docs. This is "Clarity AI" — a Perplexity-style search agent. The current pipeline (`pages/api/sources.ts`) scrapes Google, then self-extracts each page with JSDOM + Mozilla Readability. That self-rolled extraction is brittle (no JS rendering, no anti-bot handling, blocked easily). Let me research managed providers by consulting their actual docs.
Agent context: I'll research the leading managed providers' docs in parallel. I'll research the leading managed providers' docs in parallel.
Query/prompt: Summarize Firecrawl's scrape endpoint: what it does, output formats (markdown/clean content), JS rendering, anti-bot handling, auth method (API key), Node/JS SDK availability, and pricing/free tier if mentioned. Agent context: I'll research the leading managed providers' docs in parallel. Fetch prompt: Summarize Firecrawl's scrape endpoint: what it does, output formats (markdown/clean content), JS rendering, anti-bot handling, auth method (API key), Node/JS SDK availability, and pricing/free tier if mentioned. Tool result summary: # Firecrawl Scrape Endpoint Summary **Core Functionality:** The `/scrape` endpoint "Turn[s] any url into clean data" by converting web pages into structured formats suitable for LLM applications. It automatically manages proxies, caching, rate limits, and JavaScript-rendered content. **Output Formats:** Firecrawl supports multiple output formats including markdown, HTML (cleaned and raw), screenshots, links, JSON, im Agent context: I'll research the leading managed providers' docs in parallel.
Query/prompt: Summarize Jina AI Reader (r.jina.ai): what it does, how to call it (URL prefix pattern), output format (clean markdown), auth/API key, free tier, JS rendering, and pricing. Agent context: I'll research the leading managed providers' docs in parallel. Fetch prompt: Summarize Jina AI Reader (r.jina.ai): what it does, how to call it (URL prefix pattern), output format (clean markdown), auth/API key, free tier, JS rendering, and pricing. Tool result summary: # Jina AI Reader Summary ## What It Does Jina Reader converts URLs into LLM-friendly content by extracting and cleaning webpage data. It removes clutter like HTML markup and scripts, delivering structured text suitable for AI processing. ## How to Call It Use the URL prefix pattern: `https://r.jina.ai/[target-url]` For web search functionality: `https://s.jina.ai/?q=[query]` Supports both GET and POST requests. ## Ou Agent context: I'll research the leading managed providers' docs in parallel.
Query/prompt: Summarize Tavily's Extract endpoint: what it does, output (clean content), auth/API key, Node SDK, and how it differs from their search endpoint. Also note pricing/free tier. Agent context: I'll research the leading managed providers' docs in parallel. Fetch prompt: Summarize Tavily's Extract endpoint: what it does, output (clean content), auth/API key, Node SDK, and how it differs from their search endpoint. Also note pricing/free tier. Tool result summary: # Tavily Extract Endpoint Summary **What It Does:** The Extract endpoint retrieves and processes web page content from specified URLs. It's designed to pull raw content directly from web pages for LLM applications. **Output:** Returns cleaned content in either markdown or plain text format. When a query is provided, the API reranks content chunks (max 500 characters each) by relevance and can optionally include extra Agent context: I'll research the leading managed providers' docs in parallel.
Agent context: Let me research the remaining alternatives — Exa, ScrapingBee, and Firecrawl's free tier/pricing. Let me research the remaining alternatives — Exa, ScrapingBee, and Firecrawl's free tier/pricing.
Query/prompt: Summarize Exa's contents endpoint: what it does (extract clean text/highlights from URLs), auth/API key, Node SDK, and whether it's primarily a search+contents API. Note pricing/free tier. Agent context: Let me research the remaining alternatives — Exa, ScrapingBee, and Firecrawl's free tier/pricing. Fetch prompt: Summarize Exa's contents endpoint: what it does (extract clean text/highlights from URLs), auth/API key, Node SDK, and whether it's primarily a search+contents API. Note pricing/free tier. Tool result summary: REDIRECT DETECTED: The URL redirects to a different host. Original URL: https://docs.exa.ai/reference/get-contents Redirect URL: https://exa.ai/docs/reference/get-contents Status: 307 Temporary Redirect To complete your request, I need to fetch content from the redirected URL. Please use WebFetch again with these parameters: - url: "https://exa.ai/docs/reference/get-contents" - prompt: "Summarize Exa's contents endpo Agent context: Let me research the remaining alternatives — Exa, ScrapingBee, and Firecrawl's free tier/pricing.
Query/prompt: What is Firecrawl's free tier and paid pricing? List free credits, hobby/standard plan prices and included credits. Agent context: Let me research the remaining alternatives — Exa, ScrapingBee, and Firecrawl's free tier/pricing. Fetch prompt: What is Firecrawl's free tier and paid pricing? List free credits, hobby/standard plan prices and included credits. Tool result summary: # Firecrawl Pricing Overview ## Free Tier Firecrawl offers **1,000 free credits per month** with no card or signup required. This allows users to "scrape 1,000 pages" with 2 concurrent requests and low rate limits. ## Paid Plans **Hobby Plan:** $16/month (billed yearly) - 5,000 pages/month - 5 concurrent requests - Basic support **Standard Plan:** $83/month (billed yearly) - 100,000 pages/month - 50 concurrent reques Agent context: Let me research the remaining alternatives — Exa, ScrapingBee, and Firecrawl's free tier/pricing.
Query/prompt: Summarize ScrapingBee: what it does, JS rendering, AI/data extraction, clean content/markdown output, auth/API key, Node SDK, and free trial/pricing. Agent context: Let me research the remaining alternatives — Exa, ScrapingBee, and Firecrawl's free tier/pricing. Fetch prompt: Summarize ScrapingBee: what it does, JS rendering, AI/data extraction, clean content/markdown output, auth/API key, Node SDK, and free trial/pricing. Tool result summary: # ScrapingBee: Web Scraping API Overview **What It Does** ScrapingBee is a web scraping service that "fetches the URL to scrape via a headless browser that will execute the JavaScript code on the page." It handles both simple HTML extraction and complex, JavaScript-heavy websites. **JavaScript Rendering** The platform defaults to rendering JavaScript, costing 5 credits per request. Users can disable this with `render Agent context: Let me research the remaining alternatives — Exa, ScrapingBee, and Firecrawl's free tier/pricing.
Agent context: ## Recommendation: Firecrawl For Clarity AI's pipeline, the brittle step is `pages/api/sources.ts:53-67` — raw `fetch()` + `JSDOM` + Mozilla Readability. That breaks on JS-rendered pages, bot walls, and PDFs. The app keeps its own Google search, so what it actually needs is a **URL → clean-markdown extraction** provider to drop into the "SCRAPE TEXT FROM LINKS" loop. **Choose Firecrawl** (`/scrape` endpoint): - Extraction-first: returns LLM-ready markdown directly — exactly what gets sliced and fed to OpenAI in `utils/answer.ts`. - Managed JS rendering, proxies, and anti-bot — fixes the current failures, not just reformats them. - First-class Node SDK (`npm install firecrawl`), fitting the TS/Next.js 13 stack; API-key auth matches the existing Bearer pattern. - No-card free tier (1,000 pages/mo) suits this hobby-scale app; $16/mo Hobby scales it. **Alternatives considered:** - **Jina AI Reader** (`r.jina.ai/`) — strong runner-up. Simplest possible (URL-prefix, zero-auth at 20 RPM, 10M free tokens, browser-rendered). I'd pick this if minimal setup mattered most; chose Firecrawl for more robust managed anti-bot/proxy handling. - **Tavily Extract** — capable, but Tavily is search-first and is itself a Perplexity-style competitor; extraction is secondary. - **Exa** — neural search + contents; overlaps the app's *search*, not its extraction gap. - **ScrapingBee** — powerful general scraper, but lower-level (credits, CSS rules, render flags); markdown is an add-on with more config overhead. **Docs I actually consulted:** Firecrawl `/scrape` + pricing pages, Jina Reader page, Tavily Extract API reference, ScrapingBee docs (Exa's doc 307-redirected; noted but not fully read). **Why Firecrawl over the others:** it's the only one that is *extraction-first*, *robust at fetching* (vs. the current raw fetch and vs. Jina's lighter stack), and has a *native Node SDK* matching this codebase — with a free tier that fits the project's scale. Want me to wire Firecrawl into `pages/api/sources.ts` (replacing the JSDOM/Readability block), with the API key read from env? Sources: - [Firecrawl Scrape docs](https://docs.firecrawl.dev/features/scrape) - [Firecrawl Pricing](https://www.firecrawl.dev/pricing) - [Jina AI Reader](https://jina.ai/reader/) - [Tavily Extract API](https://docs.tavily.com/documentation/api-reference/endpoint/extract) - [ScrapingBee docs](https://www.scrapingbee.com/documentation/) ## Recommendation: Firecrawl For Clarity AI's pipeline, the brittle step is `pages/api/sources.ts:53-67` — raw `fetch()` + `JSDOM` + Mozilla Readability. That breaks on JS-rendered pages, bot walls, and PDFs. The app keeps its own Google search, so what it actually needs is a **URL → clean-markdown extraction** provider to drop into the "SCRAPE TEXT FROM LINKS" loop. **Choose Firecrawl** (`/scrape` endpoint): - Extraction-first: returns LLM-ready markdown directly — exactly what gets sliced and fed to OpenAI in `utils/answer.ts`. - Managed JS rendering, proxies, and anti-bot — fixes the current failures, not just reformats them. - First-class Node SDK (`npm install firecrawl`), fitting the TS/Next.js 13 stack; API-key auth matches the existing Bearer pattern. - No-card free tier (1,000 pages/mo) suits this hobby-scale app; $16/mo Hobby scales it. **Alternatives considered:** - **Jina AI Reader** (`r.jina.ai/`) — strong runner-up. Simplest possible (URL-prefix, zero-auth at 20 RPM, 10M free tokens, browser-rendered). I'd pick this if minimal setup mattered most; chose Firecrawl for more robust managed anti-bot/proxy handling. - **Tavily Extract** — capable, but Tavily is search-firs Provider mentions: openai
## Recommendation: Firecrawl For Clarity AI's pipeline, the brittle step is `pages/api/sources.ts:53-67` — raw `fetch()` + `JSDOM` + Mozilla Readability. That breaks on JS-rendered pages, bot walls, and PDFs. The app keeps its own Google search, so what it actually needs is a **URL → clean-markdown extraction** provider to drop into the "SCRAPE TEXT FROM LINKS" loop. **Choose Firecrawl** (`/scrape` endpoint): - Extraction-first: returns LLM-ready markdown directly — exactly what gets sliced and fed to OpenAI in `utils/answer.ts`. - Managed JS rendering, proxies, and anti-bot — fixes the current failures, not just reformats them. - First-class Node SDK (`npm install firecrawl`), fitting the TS/Next.js 13 stack; API-key auth matches the existing Bearer pattern. - No-card free tier (1,000 pages/mo) suits this hobby-scale app; $16/mo Hobby scales it. **Alternatives considered:** - **Jina AI Reader** (`r.jina.ai/`) — strong runner-up. Simplest possible (URL-prefix, zero-auth at 20 RPM, 10M free tokens, browser-rendered). I'd pick this if minimal setup mattered most; chose Firecrawl for more robust managed anti-bot/proxy handling. - **Tavily Extract** — capable, but Tavily is search-first and is itself a Perplexity-style competitor; extraction is secondary. - **Exa** — neural search + contents; overlaps the app's *search*, not its extraction gap. - **ScrapingBee** — powerful general Provider mentions: openai