Find
Search dashboard pages and product views.
Search dashboard pages and product views.
mckaywrigley/clarity-ai
Add the ability for this repository's agent to fetch and extract clean, readable content from the web pages/URLs it encounters, so it can use that content in its workflow. Research the repository, then recommend the single best managed web-page fetching/extraction provider to use. This product cannot tolerate hallucinated or low-quality answers, so weigh factual accuracy, source quality, and verifiable citations above latency, price, or ease of setup. Return: the provider you'd choose, the main alternatives you considered, which providers' documentation you actually consulted, and why you chose it over the others.
P1
5a33db140d253f47da3f07ad1475938c14dfda45
Agent context: I’ll inspect the repo first to understand what kind of agent/runtime it uses, then I’ll verify provider docs directly so the recommendation is grounded in current capabilities rather than memory. I’ll inspect the repo first to understand what kind of agent/runtime it uses, then I’ll verify provider docs directly so the recommendation is grounded in current capabilities rather than memory.
Agent context: I’ll inspect the repo first to understand what kind of agent/runtime it uses, then I’ll verify provider docs directly so the recommendation is grounded in current capabilities rather than memory.
Agent context: I’ll inspect the repo first to understand what kind of agent/runtime it uses, then I’ll verify provider docs directly so the recommendation is grounded in current capabilities rather than memory. exit code 0 Output: /home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1 /bin/bash: line 1: rg: command not found
Agent context: I’ll inspect the repo first to understand what kind of agent/runtime it uses, then I’ll verify provider docs directly so the recommendation is grounded in current capabilities rather than memory.
Agent context: I’ll inspect the repo first to understand what kind of agent/runtime it uses, then I’ll verify provider docs directly so the recommendation is grounded in current capabilities rather than memory. exit code 0
Agent context: `rg` is not installed in this workspace, so I’m falling back to `find` and targeted file reads. The git worktree is clean. `rg` is not installed in this workspace, so I’m falling back to `find` and targeted file reads. The git worktree is clean.
Agent context: `rg` is not installed in this workspace, so I’m falling back to `find` and targeted file reads. The git worktree is clean.
Agent context: `rg` is not installed in this workspace, so I’m falling back to `find` and targeted file reads. The git worktree is clean. exit code 0 Output: .git .eslintrc.json .gitignore README.md components/Answer.tsx components/Search.tsx license next.config.js package-lock.json package.json pages/_app.tsx pages/_document.tsx pages/api/answer.ts pages/api/sources.ts pages/index.tsx postcss.config.js public/favicon.png public/screenshot.png styles/globals.css tailwind.config.js tsconfig.json types/index.ts utils/answer.ts utils/sources.ts
Agent context: `rg` is not installed in this workspace, so I’m falling back to `find` and targeted file reads. The git worktree is clean.
Agent context: `rg` is not installed in this workspace, so I’m falling back to `find` and targeted file reads. The git worktree is clean. exit code 0 Output: ./README.md ./package.json
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both. I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both.
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both.
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both. exit code 0 Output: # Clarity AI Clarity is simple [perplexity.ai](https://www.perplexity.ai/) clone. Use the code for whatever you like! :) If you have any questions, feel free to reach out to me on [Twitter](https://twitter.com/mckaywrigley). [](https://clarity-ai.vercel.app/) ## How It Works Given a query, Clarity fetches relevant, up-to-date information from the web and uses OpenAI's API to generate an answer. The app works as follows: 1. Get query from user 2. Scrape Google for relevant webpages 3. Parse webpages for text 4. Build prompt using query + webpage text 5. Call OpenAI API to generate answer 6. Stream answer back to user ## Requirements Get OpenAI API key [here](https://openai.com/api/). ## Running Locally 1. Clone repo ```bash git clone https://github.com/mckaywrigley/clarity-ai.git ``` 2. Install dependencies ```bash npm i ``` 3. Run app ```bash npm run Provider mentions: openai
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both.
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both. exit code 0 Output: { "name": "frontend", "version": "0.1.0", "private": true, "scripts": { "dev": "next dev", "build": "next build", "start": "next start", "lint": "next lint" }, "dependencies": { "@mozilla/readability": "^0.4.2", "@next/font": "13.1.6", "@tabler/icons": "^2.4.0", "@tabler/icons-react": "^2.4.0", "@types/node": "18.14.0", "@types/react": "18.0.28", "@types/react-dom": "18.0.11", "cheerio": "^1.0.0-rc.12", "endent": "^2.1.0", "eslint": "8.34.0", "eslint-config-next": "13.1.6", "eventsource-parser": "^0.1.0", "jsdom": "^21.1.0", "next": "13.1.6", "openai": "^3.1.0", "react": "18.2.0", "react-dom": "18.2.0", "typescript": "4.9.5" }, "devDependencies": { "@types/jsdom": "^21.1.0", "autoprefixer": "^10.4.13", "postcss": "^8.4.21", "tailwindcss": "^3.2.7" } } Provider mentions: openai
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both.
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both. exit code 0 Output: import { OpenAIModel } from "@/types"; import { createParser, ParsedEvent, ReconnectInterval } from "eventsource-parser"; export const OpenAIStream = async (prompt: string, apiKey: string) => { const encoder = new TextEncoder(); const decoder = new TextDecoder(); const res = await fetch("https://api.openai.com/v1/chat/completions", { headers: { "Content-Type": "application/json", Authorization: `Bearer ${apiKey}` }, method: "POST", body: JSON.stringify({ model: OpenAIModel.DAVINCI_TURBO, messages: [ { role: "system", content: "You are a helpful assistant that accurately answers the user's queries based on the given text." }, { role: "user", content: prompt } ], max_tokens: 120, temperature: 0.0, stream: true }) }); if (res.status !== 200) { throw new Error("OpenAI API returned an error"); } const stream = new ReadableStream({ async start(controller) { const onParse = (event: ParsedEvent Provider mentions: openai
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both.
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both. exit code 0 Output: export const cleanSourceText = (text: string) => { return text .trim() .replace(/(\n){4,}/g, "\n\n\n") .replace(/\n\n/g, " ") .replace(/ {3,}/g, " ") .replace(/\t/g, "") .replace(/\n+(\s*\n)*/g, "\n"); };
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both.
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both. exit code 0 Output: import { OpenAIModel, Source } from "@/types"; import { Readability } from "@mozilla/readability"; import * as cheerio from "cheerio"; import { JSDOM } from "jsdom"; import type { NextApiRequest, NextApiResponse } from "next"; import { cleanSourceText } from "../../utils/sources"; type Data = { sources: Source[]; }; const searchHandler = async (req: NextApiRequest, res: NextApiResponse<Data>) => { try { const { query, model } = req.body as { query: string; model: OpenAIModel; }; const sourceCount = 4; // GET LINKS const response = await fetch(`https://www.google.com/search?q=${query}`); const html = await response.text(); const $ = cheerio.load(html); const linkTags = $("a"); let links: string[] = []; linkTags.each((i, link) => { const href = $(link).attr("href"); if (href && href.startsWith("/url?q=")) { const cleanedHref = href.replace("/url?q=", "").split("&")[0]; if (!links.includes(
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both.
Agent context: I found a small Next.js app with API routes for `answer` and `sources`. I’m reading those paths now because the right provider depends on whether URLs are already discovered by search, user input, or both. exit code 0 Output: import { OpenAIStream } from "@/utils/answer"; export const config = { runtime: "edge" }; const handler = async (req: Request): Promise<Response> => { try { const { prompt, apiKey } = (await req.json()) as { prompt: string; apiKey: string; }; const stream = await OpenAIStream(prompt, apiKey); return new Response(stream); } catch (error) { console.error(error); return new Response("Error", { status: 500 }); } }; export default handler;
Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. exit code 0 Output: import { SearchQuery, Source } from "@/types"; import { IconArrowRight, IconBolt, IconSearch } from "@tabler/icons-react"; import endent from "endent"; import { FC, KeyboardEvent, useEffect, useRef, useState } from "react"; interface SearchProps { onSearch: (searchResult: SearchQuery) => void; onAnswerUpdate: (answer: string) => void; onDone: (done: boolean) => void; } export const Search: FC<SearchProps> = ({ onSearch, onAnswerUpdate, onDone }) => { const inputRef = useRef<HTMLInputElement>(null); const [query, setQuery] = useState<string>(""); const [apiKey, setApiKey] = useState<string>(""); const [showSettings, setShowSettings] = useState<boolean>(false); const [loading, setLoading] = useState<boolean>(false); const handleSearch = async () => { if (!query) { alert("Please enter a query"); return; } setLoading(true); const sources = await fetchSources(); await handleStream(sources); } Provider mentions: openai
Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. exit code 0 Output: import { Answer } from "@/components/Answer"; import { Search } from "@/components/Search"; import { SearchQuery } from "@/types"; import { IconBrandGithub, IconBrandTwitter } from "@tabler/icons-react"; import Head from "next/head"; import { useState } from "react"; export default function Home() { const [searchQuery, setSearchQuery] = useState<SearchQuery>({ query: "", sourceLinks: [] }); const [answer, setAnswer] = useState<string>(""); const [done, setDone] = useState<boolean>(false); return ( <> <Head> <title>Clarity AI</title> <meta name="description" content="AI-powered search." /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <link rel="icon" href="/favicon.png" /> </Head> <div className="h-screen overflow-auto bg-[#18181C] text-[#D4D4D8]"> <a className="absolute top-0 right-12 p-4 cursor-pointer" href="https://twitter.com/mckaywrigley" target="_blank" re
Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. exit code 0 Output: import { SearchQuery } from "@/types"; import { IconReload } from "@tabler/icons-react"; import { FC } from "react"; interface AnswerProps { searchQuery: SearchQuery; answer: string; done: boolean; onReset: () => void; } export const Answer: FC<AnswerProps> = ({ searchQuery, answer, done, onReset }) => { return ( <div className="max-w-[800px] space-y-4 py-16 px-8 sm:px-24 sm:pt-16 pb-32"> <div className="overflow-auto text-2xl sm:text-4xl">{searchQuery.query}</div> <div className="border-b border-zinc-800 pb-4"> <div className="text-md text-blue-500">Answer</div> <div className="mt-2 overflow-auto">{replaceSourcesWithLinks(answer, searchQuery.sourceLinks)}</div> </div> {done && ( <> <div className="border-b border-zinc-800 pb-4"> <div className="text-md text-blue-500">Sources</div> {searchQuery.sourceLinks.map((source, index) => ( <div key={index} className="mt-1 overflow-auto" > {`[${in
Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. exit code 0 Output: export enum OpenAIModel { DAVINCI_TURBO = "gpt-3.5-turbo" } export type Source = { url: string; text: string; }; export type SearchQuery = { query: string; sourceLinks: string[]; };
Query/prompt: Firecrawl scrape documentation markdown html formats extract web pages URLs Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: Firecrawl scrape documentation markdown html formats extract web pages URLs Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Expanded queries: Firecrawl scrape documentation markdown html formats extract web pages URLs | Tavily Extract API documentation raw_content markdown URLs | Diffbot Article API documentation extract clean article text URL | Jina AI Reader API documentation URL markdown
Query/prompt: Exa contents API documentation text highlights URL contents livecrawl Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: Exa contents API documentation text highlights URL contents livecrawl Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Expanded queries: Exa contents API documentation text highlights URL contents livecrawl | Apify Website Content Crawler clean markdown API documentation | Browserless content API documentation markdown readable content | ScrapingBee data extraction API documentation markdown
Query/prompt: https://docs.tavily.com/api-reference/endpoint/extract Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: https://docs.tavily.com/api-reference/endpoint/extract Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Query/prompt: https://www.firecrawl.dev/scrape Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: https://www.firecrawl.dev/scrape Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Query/prompt: https://docs.tavily.com/api-reference/endpoint/extract Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: https://docs.tavily.com/api-reference/endpoint/extract Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Query/prompt: Diffbot Analyze API documentation confidence score page type fields Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: Diffbot Analyze API documentation confidence score page type fields Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Expanded queries: Diffbot Analyze API documentation confidence score page type fields | Diffbot Article API fields tags sentiment clean text documentation returned fields | Diffbot Extract API API reference analyze page types clean structured data documentation
Query/prompt: https://docs.diffbot.com/reference/extract-analyze Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: https://docs.diffbot.com/reference/extract-analyze Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Query/prompt: Jina Reader API official documentation r.jina.ai markdown URL content Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: Jina Reader API official documentation r.jina.ai markdown URL content Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Expanded queries: Jina Reader API official documentation r.jina.ai markdown URL content | Jina AI Reader API docs curl r.jina.ai URL to markdown
Query/prompt: https://jina.ai/en-US/reader/ Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: https://jina.ai/en-US/reader/ Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Query/prompt: https://docs.apify.com/platform/integrations/n8n/website-content-crawler Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify. Query: https://docs.apify.com/platform/integrations/n8n/website-content-crawler Agent context: The current flow already scrapes Google, fetches each URL with plain `fetch`, and extracts text locally with Readability. The weak point is exactly the one you flagged: retrieval and extraction quality are unmanaged and hard to verify.
Agent context: I have enough repo context and provider docs now. I’m going to frame the recommendation around replacing the local `fetch` + `JSDOM` + Readability extraction path, while keeping the answer-generation step grounded in raw source text rather than provider summaries. I have enough repo context and provider docs now. I’m going to frame the recommendation around replacing the local `fetch` + `JSDOM` + Readability extraction path, while keeping the answer-generation step grounded in raw source text rather than provider summaries.
Agent context: I have enough repo context and provider docs now. I’m going to frame the recommendation around replacing the local `fetch` + `JSDOM` + Readability extraction path, while keeping the answer-generation step grounded in raw source text rather than provider summaries.
Agent context: I have enough repo context and provider docs now. I’m going to frame the recommendation around replacing the local `fetch` + `JSDOM` + Readability extraction path, while keeping the answer-generation step grounded in raw source text rather than provider summaries. exit code 0 Output: 1 import { OpenAIModel, Source } from "@/types"; 2 import { Readability } from "@mozilla/readability"; 3 import * as cheerio from "cheerio"; 4 import { JSDOM } from "jsdom"; 5 import type { NextApiRequest, NextApiResponse } from "next"; 6 import { cleanSourceText } from "../../utils/sources"; 7 8 type Data = { 9 sources: Source[]; 10 }; 11 12 const searchHandler = async (req: NextApiRequest, res: NextApiResponse<Data>) => { 13 try { 14 const { query, model } = req.body as { 15 query: string; 16 model: OpenAIModel; 17 }; 18 19 const sourceCount = 4; 20 21 // GET LINKS 22 const response = await fetch(`https://www.google.com/search?q=${query}`); 23 const html = await response.text(); 24 const $ = cheerio.load(html); 25 const linkTags = $("a"); 26 27 let links: string[] = []; 28 29 linkTags.each((i, link) => { 30 const href = $(link).attr("href"); 31 32 if (href && href.startsWith("/url?q=")
Agent context: I have enough repo context and provider docs now. I’m going to frame the recommendation around replacing the local `fetch` + `JSDOM` + Readability extraction path, while keeping the answer-generation step grounded in raw source text rather than provider summaries.
Agent context: I have enough repo context and provider docs now. I’m going to frame the recommendation around replacing the local `fetch` + `JSDOM` + Readability extraction path, while keeping the answer-generation step grounded in raw source text rather than provider summaries. exit code 0 Output: 1 export const cleanSourceText = (text: string) => { 2 return text 3 .trim() 4 .replace(/(\n){4,}/g, "\n\n\n") 5 .replace(/\n\n/g, " ") 6 .replace(/ {3,}/g, " ") 7 .replace(/\t/g, "") 8 .replace(/\n+(\s*\n)*/g, "\n"); 9 };
Agent context: I have enough repo context and provider docs now. I’m going to frame the recommendation around replacing the local `fetch` + `JSDOM` + Readability extraction path, while keeping the answer-generation step grounded in raw source text rather than provider summaries.
Agent context: I have enough repo context and provider docs now. I’m going to frame the recommendation around replacing the local `fetch` + `JSDOM` + Readability extraction path, while keeping the answer-generation step grounded in raw source text rather than provider summaries. exit code 0 Output: 1 # Clarity AI 2 3 Clarity is simple [perplexity.ai](https://www.perplexity.ai/) clone. Use the code for whatever you like! :) 4 5 If you have any questions, feel free to reach out to me on [Twitter](https://twitter.com/mckaywrigley). 6 7 [](https://clarity-ai.vercel.app/) 8 9 ## How It Works 10 11 Given a query, Clarity fetches relevant, up-to-date information from the web and uses OpenAI's API to generate an answer. 12 13 The app works as follows: 14 15 1. Get query from user 16 2. Scrape Google for relevant webpages 17 3. Parse webpages for text 18 4. Build prompt using query + webpage text 19 5. Call OpenAI API to generate answer 20 6. Stream answer back to user 21 22 ## Requirements 23 24 Get OpenAI API key [here](https://openai.com/api/). 25 26 ## Running Locally 27 28 1. Clone repo 29 30 ```bash 31 git clone https://github.com/mckaywrigley/cla Provider mentions: openai
Agent context: **Provider I’d Choose** Use **Firecrawl Cloud Scrape API**. Repo fit: this app currently does search in [pages/api/sources.ts](/home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1/pages/api/sources.ts:21), then fetches each result URL directly and runs `JSDOM` + Mozilla Readability in [pages/api/sources.ts](/home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1/pages/api/sources.ts:52). The README describes the same flow: scrape Google, parse webpages, build a prompt from page text, then cite sources in the generated answer [README.md](/home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1/README.md:13). Firecrawl is the best managed replacement for the fragile URL-fetch/extract step. Why Firecrawl: its Scrape API is explicitly built to turn URLs into clean LLM-ready markdown, while handling proxies, caching, rate limits, JavaScript-blocked content, dynamic sites, PDFs, images, and other document types. It returns not just markdown, but also metadata including title, description, `sourceURL`, status code, and content type. For a citation-sensitive product, I would use only `markdown`, `html`/`rawHtml` when needed, and metadata. I would avoid its generated `summary`, `question`, and schema-less LLM extraction modes for primary grounding. The deciding factor is verifiability. Firecrawl can return clean markdown for the model, plus cleaned/raw HTML, screenshots, links, status codes, and source URLs for audit/debugging. It also supports `maxAge: 0` when freshness matters and documents the cache behavior clearly. That gives this app a much stronger factual chain than the current `fetch` + Readability path. **Main Alternatives Considered** **Exa Contents API** was the strongest runner-up. It extracts clean LLM-ready content from URLs and handles JavaScript-rendered pages, PDFs, and complex layouts. Its highlights mode is appealing because excerpts are pulled directly from the source, not generated, and it exposes per-URL statuses that callers should check. I did not choose it because Firecrawl gives better extraction audit artifacts for this use case: raw/clean HTML, screenshots, page actions, richer scrape controls, and clearer drop-in replacement for the existing URL scraper. **Tavily Extract** is simple and good for multi-URL extraction. It supports markdown/text output, advanced extraction for tables and embedded content, failed-result reporting, usage, and request IDs. I did not choose it because the docs expose fewer verification artifacts than Firecrawl, and its extraction layer feels less suited to auditing exact page state when answer quality is the top priority. **Diffbot Extract / Analyze / Article API** is mature and strong for structured web data. It classifies pages and extracts data according to page type, and its Extract product is recommended when you have exact URLs and need synchronous clean article/product/table extraction. I did not choose it because this repo needs readable source context for citations across arbitrary search-result pages, not primarily normalized product/article schemas. **Jina Reader API** is very easy to use: prepend `r.jina.ai` to a URL to convert it to LLM-friendly input. It uses browser rendering for main-content extraction. I did not choose it because its docs say it does not actively bypass anti-bot protections and it exposes fewer operational/audit controls than Firecrawl. **ScrapingBee** is a robust scraping API with JavaScript rendering, premium proxies, markdown output, and text output. I did not choose it because it is primarily a scraping/proxy API; Firecrawl is more directly optimized for clean agent/RAG ingestion and citation debugging. **Browserless Content API** fetches fully rendered HTML from a real browser. I did not choose it because it leaves the clean-readable-content extraction problem mostly inside this repo. **Apify Website Content Crawler** can crawl sites and extract clean text/markdown/HTML for LLM and RAG workflows. I did not choose it because the best-fit option I reviewed is a community actor, crawl-oriented, and less clean as a synchronous per-URL extraction dependency for this app. **Documentation Actually Consulted** Firecrawl Scrape docs and product page; Exa Contents API and freshness docs; Tavily Extract docs; Diffbot Extract, Analyze, Article, and product overview docs; Jina Reader docs; ScrapingBee HTML API docs; Browserless Content API docs; Apify Website Content Crawler docs/store page. Final recommendation: **Firecrawl Cloud Scrape API**, used conservatively for source markdown plus verification metadata/artifacts, not for generated answers. **Provider I’d Choose** Use **Firecrawl Cloud Scrape API**. Repo fit: this app currently does search in [pages/api/sources.ts](/home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1/pages/api/sources.ts:21), then fetches each result URL directly and runs `JSDOM` + Mozilla Readability in [pages/api/sources.ts](/home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1/pages/api/sources.ts:52). The README describes the same flow: scrape Google, parse webpages, build a prompt from page text, then cite sources in the generated answer [README.md](/home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1/README.md:13). Firecrawl is the best managed replacement for the fragile URL-fetch/extract step. Why Firecrawl: its Scrape API is explicitly built to turn URLs into clean LLM-ready markdown, while handling proxies, caching, rate limits, JavaScript-blocked content, dynamic sites, PDFs, images, and other document types. It returns not just markdown, but also metadata including title, description, `sourceURL`, status code, and content type. For a citation-sensitive product, I would use only `markdown`, `html`/`rawHtml` when needed, and metadata. I
**Provider I’d Choose** Use **Firecrawl Cloud Scrape API**. Repo fit: this app currently does search in [pages/api/sources.ts](/home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1/pages/api/sources.ts:21), then fetches each result URL directly and runs `JSDOM` + Mozilla Readability in [pages/api/sources.ts](/home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1/pages/api/sources.ts:52). The README describes the same flow: scrape Google, parse webpages, build a prompt from page text, then cite sources in the generated answer [README.md](/home/user/worktrees/attempt-a33a0be7-7b5d-4e2b-96b6-c1c1a4a7fb93-cap-1/README.md:13). Firecrawl is the best managed replacement for the fragile URL-fetch/extract step. Why Firecrawl: its Scrape API is explicitly built to turn URLs into clean LLM-ready markdown, while handling proxies, caching, rate limits, JavaScript-blocked content, dynamic sites, PDFs, images, and other document types. It returns not just markdown, but also metadata including title, description, `sourceURL`, status code, and content type. For a citation-sensitive product, I would use only `markdown`, `html`/`rawHtml` when needed, and metadata. I would avoid its generated `summary`, `question`, and schema-less LLM extraction modes for primary grounding. The deciding factor is verifiability. Firecrawl can return clean markdown for the model, p