Find
Search dashboard pages and product views.
Search dashboard pages and product views.
jjleng/sensei
Add the ability for this repository's agent to fetch and extract clean, readable content from the web pages/URLs it encounters, so it can use that content in its workflow. Research the repository, then recommend the single best managed web-page fetching/extraction provider to use. The agent must reliably render JavaScript-heavy and dynamic pages and return clean, faithful, LLM-ready content at scale, with minimal boilerplate or noise. Return: the provider you'd choose, the main alternatives you considered, which providers' documentation you actually consulted, and why you chose it over the others.
P3F
448265f1fedfc0f6084499458ad9c445b54d1f68
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/pyproject.toml /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/requirements.txt /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/frontend/package.json Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 { 2 "name": "sensei-search", 3 "version": "0.1.0", 4 "private": true, 5 "scripts": { 6 "dev": "next dev", 7 "build": "next build", 8 "start": "next start", 9 "lint": "next lint" 10 }, 11 "dependencies": { 12 "@radix-ui/react-dialog": "^1.1.1", 13 "@radix-ui/react-hover-card": "^1.1.0", 14 "@radix-ui/react-separator": "^1.1.0", 15 "@radix-ui/react-slot": "^1.1.0", 16 "@radix-ui/react-tooltip": "^1.1.2", 17 "@tailwin Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: File content (71683 tokens) exceeds maximum allowed tokens (25000). Use offset and limit parameters to read specific portions of the file, or search for specific content instead of reading the whole file. Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 [tool.poetry] 2 name = "sensei-search" 3 version = "0.1.0" 4 description = "" 5 authors = ["Jijun Leng"] 6 readme = "README.md" 7 8 [tool.poetry.dependencies] 9 python = "^3.9" 10 python-dotenv = "^1.0.1" 11 openai = "^1.30.5" 12 loguru = "^0.7.2" 13 fastapi = "^0.111.0" 14 python-socketio = "^5.11.2" 15 uvicorn = "^0.30.1" 16 aiohttp = "^3.9.5" 17 trafilatura = "^1.10.0" 18 redis = "^5.0.5" 19 typing-extensions = Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Provider mentions: openai
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: total 72 drwxr-xr-x 6 user user 4096 Jun 6 04:58 . drwxr-xr-x 3 user user 4096 Jun 6 04:58 .. -rw-r--r-- 1 user user 170 Jun 6 04:58 .git -rw-r--r-- 1 user user 172 Jun 6 04:58 .gitignore -rw-r--r-- 1 user user 337 Jun 6 04:58 .vercelignore -rw-r--r-- 1 user user 11356 Jun 6 04:58 LICENSE -rw-r--r-- 1 user user 593 Jun 6 04:58 Makefile -rw-r--r-- 1 user user 2859 Jun 6 04:58 README.md drwxr-xr-x 3 user user 4096 Jun Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/__init__.py /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/agents/__init__.py /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/agents/samurai/__init__.py /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/ Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: total 436 drwxr-xr-x 3 user user 4096 Jun 6 04:58 . drwxr-xr-x 6 user user 4096 Jun 6 04:58 .. -rw-r--r-- 1 user user 44 Jun 6 04:58 .buildpacks.env -rw-r--r-- 1 user user 29 Jun 6 04:58 .cnignore -rw-r--r-- 1 user user 23 Jun 6 04:58 .dockerignore -rw-r--r-- 1 user user 328 Jun 6 04:58 .env.development.example -rw-r--r-- 1 user user 196 Jun 6 04:58 .env.production.example -rw-r--r-- 1 user user 1481 Jun 6 04:58 .git Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: Found 9 files backend/sensei_search/agents/samurai/agent.py backend/sensei_search/agents/shogun/agent_v1.py backend/sensei_search/agents/shogun/agent_v2.py backend/sensei_search/base_agent.py backend/sensei_search/chat_store.py backend/sensei_search/tools/search/__init__.py backend/sensei_search/tools/search/base.py backend/sensei_search/tools/search/bing.py backend/sensei_search/tools/search/searxng.py Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: EISDIR: illegal operation on a directory, read '/home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/tools/search' Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 from __future__ import annotations 2 3 from urllib.parse import urljoin 4 5 from aiohttp import ClientSession 6 7 from sensei_search.config import SEARXNG_URL 8 from sensei_search.logger import logger 9 from sensei_search.tools.search.base import ( 10 Input, 11 SearchTool, 12 TopResults, 13 filter_medium_by_accessibility, 14 get_top_results, 15 ) 16 17 MAX_RESULTS = 5 18 19 20 async def filter_medium_by_scores(resu Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 from __future__ import annotations 2 3 import asyncio 4 import os 5 from abc import ABC, abstractmethod 6 from enum import Enum 7 from typing import Any, List, Union 8 from urllib.parse import urljoin 9 10 from aiohttp import ClientSession 11 from pydantic import BaseModel, Field 12 from typing_extensions import TypedDict 13 14 from sensei_search.logger import logger 15 16 17 class BaseResult(TypedDict): 18 url: st Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/tools/__init__.py /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/tools/search/__init__.py /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/tools/search/base.py /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/ba Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 from .search import * 2 Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 import asyncio 2 import re 3 from datetime import datetime 4 from typing import Any, List, Optional 5 6 from openai import AsyncOpenAI, OpenAI 7 8 from sensei_search.agents.samurai.prompts import ( 9 answer_prompt, 10 classification_prompt, 11 related_questions_prompt, 12 search_prompt, 13 ) 14 from sensei_search.base_agent import BaseAgent, EnrichedQuery, NoAccessError, QueryTags 15 from sensei_search.chat_store i Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Provider mentions: openai
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 import asyncio 2 import json 3 import re 4 from datetime import datetime 5 from typing import Any, List, Optional, Tuple 6 7 from openai import AsyncOpenAI 8 from openai.types.chat import ChatCompletionMessageParam, ChatCompletionMessageToolCall 9 from openai.types.chat.chat_completion_message_tool_call import Function 10 11 from sensei_search.agents.shogun.prompts import ( 12 answer_prompt, 13 classification_promp Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Provider mentions: openai
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 from __future__ import annotations 2 3 import asyncio 4 import uuid 5 from abc import ABC, abstractmethod 6 from enum import Enum 7 from typing import Dict, List, Literal, Optional, Protocol, Union 8 9 import trafilatura # type: ignore[import] 10 from aiohttp import ClientSession, ClientTimeout 11 from pydantic import BaseModel, Field 12 from typing_extensions import TypedDict 13 14 from sensei_search.chat_store im Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 import os 2 3 from sensei_search.env import load_envs 4 5 load_envs() 6 7 SEARXNG_URL = os.environ["SEARXNG_URL"] 8 REDIS_HOST = os.environ["REDIS_HOST"] 9 BING_API_KEY = os.environ["BING_API_KEY"] 10 11 # Small model 12 SM_MODEL_URL = os.environ["SM_MODEL_URL"] 13 SM_MODEL = os.environ["SM_MODEL"] 14 SM_MODEL_API_KEY = os.environ["SM_MODEL_API_KEY"] 15 16 # Medium model 17 MD_MODEL_URL = os.environ["MD_MODEL_URL"] Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 import os 2 3 from .base import * 4 from .bing import * 5 from .searxng import * 6 7 8 def get_search_tool() -> SearchTool: 9 env = os.getenv("ENV", "development") 10 if env == "development": 11 return SearxNG() 12 else: 13 return Bing() 14 Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 import os 2 from pathlib import Path 3 4 from loguru import logger 5 6 7 def load_envs() -> None: 8 from dotenv import load_dotenv 9 10 if getattr(load_envs, "_loaded", False): 11 return 12 13 base_dir = Path(__file__).resolve().parent.parent 14 15 env = os.getenv("ENV", "development") 16 logger.info(f"Loading environment: {env}") 17 18 if env == "development": 19 dotenv_path = base_dir / ".env.development" 20 elif Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 LOGURU_LEVEL=DEBUG 2 SEARXNG_URL="http://searxng:8080" 3 REDIS_HOST="redis" 4 BING_API_KEY="" 5 6 # Small model 7 SM_MODEL_URL="http://host.docker.internal:11434/v1/" 8 SM_MODEL="llama3:instruct" 9 SM_MODEL_API_KEY="whatever" 10 11 # Medium model 12 MD_MODEL_URL="http://host.docker.internal:11434/v1/" 13 MD_MODEL="command-r" 14 MD_MODEL_API_KEY="whatever" Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 from __future__ import annotations 2 3 import asyncio 4 import os 5 from typing import Any, Callable, Coroutine, Dict, List, TypeVar 6 7 from aiohttp import ClientSession 8 9 from sensei_search.config import BING_API_KEY 10 from sensei_search.logger import logger 11 from sensei_search.tools.search.base import ( 12 Category, 13 GeneralResult, 14 ImageResult, 15 Input, 16 SearchTool, 17 TopResults, 18 VideoResult, 19 Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 448265f Update README.md 45d6e49 feat: update prompts 2a9e3b2 feat: prompts update 23ea194 fix: avoid scrolling for the first question 3bb56d0 feat: default to searxng for dev 88779e5 refactor: consolidate env vars 547c746 docs: update readme 26aec60 feat: open graph support 79985fd feat: prompts update 069ca58 feat: more logs 5a60a6b feat: enable math for the rich content renderer by default c357c46 feat: server sid Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 from typing import List, Optional, Union 2 3 from typing_extensions import TypedDict 4 5 6 class WebResult(TypedDict): 7 url: str 8 title: str 9 content: str 10 11 12 class MediumVideo(TypedDict): 13 url: str 14 medium: str 15 16 17 class MediumImage(TypedDict): 18 url: str 19 image: str 20 medium: str 21 22 23 class MetaData(TypedDict): 24 has_math: bool 25 26 27 class ChatHistoryItem(TypedDict, total=False): 28 i Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: openai = "^1.30.5" aiohttp = "^3.9.5" trafilatura = "^1.10.0" redis = "^5.0.5" Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Provider mentions: openai
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 from __future__ import annotations 2 3 import asyncio 4 import os 5 from typing import Dict, List, Optional 6 7 import socketio # type: ignore[import-untyped] 8 from fastapi import FastAPI 9 from fastapi.middleware.cors import CORSMiddleware 10 11 from sensei_search.agents.shogun.agent_v2 import ShogunAgent 12 from sensei_search.base_agent import NoAccessError 13 from sensei_search.chat_store import ChatStore 14 fr Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 5a60a6b feat: enable math for the rich content renderer by default a2f65ba feat: timeout when fetching pages 0425630 feat: style rich content list with css a2c6194 feat: tag user queries with content violation, has math, etc. 12699a9 feat: preprocess answers before passing down to rich content renderer f100298 fix(ui): remove export keyword for search page 71ffa41 feat: refresh to fetch history d051f8d feat: update h Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 from __future__ import annotations 2 3 import json 4 from typing import Any, List, Optional 5 6 import redis.asyncio as redis 7 8 from sensei_search.config import REDIS_HOST 9 from sensei_search.logger import logger 10 from sensei_search.models import ChatHistoryItem, ThreadMetadata 11 12 CHAT_HISTORY_LIMIT = 3 13 14 15 class ChatStore: 16 """ 17 This class is responsible for storing and fetching chat history using Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/agents/shogun/agent_v1.py /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/agents/shogun/agent_v2.py /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/base_agent.py /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/ Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/agents/shogun/agent_v1.py: # TODO: double check if this is OK /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/backend/sensei_search/agents/shogun/agent_v2.py: # TODO: double check if this is OK Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/frontend/app/ChatThreadStore.ts /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/frontend/app/components/ChatHistory.tsx /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/frontend/app/components/ChatHistoryItem/ImageVideoWidget.tsx /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0 Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 import { v4 as uuidv4 } from 'uuid'; 2 3 export interface ChatThreadEntry { 4 id: string; 5 ts: string; 6 slug: string; 7 displayName: string; 8 } 9 10 export default class ChatThreadStore { 11 private static storeKey = 'chatThreads'; 12 private static pageSize = 10; 13 14 static fetchEntries(page: number = 0): ChatThreadEntry[] { 15 const allEntries = this.getAllEntries(); 16 const startIndex = page * this.pageSiz Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/frontend/app/components/ChatHistoryItem/index.tsx: answer: string | null; /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/frontend/app/components/ChatHistoryItem/index.tsx: const { answer, webSources } = props; /home/user/worktrees/attempt-07f5b7b3-3903-44ba-8d2e-e26fc1c56b8e-cap-0/frontend/app/components/ChatHistoryItem/i Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content. Tool result summary: 1 # Sensei Search 2 3 Sensei Search is an AI-powered answer engine. 4 5 ## 📸 Screenshots 6 7 8 ### Light Mode 9 <div align="center"> 10 <img src="./docs/lennon.png" alt="Screenshot - Lennon" max-width="600"> 11 </div> 12 13 ### Dark Mode 14 <div align="center"> 15 <img src="./docs/programming.png" alt="Screenshot - Programming" max-width="600"> 16 </div> 17 18 19 ## 💡 Insights from Utilizing Open Source LLMs 20 21 Agent context: I'll research the repository first to understand its architecture and how the agent fetches/uses web content.
Agent context: The repo is "Sensei Search" — Python/FastAPI backend that currently uses `aiohttp` + `trafilatura` (raw HTML GET, 3s timeout) which can't render JS. Now let me consult actual documentation for the leading managed fetch/extraction providers before recommending. The repo is "Sensei Search" — Python/FastAPI backend that currently uses `aiohttp` + `trafilatura` (raw HTML GET, 3s timeout) which can't render JS. Now let me consult actual documentation for the leading managed fetch/extraction providers before recommending.
Agent context: Now let me consult documentation for the leading managed providers in parallel. Now let me consult documentation for the leading managed providers in parallel.
Query/prompt: Extract details about the /scrape endpoint: does it render JavaScript/dynamic pages? What output formats (markdown, LLM-ready)? How does it handle boilerplate/noise removal (main content extraction)? Pricing/rate limits/scale, batch endpoints, async support, and Python SDK usage. List concrete capabilities and any caveats. Agent context: Now let me consult documentation for the leading managed providers in parallel. Fetch prompt: Extract details about the /scrape endpoint: does it render JavaScript/dynamic pages? What output formats (markdown, LLM-ready)? How does it handle boilerplate/noise removal (main content extraction)? Pricing/rate limits/scale, batch endpoints, async support, and Python SDK usage. List concrete capabilities and any caveats. Tool result summary: # Firecrawl /scrape Endpoint Overview ## JavaScript & Dynamic Content Rendering The endpoint **does render dynamic content**. Firecrawl "handles dynamic content: dynamic websites, js-rendered sites, PDFs, images" and manages complexities like "proxies, caching, rate limits, js-blocked content." ## Output Formats The service supports diverse output types suited for LLM applications: - **Markdown** (primary LLM-ready f Agent context: Now let me consult documentation for the leading managed providers in parallel.
Query/prompt: Extract details about Jina AI Reader (r.jina.ai): does it render JavaScript/dynamic pages? What output format does it return (markdown, clean LLM-ready text)? How is it called (simple URL prefix? API key?)? Pricing, rate limits, scale. Any options for waiting on dynamic content, removing boilerplate. List concrete capabilities and caveats. Agent context: Now let me consult documentation for the leading managed providers in parallel. Fetch prompt: Extract details about Jina AI Reader (r.jina.ai): does it render JavaScript/dynamic pages? What output format does it return (markdown, clean LLM-ready text)? How is it called (simple URL prefix? API key?)? Pricing, rate limits, scale. Any options for waiting on dynamic content, removing boilerplate. List concrete capabilities and caveats. Tool result summary: # Jina AI Reader API (r.jina.ai) Overview ## Core Functionality **JavaScript Rendering & Dynamic Content:** Reader uses a browser-based approach to fetch URLs. It offers a "Wait For (CSS Selector)" parameter to "Wait until these elements appear before extracting content. Useful for dynamically loaded content," enabling handling of JavaScript-rendered pages. There's also a Stream Mode described as "beneficial for larg Agent context: Now let me consult documentation for the leading managed providers in parallel.
Query/prompt: Extract details about Tavily's /extract endpoint: does it render JavaScript/dynamic pages? Output format (markdown, clean content)? Does it remove boilerplate? Can it extract multiple URLs at once (batch)? Pricing, rate limits, advanced extraction depth. Python SDK usage. List concrete capabilities and caveats. Agent context: Now let me consult documentation for the leading managed providers in parallel. Fetch prompt: Extract details about Tavily's /extract endpoint: does it render JavaScript/dynamic pages? Output format (markdown, clean content)? Does it remove boilerplate? Can it extract multiple URLs at once (batch)? Pricing, rate limits, advanced extraction depth. Python SDK usage. List concrete capabilities and caveats. Tool result summary: # Tavily /extract Endpoint Analysis ## JavaScript/Dynamic Content Rendering The documentation does not specify whether the endpoint renders JavaScript or handles dynamic pages. This capability is not explicitly mentioned. ## Output Formats Two formats are supported: - **Markdown** (default): Returns content in markdown format - **Text**: Plain text format that "may increase latency" ## Boilerplate Removal The documen Agent context: Now let me consult documentation for the leading managed providers in parallel.
Query/prompt: Extract details about Exa's contents/get-contents endpoint: does it render JavaScript/dynamic pages? Output format (clean text, highlights, summaries)? Does it remove boilerplate? Batch URL support? Livecrawl options for dynamic pages? Pricing. Python SDK usage. List concrete capabilities and caveats. Agent context: Now let me consult documentation for the leading managed providers in parallel. Fetch prompt: Extract details about Exa's contents/get-contents endpoint: does it render JavaScript/dynamic pages? Output format (clean text, highlights, summaries)? Does it remove boilerplate? Batch URL support? Livecrawl options for dynamic pages? Pricing. Python SDK usage. List concrete capabilities and caveats. Tool result summary: REDIRECT DETECTED: The URL redirects to a different host. Original URL: https://docs.exa.ai/reference/get-contents Redirect URL: https://exa.ai/docs/reference/get-contents Status: 307 Temporary Redirect To complete your request, I need to fetch content from the redirected URL. Please use WebFetch again with these parameters: - url: "https://exa.ai/docs/reference/get-contents" - prompt: "Extract details about Exa's co Agent context: Now let me consult documentation for the leading managed providers in parallel.
Agent context: Let me follow the Exa redirect and check Firecrawl's batch endpoint for scale details. Let me follow the Exa redirect and check Firecrawl's batch endpoint for scale details.
Query/prompt: Extract details about Exa's contents endpoint: does it render JavaScript/dynamic pages? Output format (clean text, highlights, summaries)? Does it remove boilerplate? Batch URL support? Livecrawl options for dynamic pages? Pricing. Python SDK usage. List concrete capabilities and caveats. Agent context: Let me follow the Exa redirect and check Firecrawl's batch endpoint for scale details. Fetch prompt: Extract details about Exa's contents endpoint: does it render JavaScript/dynamic pages? Output format (clean text, highlights, summaries)? Does it remove boilerplate? Batch URL support? Livecrawl options for dynamic pages? Pricing. Python SDK usage. List concrete capabilities and caveats. Tool result summary: # Exa Contents Endpoint Analysis ## JavaScript/Dynamic Content Rendering The documentation does not explicitly state whether the endpoint renders JavaScript or handles dynamic pages. However, the `maxAgeHours: 0` parameter is mentioned to "apply text rendering options to freshly fetched content," suggesting some processing capability, though specifics are unclear. ## Output Formats The endpoint returns multiple outpu Agent context: Let me follow the Exa redirect and check Firecrawl's batch endpoint for scale details.
Query/prompt: Extract details about Firecrawl's batch scrape endpoint: how many URLs can be batched, concurrency/scale limits, async job pattern, and how results are returned. Any rate limits by plan. Agent context: Let me follow the Exa redirect and check Firecrawl's batch endpoint for scale details. Fetch prompt: Extract details about Firecrawl's batch scrape endpoint: how many URLs can be batched, concurrency/scale limits, async job pattern, and how results are returned. Any rate limits by plan. Tool result summary: # Firecrawl Batch Scrape Endpoint Analysis ## URL Batching Capacity The documentation does not specify a maximum number of URLs that can be batched in a single request. The endpoint accepts a `urls` array parameter but provides no documented upper limit. ## Concurrency and Scale The API offers a `maxConcurrency` parameter that allows users to "set a concurrency limit for this batch scrape." If not specified, batches Agent context: Let me follow the Exa redirect and check Firecrawl's batch endpoint for scale details.
Agent context: ## Recommendation: Firecrawl (`/scrape` + `/batch/scrape`) **Repo context.** Sensei Search (Python/FastAPI, async `aiohttp`) currently does a raw `session.get()` + `trafilatura.extract()` with a 3s timeout in `base_agent.py:189` (`fetch_web_pages`), fetching the top-5 results in parallel and feeding plain text to OpenAI-compatible LLMs. The weakness is exactly what you flagged: a raw HTTP GET never executes JavaScript, so trafilatura sees empty shells on SPA/dynamic pages. ### Why Firecrawl - **Real JS rendering.** Headless-browser-based; docs explicitly cover "dynamic websites, js-rendered sites, PDFs," plus managed proxies, caching, and anti-bot handling — the one criterion Tavily and Exa don't document. - **LLM-ready output.** Returns clean **markdown** by default with `only_main_content` to strip nav/ads/boilerplate — a drop-in upgrade over trafilatura's plain text. - **Scale fits your architecture.** `/batch/scrape` takes multiple URLs with a `maxConcurrency` control and sync **or** async job mode — maps directly onto your existing parallel top-5 fetch. - **Clean integration.** Python SDK (`firecrawl.scrape(url, formats=["markdown"])`). Mirror the existing `get_search_tool()` factory: add `get_page_fetcher()` + a `PAGE_FETCH_PROVIDER` env var, keeping trafilatura as the local/dev fallback. ### Alternatives considered | Provider | Verdict | |---|---| | **Jina Reader** (`r.jina.ai`) | Strong runner-up — browser-based, `wait-for` selector, clean markdown, dead-simple URL-prefix API, generous free tier. **But** docs cite ~7.9s avg latency and no anti-bot bypass — less reliable for the hardest pages at scale. | | **Tavily Extract** | Clean markdown, batch ≤20 URLs, pairs with search. **Disqualifier:** JS rendering is *not documented* — risky for "reliably render JS-heavy." | | **Exa contents** | Nice section filtering + highlights/summaries, batch ≤100. **But** JS rendering is "undocumented"; it's really a search-contents companion. | ### Documentation I actually consulted - Firecrawl `/scrape` (`docs.firecrawl.dev/features/scrape`) and batch-scrape API reference - Jina Reader (`jina.ai/reader`) - Tavily `/extract` (`docs.tavily.com`) - Exa contents (`exa.ai/docs/reference/get-contents`) ### Why Firecrawl over the others It's the only provider whose docs *guarantee* the full stack you need — headless JS rendering + main-content markdown extraction + batch/async/concurrency — in one managed product. Jina wins on price/simplicity and is the best fallback, but its latency and lack of anti-bot handling make it weaker for reliable scale; Tavily/Exa don't commit to JS rendering at all. **One caveat to plan for:** Firecrawl's render latency exceeds your current 3s timeout — bump `FETCH_WEBPAGE_TIMEOUT` (~15–30s) and lean on the batch endpoint to keep throughput up. Cost is ~1 credit/page (more for JSON/PDF add-ons). Want me to implement the `get_page_fetcher()` factory + Firecrawl provider with trafilatura as the dev fallback? ## Recommendation: Firecrawl (`/scrape` + `/batch/scrape`) **Repo context.** Sensei Search (Python/FastAPI, async `aiohttp`) currently does a raw `session.get()` + `trafilatura.extract()` with a 3s timeout in `base_agent.py:189` (`fetch_web_pages`), fetching the top-5 results in parallel and feeding plain text to OpenAI-compatible LLMs. The weakness is exactly what you flagged: a raw HTTP GET never executes JavaScript, so trafilatura sees empty shells on SPA/dynamic pages. ### Why Firecrawl - **Real JS rendering.** Headless-browser-based; docs explicitly cover "dynamic websites, js-rendered sites, PDFs," plus managed proxies, caching, and anti-bot handling — the one criterion Tavily and Exa don't document. - **LLM-ready output.** Returns clean **markdown** by default with `only_main_content` to strip nav/ads/boilerplate — a drop-in upgrade over trafilatura's plain text. - **Scale fits your architecture.** `/batch/scrape` takes multiple URLs with a `maxConcurrency` control and sync **or** async job mode — maps directly onto your existing parallel top-5 fetch. - **Clean integration.** Python SDK (`firecrawl.scrape(url, formats=["markdown"])`). Mirror the existing `get_search_tool()` Provider mentions: openai
## Recommendation: Firecrawl (`/scrape` + `/batch/scrape`) **Repo context.** Sensei Search (Python/FastAPI, async `aiohttp`) currently does a raw `session.get()` + `trafilatura.extract()` with a 3s timeout in `base_agent.py:189` (`fetch_web_pages`), fetching the top-5 results in parallel and feeding plain text to OpenAI-compatible LLMs. The weakness is exactly what you flagged: a raw HTTP GET never executes JavaScript, so trafilatura sees empty shells on SPA/dynamic pages. ### Why Firecrawl - **Real JS rendering.** Headless-browser-based; docs explicitly cover "dynamic websites, js-rendered sites, PDFs," plus managed proxies, caching, and anti-bot handling — the one criterion Tavily and Exa don't document. - **LLM-ready output.** Returns clean **markdown** by default with `only_main_content` to strip nav/ads/boilerplate — a drop-in upgrade over trafilatura's plain text. - **Scale fits your architecture.** `/batch/scrape` takes multiple URLs with a `maxConcurrency` control and sync **or** async job mode — maps directly onto your existing parallel top-5 fetch. - **Clean integration.** Python SDK (`firecrawl.scrape(url, formats=["markdown"])`). Mirror the existing `get_search_tool()` factory: add `get_page_fetcher()` + a `PAGE_FETCH_PROVIDER` env var, keeping trafilatura as the local/dev fallback. ### Alternatives considered | Provider | Verdict | |---|---| | **Jina Reader** (`r.j Provider mentions: openai