#What is the suggested way to do web scraping from within a server function?
4 messages · Page 1 of 1 (latest)
Hi @wet elm. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
Convex doesn’t have a single “scraping API”, but it does support doing web scraping from actions, and there are examples that show the recommended patterns.
1. Use an action (not a query/mutation)
Network requests like fetch are only allowed in actions, not in queries or mutations. Scraping examples in the Convex docs and blog are implemented as actions that:
fetchHTML from a URL- Parse it (cheerio or custom HTML → text/markdown)
- Store results via
ctx.runMutationorctx.dbinside an internal mutation. [Server functions; Runtimes]
Example pattern (scraping one page in an action):
export const fetchSingle = internalAction({
args: { url: v.string() },
handler: async (ctx, { url }) => {
const response = await fetch(url);
const text = parsePage(await response.text());
if (text.length > 0) {
await ctx.runMutation(internal.ingest.load.updateDocument, { url, text });
}
},
});
And the mutation does the database work (chunking, etc.). [Vector search ingest]
Another example uses cheerio to scrape GitHub pages inside an action: [GitHub scraping]
import * as cheerio from "cheerio";
const getGithubRepoPageData = async (owner: string, name: string) => {
const html = await fetch(`<https://github.com/${owner}/${name}`>).then(res => res.text());
const $ = cheerio.load(html);
// ...select & parse counts...
};
2. Prefer helpers over ctx.runAction when staying in one runtime
If your scraping logic lives entirely in actions using the same runtime, Convex recommends putting the scraping logic into plain TypeScript helper functions and calling them directly, rather than chaining ctx.runAction calls. runAction should mainly be used when you actually need to cross runtimes (Convex ↔ Node). [Use runAction only when using a different runtime]
Pattern:
- Action A:
- Uses
fetch+ parsing helpers - Calls
ctx.runMutation(...)to store results
- Uses
- Shared helper
scrapeSinglePage(ctx, { url })holds the scraping logic and is called directly from actions.
3. Reasonableness / limits
The docs and Stack posts explicitly use Convex actions for scraping websites and sitemaps, so this is a supported and reasonable pattern, as long as:
- It’s done from actions (not queries/mutations).
- You keep work per action within Convex’s function resource limits (e.g. batch or paginate, as shown in the GitHub scraping tutorial). [Dealing with flaky data]