`crawl` results in `waiting` but `scrape` works | Firecrawl | Page 1

broken warren Jun 5, 2024, 12:09 AM

#

Hello, when running locally, I'm able to scrape, using curl, successfully.

However, if I try the crawl endpoint it results in a job that is constantly waiting.

Is this because it depends on scrapingbee?

I do see the following log which may be relevant:

Corepack is about to download https://registry.npmjs.org/pnpm/-/pnpm-9.1.4.tgz

> [email protected] start:production /app
> tsc && node dist/src/index.js

Authentication is disabled. Supabase client will not be initialized.
POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more.
Web scraper queue created
Server listening on port 3002
For the UI, open http://0.0.0.0:3002/admin//queues

1. Make sure Redis is running on port 6379 by default
2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
Attempted to access Supabase client when it's not configured.
Error logging crawl job:
 Error: Supabase client is not configured.
    at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
    at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
    at crawlController (/app/dist/src/controllers/crawl.js:87:40)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
[Playwright] Error fetching url: https://www.stuff.co.nz/ with status: 404
Falling back to fetch
WARNING - You're bypassing authentication

The 404 status seems misleading as the same url works from the scrape endpoint.

analog monolith Jun 5, 2024, 2:29 AM

#

@broken warren you need to run the workers seperately. Try doing npm run workers in a seperate terminal

broken warren Jun 5, 2024, 2:30 AM

#

Thanks @analog monolith - if I'm using docker compose up to start, which container should I run this command in?

analog monolith Jun 5, 2024, 2:32 AM

#

Oh I see. I think it should have automatically handled that for you. ccing @spare lance which can prob help you better in this area

spare lance Jun 5, 2024, 12:03 PM

#

@broken warren you should have a worker container running automatically when you run docker compose at root. Can you confirm if this container is running? You can use docker ps to check

broken warren Jun 25, 2024, 11:12 PM

#

Hi @spare lance - yes, I do see the worker container running.
I also see this in the api container api-1 | Worker 73 listening on port 3002

As I am running docker compose up - not including the -d I'm seeing all output from the running containers.

The last log output I see from the worker container is:

worker-1              | Web scraper queue created
worker-1              | Connected to Redis Session Store!

It seems as if it never gets queue message from the api.

When I send a crawl request, this is the only thing logged:

Error logging crawl job:
api-1                 |  Error: Supabase client is not configured.
api-1                 |     at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
api-1                 |     at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
api-1                 |     at crawlController (/app/dist/src/controllers/crawl.js:92:40)
api-1                 |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

eternal kelp Jul 1, 2024, 11:02 AM

#

I also encountered this problem

soft forge Jul 12, 2024, 1:07 PM

#

i am also running into this problem

analog monolith Jul 12, 2024, 2:38 PM

#

Hey yall, quick update #🧡┃firecrawl-status message

empty rune Jul 24, 2024, 3:14 PM

#

analog monolith Hey yall, quick update https://discord.com/channels/1226707384710332458/12613302...

Im also running into this problem... scraping works but crawling keeps timing out. THe update you sent is only for the people using the API key right? not for self-host?

analog monolith Jul 24, 2024, 3:18 PM

#

empty rune Im also running into this problem... scraping works but crawling keeps timing ou...

Correct, are you having this issue while self hosting ?

empty rune Jul 24, 2024, 3:18 PM

#

Yes i am, im trying to do it via docker now

analog monolith Jul 24, 2024, 3:18 PM

#

Make sure that you are running the workers

empty rune Jul 24, 2024, 3:18 PM

#

analog monolith Make sure that you are running the workers

MaxRetriesPerRequestError: Reached the max retries per request limit (which is 20). Refer to "maxRetriesPerRequest" option for details.
    at Socket.<anonymous> (C:\Users\Patrick\Desktop\firecrawl-main\apps\api\node_modules\.pnpm\[email protected]\node_modules\ioredis\built\redis\event_handler.js:182:37)
    at Object.onceWrapper (node:events:633:26)
    at Socket.emit (node:events:518:28)
    at Socket.emit (node:domain:488:12)
    at TCP.<anonymous> (node:net:337:12)

analog monolith Jul 24, 2024, 3:18 PM

#

Got it, if you are manually doing it do npm run start and npm run workers on separate terminals

#

Hmm

#

Do You have redis running ?

empty rune Jul 24, 2024, 3:19 PM

#

analog monolith Got it, if you are manually doing it do npm run start and npm run workers on sep...

Yes i did that, had workers all online

#

Quick question inbetween, i got it in docker desktop now but what .env is it using? from the folder where i docked it from?

analog monolith Jul 24, 2024, 3:23 PM

#

Gotcha

#

I believe it should be using it from the apps/api folder

empty rune Jul 24, 2024, 3:24 PM

#

when i docker-compose up it uses this:

name: firecrawl
version: '3.9'

x-common-service: &common-service
  build: apps/api
  networks:
    - backend
  environment:
    - REDIS_URL=${REDIS_URL:-redis://redis:6379}
    - REDIS_RATE_LIMIT_URL=${REDIS_URL:-redis://redis:6379}
    - PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
    - USE_DB_AUTHENTICATION=${USE_DB_AUTHENTICATION}
    - PORT=${PORT:-3002}
    - NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
    - OPENAI_API_KEY=${OPENAI_API_KEY}
    - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
    - SERPER_API_KEY=${SERPER_API_KEY}
    - LLAMAPARSE_API_KEY=${LLAMAPARSE_API_KEY}
    - LOGTAIL_KEY=${LOGTAIL_KEY}
    - BULL_AUTH_KEY=${BULL_AUTH_KEY}
    - TEST_API_KEY=${TEST_API_KEY}
    - POSTHOG_API_KEY=${POSTHOG_API_KEY}
    - POSTHOG_HOST=${POSTHOG_HOST}
    - SUPABASE_ANON_TOKEN=${SUPABASE_ANON_TOKEN}
    - SUPABASE_URL=${SUPABASE_URL}
    - SUPABASE_SERVICE_TOKEN=${SUPABASE_SERVICE_TOKEN}
    - SCRAPING_BEE_API_KEY=${SCRAPING_BEE_API_KEY}
    - HOST=${HOST:-0.0.0.0}
    - SELF_HOSTED_WEBHOOK_URL=${SELF_HOSTED_WEBHOOK_URL}
  extra_hosts:
    - "host.docker.internal:host-gateway"

services:
  playwright-service:
    build: apps/playwright-service
    environment:
      - PORT=3000
      - PROXY_SERVER=${PROXY_SERVER}
      - PROXY_USERNAME=${PROXY_USERNAME}
      - PROXY_PASSWORD=${PROXY_PASSWORD}
      - BLOCK_MEDIA=${BLOCK_MEDIA}
    networks:
      - backend

  api:
    <<: *common-service
    depends_on:
      - redis
      - playwright-service
    ports:
      - "3002:3002"
    command: [ "pnpm", "run", "start:production" ]

  worker:
    <<: *common-service
    depends_on:
      - redis
      - playwright-service
      - api
    command: [ "pnpm", "run", "workers" ]

  redis:
    image: redis:alpine
    networks:
      - backend
    command: redis-server --bind 0.0.0.0

networks:
  backend:
    driver: bridge

And so it creates another env

#

i just cant seem to find my way in docker desktop hahah

#

Okay so when i run docker-compose config it prints me my .env file and thats all correct. it states USE_DB_AUTHENTICATION: "false"
But in docker desktop it shows this in logs

spare lance Jul 24, 2024, 3:40 PM

#

hey @empty rune ! This looks like a warning message, are you able to run crawl or scrape?

empty rune Jul 24, 2024, 3:40 PM

#

I tried opening the workers and queue in 2 seperate cmds. when i tried post via python the scrape worked but the crawl keeps timing out

empty rune Jul 24, 2024, 3:41 PM

#

empty rune Okay so when i run docker-compose config it prints me my .env file and thats all...

im trying to compose it into docker desktop to see if it works via there but for some reason it takes a different env

#

Okay so i got it to work in docker now by deleting and re-composing it! For some reason when using cmd i needed to change .env.local to .env and in docker it probably needed the .local

spare lance Jul 24, 2024, 3:44 PM

#

oh ok! Does crawl works now?

empty rune Jul 24, 2024, 3:45 PM

#

Crawl now gives a jobId while it never has before so thats a step forward!

However i now get this:

📎 message.txt

spare lance Jul 24, 2024, 3:46 PM

#

oh I think there's a bug there

#

1 sec

empty rune Jul 24, 2024, 3:47 PM

#

alright

#

Okay so despite the error it does work

#

i can retrieve it by job id

spare lance Jul 24, 2024, 3:54 PM

#

I just pushed a fix for this error

#

you can update your firecrawl repo for solving that

spare lance Jul 24, 2024, 3:55 PM

#

empty rune Okay so despite the error it does work

awesome

empty rune Jul 24, 2024, 3:56 PM

#

Thanks so much! if i find any more errors/bugs i will let you know!

#

Really cool what you guys working on!

analog monolith Jul 24, 2024, 3:57 PM

#

Awesome that it worked! And thank you! 🔥

empty rune Jul 25, 2024, 1:25 PM

#

Question, how do I use llm extract locally? And are these inputs correct for pyton?

#

or is llm extract only for scrape function?

#

I see that you need an api key for llm extraction which seems logical because this is ran on your network.

So im trying out the markdown function but for some reason the markdown is the same as the output...

empty rune Jul 25, 2024, 2:24 PM

#

And it also gives markdown with standard crawl function

spare lance Jul 26, 2024, 2:28 PM

#

Hey @empty rune , to use llm extract, you need to set up the extractorOptions parameter when using the scrape functions. Also, if you're using it self-hosted, you'll need to configure the OPENAI_API_KEY in your .env file.

empty rune Jul 27, 2024, 12:27 PM

#

spare lance Hey <@289755785016246273> , to use llm extract, you need to set up the extractor...

Im trying to keep away from using paid llm like OpenAI. Is there any way to use a self-hosted llm like llama to llm extract?

spare lance Jul 29, 2024, 12:58 PM

#

empty rune Im trying to keep away from using paid llm like OpenAI. Is there any way to use ...

we have one open PR for that. It's still under review, though.

#`crawl` results in `waiting` but `scrape` works