#`crawl` results in `waiting` but `scrape` works

48 messages · Page 1 of 1 (latest)

broken warren
#

Hello, when running locally, I'm able to scrape, using curl, successfully.

However, if I try the crawl endpoint it results in a job that is constantly waiting.

Is this because it depends on scrapingbee?

I do see the following log which may be relevant:

Corepack is about to download https://registry.npmjs.org/pnpm/-/pnpm-9.1.4.tgz

> [email protected] start:production /app
> tsc && node dist/src/index.js

Authentication is disabled. Supabase client will not be initialized.
POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more.
Web scraper queue created
Server listening on port 3002
For the UI, open http://0.0.0.0:3002/admin//queues

1. Make sure Redis is running on port 6379 by default
2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
Attempted to access Supabase client when it's not configured.
Error logging crawl job:
 Error: Supabase client is not configured.
    at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
    at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
    at crawlController (/app/dist/src/controllers/crawl.js:87:40)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
WARNING - You're bypassing authentication
WARNING - You're bypassing authentication
[Playwright] Error fetching url: https://www.stuff.co.nz/ with status: 404
Falling back to fetch
WARNING - You're bypassing authentication

The 404 status seems misleading as the same url works from the scrape endpoint.

analog monolith
#

@broken warren you need to run the workers seperately. Try doing npm run workers in a seperate terminal

broken warren
#

Thanks @analog monolith - if I'm using docker compose up to start, which container should I run this command in?

analog monolith
#

Oh I see. I think it should have automatically handled that for you. ccing @spare lance which can prob help you better in this area

spare lance
#

@broken warren you should have a worker container running automatically when you run docker compose at root. Can you confirm if this container is running? You can use docker ps to check

broken warren
#

Hi @spare lance - yes, I do see the worker container running.
I also see this in the api container api-1 | Worker 73 listening on port 3002

As I am running docker compose up - not including the -d I'm seeing all output from the running containers.

The last log output I see from the worker container is:

worker-1              | Web scraper queue created
worker-1              | Connected to Redis Session Store!

It seems as if it never gets queue message from the api.

When I send a crawl request, this is the only thing logged:

Error logging crawl job:
api-1                 |  Error: Supabase client is not configured.
api-1                 |     at Proxy.<anonymous> (/app/dist/src/services/supabase.js:38:23)
api-1                 |     at logCrawl (/app/dist/src/services/logging/crawl_log.js:9:14)
api-1                 |     at crawlController (/app/dist/src/controllers/crawl.js:92:40)
api-1                 |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
eternal kelp
#

I also encountered this problem

soft forge
#

i am also running into this problem

analog monolith
#

Hey yall, quick update #🧡┃firecrawl-status message

empty rune
analog monolith
empty rune
#

Yes i am, im trying to do it via docker now

analog monolith
#

Make sure that you are running the workers

empty rune
# analog monolith Make sure that you are running the workers
MaxRetriesPerRequestError: Reached the max retries per request limit (which is 20). Refer to "maxRetriesPerRequest" option for details.
    at Socket.<anonymous> (C:\Users\Patrick\Desktop\firecrawl-main\apps\api\node_modules\.pnpm\[email protected]\node_modules\ioredis\built\redis\event_handler.js:182:37)
    at Object.onceWrapper (node:events:633:26)
    at Socket.emit (node:events:518:28)
    at Socket.emit (node:domain:488:12)
    at TCP.<anonymous> (node:net:337:12)
analog monolith
#

Got it, if you are manually doing it do npm run start and npm run workers on separate terminals

#

Hmm

#

Do You have redis running ?

empty rune
#

Quick question inbetween, i got it in docker desktop now but what .env is it using? from the folder where i docked it from?

analog monolith
#

Gotcha

#

I believe it should be using it from the apps/api folder

empty rune
#

when i docker-compose up it uses this:

name: firecrawl
version: '3.9'

x-common-service: &common-service
  build: apps/api
  networks:
    - backend
  environment:
    - REDIS_URL=${REDIS_URL:-redis://redis:6379}
    - REDIS_RATE_LIMIT_URL=${REDIS_URL:-redis://redis:6379}
    - PLAYWRIGHT_MICROSERVICE_URL=${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000}
    - USE_DB_AUTHENTICATION=${USE_DB_AUTHENTICATION}
    - PORT=${PORT:-3002}
    - NUM_WORKERS_PER_QUEUE=${NUM_WORKERS_PER_QUEUE}
    - OPENAI_API_KEY=${OPENAI_API_KEY}
    - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
    - SERPER_API_KEY=${SERPER_API_KEY}
    - LLAMAPARSE_API_KEY=${LLAMAPARSE_API_KEY}
    - LOGTAIL_KEY=${LOGTAIL_KEY}
    - BULL_AUTH_KEY=${BULL_AUTH_KEY}
    - TEST_API_KEY=${TEST_API_KEY}
    - POSTHOG_API_KEY=${POSTHOG_API_KEY}
    - POSTHOG_HOST=${POSTHOG_HOST}
    - SUPABASE_ANON_TOKEN=${SUPABASE_ANON_TOKEN}
    - SUPABASE_URL=${SUPABASE_URL}
    - SUPABASE_SERVICE_TOKEN=${SUPABASE_SERVICE_TOKEN}
    - SCRAPING_BEE_API_KEY=${SCRAPING_BEE_API_KEY}
    - HOST=${HOST:-0.0.0.0}
    - SELF_HOSTED_WEBHOOK_URL=${SELF_HOSTED_WEBHOOK_URL}
  extra_hosts:
    - "host.docker.internal:host-gateway"

services:
  playwright-service:
    build: apps/playwright-service
    environment:
      - PORT=3000
      - PROXY_SERVER=${PROXY_SERVER}
      - PROXY_USERNAME=${PROXY_USERNAME}
      - PROXY_PASSWORD=${PROXY_PASSWORD}
      - BLOCK_MEDIA=${BLOCK_MEDIA}
    networks:
      - backend

  api:
    <<: *common-service
    depends_on:
      - redis
      - playwright-service
    ports:
      - "3002:3002"
    command: [ "pnpm", "run", "start:production" ]

  worker:
    <<: *common-service
    depends_on:
      - redis
      - playwright-service
      - api
    command: [ "pnpm", "run", "workers" ]

  redis:
    image: redis:alpine
    networks:
      - backend
    command: redis-server --bind 0.0.0.0

networks:
  backend:
    driver: bridge

And so it creates another env

#

i just cant seem to find my way in docker desktop hahah

#

Okay so when i run docker-compose config it prints me my .env file and thats all correct. it states USE_DB_AUTHENTICATION: "false"
But in docker desktop it shows this in logs

spare lance
#

hey @empty rune ! This looks like a warning message, are you able to run crawl or scrape?

empty rune
#

I tried opening the workers and queue in 2 seperate cmds. when i tried post via python the scrape worked but the crawl keeps timing out

empty rune
#

Okay so i got it to work in docker now by deleting and re-composing it! For some reason when using cmd i needed to change .env.local to .env and in docker it probably needed the .local

spare lance
#

oh ok! Does crawl works now?

empty rune
#

Crawl now gives a jobId while it never has before so thats a step forward!

However i now get this:

spare lance
#

oh I think there's a bug there

#

1 sec

empty rune
#

alright

#

Okay so despite the error it does work

#

i can retrieve it by job id

spare lance
#

I just pushed a fix for this error

#

you can update your firecrawl repo for solving that

spare lance
empty rune
#

Thanks so much! if i find any more errors/bugs i will let you know!

#

Really cool what you guys working on!

analog monolith
#

Awesome that it worked! And thank you! 🔥

empty rune
#

Question, how do I use llm extract locally? And are these inputs correct for pyton?

#

or is llm extract only for scrape function?

#

I see that you need an api key for llm extraction which seems logical because this is ran on your network.

So im trying out the markdown function but for some reason the markdown is the same as the output...

empty rune
#

And it also gives markdown with standard crawl function

spare lance
#

Hey @empty rune , to use llm extract, you need to set up the extractorOptions parameter when using the scrape functions. Also, if you're using it self-hosted, you'll need to configure the OPENAI_API_KEY in your .env file.

empty rune
spare lance