LLM Extract Does Not Do Whole Page? | Firecrawl | Page 1

static blaze Sep 19, 2024, 9:15 AM

#

@clever steppe Trying To Extract Structured Data From A Website. But All The Data Is Not Being Scraped. Only The First Entries At The Top Of The Page Are Being Scraped. Any Suggestions?

full cradle Sep 19, 2024, 3:34 PM

#

Hey, could you share your request url/schema so we can replicate?

#

My guess is that it has to do with the page loading on scroll

static blaze Sep 19, 2024, 6:32 PM

#

Thank you very much for helping! Just a caveat, I am not a coder or developer. So, there's a likelihood I am missing something in the request. Here you go:

#

data = app.scrape_url(
"https://www.zillow.com/las-vegas-nv/sold/?searchQueryState={"pagination"%3A{}%2C"isMapVisible"%3Atrue%2C"mapBounds"%3A{"west"%3A-116.15555895361328%2C"east"%3A-114.42177904638672%2C"south"%3A35.71392015428329%2C"north"%3A36.7166111978914}%2C"regionSelection"%3A[{"regionId"%3A18959%2C"regionType"%3A6}]%2C"filterState"%3A{"sort"%3A{"value"%3A"globalrelevanceex"}%2C"ah"%3A{"value"%3Atrue}%2C"rs"%3A{"value"%3Atrue}%2C"fsba"%3A{"value"%3Afalse}%2C"fsbo"%3A{"value"%3Afalse}%2C"nc"%3A{"value"%3Afalse}%2C"cmsn"%3A{"value"%3Afalse}%2C"auc"%3A{"value"%3Afalse}%2C"fore"%3A{"value"%3Afalse}}%2C"isListVisible"%3Atrue%2C"mapZoom"%3A11%2C"usersSearchTerm"%3A"Las Vegas NV"}",
{
"formats": ["extract"],
"extract": {
"prompt": "Extract the address, date, price, beds, baths, and sqft of all homes sold in Las Vegas, Nevada in the last 6 months.",
},
},
)

Zillow

Las Vegas NV Real Estate - Las Vegas NV Homes For Sale | Zillow

Zillow has 6413 homes for sale in Las Vegas NV. View listing photos, review sales history, and use our detailed real estate filters to find the perfect place.

clever steppe Sep 20, 2024, 8:20 PM

#

Where is your extraction schema? thats the most important parameter to pass because it tells the model exactly what format it should return the data in

static blaze Sep 21, 2024, 8:30 AM

#

Caleb! Great to hear from you. In my ignorance, I put the schema in the prompt. The extraction produced the desire result as far as structuring the output correctly. However, It stopped after 10 extractions when there were over 500 more to do. Should I explicitly state the extraction schema?

clever steppe Sep 25, 2024, 2:32 AM

#

Yes, explicitly state the extraction schema!

static blaze Sep 25, 2024, 10:45 PM

#

Caleb. Understood and thank you for your time. I declaared the schema to scrape data off a simpler website. Here is the updated code I used:

#

class ExtractSchema(BaseModel):
Address: str
Location: str
Price: int
Beds: int
Baths: float
SqFt: int
Px_SqFt: int
Time_On_Redfin: str

#

data = app.scrape_url(
"https://www.redfin.com/zipcode/89134/filter/sort=lo-days",
{
#extract the listings
"formats": ["extract"],
"extract": {
"schema": ExtractSchema.model_json_schema(),
"prompt": "Extract all the listings from the redfin website"
}
)

print(data["extract"])

89134, NV Homes for Sale

#LLM Extract Does Not Do Whole Page?