#LLM Extract Does Not Do Whole Page?
11 messages · Page 1 of 1 (latest)
Hey, could you share your request url/schema so we can replicate?
My guess is that it has to do with the page loading on scroll
Thank you very much for helping! Just a caveat, I am not a coder or developer. So, there's a likelihood I am missing something in the request. Here you go:
data = app.scrape_url(
"https://www.zillow.com/las-vegas-nv/sold/?searchQueryState={"pagination"%3A{}%2C"isMapVisible"%3Atrue%2C"mapBounds"%3A{"west"%3A-116.15555895361328%2C"east"%3A-114.42177904638672%2C"south"%3A35.71392015428329%2C"north"%3A36.7166111978914}%2C"regionSelection"%3A[{"regionId"%3A18959%2C"regionType"%3A6}]%2C"filterState"%3A{"sort"%3A{"value"%3A"globalrelevanceex"}%2C"ah"%3A{"value"%3Atrue}%2C"rs"%3A{"value"%3Atrue}%2C"fsba"%3A{"value"%3Afalse}%2C"fsbo"%3A{"value"%3Afalse}%2C"nc"%3A{"value"%3Afalse}%2C"cmsn"%3A{"value"%3Afalse}%2C"auc"%3A{"value"%3Afalse}%2C"fore"%3A{"value"%3Afalse}}%2C"isListVisible"%3Atrue%2C"mapZoom"%3A11%2C"usersSearchTerm"%3A"Las Vegas NV"}",
{
"formats": ["extract"],
"extract": {
"prompt": "Extract the address, date, price, beds, baths, and sqft of all homes sold in Las Vegas, Nevada in the last 6 months.",
},
},
)
Where is your extraction schema? thats the most important parameter to pass because it tells the model exactly what format it should return the data in
Caleb! Great to hear from you. In my ignorance, I put the schema in the prompt. The extraction produced the desire result as far as structuring the output correctly. However, It stopped after 10 extractions when there were over 500 more to do. Should I explicitly state the extraction schema?
Yes, explicitly state the extraction schema!
Caleb. Understood and thank you for your time. I declaared the schema to scrape data off a simpler website. Here is the updated code I used:
class ExtractSchema(BaseModel):
Address: str
Location: str
Price: int
Beds: int
Baths: float
SqFt: int
Px_SqFt: int
Time_On_Redfin: str
data = app.scrape_url(
"https://www.redfin.com/zipcode/89134/filter/sort=lo-days",
{
#extract the listings
"formats": ["extract"],
"extract": {
"schema": ExtractSchema.model_json_schema(),
"prompt": "Extract all the listings from the redfin website"
}
)
print(data["extract"])