#Giving urls as background informations

1 messages · Page 1 of 1 (latest)

main mountain
#

Maybe somebody had solved this already and could help me.

I’m crawling the top20 urls for a specific keyword and wanna use those h1/2s and the text for it as a background information.

Now I want to create a new blog post with this informations and additional manual background information.

But how to handle such big data without reaching the token limit ?

opal mountain
#

Scrape the h1s and h2s and combine?

#

or do you want all the text from those articles as well?

#

or just section headings

main mountain
#

@opal mountain I thought about creating a new h1/h2 through a prompting based on those h1s/h2s of the scrapped urls. The text should be also scrapped and combined as background informations. Then checking the gaps and missing informations, add them and recreate a better and more helpful blog post

opal mountain
#

I understand what you're trying to do here, but it seems something GPT might struggle to follow. First you'd need to scrape all those 20 websites fully, which is a challenge in itself to create a good scraper, unless you use a service.

#

you need to parse it nicely, without losing much, or scraping irrelevant data (besides having to scrape websites with good protection against bots giving you 'forbidden 403')

#

If you can get passed that, you then need a way of giving this data to GPT with a prompt and a good instruction. You'd theoretically get 10-15k words, which is a bit too much but some models can handle it. However, at certain times you might get above 20k words, which is too much for chatgpt if you're going to use that

#

There is another obstacle: you need a good model to parse this and actually figure out what's missing and what should be included and accentuated in your new blog post. o1 can do this, but that's only worth it price-wise if you're doing it manually (no API).

main mountain
#

@opal mountain Ty for the response. I’ve implemented it successfully yesterday and it’s working fine.