#"Read" content of the website from an URL

1 messages · Page 1 of 1 (latest)

dull apex
#

Hi guys, I need some advice on OpenAi. I would like to make a website that reads the content of the URL provided, then returns specific data. I made this schema, I hope it's clear.
So my goal is with AI:

  • being able to understand what this page is about
  • Check and return the list of artists and songs.
    I'm not sure how to proceed so I would love some feedback. I thought about using Linkreader from chatGPT ? Or some other tools I found out like browser ai or ParseHub.
    I added a schema, I hope this helps.
gloomy birch
#

You use some python (beautifulsoup or so) to scrape that site, pass the results to gpt, prompt it to check your steps, then if it returns you a list of players you use some other custom code to create your playlists. I’m not familiar with Spotify or if they have an api you can post to create lists.

Also, scraping websites can break laws, terms of use, or/and get you blocked (by the scraped websites)

Gpt in all this will have very little work to do. Most of your work will happen in custom code that scrapes and then posts potential playlists

fierce holly
#

Agree with smileBeda. You can just pass the full website to the model and ask it to extract the information you want to understand. Beautifulsoup and requests should do the work if you code with python

tribal zinc
#

yes, using Python + Langcahin, you cn fetch data from website url

In langchain, there are document loaders to read data from document, website urls, so your problem can be solved using python, langchain well

dull apex
dull apex
dull apex
#

Sorry for the late replies guys. I appreciate you took time to answer me 😁
Still a lot of work ahead

gloomy birch
# dull apex Thanks for your feedback. I wanted to use it because I'm not confident with scra...

Unfortunately whether you, gpt or else wants contents of a website, you’ve to either scrape it or read it personally.

One would assume that if a website allows you to read it, you can also scrape it. However the main reason many hosts block scrapers is the traffic spikes (it puts some drain on their servers)

On the other hand, there’s no foolproof protection against it.
The “proper” way would be to offer, and then let consume, an api.

It’s a bit silly if you think about it from just your or my point of view. Where’s the difference if I scrape and read or just read? None. But imagine there’s thousands of scrapers or even just one who scrapes your whole website and follows all links. That’s a traffic on the site never replicable by a human.

And of course then there’s the whole copyright stuff. Many sites put right click protections on images, which stops most users from copying their imagery. But a scraper goes around that with ease.

On the other hand, datasets like the ones gpt was trained on wasn’t built by asking for permission. Their “excuse” was (and is) “it’s publicly available data”.
And they clearly haven’t been blocked by many nor sued by many 😅
One of the sets is https:// en. m. wikipedia.org /wiki /Common_Crawl for example.