Unfortunately whether you, gpt or else wants contents of a website, you’ve to either scrape it or read it personally.
One would assume that if a website allows you to read it, you can also scrape it. However the main reason many hosts block scrapers is the traffic spikes (it puts some drain on their servers)
On the other hand, there’s no foolproof protection against it.
The “proper” way would be to offer, and then let consume, an api.
It’s a bit silly if you think about it from just your or my point of view. Where’s the difference if I scrape and read or just read? None. But imagine there’s thousands of scrapers or even just one who scrapes your whole website and follows all links. That’s a traffic on the site never replicable by a human.
And of course then there’s the whole copyright stuff. Many sites put right click protections on images, which stops most users from copying their imagery. But a scraper goes around that with ease.
On the other hand, datasets like the ones gpt was trained on wasn’t built by asking for permission. Their “excuse” was (and is) “it’s publicly available data”.
And they clearly haven’t been blocked by many nor sued by many 😅
One of the sets is https:// en. m. wikipedia.org /wiki /Common_Crawl for example.