Crawler issues - Cannot crawl static document | ToS;DR | Page 1

lapis charm Mar 24, 2023, 10:46 AM

#

I am trying to update the documents of https://explore.wolt.com/en/deu/terms. However, When crawling the document, the crawler returns that there is no text.
The website serves the terms of service as html, so it does not have to do with javascript loading the document later.

#

@craggy harness I think you are the right person to mention in this thread?

craggy harness Mar 24, 2023, 11:34 AM

#

@lapis charm it could be that the website is displayed different to the crawler: e.g. an „are you a robot?“ wall

harsh spear Mar 24, 2023, 12:25 PM

#

That is happening with alnsot Evers crawl right now, sometimes with the same doc that worked seconds before

#

I reported that somewhere already

craggy harness Mar 24, 2023, 12:28 PM

#

Hmmm

#

I‘ll check the logs

harsh spear Mar 24, 2023, 2:21 PM

#

#

https://edit.tosdr.org/documents/2386

Terms of Service; Didn't Read - Phoenix

I have read and understood the terms of service is the biggest lie on the Internet. We aim to fix that.

craggy harness Mar 25, 2023, 8:14 PM

#

@here I've further debugged the issue on us-east-1 and it seems the Website does an invalid redirect on HEAD requests

https://sentry.internal.jrbit.de/share/issue/0a942882e1d44021a859bfd5f5e5b924/

Sentry

AssertionError: protocol mismatch

protocol mismatch AssertionError /var/www/crawler/functions/async.crawl.js IncomingMessage. False IncomingMessage.(functions:async.crawl)

#

Sadly not on us as initial HEAD requests are required

lapis charm Mar 25, 2023, 9:29 PM

#

Thanks for clarifying, I’ll send an email regarding the request

lapis charm Apr 3, 2023, 11:41 AM

#

craggy harness @here I've further debugged the issue on us-east-1 and it seems the Website does...

Would it be possible to create a new error message specifically for this?. Then we curators know that what the issue is instead of relying on the devs to tell us.

craggy harness Apr 3, 2023, 11:41 AM

#

i'll pass it onto the phoenix team 👍🏻

lapis charm Apr 3, 2023, 11:46 AM

#

Furthermore, I seem to not be able to find where the issue is. I see no request header in the head request when running it manually.

You state that it does an invalid redirect. and the error you sent shows that the protocols mismach but for me the head request also seems to check out

#

Only when doing a http:// request we get a similar issue

lapis charm Apr 4, 2023, 12:21 PM

#

@craggy harness sorry for pinging again. But could you repeat the request but instead use https://?

craggy harness Apr 4, 2023, 9:49 PM

#

lapis charm <@253160415947653120> sorry for pinging again. But could you repeat the request ...

The document is in https, the site is still broken with HEAD requests

#Crawler issues - Cannot crawl static document