fleet silo Aug 26, 2023, 4:04 PM

#

Hello everyone,

I hope this message finds you well. I'm currently working on a web scraping project where I need to extract URLs from search results based on a specific keyword. However, I've encountered a challenge in identifying the correct class name to use for extracting the URLs from the search results' HTML.

I've thoroughly examined the HTML structure, and it seems that the issue might be related to how the class name is being identified. I've attempted to use a certain class name, but it appears that either the structure has changed or there's a discrepancy in identifying the class name accurately.

For those of you experienced in web scraping or API utilization, I kindly ask for your guidance on the most effective approach to determine the accurate class name or alternative techniques to successfully extract the URLs from the search results.

If any of you can offer assistance, insights, code examples, or recommended resources, I would greatly appreciate your input. Your expertise would be of immense help in overcoming this challenge.

Thank you very much for considering my request, and I eagerly await any valuable advice you might have to share.

fleet silo Aug 26, 2023, 4:55 PM

#

@lone dawn

lone dawn Aug 26, 2023, 4:57 PM

#

urm

#

when i was doing a web scraping project using selenium, i just copied the xpath and chucked that in my code, but sometimes it didnt work

#

ur better off waiting for someone who has more experience than me in this topic, sorry

fleet silo Aug 26, 2023, 5:00 PM

#

lone dawn ur better off waiting for someone who has more experience than me in this topic,...

Can u pink somebody who has more experience in this topic

#

*ping

lone dawn Aug 26, 2023, 5:01 PM

#

im not sure who will, just wait and see who responds

fleet silo Aug 26, 2023, 5:01 PM

#

WEB SCRAPING

lone dawn Aug 26, 2023, 5:01 PM

#

i would also advise to not dm people, just wait

fleet silo Aug 26, 2023, 5:01 PM

#

lone dawn im not sure who will, just wait and see who responds

No one does, when i post here idk why

lone dawn Aug 26, 2023, 5:01 PM

#

🤷‍♂️

fleet silo Aug 26, 2023, 5:01 PM

#

Ahhman

lone dawn Aug 26, 2023, 5:02 PM

#

post more in one place, thats my advise

#

there will not always be someone here that can answer your question

fleet silo Aug 26, 2023, 5:02 PM

#

Yeah

#

Wait ik this guy

#

Hes the boss here

#

@worldly moat broo helpppp

#

webbb

worldly moat Aug 26, 2023, 5:12 PM

#

What search results are you scraping

#

Scraping Search Results

fleet silo Aug 26, 2023, 5:14 PM

#

worldly moat What search results are you scraping

So there this api it scrapes google searches. Like if i put abc then it returns sites ranked for that word but everything it returns is in html i just want to automatically extract sites url from the html. I tried using various libraries, then openais api to locate those class names that contain urls but failed.

worldly moat Aug 26, 2023, 5:15 PM

#

Google disallows scraping and actively blocks scraping using various techniques

fleet silo Aug 26, 2023, 5:16 PM

#

Step 1: Retrieve sites ranked for the provided keyword

keyword = input("Enter the keyword to search: ")
search_url = f"https://webspi.p.rapidapi.com/search/{keyword}"
headers = {
"X-RapidAPI-Key": apiKey,
"X-RapidAPI-Host": "webspi.p.rapidapi.com"
}

response = requests.get(search_url, headers=headers)

if response.status_code != 200:
print("Error accessing search API.")
exit()

html_code = response.text

Step 2: Identify the class name pattern using regular expressions

class_name_pattern = re.compile(r'class="'["']')

Step 3: Extract URLs based on the identified class name

soup = BeautifulSoup(html_code, "lxml")

Try to identify the class name pattern from the first 10 <a> tags

class_name_found = False
for tag in soup.find_all("a")[:10]:
match = class_name_pattern.search(str(tag))
if match:
class_name = match.group(1)
class_name_found = True
break

if not class_name_found:
print("No class name pattern found in the first 10 <a> tags.")
exit()

identified_urls = []
search_results = soup.find_all("a", class_=class_name)
for result in search_results:
url = result["href"]
identified_urls.append(url)

if not identified_urls:
print("No URLs identified in the search results.")
exit()

worldly moat Aug 26, 2023, 5:16 PM

#

So the reason you're not getting much help here is because it's not something that is allowed here and isn't really possible.

fleet silo Aug 26, 2023, 5:16 PM

#

worldly moat So the reason you're not getting much help here is because it's not something th...

Is it illegal?

worldly moat Aug 26, 2023, 5:16 PM

#

Disallowed by their TOS

fleet silo Aug 26, 2023, 5:16 PM

#

worldly moat Disallowed by their TOS

Like how do serp seo tools website work then?

worldly moat Aug 26, 2023, 5:17 PM

#

They likely have access to the Google API

#

We use the Google search API ourselves

#

!google example

rocky elbowBOT Aug 26, 2023, 5:17 PM

#

Google Results

Results for "example"

fleet silo Aug 26, 2023, 5:17 PM

#

Is it free???

worldly moat Aug 26, 2023, 5:17 PM

#

Yes

fleet silo Aug 26, 2023, 5:17 PM

#

Shishhhhhhh

#

Any videos related to that?

worldly moat Aug 26, 2023, 5:18 PM

#

No idea

#

I just read their docs

fleet silo Aug 26, 2023, 5:18 PM

#

Idk anything abt that

fleet silo Aug 26, 2023, 5:18 PM

#

worldly moat I just read their docs

Okkkk thanks man

#

Btw im new to diz stuff i don know whats wrong or right

#

Im sry

#Scraping Search Results

Step 1: Retrieve sites ranked for the provided keyword

Step 2: Identify the class name pattern using regular expressions

Step 3: Extract URLs based on the identified class name

Try to identify the class name pattern from the first 10 <a> tags