Scraping loop using proxies everything - python

Hello fellow coders :)
So as part of my research project I need to scrape data out of a website.
Obviously it detects bots therefore I am trying to implement proxies on a loop I know works (getting the brands url):
The working loop:
brands_links= []
for country_link in country_links:
r = requests.get(url + country_link, headers=headers)
soup_b = BeautifulSoup(r.text, "lxml")
for link in soup_b.find_all("div", class_='designerlist cell small-6 large-4'):
for link in link.find_all('a'):
durl = link.get('href')
brands_links.append(durl)
The loop using proxies:
brands_links= []
i = 0
while i in range(0, len(country_links)):
print(i)
try:
proxy_index = random.randint(0, len(proxies) - 1)
proxy = {"http": proxies[proxy_index], "https": proxies[proxy_index]}
r = requests.get(url + country_links[i], headers=headers, proxies=proxy, timeout=10)
soup_b = BeautifulSoup(r.text, "lxml")
for link in soup_b.find_all("div", class_='designerlist cell small-6 large-4'):
for link in link.find_all('a'):
durl = link.get('href')
brands_links.append(durl)
if durl is not None :
print("scraping happening")
i += 1
else:
continue
except:
print("proxy not working")
proxies.remove(proxies[proxy_index])
if i == len(country_links):
break
else:
continue
Unfortunately it does not scrape all the links.
With the working loop only using headers I get a list of lenght 3788. With this one I only get 2387.
By inspecting the data I can see it skips some country links hence the difference in length.
I am trying to force the loop to scrape all the links with the "if" statement but it does not seem to work.
Anyone knows what I am doing wrong or got an idea which would make it scrape everything?
Thanks in advances

Thanks for sharing the link.
You said:
Obviously it detects bots therefore I am trying to implement
proxies...
What makes you think this? Here is some code I came up with, which seems to scrape all the divs, as far as I can tell:
def main():
import requests
from bs4 import BeautifulSoup
countries = (
("United States", "United+States.html"),
("Canada", "Canada.html"),
("United Kingdom", "United+Kingdom.html")
)
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}
for country, document in countries:
url = f"https://www.fragrantica.com/country/{document}"
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
divs = soup.find_all("div", {"class": "designerlist"})
print(f"Number of divs in {country}: {len(divs)}")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Number of divs in United States: 1016
Number of divs in Canada: 40
Number of divs in United Kingdom: 308
>>>

So I found a way to for the loop to keep the scraping until it actually scrapes the link.
Here's the updated code:
brands_links= []
i = 0
while i in range(0, len(country_links)):
print(i)
try:
proxy_index = random.randint(0, len(proxies) - 1)
proxy = {"http": proxies[proxy_index], "https": proxies[proxy_index]}
r = requests.get(url + country_links[i], headers=headers, proxies=proxy, timeout=10)
soup_b = BeautifulSoup(r.text, "lxml")
for link in soup_b.find_all("div", class_='designerlist cell small-6 large-4'):
for link in link.find_all('a'):
durl = link.get('href')
brands_links.append(durl)
except:
print("proxy not working")
proxies.remove(proxies[proxy_index])
continue
try :
durl
except NameError:
print("scraping not happening")
continue
else:
print("scraping happening")
del durl
i += 1
if i == len(country_links):
break
else:
continue
So it is the last if statement which checks if the link was actually scraped.
I am not really familiar with functions. So if anyone has a way to make it simpler or more efficient I would highly appreciate. As for now I will be using #Paul M function to improve my loop or tranform it into a function.

Related

Why am I not seeing any results in my output from extracting indeed data using python

I am trying to run this code in idle 3.10.6 and I am not seeing any kind of data that should be extracted from Indeed. All this data should be in the output when I run it but it isn't. Below is the input statement
#Indeed data
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko"}
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}"
r = requests.get(url,headers)
soup = BeautifulSoup(r.content, "html.parser")
return soup
def transform(soup):
divs = soup.find_all("div", class_ = "jobsearch-SerpJobCard")
for item in divs:
title = item.find ("a").text.strip()
company = item.find("span", class_="company").text.strip()
try:
salary = item.find("span", class_ = "salarytext").text.strip()
finally:
salary = ""
summary = item.find("div",{"class":"summary"}).text.strip().replace("\n","")
job = {
"title":title,
"company":company,
'salary':salary,
"summary":summary
}
joblist.append(job)
joblist = []
for i in range(0,40,10):
print(f'Getting page, {i}')
c = extract(10)
transform(c)
df = pd.DataFrame(joblist)
print(df.head())
df.to_csv('jobs.csv')
Here is the output I get
Getting page, 0
Getting page, 10
Getting page, 20
Getting page, 30
Empty DataFrame
Columns: []
Index: []
Why is this going on and what should I do to get that extracted data from indeed? What I am trying to get is the jobtitle,company,salary, and summary information. Any help would be greatly apprieciated.
The URL string includes {page}, bit it's not an f-string, so it's not being interpolated, and the URL you are fetching is:
https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}
That returns an error page.
So you should add an f before opening quote when you set url.
Also, you are calling extract(10) each time, instead of extract(i).
This is the correct way of using url
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}".format(page=page)
r = requests.get(url,headers)
here r.status_code gives an error 403 which means the request is forbidden.The site will block your request from fullfilling.use indeed job search Api

Multiple values against the same tag not scraping

I'm getting no values for my "Number of Rooms" and "Room" search.
https://www.zoopla.co.uk/property/uprn/906032139/
I can see here that I should be returning something but not getting anything.
Can anyone possibly point me in the right direction of how to solve this? I am not even sure what to search for as it's not erroring. I thought it would put all the data in and then I would need to figure out a way to seperate it. Do I need to maybe scrape it into a dictionary?
import requests
from bs4 import BeautifulSoup as bs
import numpy as np
import pandas as pd
import matplotlib as plt
import time
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
"Accept-Language": "en-US,en;q=0.5",
"Referer": "https://google.co.uk",
"DNT": "1"
}
page = 1
addresses = []
while page != 2:
url = f"https://www.zoopla.co.uk/house-prices/edinburgh/?pn={page}"
print(url)
response = requests.get(url, headers=headers)
print(response)
html = response.content
soup = bs(html, "lxml")
time.sleep(1)
for address in soup.find_all("div", class_="c-rgUPM c-rgUPM-pnwXf-hasUprn-true"):
details = {}
# Getting the address
details["Address"] = address.h2.get_text(strip=True)
# Getting each addresses unique URL
scotland_house_url = f'https://www.zoopla.co.uk{address.find("a")["href"]}'
details["URL"] = scotland_house_url
scotland_house_url_response = requests.get(
scotland_house_url, headers=headers)
scotland_house_soup = bs(scotland_house_url_response.content, "lxml")
# Lists status of the property
try:
details["Status"] = [status.get_text(strip=True) for status in scotland_house_soup.find_all(
"span", class_="css-10o3xac-Tag e164ranr11")]
except AttributeError:
details["Status"] = ""
# Lists the date of the status of the property
try:
details["Status Date"] = [status_date.get_text(
strip=True) for status_date in scotland_house_soup.find_all("p", class_="css-1jq4rzj e164ranr10")]
except AttributeError:
details["Status Date"] = ""
# Lists the value of the property
try:
details["Value"] = [value.get_text(strip=True).replace(",", "").replace(
"£", "") for value in scotland_house_soup.find_all("p", class_="css-1x01gac-Text eczcs4p0")]
except AttributeError:
details["Value"] = ""
# Lists the number of rooms
try:
details["Number of Rooms"] = [number_of_rooms.get_text(strip=True) for number_of_rooms in scotland_house_soup.find_all(
"p", class_="css-82kmy1 e13gx5i3")]
except AttributeError:
details["Number of Rooms"] = ""
# Lists type of room
try:
details["Room"] = [room.get_text(strip=True) for room in scotland_house_soup.find_all(
"span", class_="css-1avcdf2 e13gx5i4")]
except AttributeError:
details["Room"] = ""
addresses.append(details)
page = page + 1
for address in addresses[:]:
print(address)
print(response)
Selecting by class_="css-1avcdf2 e13gx5i4" seems brittle, the class might change all the time. Try different CSS selector:
import requests
from bs4 import BeautifulSoup
url = "https://www.zoopla.co.uk/property/uprn/906032139/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
tag = soup.select_one('#timeline p:has(svg[data-testid="bed"]) + p')
no_beds, beds = tag.get_text(strip=True, separator=" ").split()
print(no_beds, beds)
Prints:
1 bed
If you want all types of rooms:
for detail in soup.select("#timeline p:has(svg[data-testid]) + p"):
n, type_ = detail.get_text(strip=True, separator="|").split("|")
print(n, type_)
Prints:
1 bed
1 bath
1 reception

Multi-Threading BS4 scraper does not speed up the process

I am trying to multithread my scraper so it runs faster. Currently I commented pagination so it finished faster for time measurement but it runs the same as the simple scraper that I did not use concurrent.futures.ThreadPoolExecutor. Instead it after I try to quit the script from executing using Crtl+c it seems to quit one process, but immediately after quitting ing it seems to continue the same scraper, and I have to stop it from executing again, so something changes, but not the speed, nor the data.
This is my scraper:
from bs4 import BeautifulSoup
import requests
import concurrent.futures
NUM_THREADS = 30
BASEURL = 'https://www.motomoto.lt'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}
page = requests.get(BASEURL, headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
item_list = []
def main():
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(parse_category, soup)
def parse_category(soup):
for a in soup.find_all('a', class_='subcategory-name', href=True):
nexturl = BASEURL + a['href']
parse_subcategory(nexturl)
def parse_subcategory(url):
subcategoryPage = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(subcategoryPage.content, 'html.parser')
for a in soup.find_all('a', class_='subcategory-image', href=True):
nexturl= BASEURL + a['href']
parse_products(nexturl)
def parse_products(url):
productsPage = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(productsPage.content, 'html.parser')
for a in soup.find_all('a', class_='thumbnail product-thumbnail', href=True):
nexturl = a['href']
parse_item(nexturl)
# this = soup.find('a', attrs={'class':'next'}, href=True)
# if this is not None:
# nextpage = BASEURL + this['href']
# print('-' * 70)
# parse_products(nextpage)
def parse_item(url):
itemPage = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(itemPage.content, 'html.parser')
title = get_title(soup)
price = get_price(soup)
category = get_category(soup)
item = {
'Title': title,
'Price': price,
'Category': category
}
item_list.append(item)
print(item)
def get_title(soup):
title = soup.find('h1', class_='h1')
title_value = title.string
title_string = title_value.strip()
return title_string
def get_price(soup):
price = soup.find('span', attrs={'itemprop':'price'}).string.strip()
return price
def get_category(soup):
category = soup.find_all("li", attrs={'itemprop':'itemListElement'})[1].find('span', attrs={'itemprop':'name'}).getText()
return category
if __name__ == "__main__":
main()
Currently I am multithreading the first function, that uses the BS4 soup to gather the category links. How may I fix it to make it faster, even though it's using multiple functions?
The signature of ThreadPoolExecutor.map is
map(func, *iterables, timeout=None, chunksize=1)
The executor processes iterables concurrently.
If you have supplied multiple soups like executor.map(parse_category, [soup1, soup2, ...]) they will be processed in parallel. But since you have supplied only one soup, you are "doing one thing concurrently", which means there is no concurrency.
As you are calling parse_category only once, it worse not adding concurrency to it. Instead, you can parallelize parse_subcategory and parse_products like this:
...
def main():
executor = concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS)
parse_category(soup, executor)
def parse_category(soup, executor):
executor.map(
lambda url: parse_subcategory(url, executor),
[BASEURL + a['href'] for a in soup.find_all('a', class_='subcategory-name', href=True)])
def parse_subcategory(url, executor):
subcategoryPage = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(subcategoryPage.content, 'html.parser')
executor.map(
lambda url: parse_products(url, executor),
[BASEURL + a['href'] for a in soup.find_all('a', class_='subcategory-image', href=True)])
def parse_products(url, executor):
productsPage = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(productsPage.content, 'html.parser')
executor.map(
parse_item,
# here you missed the `BASEURL`, I kept it as-is
[a['href'] for a in soup.find_all('a', class_='thumbnail product-thumbnail', href=True)])
...
The remainder of the script is unchanged.
I didn't test it as the website seems inaccessible from my location. Reply if there's any bug.

Loop duplicating results

I'm writing a code for web-scrape Transfermarkt website, but I'm having some issues on the code.
The code had returned an error that was fixed thru the topic: Loop thru multiple URLs in Python - InvalidSchema("No connection adapters were found for {!r}".format
After this fix, other problems came in.
First: the code is duplicating the results on data frame.
Second one, the code is taking only the last element of each URL. In fact, what I want is get all the agencies URLs in the pagina = range(1) and then scrape all players in each agency, thru the URL scrapped in the first part.
ps.: pagina = range(1) it will be range (1,40), its the numbers of pages that i will scrape to get all agency's links.
Can anyone give me a hand on this issues?
Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from requests.sessions import default_headers
nome=[]
posicao=[]
nacionalidade=[]
idade=[]
clube=[]
contrato=[]
valor=[]
tf = f"http://www.transfermarkt.com.br"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
pagina = range(1,5)
def main(url):
with requests.Session() as req:
links = []
for lea in pagina:
print(f"Extraindo links da página {lea}")
r = req.get(url.format(lea), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
link = [f"{tf}{item.next_element.get('href')}" for item in soup.findAll(
"td", class_="hauptlink")]
links.extend(link)
print(f"Collected {len(links)} Links")
time.sleep(1)
for url in links:
r= requests.get(url, headers=headers)
r.status_code
soup = BeautifulSoup(r.text, 'html.parser')
player_info= soup.find_all('tr', class_=['odd', 'even'])
for info in player_info:
player = info.find_all("td")
vall= info.find('td', {'class': 'zentriert hauptlink'})
nome.append(player[2].text)
posicao.append(player[3].text)
nacionalidade.append(player[4].img['alt'])
idade.append(player[5].text)
clube.append(player[6].img['alt'])
contrato.append(player[7].text)
valor.append(vall)
time.sleep(1)
df = pd.DataFrame(
{"NOME":nome,
"POSICAO":posicao,
"NACIONALIDADE":nacionalidade,
"IDADE":idade,
"CLUBE":clube,
"CONTRATO":contrato,
"VALOR":valor}
)
print(df)
df
#df.to_csv('MBB.csv', index=False)
main("https://www.transfermarkt.com.br/berater/beraterfirmenuebersicht/berater?ajax=yw1&page={}")

Getting a list of Urls and then finding specific text from all of them in Python 3.5.1

So I have this code that will give me the urls I need in a list format
import requests
from bs4 import BeautifulSoup
offset = 0
links = []
with requests.Session() as session:
while True:
r = session.get("http://rayleighev.deviantart.com/gallery/44021661/Reddit?offset=%d" % offset)
soup = BeautifulSoup(r.content, "html.parser")
new_links = soup.find_all("a", {'class' : "thumb"})
# no more links - break the loop
if not new_links:
break
# denotes the number of gallery pages gone through at one time (# of pages times 24 equals the number below)
links.extend(new_links)
print(len(links))
offset += 24
#denotes the number of gallery pages(# of pages times 24 equals the number below)
if offset == 48:
break
for link in links:
print(link.get("href"))
After that I try to get different text from all of the urls, and all that text is in relatively the same place on each one. But, whenever I run the second half, below, I keep getting a chunk of html text and some errors, and I'm not sure of how to fix it or if there is any other, and preferably simpler, way to get the text from each url.
import urllib.request
import re
for link in links:
url = print("%s" % link)
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
paragraphs = re.findall(r'</a><br /><br />(.*?)</div>', str(respData))
if paragraphs != None:
paragraphs = re.findall(r'<br /><br />(.*?)</span>', str(respData))
if paragraphs != None:
paragraphs = re.findall(r'<br /><br />(.*?)</span></div>', str(respData))
for eachP in paragraphs:
print(eachP)
title = re.findall(r'<title>(.*?)</title>', str(respData))
for eachT in title:
print(eachT)
Your code:
for link in links:
url = print("%s" % link)
assigns None to url. Perhaps you mean:
for link in links:
url = "%s" % link.get("href")
There's also no reason to use urllib to get the sites content, you can use requests as you did before by changing:
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
to
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, "html.parser")
Now you can get the title and paragraph with just:
title = soup.find('div', {'class': 'dev-title-container'}).h1.text
paragraph = soup.find('div', {'class': 'text block'}).text

Categories

Resources