I am new in python and would like to learn web scraping with python.
My first project are the yellow pages in Germany.
When executing my code, I am getting following IndexError after scraping 12 pages:
('Traceback (most recent call last):
File "C:/Users/Zorro/PycharmProjects/scraping/venv/Lib/site-packages/pip-19.0.3-py3.6.egg/pip/_vendor/pytoml/test.py", line 25, in
city = city_container[0].text.strip()
IndexError: list index out of range
Process finished with exit code 1')
I would like to know how I can skip this error, so that python does not stop scraping.
I tried to use try and except blocks, but did not succeed.
from bs4 import BeautifulSoup as soup
import requests
page_title = "/Seite-"
page_number = 1
for i in range(25):
my_url = "https://www.gelbeseiten.de/Branchen/Italienisches%20Restaurant/Berlin"
page_html = requests.get(my_url + page_title + str(page_number))
page_soup = soup(page_html.text, "html.parser")
containers = page_soup.findAll("div", {"class": "table"})
for container in containers:
name_container = container.findAll("div", {"class": "h2"})
name = name_container[0].text.strip()
street_container = container.findAll("span", {"itemprop": "streetAddress"})
street = street_container[0].text.strip()
city_container = container.findAll("span", {"itemprop": "addressLocality"})
city = city_container[0].text.strip()
plz_container = container.findAll("span", {"itemprop": "postalCode"})
plz_name = plz_container[0].text.strip()
tele_container = container.findAll("li", {"class": "phone"})
tele = tele_container[0].text.strip()
print(name, "\n" + street, "\n" + plz_name + " " + city, "\n" + tele)
print()
page_number += 1
Ok, the formatting seems to have suffered a little upon posting the code.
Two things:
1) When webscraping it is usually advisable to add some downtime between consecutive scrapes to not get thrown off the server and not block too many resources. I added time.sleep(5) between every page request to wait 5 seconds before loading another page.
2) For me, try except worked just fine, if you add pass to the exception part. Of course, you can become more sophisticated in treating exceptions.
from bs4 import BeautifulSoup as soup
import requests
import time
page_title = "/Seite-"
page_number = 1
for i in range(25):
print(page_number)
time.sleep(5)
my_url = "https://www.gelbeseiten.de/Branchen/Italienisches%20Restaurant/Berlin"
page_html = requests.get(my_url + page_title + str(page_number))
page_soup = soup(page_html.text, "html.parser")
containers = page_soup.findAll("div", {"class": "table"})
for container in containers:
try:
name_container = container.findAll("div", {"class": "h2"})
name = name_container[0].text.strip()
street_container = container.findAll("span", {"itemprop": "streetAddress"})
street = street_container[0].text.strip()
city_container = container.findAll("span", {"itemprop": "addressLocality"})
city = city_container[0].text.strip()
plz_container = container.findAll("span", {"itemprop": "postalCode"})
plz_name = plz_container[0].text.strip()
tele_container = container.findAll("li", {"class": "phone"})
tele = tele_container[0].text.strip()
print(name, "\n" + street, "\n" + plz_name + " " + city, "\n" + tele)
print()
except:
pass
page_number += 1
Related
My original problem:
"I have created web scraping tool for picking data from listed houses.
I have problem when it comes to changing page. I did make for loop to go from 1 to some number.
Problem is this: In this web pages last "page" can be different all the time. Now it is 70, but tomorrow it can be 68 or 72. And if I but range for example to (1-74) it will print last page many times, because if you go over the maximum the page always loads the last."
Then I got help from Ricco D who wrote code that it will know when to stop:
import requests
from bs4 import BeautifulSoup as bs
url='https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000'
page=requests.get(url)
soup = bs(page.content,'html.parser')
last_page = None
pages = []
buttons=soup.find_all('button', class_= "Pagination__button__3H2wX")
for button in buttons:
pages.append(button.text)
print(pages)
This works just fine.
Butt when I try to combine this with my original code, which also works by itself I run into error:
Traceback (most recent call last):
File "C:/Users/Käyttäjä/PycharmProjects/Etuoviscaper/etuovi.py", line 29, in <module>
containers = page_soup.find("div", {"class": "ListPage__cardContainer__39dKQ"})
File "C:\Users\Käyttäjä\PycharmProjects\Etuoviscaper\venv\lib\site-packages\bs4\element.py", line 2173, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
This is the error I get.
Any ideas how to get this work? Thanks
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
import requests
my_url = 'https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1'
filename = "asunnot.csv"
f = open(filename, "w")
headers = "Neliöt; Hinta; Osoite; Kaupunginosa; Kaupunki; Huoneistoselitelmä; Rakennusvuosi\n"
f.write(headers)
page = requests.get(my_url)
soup = soup(page.content, 'html.parser')
pages = []
buttons = soup.findAll("button", {"class": "Pagination__button__3H2wX"})
for button in buttons:
pages.append(button.text)
last_page = int(pages[-1])
for sivu in range(1, last_page):
req = requests.get(my_url + str(sivu))
page_soup = soup(req.text, "html.parser")
containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})
for container in containers:
size_list = container.find("div", {"class": "flexboxgrid__col-xs__26GXk flexboxgrid__col-md-4__2DYW-"}).text
size_number = re.findall("\d+\,*\d+", size_list)
size = ''.join(size_number) # Asunnon koko neliöinä
prize_line = container.find("div", {"class": "flexboxgrid__col-xs-5__1-5sb flexboxgrid__col-md-4__2DYW-"}).text
prize_number_list = re.findall("\d+\d+", prize_line)
prize = ''.join(prize_number_list[:2]) # Asunnon hinta
address_city = container.h4.text
address_list = address_city.split(', ')[0:1]
address = ' '.join(address_list) # osoite
city_part = address_city.split(', ')[-2] # kaupunginosa
city = address_city.split(', ')[-1] # kaupunki
type_org = container.h5.text
type = type_org.replace("|", "").replace(",", "").replace(".", "") # asuntotyyppi
year_list = container.find("div", {"class": "flexboxgrid__col-xs-3__3Kf8r flexboxgrid__col-md-4__2DYW-"}).text
year_number = re.findall("\d+", year_list)
year = ' '.join(year_number)
print("pinta-ala: " + size)
print("hinta: " + prize)
print("osoite: " + address)
print("kaupunginosa: " + city_part)
print("kaupunki: " + city)
print("huoneistoselittelmä: " + type)
print("rakennusvuosi: " + year)
f.write(size + ";" + prize + ";" + address + ";" + city_part + ";" + city + ";" + type + ";" + year + "\n")
f.close()
Your main problem has to do with the way you use soup. You first import BeautifulSoup as soup - and then you override this name, when you create your first BeautifulSoup-instance:
soup = soup(page.content, 'html.parser')
From this point on soup will no longer be the name library BeautifulSoup, but the object you just created. Hence, when you some lines further down try to create a new instance (page_soup = soup(req.text, "html.parser")) this fails as soup no longer refers to BeautifulSoup.
So the best thing would be importing the library correctly like so: from bs4 import BeautifulSoup (or import AND use it as bs - like Ricco D did), and then change the two instantiating lines like so:
soup = BeautifulSoup(page.content, 'html.parser') # this is Python2.7-syntax btw
and
page_soup = BeautifulSoup(req.text, "html.parser") # this is Python3-syntax btw
If you're on Python3, the proper requests-syntax would by page.text and not page.content as .content returns bytes in Python3, which is not what you want (as BeautifulSoup needs a str). If you're on Python2.7 you should probably change req.text to req.content.
Good luck.
Finding your element with class name doesn't seem to be the best idea..because of this. Same class name for all the next elements.
I don't know what you are looking for exactly because of the language. I suggest..you go to the website>press f12>press ctrl+f>type the xpath..See what elements you get.If you don't know about xpaths read this. https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples
I have created web scraping tool for picking data from listed houses.
I have problem when it comes to changing page. I did make for loop to go from 1 to some number.
Problem is this: In this web pages last "page" can be different all the time. Now it is 70, but tomorrow it can be 68 or 72. And if I but range for example to (1-74) it will print last page many times, because if you go over the maximum the page always loads the last.
html: https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000 <---- if you but this over the real number (70) of pages, it will automatically open the last page (70) as many times it is ranged.
So how to make this loop stop when it reaches maximum number?
for sivu in range(1, 100):
req = requests.get(my_url + str(sivu))
page_soup = soup(req.text, "html.parser")
containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})
Thanks
Using the site you gave, you can get the maximum range by scraping the button texts.
import requests
from bs4 import BeautifulSoup as bs
url='https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000'
page=requests.get(url)
soup = bs(page.content,'html.parser')
last_page = None
pages = []
buttons=soup.find_all('button', class_= "Pagination__button__3H2wX")
for button in buttons:
pages.append(button.text)
print(pages)
Output: ['1', '68', '69', '70']
The last element will be the last page, I was able to get the buttons using class_= "Pagination__button__3H2wX". You can just get the last element of the array and use it as the limit of your loop. But take note that this might change depending on the web dev of the site whether he decides to change something on these buttons.
So here is my code now. For some reason I still can not get it going. Any ideas?
Error:
Traceback (most recent call last):
File "C:/Users/Käyttäjä/PycharmProjects/Etuoviscaper/etuovi.py", line 29, in
containers = page_soup.find("div", {"class": "ListPage__cardContainer__39dKQ"})
File "C:\Users\Käyttäjä\PycharmProjects\Etuoviscaper\venv\lib\site-packages\bs4\element.py", line 2173, in getattr
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
import requests
my_url = 'https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1'
filename = "asunnot.csv"
f = open(filename, "w")
headers = "Neliöt; Hinta; Osoite; Kaupunginosa; Kaupunki; Huoneistoselitelmä; Rakennusvuosi\n"
f.write(headers)
page = requests.get(my_url)
soup = soup(page.content, 'html.parser')
pages = []
buttons = soup.findAll("button", {"class": "Pagination__button__3H2wX"})
for button in buttons:
pages.append(button.text)
last_page = int(pages[-1])
for sivu in range(1, last_page):
req = requests.get(my_url + str(sivu))
page_soup = soup(req.text, "html.parser")
containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})
for container in containers:
size_list = container.find("div", {"class": "flexboxgrid__col-xs__26GXk flexboxgrid__col-md-4__2DYW-"}).text
size_number = re.findall("\d+\,*\d+", size_list)
size = ''.join(size_number) # Asunnon koko neliöinä
prize_line = container.find("div", {"class": "flexboxgrid__col-xs-5__1-5sb flexboxgrid__col-md-4__2DYW-"}).text
prize_number_list = re.findall("\d+\d+", prize_line)
prize = ''.join(prize_number_list[:2]) # Asunnon hinta
address_city = container.h4.text
address_list = address_city.split(', ')[0:1]
address = ' '.join(address_list) # osoite
city_part = address_city.split(', ')[-2] # kaupunginosa
city = address_city.split(', ')[-1] # kaupunki
type_org = container.h5.text
type = type_org.replace("|", "").replace(",", "").replace(".", "") # asuntotyyppi
year_list = container.find("div", {"class": "flexboxgrid__col-xs-3__3Kf8r flexboxgrid__col-md-4__2DYW-"}).text
year_number = re.findall("\d+", year_list)
year = ' '.join(year_number)
print("pinta-ala: " + size)
print("hinta: " + prize)
print("osoite: " + address)
print("kaupunginosa: " + city_part)
print("kaupunki: " + city)
print("huoneistoselittelmä: " + type)
print("rakennusvuosi: " + year)
f.write(size + ";" + prize + ";" + address + ";" + city_part + ";" + city + ";" + type + ";" + year + "\n")
f.close()
I am trying to get a Product List of a website with selenium. I prototyped the program and everything worked perfectly but now I built a loop to get all products and it just gives me the same product 484 times(that's the number of products there are on the website)
Here is my code:
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import selenium
from selenium import webdriver
# URl to web scrape from
page_url = "https://www.smythstoys.com/at/de-at/spielzeug/lego/c/SM100114"
driver = webdriver.Chrome()
driver.get(page_url)
buttonName = "loadMoreProducts"
loadMoreButton = driver.find_element_by_id(buttonName)
while loadMoreButton is not None:
try:
try:
loadMoreButton.click()
except selenium.common.exceptions.ElementNotInteractableException:
break
except selenium.common.exceptions.ElementClickInterceptedException:
break
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# gets all products
containers = driver.find_elements_by_tag_name('article')
print(len(containers))
# name the output file to write to local disk
out_filename = "smythstoys_product_data.csv"
# header of csv file to be written
headers = "product_name;price; info \n"
# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)
# loops trough all products
# -----------------------------------------------------------------------
# here is the problem:
for container in driver.find_elements_by_tag_name('article'):
print("----------------------------------------------------------------------")
product_name_container = container.find_element_by_xpath("//h2[#class ='prodName trackProduct']")
product_name = product_name_container.text
print(product_name)
price_container = container.find_element_by_xpath("//div[#class ='price']")
price = price_container.text
print("price:", price)
# ------------------------------------------------------------------------------------
try:
info_container = container.find_element_by_xpath("//span[#class ='decalImage-right']").text
print(info_container)
if not info_container:
info = "no special type"
print(info)
print(info_container)
f.write(product_name + "; " + price + "; " + info + "\n")
continue
if info_container == "https://smyths-at-prod-images.storage.googleapis.com/sys-master/images/hed/h5f/8823589830686" \
"/lego-hard-to-find-decal_CE.svg":
info = "seltenes Set"
elif info_container == "https://smyths-at-prod-images.storage.googleapis.com/sys-master/images/h41/h70" \
"/8823587930142/new-decal_CE%20%281%29.svg":
info = "neues Set"
elif info_container == "https://smyths-at-prod-images.storage.googleapis.com/sys-master/images/hde/hae" \
"/8871381303326/sale-decal_CE.svg":
info = "Sale"
else:
info = "unknown type" + info_container
print(info)
print(info_container)
except NameError:
print("no atribute")
if info_container is None:
info = "unknown type"
print(info)
# writes the dataset to file
f.write(product_name + "; " + price + "; " + info + "\n")
f.close() # Close the file
My output is:
LEGO Star Wars 75244 Tantive IV
price: 199,99€
no special type
and that 484x
I'm not sure why you used selenium to get the products when requests can do it smoothly. The following is something you wanna do to get all the products using requests.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
link = "https://www.smythstoys.com/at/de-at/at/de-at/spielzeug/lego/c/SM100114/load-more?"
params = {'q':':bestsellerRating:productVisible:true','page':'1'}
p = 0
while True:
params['page'] = p
r = requests.get(link,params=params,headers={
'content-type': 'application/json; charset=utf-8'
})
soup = BeautifulSoup(r.json()['htmlContent'],"lxml")
if not soup.select_one("a.trackProduct[href]"):break
for item in soup.select("a.trackProduct[href]"):
product_name = item.select_one("h2.prodName").get_text(strip=True)
product_price = item.select_one("[itemprop='price']").get("content")
print(product_name,product_price)
p+=1
I developed this program to scrape newegg for ps4 prices. However I want to scrape multiple pages. Here is what I have but once it scrapes the first page the program stops. Basically I am trying to change the link so 'pages-1' changes to 2,3,4 etc. Is there a better way to do this?
from bs4 import BeautifulSoup
import requests
import csv
page_num = 1
prod_num = 0
source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-' + str(page_num) + '?PageSize=36&order=BESTMATCH').text
soup = BeautifulSoup(source, 'lxml')
csv_file = open('newegg_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Product', 'Price', 'Shipping_info'])
for info in soup.find_all('div', class_='item-container'):
prod = info.find('a', class_='item-title').text.strip()
price = info.find('li', class_='price-current').text.strip().splitlines()[1].replace(u'\xa0', '')
if u'$' not in price:
price = info.find('li', class_='price-current').text.strip().splitlines()[0].replace(u'\xa0', '')
ship = info.find('li', class_='price-ship').text.strip()
print(prod)
print(price)
print(ship)
csv_writer.writerow([prod, price, ship])
prod_num += 1
if prod_num > 35: #there is about 35 items per newegg page
page_num += 1
# print(price.splitlines()[1])
print('-----------')
csv_file.close()
i found the page limit num here
and i think you can get the page limit by xpath or other ways:
# xpath syntax may like this
# //span[#class='list-tool-pagination-text']
hope it's useful for you
If you noticed, Next "button" tag of last page has attribute "disabled", So [tag_name].has_attr('disabled') return True . Using this you can manage pagination.
import requests
from bs4 import BeautifulSoup
import csv
csv_file = open('newegg_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Product', 'Price', 'Shipping_info'])
URL_PART1 = "https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-"
URL_PART2 = "?PageSize=36&order=BESTMATCH"
PAGE_NO = 1
url = URL_PART1 + str(PAGE_NO) + URL_PART2
while len(url):
PAGE_NO+=1
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
all_divs = soup.find_all('div', attrs={'class':'item-info'})
for item in all_divs:
prod = ""
price = ""
ship = ""
# get product name
prod = item.find('a', attrs={'class':'item-title'})
if prod:
prod = prod.text.strip()
# get price
price_part = item.find('li', attrs={'class':'price-current'})
if price_part:
price_part1 = price_part.strong
if price_part1:
price_part1 = price_part1.text.strip()
price_part2 = price_part.sup
if price_part2:
price_part2 = price_part2.text.strip()
if price_part1 and price_part2:
price = price_part1 + price_part2
# get shipping info
ship = item.find('li', attrs={'class':'price-ship'})
if ship:
ship = ship.text.strip()
csv_writer.writerow([prod, price, ship])
# manage pagination
next_button = soup.find('button', attrs={'title': 'Next'})
if not(next_button.has_attr('disabled')):
url = URL_PART1 + str(PAGE_NO) + URL_PART2
else:
url = ""
I am trying to extract some information about mtg cards from a webpage with the following program but I repeatedly retrieve information about the initial page given(InitUrl). The crawler is unable to proceed further. I have started to believe that i am not using the correct urls or maybe there is a restriction in using urllib that slipped my attention. Here is the code that i struggle with for weeks now:
import re
from math import ceil
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
InitUrl = "https://mtgsingles.gr/search?q=dragon"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 4 # depth of pages to be retrieved
query = InitUrl.split("?")[1]
for i in range(0, NumOfPages):
if i == 0:
Url = InitUrl
else:
Url = URL_Next
print(Url)
UClient = uReq(Url) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
try:
URL_Next = InitUrl + "&page=" + str(i + 2)
print("The next URL is: " + URL_Next + "\n")
except IndexError:
print("Crawling process completed! No more infomation to retrieve!")
else:
NumOfCrawledPages += 1
Url = URL_Next
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
One of the reasons your code fail is, that you don't use cookies. The site seem to require these to allow paging.
A clean and simple way of extracting the data you're interested in would be like this:
import requests
from bs4 import BeautifulSoup
# the site actually uses this url under the hood for paging - check out Google Dev Tools
paging_url = "https://mtgsingles.gr/search?ajax=products-listing&lang=en&page={}&q=dragon"
return_list = []
# the page-scroll will only work when we support cookies
# so we fetch the page in a session
session = requests.Session()
session.get("https://mtgsingles.gr/")
All pages have a next button except the last one. So we use this knowledge to loop until the next-button goes away. When it does - meaning that the last page is reached - the button is replaced with a 'li'-tag with the class of 'next hidden'. This only exists on the last page
Now we're ready to start looping
page = 1 # set count for start page
keep_paging = True # use flag to end loop when last page is reached
while keep_paging:
print("[*] Extracting data for page {}".format(page))
r = session.get(paging_url.format(page))
soup = BeautifulSoup(r.text, "html.parser")
items = soup.select('.iso-item.item-row-view.clearfix')
for item in items:
name = item.find('div', class_='col-md-10').get_text().strip().split('\xa0')[0]
toughness_element = item.find('div', class_='card-power-toughness')
try:
toughness = toughness_element.get_text().strip()
except:
toughness = None
cardtype = item.find('div', class_='cardtype').get_text()
card_dict = {
"name": name,
"toughness": toughness,
"cardtype": cardtype
}
return_list.append(card_dict)
if soup.select('li.next.hidden'): # this element only exists if the last page is reached
keep_paging = False
print("[*] Scraper is done. Quitting...")
else:
page += 1
# do stuff with your list of dicts - e.g. load it into pandas and save it to a spreadsheet
This will scroll until no more pages exists - no matter how many subpages would be in the site.
My point in the comment above was merely that if you encounter an Exception in your code, your pagecount would never increase. That's probably not what you want to do, which is why I recommended you to learn a little more about the behaviour of the whole try-except-else-finally deal.
I am also bluffed, by the request given the same reply, ignoring the page parameter. As a dirty soulution I can offer you first to set up the page-size to a high enough number to get all the Items that you want (this parameter works for some reason...)
import re
from math import ceil
import requests
from bs4 import BeautifulSoup as soup
InitUrl = Url = "https://mtgsingles.gr/search"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 2 # depth of pages to be retrieved
query = "dragon"
cardSet=set()
for i in range(1, NumOfPages):
page_html = requests.get(InitUrl,params={"page":i,"q":query,"page-size":999})
print(page_html.url)
page_soup = soup(page_html.text, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
cardString=card_name + "\n" + cardP_T + "\n" + cardType + "\n"
cardSet.add(cardString)
print(cardString)
NumOfCrawledPages += 1
print("Moving to page : " + str(NumOfCrawledPages + 1) + " with " +str(len(cards)) +"(cards)\n")