Python web scraping : how to skip url error

Python web scraping : how to skip url error - python

I am trying to scrape a webpage ("coinmarketcap"). I am scraping data from 2013 to 2019 October (Open, High, Low, Close, Marketcap, Volume) of all cryptocurrencies.
for j in range (0,name_size):
url = ("https://coinmarketcap.com/currencies/" + str(name[j]) + "/historical-data/?start=20130429&end=20191016")
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
priceDiv = soup.find('div', attrs={'class':'table-responsive'})
rows = priceDiv.find_all('tr')
The problem is some url doesn't exist. And I don't know how to skip those. Can you please help me?

Use try-except
for j in range (0,name_size):
url = ("https://coinmarketcap.com/currencies/" + str(name[j]) + "/historical-data/?start=20130429&end=20191016")
try:
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
priceDiv = soup.find('div', attrs={'class':'table-responsive'})
except:
print("Coult not open url")
rows = priceDiv.find_all('tr')

use error catching.
try:
#do the thing
except Exception as e:
#here you can print the error
The ones in error will simply be skipped wit the print message, else the task continues

Related

How to find body of a webpage using beautifulsoup

I want to check if there is any content available on more than 500 webpages, using beautiful soup. This is the is script I wrote. It works, but somewhere it stops. If I fix the error it shows a different one. Below is the code I tried. I just want to be sure the page has a body. I'm unsure how to handle timeouts. Maybe the website needs more time.
method 1:
res = requests.get(full_https_url, timeout=40)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems == '':
pass
else:
print('body found')
method 2:
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems != '':
print('body found')
else:
pass

select() returns a list, not a string, so it will always compare not equal to '', whether it's successful or not. Just test if the result is not empty.
Use try/except to catch the timeout error.
try:
res = requests.get(full_https_url, timeout=40)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems:
# do stuff
else:
print("No body in {full_https_url}")
except requests.exceptions.Timeout:
print(f"Timeout on {full_https_url}, skipping")

trying to loop through a list of urls and scrape each page for text

I'm having an issue. It loops through the list of URLS, but it's not adding the text content of each page scraped to the presults list.
I haven't gotten to the raw text processing yet. I'll probably make a question for that once I get there if I can't figure out.
What is wrong here? The length of presults remains at 1 even though it seems to be looping through the list of urls for the scrape...
Here's part of the code I'm having an issue with:
counter=0
for xa in range(0,len(qresults)):
pageURL=qresults[xa].format()
pageresp= requests.get(pageURL, headers=headers)
if pageresp.status_code==200:
print(pageURL)
psoup=BeautifulSoup(pageresp.content, 'html.parser')
presults=[]
para=psoup.text
presults.append(para)
print(len(presults))
else: print("Could not reach domain")
print(len(presults))

Your immediate problem is here:
presults=[]
para=psoup.text
presults.append(para)
On every for iteration, you replace your existing presults list with the empty list and add one item. On the next iteration, you again wipe out the previous result.
Your initialization must be done only once and that before the loop:
presults = []
for xa in range(0,len(qresults)):

Ok, I don't even see you looping through any URLs here, but below is a generic example of how this kind of request can be achieved.
import requests
from bs4 import BeautifulSoup
base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1
while current_page < 200:
print(current_page)
url = base_url + str(current_page)
#current_page += 1
r = requests.get(url)
zute_soup = BeautifulSoup(r.text, 'html.parser')
firme = zute_soup.findAll('div', {'class': 'jobs-item'})
for title in firme:
title1 = title.findAll('h6')[0].text
print(title1)
adresa = title.findAll('div', {'class': 'description'})[0].text
print(adresa)
kontakt = title.findAll('div', {'class': 'description'})[1].text
print(kontakt)
print('\n')
page_line = "{title1}\n{adresa}\n{kontakt}".format(
title1=title1,
adresa=adresa,
kontakt=kontakt
)
current_page += 1

How to Iterate over pages in Ebay

I am building a scraper for Ebay. I am trying to figure out a way to manipulate the page number portion of the Ebay url to go to the next page until there are no more pages (If you were on page 2 the page number portion would look like "_pgn=2"). I noticed that if you put any number greater than the max number of pages a listing has, the page will reload to the last page, not give like a page doesn't exist error. (If a listing has 5 pages, then the last listing' page number url portion of _pgn=5 would rout to the same page if the page number url portion was _pgn=100). How can I implement a way to start at page one, get the html soup of the page, get the all relevant data I want from the soup, then load up the next page with the new page number and start the process again until there are not any new pages to scrape? I tried to get the number of results a listing has by using selenium xpath and math.ceil the quotient of number of results and 50 (default number of max listings per page) and use that quotient as my max_page, but I get errors saying the element doesn't exist even though it does. self.driver.findxpath('xpath').text. That 243 is what I am trying to get with the xpath.
class EbayScraper(object):
def __init__(self, item, buying_type):
self.base_url = "https://www.ebay.com/sch/i.html?_nkw="
self.driver = webdriver.Chrome(r"chromedriver.exe")
self.item = item
self.buying_type = buying_type + "=1"
self.url_seperator = "&_sop=12&rt=nc&LH_"
self.url_seperator2 = "&_pgn="
self.page_num = "1"
def getPageUrl(self):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"
self.item = self.item.replace(" ", "+")
url = self.base_url + self.item + self.url_seperator + self.buying_type + self.url_seperator2 + self.page_num
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def getInfo(self, soup):
for listing in soup.find_all("li", {"class": "s-item"}):
raw = listing.find_all("a", {"class": "s-item__link"})
if raw:
raw_price = listing.find_all("span", {"class": "s-item__price"})[0]
raw_title = listing.find_all("h3", {"class": "s-item__title"})[0]
raw_link = listing.find_all("a", {"class": "s-item__link"})[0]
raw_condition = listing.find_all("span", {"class": "SECONDARY_INFO"})[0]
condition = raw_condition.text
price = float(raw_price.text[1:])
title = raw_title.text
link = raw_link['href']
print(title)
print(condition)
print(price)
if self.buying_type != "BIN=1":
raw_time_left = listing.find_all("span", {"class": "s-item__time-left"})[0]
time_left = raw_time_left.text[:-4]
print(time_left)
print(link)
print('\n')
if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")
instance = EbayScraper(item, buying_type)
page = instance.getPageUrl()
instance.getInfo(page)

if you want to iterate all pages and gather all results then your script needs to check if there is a next page after you visit the page
import requests
from bs4 import BeautifulSoup
class EbayScraper(object):
def __init__(self, item, buying_type):
...
self.currentPage = 1
def get_url(self, page=1):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"
self.item = self.item.replace(" ", "+")
# _ipg=200 means that expect a 200 items per page
return '{}{}{}{}{}{}&_ipg=200'.format(
self.base_url, self.item, self.url_seperator, self.buying_type,
self.url_seperator2, page
)
def page_has_next(self, soup):
container = soup.find('ol', 'x-pagination__ol')
currentPage = container.find('li', 'x-pagination__li--selected')
next_sibling = currentPage.next_sibling
if next_sibling is None:
print(container)
return next_sibling is not None
def iterate_page(self):
# this will loop if there are more pages otherwise end
while True:
page = instance.getPageUrl(self.currentPage)
instance.getInfo(page)
if self.page_has_next(page) is False:
break
else:
self.currentPage += 1
def getPageUrl(self, pageNum):
url = self.get_url(pageNum)
print('page: ', url)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def getInfo(self, soup):
...
if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")
instance = EbayScraper(item, buying_type)
instance.iterate_page()
the important functions here are page_has_next and iterate_page
page_has_next - a function that check if the pagination of the page has another li element next to the selected page. e.g < 1 2 3 > if we are on page 1 then it checks if there is 2 next -> something like this
iterate_page - a function that loop until there is no page_next
also note that you don't need selenium for this unless you need to mimic user clicks or need a browser to navigate.

how to speed up my process

I wrote a script that will web scrape data for a list of stocks. The scraper has to get the data from 2 separate pages so each stock symbol must scrape 2 different pages. If I run the process on a list that is 1000 items long it will take around 30 minutes to complete. It's not horrible, I can set it and forget it, but I'm wondering if there is a way to speed up the process. Maybe store the data and wait to write it all at the end instead of on each loop? Any other ideas appreciated.
import requests
from BeautifulSoup import BeautifulSoup
from progressbar import ProgressBar
import csv
symbols = {'AMBTQ','AABA','AAOI','AAPL','AAWC','ABEC','ABQQ','ACFN','ACIA','ACIW','ACLS'}
pbar = ProgressBar()
with open('industrials.csv', "ab") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
writer.writerow(['Symbol','5 Yr EPS','EPS TTM'])
for s in pbar(symbols):
try:
url1 = 'https://research.tdameritrade.com/grid/public/research/stocks/fundamentals?symbol='
full1 = url1 + s
response1 = requests.get(full1)
html1 = response1.content
soup1 = BeautifulSoup(html1)
for hist_div in soup1.find("div", {"data-module-name": "HistoricGrowthAndShareDetailModule"}):
EPS5yr = hist_div.find('label').text
except Exception as e:
EPS5yr = 'Bad Data'
pass
try:
url2 = 'https://research.tdameritrade.com/grid/public/research/stocks/summary?symbol='
full2 = url2 + s
response2 = requests.get(full2)
html2 = response2.content
soup2 = BeautifulSoup(html2)
for div in soup2.find("div", {"data-module-name": "StockSummaryModule"}):
EPSttm = div.findAll("dd")[11].text
except Exception as e:
EPSttm = "Bad data"
pass
writer.writerow([s,EPS5yr,EPSttm])

Try/except to scrape sites with 3 random numbers in the end of URL

I'm trying to scrape data from Brazil's Supreme Court using Python's BeautifulSoup and Requests.
Every one of the 144 links has a number between 1 and 3 in the end (e.g.: http://www.stf.jus.br/portal/remuneracao/listarRemuneracao.asp?periodo=012007&ano=2007&mes=01&folha=3).
There's no pattern in the 'folha'(sheet, in Portuguese) part. Some months are 1, others are 2 or 3. It seems random. When the URL with the wrong number is accessed, the site loads, but with the message 'A folha solicitada não é válida' (the requested sheet is invalid, in Portuguese).
In my code (below), after creating a list with the links without the 'sheet' number, I load the page and check if the message is there. If it is, with the try method, the code then attaches the following number (2 or 3) in the URL.
But the code doesn't run. Is there a way to use try/except for 3 possible outcomes in the code?
records=[]
for x in links:
r = requests.get(x+'1')
soup = BeautifulSoup(r.text, 'html.parser')
if BeautifulSoup(r.text, 'html.parser') == 'A folha solicitada não é válida':
try:
r = requests.get(x+'2')
soup = BeautifulSoup(r.text, 'html.parser')
if BeautifulSoup(r.text, 'html.parser') == 'A folha solicitada não é válida':
try:
r = requests.get(x+'3')
soup = BeautifulSoup(r.text, 'html.parser')
else:
continue
else:
continue
mes = x[-30:-28]+'/'+x[-28:-24]
ativos = soup.find_all('table', {'id':'ministros_ativos'})
ativos = ativos[0]
for x in range(0,11):
nome = ativos.find_all('a', {'class':'exibirServidor'})[x].text
salarios = ativos.contents[3].findAll('td', {'align':'right'})
salarios_brutos = salarios[::2]
salarios_liquidos = salarios[1::2]
for x in salarios_liquidos:
liquido = x.text
for x in salarios_brutos:
bruto = x.text
records.append((nome, bruto, liquido, mes))

You can use range to create a list of numbers between 1 and 3, and iterate over that list to produce a url. If the response is valid break the loop and continue with your code.
for x in links:
for i in range(1,4):
try:
r = requests.get(x+str(i))
except Exception as e:
continue
if 'A folha solicitada não é válida' not in r.text:
break
else:
continue
soup = BeautifulSoup(r.text, 'html.parser')
Notes:
For python 2 you'll have to turn the error message to unicode. (use the u prefix)
requests won't raise an exception for a 404 response so you don't need a try/except for that, however other exceptions may occur.
Use except to catch exceptions. else is used after except in a try/except/else block, and is executed if no exceptions occur.
The else statement in the for/else block is executed if the loop doesn't break. Basically it means "continue with the next x if no valid response is recieved".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python web scraping : how to skip url error - python

use error catching. try: #do the thing except Exception as e: #here you can print the error The ones in error will simply be skipped wit the print message, else the task continues

Related

How to find body of a webpage using beautifulsoup

trying to loop through a list of urls and scrape each page for text

How to Iterate over pages in Ebay

how to speed up my process

Try/except to scrape sites with 3 random numbers in the end of URL

Categories

Resources