How to handle Empty List- multiple page web scraping

How to handle Empty List- multiple page web scraping - python

I am trying to pull the question and answer section from Lazada through web scraping, however I am having issue when some of the pages doesn't have any question/answer. My code returns nothing when i run it for multiple web pages but works only for one page that have question and answer.
How do i make the code continue reading the rest of web pages though the first page have no question?
I have tried adding if else statement in my code as shown below.
import bleach
import csv
import datetime
from bs4 import BeautifulSoup
urls = ['url1','url2','url3']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
now = datetime.datetime.now()
print ("Date data being pulled:")
print str(now)
print ("")
nameList = soup.findAll("div", {"class":"qna-content"})
for name in nameList:
if nameList == None:
print('None')
else:
print(name.get_text())
continue
my expected output will be something like as shown below :
None --> output from url1
None --> output from url2
can choose huzelnut?
Hi Dear Customer , for the latest expiry date its on 2019 , and we will make sure the expiry date is still more than 6 months.--> output from url3
I appreciate your help, thanks in advance!

you have wrong syntax, put if nameList == None: outside the loop, also you need to fix the indentation.
urls = ['url1','url2','url3']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
now = datetime.datetime.now()
print ("Date data being pulled:")
print str(now)
print ("")
nameList = soup.findAll("div", {"class":"qna-content"})
if nameList == None:
print(url, 'None')
continue # skip this URL
for name in nameList:
print(name.get_text())

I did some changes to the logic of the code and manage to print the record for now, since I am still learning, hope to get sharing for others as well if you have alternative/better solution.
import datetime
from bs4 import BeautifulSoup
import requests
urls = ['url1','url2','url3']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
now = datetime.datetime.now()
print ("Date data being pulled:")
print str(now)
print ("")
qna = []
qna = soup.findAll("div", class_= "qna-content")
for qnaqna in qna:
if not qnaqna:
print('List is empty')
else:
print(qnaqna.get_text())
continue

Related

Trying to get a value from a html code by using beautiful soap but have hard time to get it

Trying to find the value shown in the picture below from the website https://www.coop.se/butiker-erbjudanden/coop/coop-ladugardsangen-/ with help of beautiful soap code. But the only value I get is the price number and not the "st" value.
Here is the code I try to use to get it...
CODE
test = product.find('span', class_='Splash-content ')
print(Price.text)

import requests
from bs4 import BeautifulSoup as bsoup
site_source = requests.get("https://www.coop.se/butiker-erbjudanden/coop/coop-ladugardsangen-/").content
soup = bsoup(site_source, "html.parser")
all_items = soup.find("div", class_="Section Section--margin")
item_list = soup.find_all("span", class_="Splash-content")
for item in item_list:
print("Price: ",item.find("span", class_="Splash-priceLarge").text)
if item.find("span", class_="Splash-priceSub Splash-priceUnitNoDecimal"):
print("Unit: ",item.find("span", class_="Splash-priceSub Splash-priceUnitNoDecimal").text)
In some cases the unit is missing so we want to make sure we handle for that.
My understanding is that you basically want to print the price and unit of each item so that is what i attempt to do.

try with :
url = "https://www.coop.se/butiker-erbjudanden/coop/coop-ladugardsangen-/"
try:
page = urllib.request.urlopen(url, timeout=20)
except HTTPError as e:
page = e.read()
soup = BeautifulSoup(page, 'html.parser')
body = soup.find('body')
result = body.find("span", class_="Splash-content")
print(result.get_text())
for me it worked !

scraping links from wikipedia

So i am trying to scrape links from a random wikipedia page here is my code thus far:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib2
# function get random page
def get_random():
import requests
# r = requests.get('https://en.wikipedia.org/wiki/Special:Random')
r = requests.get('https://en.wikipedia.org/wiki/Carole_Ann')
return r.url
#========================
#finding the valid link
def validlink(href):
if href:
if re.compile('^/wiki/').search(href):
if not re.compile('/\w+:').search(href):
return True
return False
#validlink()===========
#the first site
a1 = get_random()
#print("the first site is: " + a1)
# the first site end()====
#looking for the article name:
blin = requests.get(a1)
soup = BeautifulSoup(blin.text, 'html.parser')
title = soup.find('h1', {'class' : 'firstHeading'})
print("starting website: " + a1 + " Titled: " + title.text)
print("")
#=============================
#first article done
#find body:
import re
body = requests.get(a1).text
soup = BeautifulSoup(body, 'lxml')
for link in soup.findAll("a"):
url = link.get("href", "")
print(
#======================
i know i'm doing this last part wrong. Im new to python so i just have no idea how to go about this part, what i need is to pull all of the links from a random site that the random page takes me to, then i pull the link and title off of that site,
then i need to pull the wikipedia links off of that page which is what i am looking to do in that last bit of code there heres another snip:
and at this point i want to print all of the links that it finds after they have been tested against my valid links function at the top:
again forgive me for being new and not understanding at this. But please help i cannot figure this out.
so the question that i have is: i need to create a snippet of code that will pull out all of the website links off of the wikipedia page (which note i still dont know how to do the for loop was my best guess based on my own research) then i need to test the links that i pulled against my validlink function, and print out all of the valid links.

If you whan it as list then create new list and append() url if it is valid.
Because the same url can be many times on page so I also check if url is already on list.
valid_urls = []
for link in soup.find_all('a'): # find_all('a', {'href': True}):
url = link.get('href', '')
if url not in valid_urls and validlink(url):
valid_urls.append(url)
print(valid_urls)
from bs4 import BeautifulSoup
import requests
import re
# --- functions ---
def is_valid(url):
"""finding the valid link"""
if url:
if url.startswith('/wiki/'): # you don't need `re` to check it
if not re.compile('/\w+:').search(url):
return True
return False
# --- main ---
#random_url = 'https://en.wikipedia.org/wiki/Special:Random'
random_url = 'https://en.wikipedia.org/wiki/Carole_Ann'
r = requests.get(random_url)
print('url:', r.url)
soup = BeautifulSoup(r.text, 'html.parser')
title = soup.find('h1', {'class': 'firstHeading'})
print('starting website:', r.url)
print('titled:', title.text)
print()
valid_urls = []
for link in soup.find_all('a'): # find_all('a', {'href': True}):
url = link.get('href', '')
if url not in valid_urls and is_valid(url):
valid_urls.append(url)
#print(valid_urls)
#for url in valid_urls:
# print(url)
print('\n'.join(valid_urls))

Is there a function available with beautifulsoup that will delete all the whitespaces

I am pretty new to Python.
I am trying to scrape the website = https://nl.soccerway.com/.
For this scraping i use beautifulsoup.
The only problem is when I scrape the team names, the team names get
extracted with whitespace surrounding them on the left and right.
How can I delete this? I know many people asked this question before, but
I cannot get it to work.
2nd Question:
How can I extract an HREF title out of a TD?
See provided HTML Code.
The club name is Perugia.
search google
search stackoverflow
Perugia
import requests
from bs4 import BeautifulSoup
def main():
url = 'https://nl.soccerway.com/'
get_detail_data(get_page(url))
def get_page(url):
response = requests.get(url)
if not response.ok:
print('response code is:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'lxml')
return soup
def get_detail_data(soup):
minutes = ""
score = ""
TeamA = ""
TeamB = ""
table_data = soup.find('table',class_='table-container')
try:
for tr in table_data.find_all('td', class_='minute visible'):
minutes = (tr.text)
print(minutes)
except:
pass
try:
for tr in soup.find_all('td', class_='team team-a'):
TeamA = tr.text
print(TeamA)
except:
pass
if __name__ == '__main__':
main()

you can use get_text(strip=True) method from beautifoulsoup
tr.get_text(strip=True)

Use the strip() method to remove trailing and leading whitespace. So in your case, it would be:
TeamA = tr.text.strip()
To get the href attribute, use the pattern tag['attribute']. In your case, it would be:
href = tr.a['href']

Web Scraping with Python (city) as parameter

def findWeather(city):
import urllib
connection = urllib.urlopen("http://www.canoe.ca/Weather/World.html")
rate = connection.read()
connection.close()
currentLoc = rate.find(city)
curr = rate.find("currentDegree")
temploc = rate.find("</span>", curr)
tempstart = rate.rfind(">", 0, temploc)
print "current temp:", rate[tempstart+1:temploc]
The link is provided above. The issue I have is everytime I run the program and use, say "Brussels" in Belgium, as the parameter, i.e findWeather("Brussels"), it will always print 24c as the temperature whereas (as I am writing this) it should be 19c. This is the case for many other cities provided by the site. Help on this code would be appreciated.
Thanks!

This one should work:
import requests
from bs4 import BeautifulSoup
url = 'http://www.canoe.ca/Weather/World.html'
response = requests.get(url)
# Get the text of the contents
html_content = response.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'lxml')
cities = soup.find_all("span", class_="titleText")
cels = soup.find_all("span", class_="currentDegree")
for x,y in zip(cities,cels):
print (x.text,y.text)

Python scraper advice

I have been working on a scraper for a little while now, and have come very close to getting it to run as intended. My code as follows:
import urllib.request
from bs4 import BeautifulSoup
# Crawls main site to get a list of city URLs
def getCityLinks():
city_sauce = urllib.request.urlopen('https://www.prodigy-living.co.uk/') # Enter url here
city_soup = BeautifulSoup(city_sauce, 'html.parser')
the_city_links = []
for city in city_soup.findAll('div', class_="city-location-menu"):
for a in city.findAll('a', href=True, text=True):
the_city_links.append('https://www.prodigy-living.co.uk/' + a['href'])
return the_city_links
# Crawls each of the city web pages to get a list of unit URLs
def getUnitLinks():
getCityLinks()
for the_city_links in getCityLinks():
unit_sauce = urllib.request.urlopen(the_city_links)
unit_soup = BeautifulSoup(unit_sauce, 'html.parser')
for unit_href in unit_soup.findAll('a', class_="btn white-green icon-right-open-big", href=True):
yield('the_url' + unit_href['href'])
the_unit_links = []
for link in getUnitLinks():
the_unit_links.append(link)
# Soups returns all of the html for the items in the_unit_links
def soups():
for the_links in the_unit_links:
try:
sauce = urllib.request.urlopen(the_links)
for things in sauce:
soup_maker = BeautifulSoup(things, 'html.parser')
yield(soup_maker)
except:
print('Invalid url')
# Below scrapes property name, room type and room price
def getPropNames(soup):
try:
for propName in soup.findAll('div', class_="property-cta"):
for h1 in propName.findAll('h1'):
print(h1.text)
except:
print('Name not found')
def getPrice(soup):
try:
for price in soup.findAll('p', class_="room-price"):
print(price.text)
except:
print('Price not found')
def getRoom(soup):
try:
for theRoom in soup.findAll('div', class_="featured-item-inner"):
for h5 in theRoom.findAll('h5'):
print(h5.text)
except:
print('Room not found')
for soup in soups():
getPropNames(soup)
getPrice(soup)
getRoom(soup)
When I run this, it returns all the prices for all the urls picked up. However, I does not return the names or the rooms and I am not really sure why. I would really appreciate any pointers on this, or ways to improve my code - been learning Python for a few months now!

I think that the links you are scraping will in the end redirect you to another website, in which case your scraping functions will not be useful!
For instance, the link for a room in Birmingham is redirecting you to another website.
Also, be careful in your usage of the find and find_all methods in BS. The first returns only one tag (as when you want one property name) while find_all() will return a list allowing you to get, for instance, multiple room prices and types.
Anyway, I have simplified a bit your code and this is how I have come across your issue. Maybe you would like to get some inspiration from that:
import requests
from bs4 import BeautifulSoup
main_url = "https://www.prodigy-living.co.uk/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find("div", class_ = "footer-city-nav") # Bottom page not loaded dynamycally
cities_links = [main_url+tag["href"] for tag in city_tags.find_all("a")] # Links to cities
# Getting the individual links to the apts
indiv_apts = []
for link in cities_links[0:4]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
links_tags = soup.find_all("a", class_ = "btn white-green icon-right-open-big")
for url in links_tags:
indiv_apts.append(main_url+url.get("href"))
# Now defining your functions
def GetName(tag):
print tag.find("h1").get_text()
def GetType_Price(tags_list):
for tag in tags_list:
print tag.find("h5").get_text()
print tag.find("p", class_ = "room-price").get_text()
# Now scraping teach of the apts - name, price, room.
for link in indiv_apts[0:2]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
property_tag = soup.find("div", class_ = "property-cta")
rooms_tags = soup.find_all("div", class_ = "featured-item")
GetName(property_tag)
GetType_Price(rooms_tags)
You will see that right at the second element of the lis, you will get an AttributeError as you are not on your website page anymore. Indeed:
>>> print indiv_apts[1]
https://www.prodigy-living.co.uk/http://www.iqstudentaccommodation.com/student-accommodation/birmingham/penworks-house?utm_source=prodigylivingwebsite&utm_campaign=birminghampagepenworksbutton&utm_medium=referral # You will not scrape the expected link right at the beginning
Next time come with a precise problem to solve, or in another case just take a look at the code review section.
On find and find_all: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
Finally, I think it also answers your question here: https://stackoverflow.com/questions/42506033/urllib-error-urlerror-urlopen-error-errno-11001-getaddrinfo-failed
Cheers :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to handle Empty List- multiple page web scraping - python

Related

Trying to get a value from a html code by using beautiful soap but have hard time to get it

scraping links from wikipedia

Is there a function available with beautifulsoup that will delete all the whitespaces

Web Scraping with Python (city) as parameter

Python scraper advice

Categories

Resources