Python/BeautifulSoup Web-scraping Issue--Irregular Returns - python

I've scoured the questions/answers and have attempted to implement changes to the following, but to no avail.
I'm trying to scrape pages of course listings from Coursera's "Data Analysis" results, https://www.coursera.org/browse/data-science/data-analysis?languages=en&page=1.
There are 9 pages, each with 25 courses, and each course is under its own <h2> tag. I've found some success with the following code, but it has not been consistent:
courses_data_sci = []
for i in range(10):
page = "https://www.coursera.org/browse/data-science/data-analysis? languages=en&page=" + str(i)
html = urlopen(page)
soup = BeautifulSoup(html.read(), "html.parser")
for meta in soup.find_all('div', {'id' : 'rendered-content'}):
for x in range(26):
try:
course = meta.find_all('h2')[x].text.strip()
courses_data_sci.append(course)
except IndexError:
pass
This code seems to return the first 2-3 pages of results and the last page of results; sometimes, if I run it again after clearning courses_data_sci, it will return the 4th page of results a few times. (I'm working in Jupyter, and I've restarted the kernel to account for any issues there.)
I'm not sure why the code isn't working correctly, let alone why it is returning inconsistent results.
Any help is appreciated. Thank you.
UPDATE
Thanks for the ideas...I am trying to utilize both to make the code work.
Just out of curiosity, I pared down the code to see what it was picking up, with both comments in mind.
courses_data_sci = []
session = requests.Session()
for i in range(10):
page = "https://www.coursera.org/browse/data-science/data-analysis? languages=en&page=" + str(i)
html = urlopen(page)
soup = BeautifulSoup(html.read(), "html.parser")
for meta in soup.find_all('div', {'id' : 'rendered-content'}):
course = meta.find_all('h2')
courses_data_sci.append(course)
# This is to check length of courses_data_sci across pages
print('Page: %s -- total length %s' % (i, len(courses_data_sci)))
This actually results in a list of lists, which does contain all the courses throughout the 9 pages (and, of course, the href info since it isn't being stripped yet). Each loop creates one list per page: a list of all the courses on the respective page. So it appears that I should be able to strip the href while the lists are being pushed to the list, courses_data_sci.
There are 2 <h2> tags per course, so I'm also thinking there could be an issue with the second range() call: for x in range(26). I've tried multiple different ranges, none of which work or which return an error, "index out of range".

I get the same behaviour using your code.
I changed it in order to use requests:
from bs4 import BeautifulSoup
import requests
courses_data_sci = []
session = requests.Session()
for i in range(10):
page = "https://www.coursera.org/browse/data-science/data-analysis?languages=en&page=" + str(i)
html = session.get(page)
soup = BeautifulSoup(html.text, "html.parser")
for meta in soup.find_all('div', {'id' : 'rendered-content'}):
for x in range(26):
try:
course = meta.find_all('h2')[x].text.strip()
courses_data_sci.append(course)
except IndexError:
pass
# This is to check length of courses_data_sci across pages
print('Page: %s -- total length %s' % (i, len(courses_data_sci)))

Related

Beautifulsoup yields only result

I have searched for a number of similiar problems here, but still confused why my code yields only 1 result, instead of at least 15 on each page.
for pagenumber in range (0,2):
url = 'https://www.autowereld.nl/volkswagen/?mdl=volkswagen_golf|volkswagen_golf-alltrack|volkswagen_golf-cabriolet|volkswagen_golf-plus|volkswagen_golf-sportsvan|volkswagen_golf-variant&p='
txt = requests.get(url + str(pagenumber))
soup = BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('article', class_="item")
for car in soup_table.findAll('a'):
link = car.get('href')
sub_url = 'https://www.autowereld.nl' + link
print(sub_url)
You are using soup.find to find something with tag "article" and class "item". From the documentation, soup.find only finds one instance of a tag, and you are looking for multiple instances of a tag. Something like
for pagenumber in range (0,2):
url = 'https://www.autowereld.nl/volkswagen/?mdl=volkswagen_golf|volkswagen_golf-alltrack|volkswagen_golf-cabriolet|volkswagen_golf-plus|volkswagen_golf-sportsvan|volkswagen_golf-variant&p='
txt = requests.get(url + str(pagenumber))
soup = BeautifulSoup(txt.text, 'html.parser')
soup_articles = soup.findAll('article', class_="item")
for article in soup_articles
for car in article.findAll('a'):
link = car.get('href')
sub_url = 'https://www.autowereld.nl' + link
print(sub_url)
might work. I also recommend using find_all instead of findAll if you're using bs4 since the mixed case versions of these methods are deprecated, but that's up to you.

Short & Easy - soup.find_all Not Returning Multiple Tag Elements

I need to scrape all 'a' tags with "result-title" class, and all 'span' tags with either class 'results-price' and 'results-hood'. Then, write the output to a .csv file across multiple columns. The current code does not print anything to the csv file. This may be bad syntax but I really can't see what I am missing. Thanks.
f = csv.writer(open(r"C:\Users\Sean\Desktop\Portfolio\Python - Web Scraper\RE Competitor Analysis.csv", "wb"))
def scrape_links(start_url):
for i in range(0, 2500, 120):
source = urllib.request.urlopen(start_url.format(i)).read()
soup = BeautifulSoup(source, 'lxml')
for a in soup.find_all("a", "span", {"class" : ["result-title hdrlnk", "result-price", "result-hood"]}):
f.writerow([a['href']], span['results-title hdrlnk'].getText(), span['results-price'].getText(), span['results-hood'].getText() )
if i < 2500:
sleep(randint(30,120))
print(i)
scrape_links('my_url')
If you want to find multiple tags with one call to find_all, you should pass them in a list. For example:
soup.find_all(["a", "span"])
Without access to the page you are scraping, it's too hard to give you a complete solution, but I recommend extracting one variable at a time and printing it to help you debug. For example:
a = soup.find('a', class_ = 'result-title')
a_link = a['href']
a_text = a.text
spans = soup.find_all('span', class_ = ['results-price', 'result-hood'])
row = [a_link, a_text] + [s.text for s in spans]
print(row) # verify we are getting the results we expect
f.writerow(row)

Using For Loop with BeautifulSoup to Select Text on a Different URLs

I have been scratching my head for nearly 4 days trying to find the best way to loop through a table of URLs on one website, request the URL and scrape text from 2 different areas of the second site.
I have tried to rewrite this script multiple times, using several different solutions to achieve my desired results, however, I have not been able to fully accomplish this.
Currently, I am able to select the first link of the table on page one, to go to the new page and select the data I need but I cant get the code to continue to loop through every link on the first page.
import requests
from bs4 import BeautifulSoup
journal_site = "https://journals.sagepub.com"
site_link 'http://journals.sagepub.com/action/showPublications?
pageSize=100&startPage='
# each page contains 100 results I need to scrape from
page_1 = '0'
page_2 = '1'
page_3 = '3'
page_4 = '4'
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
for table_row in soup.select('div.results'):
journal_name = table_row.findAll('tr', class_='False')
journal_link = table_row.find('a')['href']
journal_page = journal_site + journal_link
r = requests.get(journal_page)
soup = BeautifulSoup(r.text, 'html.parser')
for journal_header, journal_description in zip(soup.select('main'),
soup.select('div.journalCarouselTextText')):
try:
title = journal_header.h1.text.strip()
description = journal_description.p.text.strip()
print(title,':', description)
except AttributeError:
continue
What is the best way to find the title and the description for every journal_name? Thanks in advance for the help!
Most of your code works for me, just needed to modify the middle section of the code, leaving the parts before and after the same:
# all code same up to here
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find("div", { "class" : "results" })
table = results.find('table')
for row in table.find_all('a', href=True):
journal_link = row['href']
journal_page = journal_site + journal_link
# from here same as your code
I stopped after it got the fourth response(title/description) of 100 results from the first page. I'm pretty sure it will get all the expected results, only needs to loop through the 4 subsequent pages.
Hope this helps.

Python scraper advice

I have been working on a scraper for a little while now, and have come very close to getting it to run as intended. My code as follows:
import urllib.request
from bs4 import BeautifulSoup
# Crawls main site to get a list of city URLs
def getCityLinks():
city_sauce = urllib.request.urlopen('https://www.prodigy-living.co.uk/') # Enter url here
city_soup = BeautifulSoup(city_sauce, 'html.parser')
the_city_links = []
for city in city_soup.findAll('div', class_="city-location-menu"):
for a in city.findAll('a', href=True, text=True):
the_city_links.append('https://www.prodigy-living.co.uk/' + a['href'])
return the_city_links
# Crawls each of the city web pages to get a list of unit URLs
def getUnitLinks():
getCityLinks()
for the_city_links in getCityLinks():
unit_sauce = urllib.request.urlopen(the_city_links)
unit_soup = BeautifulSoup(unit_sauce, 'html.parser')
for unit_href in unit_soup.findAll('a', class_="btn white-green icon-right-open-big", href=True):
yield('the_url' + unit_href['href'])
the_unit_links = []
for link in getUnitLinks():
the_unit_links.append(link)
# Soups returns all of the html for the items in the_unit_links
def soups():
for the_links in the_unit_links:
try:
sauce = urllib.request.urlopen(the_links)
for things in sauce:
soup_maker = BeautifulSoup(things, 'html.parser')
yield(soup_maker)
except:
print('Invalid url')
# Below scrapes property name, room type and room price
def getPropNames(soup):
try:
for propName in soup.findAll('div', class_="property-cta"):
for h1 in propName.findAll('h1'):
print(h1.text)
except:
print('Name not found')
def getPrice(soup):
try:
for price in soup.findAll('p', class_="room-price"):
print(price.text)
except:
print('Price not found')
def getRoom(soup):
try:
for theRoom in soup.findAll('div', class_="featured-item-inner"):
for h5 in theRoom.findAll('h5'):
print(h5.text)
except:
print('Room not found')
for soup in soups():
getPropNames(soup)
getPrice(soup)
getRoom(soup)
When I run this, it returns all the prices for all the urls picked up. However, I does not return the names or the rooms and I am not really sure why. I would really appreciate any pointers on this, or ways to improve my code - been learning Python for a few months now!
I think that the links you are scraping will in the end redirect you to another website, in which case your scraping functions will not be useful!
For instance, the link for a room in Birmingham is redirecting you to another website.
Also, be careful in your usage of the find and find_all methods in BS. The first returns only one tag (as when you want one property name) while find_all() will return a list allowing you to get, for instance, multiple room prices and types.
Anyway, I have simplified a bit your code and this is how I have come across your issue. Maybe you would like to get some inspiration from that:
import requests
from bs4 import BeautifulSoup
main_url = "https://www.prodigy-living.co.uk/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find("div", class_ = "footer-city-nav") # Bottom page not loaded dynamycally
cities_links = [main_url+tag["href"] for tag in city_tags.find_all("a")] # Links to cities
# Getting the individual links to the apts
indiv_apts = []
for link in cities_links[0:4]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
links_tags = soup.find_all("a", class_ = "btn white-green icon-right-open-big")
for url in links_tags:
indiv_apts.append(main_url+url.get("href"))
# Now defining your functions
def GetName(tag):
print tag.find("h1").get_text()
def GetType_Price(tags_list):
for tag in tags_list:
print tag.find("h5").get_text()
print tag.find("p", class_ = "room-price").get_text()
# Now scraping teach of the apts - name, price, room.
for link in indiv_apts[0:2]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
property_tag = soup.find("div", class_ = "property-cta")
rooms_tags = soup.find_all("div", class_ = "featured-item")
GetName(property_tag)
GetType_Price(rooms_tags)
You will see that right at the second element of the lis, you will get an AttributeError as you are not on your website page anymore. Indeed:
>>> print indiv_apts[1]
https://www.prodigy-living.co.uk/http://www.iqstudentaccommodation.com/student-accommodation/birmingham/penworks-house?utm_source=prodigylivingwebsite&utm_campaign=birminghampagepenworksbutton&utm_medium=referral # You will not scrape the expected link right at the beginning
Next time come with a precise problem to solve, or in another case just take a look at the code review section.
On find and find_all: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
Finally, I think it also answers your question here: https://stackoverflow.com/questions/42506033/urllib-error-urlerror-urlopen-error-errno-11001-getaddrinfo-failed
Cheers :)

Python - Show Results from all Pages not just the first page (Beautiful Soup)

I have been making a simple scraper using Beautiful Soup to get food hygiene rating of restaurants based on postcode entered by user. The code works correctly and takes results from the URL correctly.
What I need help with is how to get all the results to display, not just the results from the first page.
My code is below:
import requests
from bs4 import BeautifulSoup
pc = input("Please enter postcode")
url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode="+pc+"&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.findAll("div", {"class": "search-result"})
for item in g_data:
print (item.find_all("a", {"class": "name"})[0].text)
try:
print (item.find_all("span", {"class": "address"})[0].text)
except:
pass
try:
print (item.find_all("div", {"class": "rating-image"})[0].text)
except:
pass
I have discovered by looking at the URL that the page shown is dependent on a variable in the URL string called page
https://www.scoresonthedoors.org.uk/search.php?award_sort=ALPHA&name=&address=BT147AL&x=0&y=0&page=2#results
The pagination code for the Next Page button is:
<a style="float: right" href="?award_sort=ALPHA&name=&address=BT147AL&x=0&y=0&page=3#results" rel="next " title="Go forward one page">Next <i class="fa fa-arrow-right fa-3"></i></a>
Is there a way I can get my code to find out how many pages of results are presented and then grab the results from each of these pages?
Would the best solution to this be to have code that alters the URL string to change "page=" each time (e.g a for loop) or is there a way to find a solution using the information in the pagination link code?
Many thanks for anyone who provides help or looks at this question
You're actually going the right way. Generating the paginated urls to scrape beforehand is a good approach.
I actually nearly wrote the whole code. What you want to look at is the find_max_page() function first which consists on taking the max page from the pagination string. With this number, you can then generate all the urls that you need to scrape, and scrape them one by one.
Check the code below, it's pretty much all there.
import requests
from bs4 import BeautifulSoup
class RestaurantScraper(object):
def __init__(self, pc):
self.pc = pc # the input postcode
self.max_page = self.find_max_page() # The number of page available
self.restaurants = list() # the final list of restaurants where the scrape data will at the end of process
def run(self):
for url in self.generate_pages_to_scrape():
restaurants_from_url = self.scrape_page(url)
self.restaurants += restaurants_from_url # we increment the restaurants to the global restaurants list
def create_url(self):
"""
Create a core url to scrape
:return: A url without pagination (= page 1)
"""
return "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + self.pc + \
"&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt"
def create_paginated_url(self, page_number):
"""
Create a paginated url
:param page_number: pagination (integer)
:return: A url paginated
"""
return self.create_url() + "&page={}".format(str(page_number))
def find_max_page(self):
"""
Function to find the number of pages for a specific search.
:return: The number of pages (integer)
"""
r = requests.get(self.create_url())
soup = BeautifulSoup(r.content, "lxml")
pagination_soup = soup.findAll("div", {"id": "paginator"})
pagination = pagination_soup[0]
page_text = pagination("p")[0].text
return int(page_text.replace('Page 1 of ', ''))
def generate_pages_to_scrape(self):
"""
Generate all the paginated url using the max_page attribute previously scraped.
:return: List of urls
"""
return [self.create_paginated_url(page_number) for page_number in range(1, self.max_page + 1)]
def scrape_page(self, url):
"""
This is coming from your original code snippet. This probably need a bit of work, but you get the idea.
:param url: Url to scrape and get data from.
:return:
"""
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.findAll("div", {"class": "search-result"})
restaurants = list()
for item in g_data:
name = item.find_all("a", {"class": "name"})[0].text
restaurants.append(name)
try:
print item.find_all("span", {"class": "address"})[0].text
except:
pass
try:
print item.find_all("div", {"class": "rating-image"})[0].text
except:
pass
return restaurants
if __name__ == '__main__':
pc = input('Give your post code')
scraper = RestaurantScraper(pc)
scraper.run()
print "{} restaurants scraped".format(str(len(scraper.restaurants)))

Categories

Resources