I have searched for a number of similiar problems here, but still confused why my code yields only 1 result, instead of at least 15 on each page.
for pagenumber in range (0,2):
url = 'https://www.autowereld.nl/volkswagen/?mdl=volkswagen_golf|volkswagen_golf-alltrack|volkswagen_golf-cabriolet|volkswagen_golf-plus|volkswagen_golf-sportsvan|volkswagen_golf-variant&p='
txt = requests.get(url + str(pagenumber))
soup = BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('article', class_="item")
for car in soup_table.findAll('a'):
link = car.get('href')
sub_url = 'https://www.autowereld.nl' + link
print(sub_url)
You are using soup.find to find something with tag "article" and class "item". From the documentation, soup.find only finds one instance of a tag, and you are looking for multiple instances of a tag. Something like
for pagenumber in range (0,2):
url = 'https://www.autowereld.nl/volkswagen/?mdl=volkswagen_golf|volkswagen_golf-alltrack|volkswagen_golf-cabriolet|volkswagen_golf-plus|volkswagen_golf-sportsvan|volkswagen_golf-variant&p='
txt = requests.get(url + str(pagenumber))
soup = BeautifulSoup(txt.text, 'html.parser')
soup_articles = soup.findAll('article', class_="item")
for article in soup_articles
for car in article.findAll('a'):
link = car.get('href')
sub_url = 'https://www.autowereld.nl' + link
print(sub_url)
might work. I also recommend using find_all instead of findAll if you're using bs4 since the mixed case versions of these methods are deprecated, but that's up to you.
Related
I need to scrape all 'a' tags with "result-title" class, and all 'span' tags with either class 'results-price' and 'results-hood'. Then, write the output to a .csv file across multiple columns. The current code does not print anything to the csv file. This may be bad syntax but I really can't see what I am missing. Thanks.
f = csv.writer(open(r"C:\Users\Sean\Desktop\Portfolio\Python - Web Scraper\RE Competitor Analysis.csv", "wb"))
def scrape_links(start_url):
for i in range(0, 2500, 120):
source = urllib.request.urlopen(start_url.format(i)).read()
soup = BeautifulSoup(source, 'lxml')
for a in soup.find_all("a", "span", {"class" : ["result-title hdrlnk", "result-price", "result-hood"]}):
f.writerow([a['href']], span['results-title hdrlnk'].getText(), span['results-price'].getText(), span['results-hood'].getText() )
if i < 2500:
sleep(randint(30,120))
print(i)
scrape_links('my_url')
If you want to find multiple tags with one call to find_all, you should pass them in a list. For example:
soup.find_all(["a", "span"])
Without access to the page you are scraping, it's too hard to give you a complete solution, but I recommend extracting one variable at a time and printing it to help you debug. For example:
a = soup.find('a', class_ = 'result-title')
a_link = a['href']
a_text = a.text
spans = soup.find_all('span', class_ = ['results-price', 'result-hood'])
row = [a_link, a_text] + [s.text for s in spans]
print(row) # verify we are getting the results we expect
f.writerow(row)
I have a problem with the following code and I am sorry, I am new to this all, I want to add the strings in the FullPage list to the actual URL and then I want to visit them and scrape some data from the pages. So far, It has been good but I do not know how to make it visit the other links in the list.
The output will only give me the data of one page but I need the data for 30 pages, how can I make this program to go over each link?
The URL has a pattern, the first part has 'http://arduinopak.com/Prd.aspx?Cat_Name=' and then the second part has the product category name.
import urllib2
from bs4 import BeautifulSoup
FullPage = ['New-Arrivals-2017-6', 'Big-Sales-click-here', 'Arduino-Development-boards',
'Robotics-and-Copters']
urlp1 = "http://www.arduinopak.com/Prd.aspx?Cat_Name="
URL = urlp1 + FullPage[0]
for n in FullPage:
URL = urlp1 + n
page = urllib2.urlopen(URL)
bsObj = BeautifulSoup(page, "html.parser")
descList = bsObj.findAll('div', attrs={"class": "panel-default"})
for desc in descList:
print(desc.getText(separator=u' '))
import urllib2
from bs4 import BeautifulSoup
FullPage = ['New-Arrivals-2017-6', 'Big-Sales-click-here', 'Arduino-Development-boards',
'Robotics-and-Copters']
urlp1 = "http://www.arduinopak.com/Prd.aspx?Cat_Name="
URL = urlp1 + FullPage[0]
for n in FullPage:
URL = urlp1 + n
page = urllib2.urlopen(URL)
bsObj = BeautifulSoup(page, "html.parser")
descList = bsObtTj.findAll('div', attrs={"class": "panel-default"})
for desc in descList:
print(desc.geext(separator=u' '))
If you want to scape each links then moving last 3 lines of your code into loop will do it.
Your current code fetches all the links but it stores only one BeautifulSoup object reference. You could instead store them all in the array or process them before visiting another URL (as shown below).
for n in FullPage:
URL = urlp1 + n
page = urllib2.urlopen(URL)
bsObj = BeautifulSoup(page, "html.parser")
descList = bsObj.findAll('div', attrs={"class": "panel-default"})
for desc in descList:
print(desc.getText(separator=u' '))
Also, note that the names using PascalCase are by convention reserved for classes. FullPage would usually be written as fullPage or FULL_PAGE if it's meant to be constant.
I need to take out phone numbers and Emails from HTML.
I can get the data.
description_source = soup.select('a[href^="mailto:"]'),
soup.select('a[href^="tel:"]')
But I do not want it.
I am trying to use
decompose
description_source = soup.decompose('a[href^="mailto:"]')
I get this error
TypeError: decompose() takes 1 positional argument but 2 were given
I have thought about using
SoupStrainer
But it looks like i would have to include everything but the mailto and tel to get the correct information...
full current code for this bit is this
import requests
from bs4 import BeautifulSoup as bs4
item_number = '122124438749'
ebay_url = "http://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=" + item_number
r = requests.get(ebay_url)
html_bytes = r.text
soup = bs4(html_bytes, 'html.parser')
description_source = soup.decompose('a[href^="mailto:"]')
#description_source.
print(description_source)
Try using find_all(). Find all the links in that page and then check which ones contain phone and email. Then remove them using extract().
Use lxml parser for faster processing. It's also recommended to use in the official documentation.
import requests
from bs4 import BeautifulSoup
item_number = '122124438749'
ebay_url = "http://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=" + item_number
r = requests.get(ebay_url)
html_bytes = r.text
soup = BeautifulSoup(html_bytes, 'lxml')
links = soup.find_all('a')
email = ''
phone = ''
for link in links:
if(link.get('href').find('tel:') > -1):
link.extract()
elif(link.get('href').find('mailto:') > -1):
link.extract()
print(soup.prettify())
You can use decompose() also instead of extract().
I have been scratching my head for nearly 4 days trying to find the best way to loop through a table of URLs on one website, request the URL and scrape text from 2 different areas of the second site.
I have tried to rewrite this script multiple times, using several different solutions to achieve my desired results, however, I have not been able to fully accomplish this.
Currently, I am able to select the first link of the table on page one, to go to the new page and select the data I need but I cant get the code to continue to loop through every link on the first page.
import requests
from bs4 import BeautifulSoup
journal_site = "https://journals.sagepub.com"
site_link 'http://journals.sagepub.com/action/showPublications?
pageSize=100&startPage='
# each page contains 100 results I need to scrape from
page_1 = '0'
page_2 = '1'
page_3 = '3'
page_4 = '4'
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
for table_row in soup.select('div.results'):
journal_name = table_row.findAll('tr', class_='False')
journal_link = table_row.find('a')['href']
journal_page = journal_site + journal_link
r = requests.get(journal_page)
soup = BeautifulSoup(r.text, 'html.parser')
for journal_header, journal_description in zip(soup.select('main'),
soup.select('div.journalCarouselTextText')):
try:
title = journal_header.h1.text.strip()
description = journal_description.p.text.strip()
print(title,':', description)
except AttributeError:
continue
What is the best way to find the title and the description for every journal_name? Thanks in advance for the help!
Most of your code works for me, just needed to modify the middle section of the code, leaving the parts before and after the same:
# all code same up to here
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find("div", { "class" : "results" })
table = results.find('table')
for row in table.find_all('a', href=True):
journal_link = row['href']
journal_page = journal_site + journal_link
# from here same as your code
I stopped after it got the fourth response(title/description) of 100 results from the first page. I'm pretty sure it will get all the expected results, only needs to loop through the 4 subsequent pages.
Hope this helps.
I have been working on a scraper for a little while now, and have come very close to getting it to run as intended. My code as follows:
import urllib.request
from bs4 import BeautifulSoup
# Crawls main site to get a list of city URLs
def getCityLinks():
city_sauce = urllib.request.urlopen('https://www.prodigy-living.co.uk/') # Enter url here
city_soup = BeautifulSoup(city_sauce, 'html.parser')
the_city_links = []
for city in city_soup.findAll('div', class_="city-location-menu"):
for a in city.findAll('a', href=True, text=True):
the_city_links.append('https://www.prodigy-living.co.uk/' + a['href'])
return the_city_links
# Crawls each of the city web pages to get a list of unit URLs
def getUnitLinks():
getCityLinks()
for the_city_links in getCityLinks():
unit_sauce = urllib.request.urlopen(the_city_links)
unit_soup = BeautifulSoup(unit_sauce, 'html.parser')
for unit_href in unit_soup.findAll('a', class_="btn white-green icon-right-open-big", href=True):
yield('the_url' + unit_href['href'])
the_unit_links = []
for link in getUnitLinks():
the_unit_links.append(link)
# Soups returns all of the html for the items in the_unit_links
def soups():
for the_links in the_unit_links:
try:
sauce = urllib.request.urlopen(the_links)
for things in sauce:
soup_maker = BeautifulSoup(things, 'html.parser')
yield(soup_maker)
except:
print('Invalid url')
# Below scrapes property name, room type and room price
def getPropNames(soup):
try:
for propName in soup.findAll('div', class_="property-cta"):
for h1 in propName.findAll('h1'):
print(h1.text)
except:
print('Name not found')
def getPrice(soup):
try:
for price in soup.findAll('p', class_="room-price"):
print(price.text)
except:
print('Price not found')
def getRoom(soup):
try:
for theRoom in soup.findAll('div', class_="featured-item-inner"):
for h5 in theRoom.findAll('h5'):
print(h5.text)
except:
print('Room not found')
for soup in soups():
getPropNames(soup)
getPrice(soup)
getRoom(soup)
When I run this, it returns all the prices for all the urls picked up. However, I does not return the names or the rooms and I am not really sure why. I would really appreciate any pointers on this, or ways to improve my code - been learning Python for a few months now!
I think that the links you are scraping will in the end redirect you to another website, in which case your scraping functions will not be useful!
For instance, the link for a room in Birmingham is redirecting you to another website.
Also, be careful in your usage of the find and find_all methods in BS. The first returns only one tag (as when you want one property name) while find_all() will return a list allowing you to get, for instance, multiple room prices and types.
Anyway, I have simplified a bit your code and this is how I have come across your issue. Maybe you would like to get some inspiration from that:
import requests
from bs4 import BeautifulSoup
main_url = "https://www.prodigy-living.co.uk/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find("div", class_ = "footer-city-nav") # Bottom page not loaded dynamycally
cities_links = [main_url+tag["href"] for tag in city_tags.find_all("a")] # Links to cities
# Getting the individual links to the apts
indiv_apts = []
for link in cities_links[0:4]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
links_tags = soup.find_all("a", class_ = "btn white-green icon-right-open-big")
for url in links_tags:
indiv_apts.append(main_url+url.get("href"))
# Now defining your functions
def GetName(tag):
print tag.find("h1").get_text()
def GetType_Price(tags_list):
for tag in tags_list:
print tag.find("h5").get_text()
print tag.find("p", class_ = "room-price").get_text()
# Now scraping teach of the apts - name, price, room.
for link in indiv_apts[0:2]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
property_tag = soup.find("div", class_ = "property-cta")
rooms_tags = soup.find_all("div", class_ = "featured-item")
GetName(property_tag)
GetType_Price(rooms_tags)
You will see that right at the second element of the lis, you will get an AttributeError as you are not on your website page anymore. Indeed:
>>> print indiv_apts[1]
https://www.prodigy-living.co.uk/http://www.iqstudentaccommodation.com/student-accommodation/birmingham/penworks-house?utm_source=prodigylivingwebsite&utm_campaign=birminghampagepenworksbutton&utm_medium=referral # You will not scrape the expected link right at the beginning
Next time come with a precise problem to solve, or in another case just take a look at the code review section.
On find and find_all: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
Finally, I think it also answers your question here: https://stackoverflow.com/questions/42506033/urllib-error-urlerror-urlopen-error-errno-11001-getaddrinfo-failed
Cheers :)