Im trying to scrape the data off of yellowpages but am running into where I can't get the text of each business name and address/phone. I'm using the code below, where am I going wrong? I'm trying to print the text of each business but only printing it out for the sake of seeing it right now as I test but once I'm done then Im going to save the data to csv.
import csv
import requests
from bs4 import BeautifulSoup
#dont worry about opening this file
"""with open('cities_louisiana.csv','r') as cities:
lines = cities.read().splitlines()
cities.close()"""
for city in lines:
print(city)
url = "http://www.yellowpages.com/search? search_terms=businesses&geo_location_terms=amite+LA&page="+str(count)
for city in lines:
for x in range (0, 50):
print("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=amite+LA&page="+str(x))
page = requests.get("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=amite+LA&page="+str(x))
soup = BeautifulSoup(page.text, "html.parser")
name = soup.find_all("div", {"class": "v-card"})
for name in name:
try:
print(name.contents[0]).find_all(class_="business-name").text
#print(name.contents[1].text)
except:
pass
You should iterate over search results, then, for every search result locate the business name (the element with the "business-name" class) and the address (the element with the "adr" class):
for result in soup.select(".search-results .result"):
name = result.select_one(".business-name").get_text(strip=True, separator=" ")
address = result.select_one(".adr").get_text(strip=True, separator=" ")
print(name, address)
.select() and .select_one() are handy CSS selector methods.
Related
I wanted to scrape name, roast and price and I have succesfully done it with the code below. However I am not able to scrape the price . it shows up as 'None'.
URLS = ["https://www.thirdwavecoffeeroasters.com/products/vienna-roast","https://www.thirdwavecoffeeroasters.com/products/baarbara-estate","https://www.thirdwavecoffeeroasters.com/products/el-diablo-blend","https://www.thirdwavecoffeeroasters.com/products/organic-signature-filter-coffee-blend","https://www.thirdwavecoffeeroasters.com/products/moka-pot-express-blend-1","https://www.thirdwavecoffeeroasters.com/products/karadykan-estate","https://www.thirdwavecoffeeroasters.com/products/french-roast","https://www.thirdwavecoffeeroasters.com/products/signature-cold-brew-blend","https://www.thirdwavecoffeeroasters.com/products/bettadakhan-estate","https://www.thirdwavecoffeeroasters.com/products/monsoon-malabar-aa"]
for url in range(0,10):
req=requests.get(URLS[url])
soup = bs(req.text,"html.parser")
coffees = soup.find_all("div",class_="col-md-4 col-sm-12 col-xs-12")
for coffee in coffees:
name = coffee.find("div",class_="product-details-main").find("ul",class_="uk-breadcrumb uk-text-uppercase").span.text
roast = coffee.find("div",class_="uk-flex uk-flex-middle uk-width-1-1 coff_type_main").find("p",class_="coff_type uk-margin-small-left uk-text-uppercase").text.split("|")[0]
prices = coffee.find("div",class_="uk-width-1-1 uk-first-column")
print(name,roast,price)
The issue here is that in the response you are not getting the HTML element you are looking for. If you search for the price element in req.text, you won't find it. This is probably because the website renders the page dynamically with JavaScript.
If you look closely in req.text, however, you can find all the coffee variants with their respective price in the meta JavaScript object. You can look for it and parse it into a Python dictionary using the loads() function from the json library.
Here is your working code that also prints the prices. Note that all prices are multiplied by 100, so you may need to account for that.
import requests
from bs4 import BeautifulSoup as bs
import json
URLS = ["https://www.thirdwavecoffeeroasters.com/products/vienna-roast","https://www.thirdwavecoffeeroasters.com/products/baarbara-estate","https://www.thirdwavecoffeeroasters.com/products/el-diablo-blend","https://www.thirdwavecoffeeroasters.com/products/organic-signature-filter-coffee-blend","https://www.thirdwavecoffeeroasters.com/products/moka-pot-express-blend-1","https://www.thirdwavecoffeeroasters.com/products/karadykan-estate","https://www.thirdwavecoffeeroasters.com/products/french-roast","https://www.thirdwavecoffeeroasters.com/products/signature-cold-brew-blend","https://www.thirdwavecoffeeroasters.com/products/bettadakhan-estate","https://www.thirdwavecoffeeroasters.com/products/monsoon-malabar-aa"]
for url in URLS:
req=requests.get(url)
soup = bs(req.text,"html.parser")
coffees = soup.find_all("div",class_="col-md-4 col-sm-12 col-xs-12")
for coffee in coffees:
name = coffee.find("div",class_="product-details-main").find("ul",class_="uk-breadcrumb uk-text-uppercase").span.text
roast = coffee.find("div",class_="uk-flex uk-flex-middle uk-width-1-1 coff_type_main").find("p",class_="coff_type uk-margin-small-left uk-text-uppercase").text.split("|")[0]
print(name,roast)
# Find the JavaScript declaration (there is only one `meta` variable, so no chance of ambiguity)
json_start = req.text.find("var meta = ") + len("var meta = ")
# Find the end of the JavaScript object (there won't be any `;` inside)
json_end = req.text.find(";",json_start)
json_str = req.text[json_start:json_end]
json_data = json.loads(json_str)
# Get the prices from the JSON object
product = json_data["product"]
variants = product["variants"]
for variant in variants:
print(f'Variant: {variant["name"]}, price: {variant["price"]}')
print("\n")
I hope this helps, cheers.
I am trying to scrape some precise lines and create table from collected data (url attached), but cannot get more than the entire body text. Thus, I got stuck.
To give some example:
I would like to arrive at the below table, scraping details from the body content.All the details are there, however any help on how to retrieve them in a form given below would be much appreciated.
My code is:
import requests
from bs4 import BeautifulSoup
# providing url
url = 'https://www.polskawliczbach.pl/wies_Baniocha'
# creating request object
req = requests.get(url)
# creating soup object
data = BeautifulSoup(req.text, 'html')
# finding all li tags in ul and printing the text within it
data1 = data.find('body')
for li in data1.find_all("li"):
print(li.text, end=" ")
At first find the ul and then try to find li inside ul. Scrape needed data, save scraped data in variable and make table using pandas. Now we have done all things if you want to save table then save it in csv file otherwise just print it.
Here's the code implementation of all above things:
from bs4 import BeautifulSoup
import requests
import pandas as pd
page = requests.get('https://www.polskawliczbach.pl/wies_Baniocha')
soup = BeautifulSoup(page.content, 'lxml')
lis=soup.find_all("ul",class_="list-group row")[1].find_all("li")[1:-1]
dic={"name":[],"value":[]}
for li in lis:
try:
dic["name"].append(li.find(text=True,recursive=False).strip())
dic["value"].append(li.find("span").text.replace(" ",""))
print(li.find(text=True,recursive=False).strip(),li.find("span").text.replace(" ",""))
except:
pass
df=pd.DataFrame(dic)
print(df)
# If you want to save this as file then uncomment following line:
# df.to_csv("<FILENAME>.csv")
And additionally if you want to scrape all then "categories", I don't understand that language so,I don't know which is useful and which is not but anyway here's the code, you can just change this part of above code:
soup = BeautifulSoup(page.content, 'lxml')
dic={"name":[],"value":[]}
lis=soup.find_all("ul",class_="list-group row")
for li in lis:
a=li.find_all("li")[1:-1]
for b in a:
error=0
try:
print(b.find(text=True,recursive=False).strip(),"\t",b.find("span").text.replace(" ","").replace(",",""))
dic["name"].append(b.find(text=True,recursive=False).strip())
dic["value"].append(b.find("span").text.replace(" ","").replace(",",""))
except Exception as e:
pass
df=pd.DataFrame(dic)
Find main tag by specific class and from it find all li tag
main_data=data.find("ul", class_="list-group").find_all("li")[1:-1]
names=[]
values=[]
main_values=[]
for i in main_data:
values.append(i.find("span").get_text())
names.append(i.find(text=True,recursive=False))
main_values.append(values)
For table representation use pandas module
import pandas as pd
df=pd.DataFrame(columns=names,data=main_values)
df
Output:
Liczba mieszkańców (2011) Kod pocztowy Numer kierunkowy
0 1 935 05-532 (+48) 22
I am supposed to use Beautiful Soup 4 to obtain course information off of my school's website as an exercise. I have been at this for the past few days and my code still does not work.
The first thing I ask the user is to import the course catalog abbreviation. For example, ICS is abbreviated as Information for Computer Science. Beautiful Soup 4 is supposed to list all of the courses and how many students are enrolled.
While I was able to get the input portion to work, I still have errors or the program just stops.
Question: Is there a way for Beautiful Soup to accept user input so that when the user inputs ICS, the output would be a list of all courses that are related to ICS?
Here is the code and my attempt at it:
from bs4 import BeautifulSoup
import requests
import re
#get input for course
course = input('Enter the course:')
#Here is the page link
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get request and response
page_response = requests.get(BASE_AVAILABILITY_URL)
#getting Beautiful Soup to gather the html content
page_content = BeautifulSoup(page_response.content, 'html.parser')
#getting course information
main = page_content.find_all(class_='parent clearfix')
main_p = "".join(str (x) for x in main)
#get the course anchor tags
main_q = BeautifulSoup(main_p, "html.parser")
courses = main.find('a', href = True)
#get each course name
#empty dictionary for course list
courses_list = []
for a in courses:
courses_list.append(a.text)
search = input('Enter the course title:')
for course in courses_list:
if re.search(search, course, re.IGNORECASE):
print(course)
This is the original code that was provided in Juypter Notebook
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text)
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
What's odd is that if the user saves the html file, uploads it into Juypter Notebook, and then opens the file to be read, the courses are displayed. But, for this task, the user can not save files and it must be an outright input to get the output.
The problem with your code is page_content.find_all(class_='parent clearfix') retuns and empty list []. So thats the first thing you need to change. Looking at the html, you'll want to be looking for <table>, <tr>, <td>, tags
working off what was provided from the original code, you just need to alter a few things to flow logically:
I'll point out what I changed:
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text) #<-- need to get the html text before creating a bs4 object. So I move the request (line below) before this, and also adjusted the parameter for this function.
# the rest of the code is fine
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
This will give you:
import requests, bs4
BASE_AVAILABILITY_URL = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s="
#get input for course
course = input('Enter the course:')
url = BASE_AVAILABILITY_URL + course
def scrape_availability(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html.parser')
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
scrape_availability(url)
I have been working on a scraper for a little while now, and have come very close to getting it to run as intended. My code as follows:
import urllib.request
from bs4 import BeautifulSoup
# Crawls main site to get a list of city URLs
def getCityLinks():
city_sauce = urllib.request.urlopen('https://www.prodigy-living.co.uk/') # Enter url here
city_soup = BeautifulSoup(city_sauce, 'html.parser')
the_city_links = []
for city in city_soup.findAll('div', class_="city-location-menu"):
for a in city.findAll('a', href=True, text=True):
the_city_links.append('https://www.prodigy-living.co.uk/' + a['href'])
return the_city_links
# Crawls each of the city web pages to get a list of unit URLs
def getUnitLinks():
getCityLinks()
for the_city_links in getCityLinks():
unit_sauce = urllib.request.urlopen(the_city_links)
unit_soup = BeautifulSoup(unit_sauce, 'html.parser')
for unit_href in unit_soup.findAll('a', class_="btn white-green icon-right-open-big", href=True):
yield('the_url' + unit_href['href'])
the_unit_links = []
for link in getUnitLinks():
the_unit_links.append(link)
# Soups returns all of the html for the items in the_unit_links
def soups():
for the_links in the_unit_links:
try:
sauce = urllib.request.urlopen(the_links)
for things in sauce:
soup_maker = BeautifulSoup(things, 'html.parser')
yield(soup_maker)
except:
print('Invalid url')
# Below scrapes property name, room type and room price
def getPropNames(soup):
try:
for propName in soup.findAll('div', class_="property-cta"):
for h1 in propName.findAll('h1'):
print(h1.text)
except:
print('Name not found')
def getPrice(soup):
try:
for price in soup.findAll('p', class_="room-price"):
print(price.text)
except:
print('Price not found')
def getRoom(soup):
try:
for theRoom in soup.findAll('div', class_="featured-item-inner"):
for h5 in theRoom.findAll('h5'):
print(h5.text)
except:
print('Room not found')
for soup in soups():
getPropNames(soup)
getPrice(soup)
getRoom(soup)
When I run this, it returns all the prices for all the urls picked up. However, I does not return the names or the rooms and I am not really sure why. I would really appreciate any pointers on this, or ways to improve my code - been learning Python for a few months now!
I think that the links you are scraping will in the end redirect you to another website, in which case your scraping functions will not be useful!
For instance, the link for a room in Birmingham is redirecting you to another website.
Also, be careful in your usage of the find and find_all methods in BS. The first returns only one tag (as when you want one property name) while find_all() will return a list allowing you to get, for instance, multiple room prices and types.
Anyway, I have simplified a bit your code and this is how I have come across your issue. Maybe you would like to get some inspiration from that:
import requests
from bs4 import BeautifulSoup
main_url = "https://www.prodigy-living.co.uk/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find("div", class_ = "footer-city-nav") # Bottom page not loaded dynamycally
cities_links = [main_url+tag["href"] for tag in city_tags.find_all("a")] # Links to cities
# Getting the individual links to the apts
indiv_apts = []
for link in cities_links[0:4]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
links_tags = soup.find_all("a", class_ = "btn white-green icon-right-open-big")
for url in links_tags:
indiv_apts.append(main_url+url.get("href"))
# Now defining your functions
def GetName(tag):
print tag.find("h1").get_text()
def GetType_Price(tags_list):
for tag in tags_list:
print tag.find("h5").get_text()
print tag.find("p", class_ = "room-price").get_text()
# Now scraping teach of the apts - name, price, room.
for link in indiv_apts[0:2]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
property_tag = soup.find("div", class_ = "property-cta")
rooms_tags = soup.find_all("div", class_ = "featured-item")
GetName(property_tag)
GetType_Price(rooms_tags)
You will see that right at the second element of the lis, you will get an AttributeError as you are not on your website page anymore. Indeed:
>>> print indiv_apts[1]
https://www.prodigy-living.co.uk/http://www.iqstudentaccommodation.com/student-accommodation/birmingham/penworks-house?utm_source=prodigylivingwebsite&utm_campaign=birminghampagepenworksbutton&utm_medium=referral # You will not scrape the expected link right at the beginning
Next time come with a precise problem to solve, or in another case just take a look at the code review section.
On find and find_all: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
Finally, I think it also answers your question here: https://stackoverflow.com/questions/42506033/urllib-error-urlerror-urlopen-error-errno-11001-getaddrinfo-failed
Cheers :)
I am writing a web scraper to scrape some information off of the website JW Pepper for a sheet music database. I am using BeautifulSoup and python to do this.
Here is my code:
# a barebones program I created to scrape the description and audio file off the JW pepper website, will eventually be used in a music database
import urllib2
import re
from bs4 import BeautifulSoup
linkgot = 0
def linkget():
search = "http://www.jwpepper.com/sheet-music/search.jsp?keywords=" # this is the url without the keyword that comes up when searching something
print("enter the name of the desired piece")
keyword = raw_input("> ") # this will add the keyword to the url
url = search + keyword
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
all_links = soup.findAll("a")
link_dict = []
item_dict = []
for link in all_links:
link_dict.append(link.get('href')) # adds a list of the the links found on the page to link_dict
item_dict.append(x for x in link_dict if '.item' in x) #sorts them occording to .item
print item_dict
linkget()
The "print" command returns this: [ at 0x10ec6dc80>], which returns nothing when I google it.
Your filtering of the list was going wrong. Rather than filter in a seperate loop, you could just build the list if .item is present as follows:
from bs4 import BeautifulSoup
import urllib2
def linkget():
search = "http://www.jwpepper.com/sheet-music/search.jsp?keywords=" # this is the url without the keyword that comes up when searching something
print("enter the name of the desired piece")
keyword = raw_input("> ") # this will add the keyword to the url
url = search + keyword
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
link_dict = []
item_dict = []
for link in soup.findAll("a", href=True):
href = link.get('href')
link_dict.append(href) # adds a list of the the links found on the page to link_dict
if '.item' in href:
item_dict.append(href)
for href in item_dict:
print href
linkget()
Giving you something like:
/Festival-of-Carols/4929683.item
/Festival-of-Carols/4929683.item
/Festival-of-Carols/4929683.item
...