I am attempting to scrape airbnb with Beautifulsoup. name is returning None when there should be text present. Am I doing something wrong? I have only just started learning this so there is a high chance I'm missing somethi
ng simple.
from bs4 import BeautifulSoup
import requests
from requests_html import HTMLSession
import lxml
def extract_features(listing_html):
features_dict = {}
name = listing_html.find('div', {'class': '_xcsyj0'})
features_dict['name'] = name
return features_dict
def getdata(url):
s = requests.get(url)
soup = BeautifulSoup(s.content, 'html.parser')
return soup
def getnextpage(soup):
page = 'https://www.airbnb.com' + str(soup.find('a').get('href'))
return page
url = 'https://www.airbnb.com/s/Bear-Creek-Lake-Dr--Jim-Thorpe--PA--USA/homes?tab_id=home_tab'
soup = getdata(url)
listings = soup.find_all('div','_8s3ctt')# (type of tag, name)
for listing in listings[0:2]:
full_url = getnextpage(listing)
ind_soup = getdata(full_url)
features = extract_features(ind_soup)
print(features)
print("-----------")
name = listing_html.find('div', {'class': '_xcsyj0'})
You are trying to find a div tag with class _xcsyj0. However when I take a look at the URL you posted in the comments, there is no tag that has that class. This usually occurs with auto-generated CSS classes and makes web scraping tasks like this difficult to impossible.
It doesn't look like the div tag you want has an id set either, so your best bet would be an API for Airbnb. From a cursory search, I found an official API — which seems to only be available to verified partners — and a number of unofficial APIs. Unfortunately, I have not worked with any Airbnb APIs so I can't suggest which is these is actually good/working.
Here I run the code and got the result. Please elaborate the problem which helps to debug the result better
Related
I created bs4 web-scraping app with python. My program return empty list for review. For soup program runs normally.
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
usernames = []
titles = []
comments = []
result = requests.get('https://www.kupujemprodajem.com/review.php?action=list')
soup = BeautifulSoup(result.text, 'html.parser')
review = soup.findAll('div', class_="single-review")
print(review)
for i in review:
header = i.find('div', class_="single-review__header")
footer = i.find('div', class_="comment-holder")
username = header.find('a', class_="single-review__username").text
title = header.find('div', class_="single-review__related-to").text
comment = footer.find('div', class_="single-review__comment").text
usernames.append(username)
titles.append(title)
comments.append(comment)
data.append(usernames)
data.append(titles)
data.append(comments)
print(data)
It isn't problem with class.
It looks like the reason this doesn't work is because the website needs a login in order to access that page. If in a private tab in a browser you where to visit https://www.kupujemprodajem.com/review.php?action=list, it would just take you to a login page.
There's 2 paths I can think of that you could take here:
Reverse engineer how the login process works and use the requests library to make a request to login and get (most likely) the session cookie from that in order to be able to request pages that require sign in.
(much simpler) use selenium instead. Selenium is a library that allows you to control a full browser instance, so you would be able to easily input credentials using this method. Beautiful soup on the other hand simply just parses html, so doing things like authenticating often take much more work in Beautiful Soup then they do in Selenium. I'd definitely suggest looking into it if you haven't already.
I'm working on webscraping project currently using BS4 where I am trying to aggregate college tuition data. I'm using the site tuitiontracker.org as a data source. Once I've navigated to a specific college, I want to scrape the tuition data off of the site. When I inspect element, I can see the tuition data stored as an innerHTML, but when I use beautiful soup to find it, it returns everything about the location except for the tuition data.
Here is the url I'm trying to scrape from: https://www.tuitiontracker.org/school.html?unitid=164580
Here is the code I am using:
import urllib.request
from bs4 import BeautifulSoup
DOWNLOAD_URL = "https://www.tuitiontracker.org/school.html?unitid=164580"
def download_page(url):
return urllib.request.urlopen(url)
# print(download_page(DOWNLOAD_URL).read())
def parse_html(html):
"""Gathers data from an HTML page"""
soup = BeautifulSoup(html, features="html.parser")
# print(soup.prettify())
tuition_data = soup.find("div", attrs={"id": "price"}).innerHTML
print(tuition_data)
def main():
url = DOWNLOAD_URL
parse_html(download_page(DOWNLOAD_URL).read())
if __name__ == "__main__":
main()
When I print tuition_data, I see the relevant tags where the tuition data is stored on the page, but no number value. I've tried using .innerHTML and .string but they end up printing either None, or simply a blank space.
Really quite confused, thanks for any clarification.
The data comes from an API endpoint and is dynamically rendered by JavaScript so you won't get it with BeautifulSoup.
However, you can query the endpoint.
Here's how:
import json
import requests
url = "https://www.tuitiontracker.org/school.html?unitid=164580"
api_endpoint = f"https://www.tuitiontracker.org/data/school-data-09042019/"
response = requests.get(f"{api_endpoint}{url.split('=')[-1]}.json").json()
tuition = response["yearly_data"][0]
print(
round(tuition["price_instate_oncampus"], 2),
round(tuition["avg_net_price_0_30000_titleiv_privateforprofit"], 2),
)
Output:
75099.8 30255.86
PS. There's a lot more in that JSON. Pro tip for future web-scraping endeavors: you favorite web browser's Developer Tools should be your best friend.
Here's what it looks like behind the scenes:
I am trying to build a scraper to get some abstracts of academic papers and their corresponding titles on this page.
The problem is that my for link in bsObj.findAll('a',{'class':'search-track'}) does not return the links I need to further build my scraper. In my code, the check is like this:
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
The for loop above does print out anything, however, the href links should be inside the <a class="search-track" ...</a>.
I have referred to this post, but changing the Beautifulsoup parser is not solving the problem of my code. I am using "html.parser" in my Beautifulsoup constructor: bsObj = bs(html.content, features="html.parser").
And the print(len(bsObj)) prints out "3" while it prints out "2" for both "lxml" and "html5lib".
Also, I started off using urllib.request.urlopen to get the page and then tried requests.get() instead. Unfortunately the two approaches give me the same bsObj.
Here is the code I've written:
#from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bs
import ssl
'''
The elsevier search is kind of a tree structure:
"keyword --> a list of journals (a journal contain many articles) --> lists of articles
'''
address = input("Please type in your keyword: ") #My keyword is catalyst for water splitting
#https://www.elsevier.com/en-xs/search-results?
#query=catalyst%20for%20water%20splitting&labels=journals&page=1
address = address.replace(" ", "%20")
address = "https://www.elsevier.com/en-xs/search-results?query=" + address + "&labels=journals&page=1"
journals = []
articles = []
def getJournals(url):
global journals
#html = urlopen(url)
html = requests.get(url)
bsObj = bs(html.content, features="html.parser")
#print(len(bsObj))
#testFile = open('testFile.txt', 'wb')
#testFile.write(bsObj.text.encode(encoding='utf-8', errors='strict') +'\n'.encode(encoding='utf-8', errors='strict'))
#testFile.close()
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
########does not print anything########
'''
if 'href' in link.attrs and link.attrs['href'] not in journals:
newJournal = link.attrs['href']
journals.append(newJournal)
'''
return None
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
getJournals(address)
print(journals)
Can anyone tell me what the problem is in my code that the for loop does not print out any links? I need to store the links of journals in a list and then visit each link to scrape the abstracts of papers. By right the abstracts part of a paper is free and the website shouldn't have blocked my ID because of it.
This page is dynamically loaded with jscript, so Beautifulsoup can't handle it directly. You may be able to do it using Selenium, but in this case you can do it by tracking the api calls made by the page (for more see, as one of many examples, here.
In your particular case it can be done this way:
from bs4 import BeautifulSoup as bs
import requests
import json
#this is where the data is hiding:
url = "https://site-search-api.prod.ecommerce.elsevier.com/search?query=catalyst%20for%20water%20splitting&labels=journals&start=0&limit=10&lang=en-xs"
html = requests.get(url)
soup = bs(html.content, features="html.parser")
data = json.loads(str(soup))#response is in json format so we load it into a dictionary
Note: in this case, it's also possible to dispense with Beautifulsoup altogether and load the response directly, as in data = json.loads(html.content). From this point:
hits = data['hits']['hits']#target urls are hidden deep inside nested dictionaries and lists
for hit in hits:
print(hit['_source']['url'])
Ouput:
https://www.journals.elsevier.com/water-research
https://www.journals.elsevier.com/water-research-x
etc.
I am writing a simple web scraper to extract the game times for the ncaa basketball games. The code doesn't need to be pretty, just work. I have extracted the value from other span tags on the same page but for some reason I cannot get this one working.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.espn.com/mens-college-basketball/game/_/id/401123420'
response = requests.get(url)
soupy = soup(response.content, 'html.parser')
containers = soupy.findAll("div",{"class" : "team-container"})
for container in containers:
spans = container.findAll("span")
divs = container.find("div",{"class": "record"})
ranks = spans[0].text
team_name = spans[1].text
team_mascot = spans[2].text
team_abbr = spans[3].text
team_record = divs.text
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
refs_container = soupy.find("div", {"class" : "game-info-note__container"})
refs = refs_container.text
print(ranks)
print(team_name)
print(team_mascot)
print(team_abbr)
print(team_record)
print(game_times)
print(refs)
The specific code I am concerned about is this,
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
I just provided the rest of the code to show that the .text on other span tags work. The time is the only data I truly want. I just get an empty string with how my code is currently.
This is the output of the code I get when I call time_container
<span class="time game-time" data-dateformat="time1" data-showtimezone="true"></span>
or just '' when I do game_times.
Here is the line of the HTML from the website:
<span class="time game-time" data-dateformat="time1" data-showtimezone="true">6:10 PM CT</span>
I don't understand why the 6:10 pm is gone when I run the script.
The site is dynamic, thus, you need to use selenium:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
game_time = soup(d.page_source, 'html.parser').find('span', {'class':'time game-time'}).text
Output:
'7:10 PM ET'
See full selenium documentation here.
An alternative would be to use some of ESPN's endpoints. These endpoints will return JSON responses. https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard
You can see other endpoints at this GitHub link https://gist.github.com/akeaswaran/b48b02f1c94f873c6655e7129910fc3b
This will make your application pretty light weight compared to running Selenium.
I recommend opening up inspect and going to the network tab. You can see all sorts of cool stuff happening. You can see all the requests that are happening in the site.
You can easily grab from an attribute on the page with requests
import requests
from bs4 import BeautifulSoup as bs
from dateutil.parser import parse
r = requests.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
soup = bs(r.content, 'lxml')
timing = soup.select_one('[data-date]')['data-date']
print(timing)
match_time = parse(timing).time()
print(match_time)
I've been trying to scrape the information inside of a particular set of p tags on a website and running into a lot of trouble.
My code looks like:
import urllib
import re
def scrape():
url = "https://www.theWebsite.com"
statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
status = re.findall(statusText,htmltext)
print("Status: " + str(status))
scrape()
Which unfortunately returns only: "Status: []"
However, that being said I have no idea what I am doing wrong because when I was testing on the same website I could use the code
statusText = re.compile('(.+?)')
instead and I would get what I was trying to, "Status: ['About', 'About']"
Does anyone know what I can do to get the information within the div tags? Or more specifically the single set of p tags the div tags contain? I've tried plugging in just about any values I could think of and have gotten nowhere. After Google, YouTube, and SO searching I'm running out of ideas now.
I use BeautifulSoup for extracting information between html tags. Suppose you want to extract a division like this : <div class='article_body' itemprop='articleBody'>...</div>
then you can use beautifulsoup and extract this division by:
soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
also see the official documentation of bs4
as an example i have edited your code for extracting a division form an article of bloomberg
you can make your own changes
import urllib
import re
from bs4 import BeautifulSoup
def scrape():
url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext)
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
print ans
scrape()
You can BeautifulSoup from here
P.S. : I use scrapy and BeautifulSoup for web scraping and I am satisfied with it