I am trying to use find_all() on the below html;
http://www.simon.com/mall
Based on advice on other threads, I ran the link through the below site and it found errors, but I am not sure how the errors shown may be hurting what I am trying to do in Beautiful Soup.
https://validator.w3.org/
Here is my code;
from requests import get
url = 'http://www.simon.com/mall'
response = get(url)
from bs4 import BeautifulSoup
html = BeautifulSoup(response.text, 'html5lib')
mall_list = html.find_all('div', class_ = 'col-xl-4 col-md-6 ')
print(type(mall_list))
print(len(mall_list))
The result is;
"C:\Program Files\Anaconda3\python.exe" C:/Users/Chris/PycharmProjects/IT485/src/GetMalls.py
<class 'bs4.element.ResultSet'>
0
Process finished with exit code 0
I know there are hundreds of these divs in the HTML. Why am I not getting any matches?
Your code looks fine, however, when I visit the simon.com/mall link and check Chrome Dev Tools there doesn't seem to be any instances of the class 'col-xl-4 col-md-6 '.
Try testing your code with 'col-xl-2' and you should see some results.
Assuming that you are trying to parse the title and location of different products from that page (mentioned in your script). The thing is the content of that page are generated dynamically so you can't catch it with requests; rather, you need to use any browser simulator like selenium that is What i did in my below code. Give this a try:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('http://www.simon.com/mall')
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
for item in soup.find_all(class_="mall-list-item-text"):
name = item.find_all(class_='mall-list-item-name')[0].text
location = item.find_all(class_='mall-list-item-location')[0].text
print(name,location)
Results:
ABQ Uptown Albuquerque, NM
Albertville Premium Outlets® Albertville, MN
Allen Premium Outlets® Allen, TX
Anchorage 5th Avenue Mall Anchorage, AK
Apple Blossom Mall Winchester, VA
I sometime use BeautifulSoup too. The problem lies in the way you get the attributes. The full working code can be seen bellow:
import requests
from bs4 import BeautifulSoup
url = 'http://www.simon.com/mall'
response = requests.get(url)
html = BeautifulSoup(response.text)
mall_list = html.find_all('div', attrs={'class': 'col-lg-4 col-md-6'})[1].find_all('option')
malls = []
for mall in mall_list:
if mall.get('value') == '':
continue
malls.append(mall.text)
print(malls)
print(type(malls))
print(len(malls))
Related
I am trying to Scrape NBA.com play by play table so I want to get the text for each box that is in the example picture.
for example(https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play).
checking the html code I figured that each line is in an article tag that contains div tag that contains two p tags with the information I want, however I wrote the following code and I get back that there are 0 articles and only 9 P tags (should be much more) but even the tags I get their text is not the box but something else. I get 9 tags so I am doing something terrible wrong and I am not sure what it is.
this is the code to get the tags:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def contains_word(t):
return t and 'keyword' in t
url = "https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
div_tags = soup.find_all('div', text=contains_word("playByPlayContainer"))
articles=soup.find_all('article')
p_tag = soup.find_all('p', text=contains_word("md:bg"))
thank you!
Use Selenium since it's using Javascript and pass it to Beautifulsoup. Also pip install selenium and get the chromedriver.exe
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play")
soup = BeautifulSoup(driver.page_source, "html.parser")
Here is my first post. Hope to be clear.
I'm scraping a web-site and here is the code I'm interested to scrape:
<div id="live-table">
<div class="event mobile event--summary">
<div elementtiming="SpeedCurveFRP" class="leagues--static event--leagues summary-results">
<div class="sportName tennis">
<div id="g_2_ldRHDOEp" title="Clicca per i dettagli dell'incontro!" class="event__matchevent__match--static event__match--twoLine">
...
What I would like to obtain is the last id (g_2_ldRHDOEp) and here is the code I produced using the beautifulsoup library
import urllib.request, urllib.error, urllib.parse
from bs4 import BeautifulSoup
url = '...'
response = urllib.request.urlopen(url)
webContent = response.read()
soup = BeautifulSoup(webContent, 'html.parser')
list = []
list = soup.find_all("div")
total_id = " "
for i in list :
id = i.get('id')
total_id = total_id + "\n" + str(id)
print(total_id)
But what I get is only
live-table
None
None
None
None
I'm quite new both to python and beautifulsoup and I'm not a seriuos programmer, I do this just for fun.
Can anyone answer me why can't I get what I want and may be how I could do this in a better and succesful way?
Thank you in advance
First of all, id and list are built-in functions, so don't use them as variable names.
The website is loaded dynamically so requests doesn't support it. We can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.flashscore.it/giocatore/djokovic-novak/AZg49Et9/"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
for tag in soup.find_all("div", id="g_2_ldRHDOEp"):
print(tag.get_text(separator=" "))
driver.quit()
Output:
30.10. 12:05 Djokovic N. (Srb) Sonego L. (Ita) 0 2 2 6 1 6 P
I would like to just get the open positions text from this website: https://www.praeses.com/careers/. I copy and pasted the class and it pulls the text from most of the site because almost everything uses this class, but there's no other unique data to pull from. How do I just get the open positions? I basically get everything with an "a class".
<a class="et_pb_button et_pb_custom_button_icon et_pb_button_1 et_hover_enabled et_pb_bg_layout_dark" href="https://www.praeses.com/senior-national-accounts-manager/" data-icon="5">Senior National Accounts Manager</a>
import requests
from bs4 import BeautifulSoup
print("Praeses jobs:")
praeses_url = "https://www.praeses.com/careers/"
praeses_html_text = requests.get(praeses_url).text
praeses_soup = BeautifulSoup(praeses_html_text, 'html.parser')
# print(praeses_soup)
for job in praeses_soup.find_all('et_pb_button et_pb_custom_button_icon et_pb_button_1 et_hover_enabled et_pb_bg_layout_dark'):
print(praeses_soup.text)
You can use CSS selector for the task.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.praeses.com/careers/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for a in soup.select('div:contains("Open Positions") ~ div > a'):
print('{:<40}{}'.format(a.get_text(strip=True), a['href']))
Prints:
Senior National Accounts Manager https://www.praeses.com/senior-national-accounts-manager/
National Accounts Manager https://www.praeses.com/national-accounts-manager/
Cloud Architect https://www.praeses.com/cloud-architect/
Front-End Developer https://www.praeses.com/front-end-developer/
Senior Project Manager (GOV) https://www.praeses.com/senior-project-manager-gov/
Try with this:
import requests
from bs4 import BeautifulSoup
print("Praeses jobs:")
praeses_url = "https://www.praeses.com/careers/"
praeses_html_text = requests.get(praeses_url).text
praeses_soup = BeautifulSoup(praeses_html_text, 'html.parser')
# print(praeses_soup)
for j in range(1,10):
try:
clase = "et_pb_button et_pb_custom_button_icon et_pb_button_"+str(j)+" et_hover_enabled et_pb_bg_layout_dark"
hola = praeses_soup.findAll("a", {"class": clase})
print(hola[0].text)
except:
print("Its over")
break
I am writing a simple web scraper to extract the game times for the ncaa basketball games. The code doesn't need to be pretty, just work. I have extracted the value from other span tags on the same page but for some reason I cannot get this one working.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.espn.com/mens-college-basketball/game/_/id/401123420'
response = requests.get(url)
soupy = soup(response.content, 'html.parser')
containers = soupy.findAll("div",{"class" : "team-container"})
for container in containers:
spans = container.findAll("span")
divs = container.find("div",{"class": "record"})
ranks = spans[0].text
team_name = spans[1].text
team_mascot = spans[2].text
team_abbr = spans[3].text
team_record = divs.text
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
refs_container = soupy.find("div", {"class" : "game-info-note__container"})
refs = refs_container.text
print(ranks)
print(team_name)
print(team_mascot)
print(team_abbr)
print(team_record)
print(game_times)
print(refs)
The specific code I am concerned about is this,
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
I just provided the rest of the code to show that the .text on other span tags work. The time is the only data I truly want. I just get an empty string with how my code is currently.
This is the output of the code I get when I call time_container
<span class="time game-time" data-dateformat="time1" data-showtimezone="true"></span>
or just '' when I do game_times.
Here is the line of the HTML from the website:
<span class="time game-time" data-dateformat="time1" data-showtimezone="true">6:10 PM CT</span>
I don't understand why the 6:10 pm is gone when I run the script.
The site is dynamic, thus, you need to use selenium:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
game_time = soup(d.page_source, 'html.parser').find('span', {'class':'time game-time'}).text
Output:
'7:10 PM ET'
See full selenium documentation here.
An alternative would be to use some of ESPN's endpoints. These endpoints will return JSON responses. https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard
You can see other endpoints at this GitHub link https://gist.github.com/akeaswaran/b48b02f1c94f873c6655e7129910fc3b
This will make your application pretty light weight compared to running Selenium.
I recommend opening up inspect and going to the network tab. You can see all sorts of cool stuff happening. You can see all the requests that are happening in the site.
You can easily grab from an attribute on the page with requests
import requests
from bs4 import BeautifulSoup as bs
from dateutil.parser import parse
r = requests.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
soup = bs(r.content, 'lxml')
timing = soup.select_one('[data-date]')['data-date']
print(timing)
match_time = parse(timing).time()
print(match_time)
usually I'm able to write a script that works for scraping, but I've been having some difficulty scraping this site for the table enlisted for this research project I'm working on. I'm planning to verify the script working on one State before entering the URL of my targeted states.
import requests
import bs4 as bs
url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table)
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections
I'm not sure if the site is attempting to block people from scraping, but all the info that I'm looking to grab is within """if you look what Table outputs.
The text is rendered with JavaScript.
First render the page with dryscrape
(If you don't want to use dryscrape see Web-scraping JavaScript page with Python )
Then the text can be extracted, after it has been rendered, from a different position on the page i.e the place it has been rendered to.
As an example this code will extract HTML from the summary.
import bs4 as bs
import dryscrape
url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])
Outputs:
<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...
So I finally managed to solve the issue, and successfuly grab the data from the Javascript page the code as follows worked for me if anyone encounters a same issue when trying to use python to scrape a javascript webpage using windows (dryscrape incompatible).
import bs4 as bs
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
trip = str(n.text)
data.append(trip)