I want to extract the price off the website
However, I'm having trouble locating the class type.
on this website
we see that the price for this course is $5141. When I check the source code the class for the price should be "field-items".
from bs4 import BeautifulSoup
import pandas as pd
import requests
url =
"https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-
advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.find(class_='field-items')
print(price)
However when I ran the code I got a description of the course instead of the price..not sure what I did wrong. Any help appreciated, thanks!
There are actually several "field-item even" classes on your webpage so you have to pick the one inside the good class. Here's the code :
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
section = soup.find(class_='field field-name-field-price field-type-number-decimal field-label-inline clearfix view-mode-full')
price = section.find(class_="field-item even").text
print(price)
And the result :
5141.00
With bs4 4.7.1 + you can use :contains to isolate the appropriate preceeding tag then use adjacent sibling and descendant combinators to get to the target
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education')
soup = bs(r.content, 'lxml')
print(soup.select_one('.field-label:contains("Price:") + div .field-item').text)
This
.field-label:contains("Price:")
looks for an element with class field-label, the . is a css class selector, which contains the text Price:. Then the + is an adjacent sibling combinator specifying to get the adjacent div. The .field-item (space dot field-item) is a descendant combinator (the space) and class selector for a child of the adjacent div having class field-item. select_one returns the first match in the DOM for the css selector combination.
Reading:
css selectors
To get the price you can try using .select() which is precise and less error prone.
import requests
from bs4 import BeautifulSoup
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.select_one("[class*='field-price'] .even").text
print(price)
Output:
5141.00
Actually the class I see, using Firefox inspector is : field-item even, it's where the text is:
<div class="field-items"><div class="field-item even">5141.00</div></div>
But you need to change a little bit your code:
price = soup.find_all("div",{"class":'field-item even'})[2]
There are more than one "field-item even" labeled class, price is not the first one.
Related
I am curious why I cannot get the div-elements with this class as follows (which worked before but on different sites). Maybe it is an issue with this site?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url = "https://www.docmorris.de/produkte/abnehmen"
page=requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, features="lxml")
divs=soup.find_all("div",attrs={"class": "l-product mod-standard product
list-item ff-slider"})
print(divs)
prints an empty array. I want all div elements with class 'l-product mod-standard product list-item ff-slider'
You only need a single class of the multi-value and that will be less fragile. Also, remove headers.
from bs4 import BeautifulSoup
import requests
url = "https://www.docmorris.de/produkte/abnehmen"
page = requests.get(url)
soup = BeautifulSoup(page.content, features="lxml")
divs = soup.select('.l-product')
print(divs)
The multi-valued (more fragile) would be:
divs = soup.select('.l-product.mod-standard.product.list-item.ff-slider')
Or (as in comments - ensure on one line) :
divs = soup.find_all("div",attrs={"class": "l-product mod-standard product list-item ff-slider"})
I am trying to scrape a particular part of a website(https://flightmath.com/from-CDG-to-BLR) but I am unable to target the element that I need.
Below is the part of the html
<h2 style="background-color:#7DC2F8;padding:10px"><i class="fa fa-plane"></i>
flight distance = <strong>4,866</strong> miles</h2>
This is my code
dist = soup.find('h2', attrs={'class': 'fa fa-plane'})
I just want to target the "4,866" part.
I would be really grateful if someone can guide me on this.
Thanks in advance.
attrs={'class': '...'} requires an exact class attribute value (not a combination). Instead, use soup.select_one method to select by extended css rule:
from bs4 import BeautifulSoup
import requests
url = 'https://flightmath.com/from-CDG-to-BLR'
html_data = requests.get(url).content
soup = BeautifulSoup(html_data, 'html.parser')
dist = soup.select_one('h2 i.fa-plane + strong')
print(dist.text) # 4,866
In case of interest: The value is hard coded into the html (for a flight speed calculation) so you could also regex out a more precise value with the following. You can use round() to get the value shown on page.
import requests, re
urls = ['https://flightmath.com/from-CDG-to-BOM', 'https://flightmath.com/from-CDG-to-BLR', 'https://flightmath.com/from-CDG-to-IXC']
p = re.compile(r'flightspeed\.min\.value\/60 \+ ([0-9.]+)')
with requests.Session() as s:
for url in urls:
print(p.findall(s.get(url).text)[0])
find tag with class name and then use find_next() to find the strong tag.
from bs4 import BeautifulSoup
import requests
url = 'https://flightmath.com/from-CDG-to-BLR'
html_data = requests.get(url).text
soup = BeautifulSoup(html_data, 'html.parser')
dist = soup.find('i',class_='fa-plane').find_next('strong')
print(dist.text)
Given the following code:
# import the module
import bs4 as bs
import urllib.request
import re
masterURL = 'http://www.metrolyrics.com/top100.html'
sauce = urllib.request.urlopen(masterURL).read()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.findAll('ul', {'class': 'song-list'}):
for span in div:
for link in span:
for a in link:
print(a)
I can parse multiple divs, and i get a result as follows :
My question is instead of getting the full contents of the div how can I only return the highlighted portion, the URL of the Href?
Try this. You need to specify the right class to fetch the urls connected to it.
from bs4 import BeautifulSoup
import urllib.request
masterURL = 'http://www.metrolyrics.com/top100.html'
sauce = urllib.request.urlopen(masterURL).read()
soup = BeautifulSoup(sauce,'lxml')
for div in soup.find_all(class_='subtitle'):
print(div.get("href"))
Output:
http://www.metrolyrics.com/charles-goose-lyrics.html
http://www.metrolyrics.com/param-singh-lyrics.html
http://www.metrolyrics.com/westlife-lyrics.html
http://www.metrolyrics.com/luis-fonsi-lyrics.html
http://www.metrolyrics.com/grease-lyrics.html
http://www.metrolyrics.com/shanti-dope-lyrics.html
and so on ---
if 'href' in a.attrs:
a.attrs['href']
this will give you what you need.
So I'm making a bitcoin checker practice and I'm having trouble scraping data because the data I want is in a span class and I don't know how to retrieve the data.
so here is the line that I got from inspect:
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
I want to scrape the "11,511.31" number. How do I do this?
I tried so many different things and I honestly have no clue what to do anymore.
here is the URL:link
Im scraping the current USD price (right next to "BTC/USD")
EDIT: Guys a lot of the examples you gave me is where i input the data. Thats not useful because i want to refresh the page every 30 seconds so I need the program to find the span class and extract the data and print it'
EDIT:current code. need to get programm to get "html" part by itself
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
url = 'https://www.gdax.com/trade/BTC-USD'
#program need to retrieve this by itself
html = """<span class="MarketInfo_market-num_1lAXs">11,560.00 USD</span>"""
soup = BeautifulSoup(html, "html.parser")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())
You just have to search for the right tag and class -
from bs4 import BeautifulSoup
html_text = """
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
"""
html = BeautifulSoup(html_text, "lxml")
spans = html.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD', '').strip())
Searching for all <span> tags and then filtering them by class attribute, which in you case has a value of MarketInfo_market-num_1lAXs. Once the filter is done just loop through the spans and using the .text attribute you can retrieve the text, then just replace the 'USD'.
UPDATE
import requests
import json
url = 'https://api.gdax.com/products/BTC-USD/trades'
res = requests.get(url)
json_res = json.loads(res.text)
print(json_res[0]['price'])
No need to understand the HTML. The data in that HTML tag is getting populated from an API call which has a JSON response. You can call that API directly. This will keep your data current.
you can use beautifulsoup or lxml.
For beautifulsoup, the code like as following
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""", "lxml")
print(soup.string)
The lxml is more quickly
from lxml import etree
span = etree.HTML("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""")
for i in span.xpath("//span/text()"):
print(i)
Try a real browser like Selenium-Firefox. I tried to use Selenium-PhantomJS, but I failed...
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = 'https://www.gdax.com/trade/BTC-USD'
driver = webdriver.Firefox(executable_path='./geckodriver')
driver.get(url)
sleep(10) # Sleep 10 seconds while waiting for the page to load...
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())
driver.close()
Output:
11,493.00
+
3.06 %
13,432 BTC
[Finished in 15.0s]
I am trying to parse through a div class from an html table on Amazon, and when I run the code, find_all() sometimes returns the right div classes that I am looking for, and other times it will return an empty list. Any ideas on why the results vary?
I am pulling from this url: https://www.amazon.com/dp/B0767653BK
My code:
req = requests.get('https://www.amazon.com/dp/B0767653BK')
page = req.text
BSoup = BeautifulSoup(page, 'html.parser')
divClass = Bsoup.find_all('div', class_='a-section a-spacing-none a-padding-none overflow_ellipsis')
It is better to use a beautifulsoup selector when trying to find all elements with a combination of CSS classes:
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.amazon.com/dp/B0767653BK')
soup = BeautifulSoup(req.text, 'html.parser')
for div_class in soup.select('div.a-section.a-spacing-none.a-padding-none.overflow_ellipsis'):
print div_class.get_text(strip=True)
This is preferable as it allows the four class elements to be present in any order. So if the page decides to change the ordering of the classes, it will still find them.
Take a look at Searching by CSS class in the documenation.