How to python Scrape text in span class

How to python Scrape text in span class - python

So I'm making a bitcoin checker practice and I'm having trouble scraping data because the data I want is in a span class and I don't know how to retrieve the data.
so here is the line that I got from inspect:
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
I want to scrape the "11,511.31" number. How do I do this?
I tried so many different things and I honestly have no clue what to do anymore.
here is the URL:link
Im scraping the current USD price (right next to "BTC/USD")
EDIT: Guys a lot of the examples you gave me is where i input the data. Thats not useful because i want to refresh the page every 30 seconds so I need the program to find the span class and extract the data and print it'
EDIT:current code. need to get programm to get "html" part by itself
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
url = 'https://www.gdax.com/trade/BTC-USD'
#program need to retrieve this by itself
html = """<span class="MarketInfo_market-num_1lAXs">11,560.00 USD</span>"""
soup = BeautifulSoup(html, "html.parser")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())

You just have to search for the right tag and class -
from bs4 import BeautifulSoup
html_text = """
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
"""
html = BeautifulSoup(html_text, "lxml")
spans = html.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD', '').strip())
Searching for all <span> tags and then filtering them by class attribute, which in you case has a value of MarketInfo_market-num_1lAXs. Once the filter is done just loop through the spans and using the .text attribute you can retrieve the text, then just replace the 'USD'.
UPDATE
import requests
import json
url = 'https://api.gdax.com/products/BTC-USD/trades'
res = requests.get(url)
json_res = json.loads(res.text)
print(json_res[0]['price'])
No need to understand the HTML. The data in that HTML tag is getting populated from an API call which has a JSON response. You can call that API directly. This will keep your data current.

you can use beautifulsoup or lxml.
For beautifulsoup, the code like as following
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""", "lxml")
print(soup.string)
The lxml is more quickly
from lxml import etree
span = etree.HTML("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""")
for i in span.xpath("//span/text()"):
print(i)

Try a real browser like Selenium-Firefox. I tried to use Selenium-PhantomJS, but I failed...
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = 'https://www.gdax.com/trade/BTC-USD'
driver = webdriver.Firefox(executable_path='./geckodriver')
driver.get(url)
sleep(10) # Sleep 10 seconds while waiting for the page to load...
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())
driver.close()
Output:
11,493.00
+
3.06 %
13,432 BTC
[Finished in 15.0s]

Related

beautiful soup- Scraping a site with hidden tag

I am trying to Scrape NBA.com play by play table so I want to get the text for each box that is in the example picture.
for example(https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play).
checking the html code I figured that each line is in an article tag that contains div tag that contains two p tags with the information I want, however I wrote the following code and I get back that there are 0 articles and only 9 P tags (should be much more) but even the tags I get their text is not the box but something else. I get 9 tags so I am doing something terrible wrong and I am not sure what it is.
this is the code to get the tags:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def contains_word(t):
return t and 'keyword' in t
url = "https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
div_tags = soup.find_all('div', text=contains_word("playByPlayContainer"))
articles=soup.find_all('article')
p_tag = soup.find_all('p', text=contains_word("md:bg"))
thank you!

Use Selenium since it's using Javascript and pass it to Beautifulsoup. Also pip install selenium and get the chromedriver.exe
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play")
soup = BeautifulSoup(driver.page_source, "html.parser")

How to Get data-* attributes when web scraping using python requests (Python Requests Creating Some Issues)

How can I get the value of data-d1-value when I am using requests library of python?
The request.get(URL) function is itself not giving the data-* attributes in the div which are present in the original webpage.
The web page is as follows:
<div id="test1" class="class1" data-d1-value="150">
180
</div>
The code I am using is :
req = request.get(url)
soup = BeautifulSoup(req.text, 'lxml')
d1_value = soup.find('div', {'class':"class1"})
print(d1_value)
The result I get is:
<div id="test1" class="class1">
180
</div>
When I debug this, I found that request.get(URL) is not returning the full div but only the id and class and not data-* attributes.
How should I modify to get the full value?
For better example:
For my case the URL is:
https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG
And the Information of variable:
The DIV CLASS is : class="inprice1 nsecp" and The value of data-numberanimate-value is what I am trying to fetch
Thanks in advance :)

EDIT
Website response differs in case of requesting it - In your case using requests the value you are looking for is served in this way:
<div class="inprice1 nsecp" id="nsecp" rel="92.75">92.75</div>
So you can get it from the rel or from the text:
soup.find('div', {'class':"inprice1"})['rel']
soup.find('div', {'class':"inprice1"}).get_text()
Example
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG')
soup = BeautifulSoup(req.text, 'lxml')
print('rel: '+soup.find('div', {'class':"inprice1"})['rel'])
print('text :'+soup.find('div', {'class':"inprice1"}).get_text())
Output
rel: 92.75
text: 92.75
To get a response that display the source as you inspect it, you have to try selenium
Example
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG"
driver.get(url)
sleep(2)
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.find('div', class_='inprice1 nsecp')['data-numberanimate-value'])
driver.close()
To get the attribute value just add ['data-d1-value'] to your find()
Example
from bs4 import BeautifulSoup
html='''
<div id="test1" class="class1" data-d1-value="150">
180
</div>
'''
soup = BeautifulSoup(html, 'lxml')
d1_value = soup.find('div', {'class':"class1"})['data-d1-value']
print(d1_value)

you are seeing this issue, because you didn't retrieve all of the other attributes which we're defined on the DIV.
The below code will retrieve all of the custom attributes which we're defined on the div as well
from bs4 import BeautifulSoup
s = '<div id="test1" class="class1" data-d1-value="150">180</div>'
soup = BeautifulSoup(s)
attributes_dictionary = soup.find('div',{'class':"class1"}).attrs
print(attributes_dictionary)

You can get data from HTML or you just can do it scraping the API
This is an example:
Website is: Money Control
If you going to developer tools into your browser, and select Network, you can see the requests that are doing the website:
See image
You can see that in headers, appear URL from API: priceapi.moneycontrol.com.
This is a strange case, because the API is open... and usually it isn't.
You can access to price:
Imagine that you save JSON data into a variable called 'json', you can access it with:
json.data.pricecurrent

Unable to find the class for price - web scraping

I want to extract the price off the website
However, I'm having trouble locating the class type.
on this website
we see that the price for this course is $5141. When I check the source code the class for the price should be "field-items".
from bs4 import BeautifulSoup
import pandas as pd
import requests
url =
"https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-
advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.find(class_='field-items')
print(price)
However when I ran the code I got a description of the course instead of the price..not sure what I did wrong. Any help appreciated, thanks!

There are actually several "field-item even" classes on your webpage so you have to pick the one inside the good class. Here's the code :
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
section = soup.find(class_='field field-name-field-price field-type-number-decimal field-label-inline clearfix view-mode-full')
price = section.find(class_="field-item even").text
print(price)
And the result :
5141.00

With bs4 4.7.1 + you can use :contains to isolate the appropriate preceeding tag then use adjacent sibling and descendant combinators to get to the target
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education')
soup = bs(r.content, 'lxml')
print(soup.select_one('.field-label:contains("Price:") + div .field-item').text)
This
.field-label:contains("Price:")
looks for an element with class field-label, the . is a css class selector, which contains the text Price:. Then the + is an adjacent sibling combinator specifying to get the adjacent div. The .field-item (space dot field-item) is a descendant combinator (the space) and class selector for a child of the adjacent div having class field-item. select_one returns the first match in the DOM for the css selector combination.
Reading:
css selectors

To get the price you can try using .select() which is precise and less error prone.
import requests
from bs4 import BeautifulSoup
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.select_one("[class*='field-price'] .even").text
print(price)
Output:
5141.00

Actually the class I see, using Firefox inspector is : field-item even, it's where the text is:
<div class="field-items"><div class="field-item even">5141.00</div></div>
But you need to change a little bit your code:
price = soup.find_all("div",{"class":'field-item even'})[2]
There are more than one "field-item even" labeled class, price is not the first one.

Python Beautiful Soup - Span class text not extracted

I'm using beautiful soup for the first time and the text from the span class is not being extracted. I'm not familiarized with HTML so I'm unsure as to why this happens, so it'd be great to understand.
I've used the code below:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.Close()
page_soup = soup(page_html, "html.parser")
content = page_soup.findAll("span",attrs={"data-item":"rate"})
With this code for index 0 it returns the following:
<span class="productdata" data-baserate-code="VRI" data-cc="AU" data-
item="rate" data-section="PHL" data-subsection="VR"></span>
However I'd expect something like this when I inspect via Chrome, which has the text such as the interest rate:
<span class="productdata" data-cc="AU" data-section="PHL" data-
subsection="VR" data-baserate-code="VRI" data-item="rate">5.20% p.a.</span>

Data you are trying to extract does not exists. It is loaded using JS after the page is loaded. Website uses a JSON api to load information on the page. So Beautiful soup can not find the data. Data can be viewed at following link that hits JSON API on the site and provides JSON data.
https://www.anz.com/productdata/productdata.asp?output=json&country=AU&section=PHL
You can parse the json and get the data. Also for HTTP requests I would recommend requests package.

As others said, the content is JavaScript generated, you can use selenium together ChromeDriver to find the data you want with something like:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome")
items = driver.find_elements_by_css_selector("span[data-item='rate']")
itemsText = [item.get_attribute("textContent") for item in items]
>>> itemsText
['5.20% p.a.', '5.30% p.a.', '5.75% p.a.', '5.52% p.a.', ....]
As seen above, BeautifulSoup wasn't necessary at all, but you can use it instead to parse the page source and get the same results:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.findAll("span",{"data-item":"rate"})
itemsText = [item.text for items in items]

Get value of span tag using BeautifulSoup

I have a number of facebook groups that I would like to get the count of the members of. An example would be this group: https://www.facebook.com/groups/347805588637627/
I have looked at inspect element on the page and it is stored like so:
<span id="count_text">9,413 members</span>
I am trying to get "9,413 members" out of the page. I have tried using BeautifulSoup but cannot work it out.
Thanks
Edit:
from bs4 import BeautifulSoup
import requests
url = "https://www.facebook.com/groups/347805588637627/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
span = soup.find("span", id="count_text")
print(span.text)

In case there is more than one span tag in the page:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_input, 'html.parser')
span = soup.find("span", id="count_text")
span.text

You can use the text attribute of the parsed span:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<span id="count_text">9,413 members</span>', 'html.parser')
>>> soup.span
<span id="count_text">9,413 members</span>
>>> soup.span.text
'9,413 members'

If you have more than one span tag you can try this
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
for tag in tags:
print(tag.contents[0])

Facebook uses javascrypt to prevent bots from scraping. You need to use selenium to extract data on python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to python Scrape text in span class - python

Related

beautiful soup- Scraping a site with hidden tag

How to Get data-* attributes when web scraping using python requests (Python Requests Creating Some Issues)

Unable to find the class for price - web scraping

Python Beautiful Soup - Span class text not extracted

Get value of span tag using BeautifulSoup

Categories

Resources