Collect data with BeautifulSoup

Collect data with BeautifulSoup - python

I´m trying to get the price value out but I can not get it, it always return a "-".
I want it to return the 1,376,000
HTML
<span class="price_big_right">
<span id="pc-lowest-1" data-price="1,376,000">
1,376,000
<img alt="c" class="coins_icon_l_bin" src="https://example.com/123"/>
</span>
</span>
Here is what I´m trying to do
import requests
from bs4 import BeautifulSoup
import smtplib
URL = 'https://www.futbin.com/22/player/426'
def test():
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id="pc-lowest-1").get_text()
# price
print(price)
test()

The data in this page is being loaded by JavaScript. So beautifulsoup will not work. You have to use selenium to get the data.
However, if you see under the Network tab in Chrome DevTools, the page loads the data from the below end point.
https://www.futbin.com/22/playerPrices?player=20801&rids=50352449,67129665
You can directly make a request to this end point and the fetch the data.
Here is how you can do it.
import requests
URL = 'https://www.futbin.com/22/playerPrices?player=20801&rids=50352449,67129665'
page = requests.get(URL)
x = page.json()
data = x['20801']['prices']
print(data['pc']['LCPrice'])
1,349,000

Related

BeautifulSoup find() function returning None because I am not getting correct html info

I am trying to webscrape an Amazon web product page and print out the productTitle. Code below:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import smtplib
def extract():
headers = {my_info}
url = "link"
# this link is to a webpage on Amazon
req = requests.get(url, headers=headers)
soup1 = BeautifulSoup(req.content, "html.parser") # get soup1 from website
soup2 = BeautifulSoup(soup1.prettify(), 'html.parser') # prettify it in soup 2
print(soup2) # when i inspect element on website, it shows that there is a html tag <span> with
id=productTitle
# title = soup2.find('span', id='productTitle').get_text() # find() is returning None
print(soup2.prettify())
I was expecting the html content that i inspected element from directly on the website to be the same as in soup2, but for some reason it's not, which is causing find() to return None. How come the html is not the same? Any help would be appreciated.

Get XHR info from URL

I have this website https://www.futbin.com/22/player/7504 and I want to know if there is a way to get the XHR url for the information using python. For example for the URL above I know the XHR I want is https://www.futbin.com/22/playerPrices?player=231443 (got it from inspect element -> network).
My objective is to get the price value from https://www.futbin.com/22/player/1 to https://www.futbin.com/22/player/10000 at once without using inspect element one by one.
import requests
URL = 'https://www.futbin.com/22/playerPrices?player=231443'
page = requests.get(URL)
x = page.json()
data = x['231443']['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])

You can find the player-resource id and build the url yourself. I use beautifulsoup. It's made for parsing websites, but you can take the requests content and throw that into an html parser as well if you don't want to install beautifulsoup
With it, read the first url, get the id and use your code to pull the prices. To test, change the 10000 to 2 or 3 and you'll see it works.
import re, requests
from bs4 import BeautifulSoup
for i in range(1,10000):
url = 'https://www.futbin.com/22/player/{}'.format(str(i))
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
player_resource = soup.find(id=re.compile('page-info')).get('data-player-resource')
# print(player_resource)
URL = 'https://www.futbin.com/22/playerPrices?player={}'.format(player_resource)
page = requests.get(URL)
x = page.json()
# print(x)
data = x[player_resource]['prices']
print(data['pc']['LCPrice'])
print(data['ps']['LCPrice'])
print(data['xbox']['LCPrice'])

How to Get data-* attributes when web scraping using python requests (Python Requests Creating Some Issues)

How can I get the value of data-d1-value when I am using requests library of python?
The request.get(URL) function is itself not giving the data-* attributes in the div which are present in the original webpage.
The web page is as follows:
<div id="test1" class="class1" data-d1-value="150">
180
</div>
The code I am using is :
req = request.get(url)
soup = BeautifulSoup(req.text, 'lxml')
d1_value = soup.find('div', {'class':"class1"})
print(d1_value)
The result I get is:
<div id="test1" class="class1">
180
</div>
When I debug this, I found that request.get(URL) is not returning the full div but only the id and class and not data-* attributes.
How should I modify to get the full value?
For better example:
For my case the URL is:
https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG
And the Information of variable:
The DIV CLASS is : class="inprice1 nsecp" and The value of data-numberanimate-value is what I am trying to fetch
Thanks in advance :)

EDIT
Website response differs in case of requesting it - In your case using requests the value you are looking for is served in this way:
<div class="inprice1 nsecp" id="nsecp" rel="92.75">92.75</div>
So you can get it from the rel or from the text:
soup.find('div', {'class':"inprice1"})['rel']
soup.find('div', {'class':"inprice1"}).get_text()
Example
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG')
soup = BeautifulSoup(req.text, 'lxml')
print('rel: '+soup.find('div', {'class':"inprice1"})['rel'])
print('text :'+soup.find('div', {'class':"inprice1"}).get_text())
Output
rel: 92.75
text: 92.75
To get a response that display the source as you inspect it, you have to try selenium
Example
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG"
driver.get(url)
sleep(2)
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.find('div', class_='inprice1 nsecp')['data-numberanimate-value'])
driver.close()
To get the attribute value just add ['data-d1-value'] to your find()
Example
from bs4 import BeautifulSoup
html='''
<div id="test1" class="class1" data-d1-value="150">
180
</div>
'''
soup = BeautifulSoup(html, 'lxml')
d1_value = soup.find('div', {'class':"class1"})['data-d1-value']
print(d1_value)

you are seeing this issue, because you didn't retrieve all of the other attributes which we're defined on the DIV.
The below code will retrieve all of the custom attributes which we're defined on the div as well
from bs4 import BeautifulSoup
s = '<div id="test1" class="class1" data-d1-value="150">180</div>'
soup = BeautifulSoup(s)
attributes_dictionary = soup.find('div',{'class':"class1"}).attrs
print(attributes_dictionary)

You can get data from HTML or you just can do it scraping the API
This is an example:
Website is: Money Control
If you going to developer tools into your browser, and select Network, you can see the requests that are doing the website:
See image
You can see that in headers, appear URL from API: priceapi.moneycontrol.com.
This is a strange case, because the API is open... and usually it isn't.
You can access to price:
Imagine that you save JSON data into a variable called 'json', you can access it with:
json.data.pricecurrent

Cannot pull data from pantip.com

I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy
import requests
import re
from bs4 import BeautifulSoup
# specify the url
url = 'https://pantip.com/topic/38372443'
# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)
# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()
I tried to find by id
elementTag_comment = soup.find(id = "comments-jsrender")
according to
I got the result below.
elementTag_comment =
<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>
The question is how can I get all comments. Please, suggest me how to fix it.

The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.
Once you do that you can use the following to return each of the posts data:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')
for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())
driver.quit()

How to python Scrape text in span class

So I'm making a bitcoin checker practice and I'm having trouble scraping data because the data I want is in a span class and I don't know how to retrieve the data.
so here is the line that I got from inspect:
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
I want to scrape the "11,511.31" number. How do I do this?
I tried so many different things and I honestly have no clue what to do anymore.
here is the URL:link
Im scraping the current USD price (right next to "BTC/USD")
EDIT: Guys a lot of the examples you gave me is where i input the data. Thats not useful because i want to refresh the page every 30 seconds so I need the program to find the span class and extract the data and print it'
EDIT:current code. need to get programm to get "html" part by itself
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
url = 'https://www.gdax.com/trade/BTC-USD'
#program need to retrieve this by itself
html = """<span class="MarketInfo_market-num_1lAXs">11,560.00 USD</span>"""
soup = BeautifulSoup(html, "html.parser")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())

You just have to search for the right tag and class -
from bs4 import BeautifulSoup
html_text = """
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
"""
html = BeautifulSoup(html_text, "lxml")
spans = html.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD', '').strip())
Searching for all <span> tags and then filtering them by class attribute, which in you case has a value of MarketInfo_market-num_1lAXs. Once the filter is done just loop through the spans and using the .text attribute you can retrieve the text, then just replace the 'USD'.
UPDATE
import requests
import json
url = 'https://api.gdax.com/products/BTC-USD/trades'
res = requests.get(url)
json_res = json.loads(res.text)
print(json_res[0]['price'])
No need to understand the HTML. The data in that HTML tag is getting populated from an API call which has a JSON response. You can call that API directly. This will keep your data current.

you can use beautifulsoup or lxml.
For beautifulsoup, the code like as following
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""", "lxml")
print(soup.string)
The lxml is more quickly
from lxml import etree
span = etree.HTML("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""")
for i in span.xpath("//span/text()"):
print(i)

Try a real browser like Selenium-Firefox. I tried to use Selenium-PhantomJS, but I failed...
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = 'https://www.gdax.com/trade/BTC-USD'
driver = webdriver.Firefox(executable_path='./geckodriver')
driver.get(url)
sleep(10) # Sleep 10 seconds while waiting for the page to load...
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())
driver.close()
Output:
11,493.00
+
3.06 %
13,432 BTC
[Finished in 15.0s]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Collect data with BeautifulSoup - python

Related

BeautifulSoup find() function returning None because I am not getting correct html info

Get XHR info from URL

How to Get data-* attributes when web scraping using python requests (Python Requests Creating Some Issues)

Cannot pull data from pantip.com

How to python Scrape text in span class

Categories

Resources