Cannot pull data from pantip.com - python

I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy
import requests
import re
from bs4 import BeautifulSoup
# specify the url
url = 'https://pantip.com/topic/38372443'
# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)
# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()
I tried to find by id
elementTag_comment = soup.find(id = "comments-jsrender")
according to
I got the result below.
elementTag_comment =
<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>
The question is how can I get all comments. Please, suggest me how to fix it.

The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.
Once you do that you can use the following to return each of the posts data:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')
for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())
driver.quit()

Related

Collect data with BeautifulSoup

I´m trying to get the price value out but I can not get it, it always return a "-".
I want it to return the 1,376,000
HTML
<span class="price_big_right">
<span id="pc-lowest-1" data-price="1,376,000">
1,376,000
<img alt="c" class="coins_icon_l_bin" src="https://example.com/123"/>
</span>
</span>
Here is what I´m trying to do
import requests
from bs4 import BeautifulSoup
import smtplib
URL = 'https://www.futbin.com/22/player/426'
def test():
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id="pc-lowest-1").get_text()
# price
print(price)
test()
The data in this page is being loaded by JavaScript. So beautifulsoup will not work. You have to use selenium to get the data.
However, if you see under the Network tab in Chrome DevTools, the page loads the data from the below end point.
https://www.futbin.com/22/playerPrices?player=20801&rids=50352449,67129665
You can directly make a request to this end point and the fetch the data.
Here is how you can do it.
import requests
URL = 'https://www.futbin.com/22/playerPrices?player=20801&rids=50352449,67129665'
page = requests.get(URL)
x = page.json()
data = x['20801']['prices']
print(data['pc']['LCPrice'])
1,349,000

Extract a value from html using BeautifulSoup

I'm trying to retrieve a value from this HTML using bs4. I'm really new to data scraping and I have tried to figure out some ways to get this value but to no avail. The closest solution I saw is this one.
Extracting a value from html table using BeautifulSoup
Here is the HTML of which I am looking at:
<div class="dataItem_hld clearfix">
<div class="smalltxt">ROE</div>
<div name="tixStockRoe" class="value">121.362</div>
</div>
I've tried this so far:
from bs4 import BeautifulSoup as BS
import requests
url = "https://www.bursamarketplace.com/mkt/themarket/stock/SUPM"
html_content = requests.get(url).text
soup = BS(html_content, 'lxml')
val = soup.find_all('div', {'name': "tixStockRoe", 'class':"value"})
Before I want to try to use strip() to get the value, my val variable is empty.
In [96]: val
Out[96]: []
I've been searching the posts for few hours, but I did not manage to type the correct code to get the value yet.
Also, please let me know if there are any good sources to learn about extracting data. Thanks
Update
I have edited the code thanks to the response to the post. Now I encounter a problem. It seems like the number 121.362 did not appear in the variable. Any idea here?
val = soup.find_all(attrs={'name': "tixStockRoe"})
and the output is this:
Out[14]: [<div class="value" name="tixStockRoe"><div class="loader loaderSmall"><div class="loader_hld"><img alt="" src="/img/loading.gif"/></div></div></div>]
The data in that page is loaded by JavaScript and that is the reason you aren't finding the data you are looking for - 121.362 using beautifulsoup.
beautifulsoup only works on static websites.
You need to use selenium to load the page and get data. You can read more about web-scraping using selenium here
Here is how you scrape using Selenium.
import time
from bs4 import BeautifulSoup, Tag, NavigableString
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome("chromedriver.exe", options=options)
url = 'https://www.bursamarketplace.com/mkt/themarket/stock/SUPM'
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
d = soup.find('div', attrs= {'name': 'tixStockRoe'})
print(d.text.strip())
121.362
Check the Docs
You can't use a keyword argument to search for HTML’s name element.
name_soup.find_all(attrs={"name": "tixStockRoe"})
You can try this :
# I do not know lxml, in my case this parser is working
soup = BS(page, 'html.parser')
val = soup.find_all('div', attrs={'name': "tixStockRoe", 'class':"value"})

How to Get data-* attributes when web scraping using python requests (Python Requests Creating Some Issues)

How can I get the value of data-d1-value when I am using requests library of python?
The request.get(URL) function is itself not giving the data-* attributes in the div which are present in the original webpage.
The web page is as follows:
<div id="test1" class="class1" data-d1-value="150">
180
</div>
The code I am using is :
req = request.get(url)
soup = BeautifulSoup(req.text, 'lxml')
d1_value = soup.find('div', {'class':"class1"})
print(d1_value)
The result I get is:
<div id="test1" class="class1">
180
</div>
When I debug this, I found that request.get(URL) is not returning the full div but only the id and class and not data-* attributes.
How should I modify to get the full value?
For better example:
For my case the URL is:
https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG
And the Information of variable:
The DIV CLASS is : class="inprice1 nsecp" and The value of data-numberanimate-value is what I am trying to fetch
Thanks in advance :)
EDIT
Website response differs in case of requesting it - In your case using requests the value you are looking for is served in this way:
<div class="inprice1 nsecp" id="nsecp" rel="92.75">92.75</div>
So you can get it from the rel or from the text:
soup.find('div', {'class':"inprice1"})['rel']
soup.find('div', {'class':"inprice1"}).get_text()
Example
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG')
soup = BeautifulSoup(req.text, 'lxml')
print('rel: '+soup.find('div', {'class':"inprice1"})['rel'])
print('text :'+soup.find('div', {'class':"inprice1"}).get_text())
Output
rel: 92.75
text: 92.75
To get a response that display the source as you inspect it, you have to try selenium
Example
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG"
driver.get(url)
sleep(2)
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.find('div', class_='inprice1 nsecp')['data-numberanimate-value'])
driver.close()
To get the attribute value just add ['data-d1-value'] to your find()
Example
from bs4 import BeautifulSoup
html='''
<div id="test1" class="class1" data-d1-value="150">
180
</div>
'''
soup = BeautifulSoup(html, 'lxml')
d1_value = soup.find('div', {'class':"class1"})['data-d1-value']
print(d1_value)
you are seeing this issue, because you didn't retrieve all of the other attributes which we're defined on the DIV.
The below code will retrieve all of the custom attributes which we're defined on the div as well
from bs4 import BeautifulSoup
s = '<div id="test1" class="class1" data-d1-value="150">180</div>'
soup = BeautifulSoup(s)
attributes_dictionary = soup.find('div',{'class':"class1"}).attrs
print(attributes_dictionary)
You can get data from HTML or you just can do it scraping the API
This is an example:
Website is: Money Control
If you going to developer tools into your browser, and select Network, you can see the requests that are doing the website:
See image
You can see that in headers, appear URL from API: priceapi.moneycontrol.com.
This is a strange case, because the API is open... and usually it isn't.
You can access to price:
Imagine that you save JSON data into a variable called 'json', you can access it with:
json.data.pricecurrent

Python BeautifulSoup Printing info from multiple tags div class

Trying to create a python script to collect info from a website.
Trying to work out how I can extract information from 2/3 DIV tags and print.
example of html code
<div class="PowerDetails">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
I have managed to get it one by running a for loop, but trying to get RunningCost and Time side by side
Python Script, I'm new to it so playing around trying a few different things
import bs4, requests, time
while True:
url = "https://www.website.com"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
#soupTitle = soup.select('.RunningCost')
soupDetail = soup.select('.Time')
for soupDetailList in soupDetail:
print (soupDetailList.text)
End goal for this script is a web monitor to list changes/updates
zip should do the job.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html_text>" , "html.parser")
div = soup.find("div")
for r, t in zip(div.find_all("p", {"class":"RunningCost"}),
div.find_all("p", {"class":"Time"})):
print(r.string, t.string)
$4.44 peek
$2.33 Off-peek
Assuming that soup is the HTML code
PowerDetails = soup.find("div")
RunningCost = PowerDetails.find_all("p", _class="RunningCost")
Time = PowerDetails.find_all("p", _class="Time")

How to extract the href attribute value from an a tag with beautiful soup?

This is the part of the html that I am extracting on the platform and it has the snippet I want to get, the value of the href attribute of the tag with the class "booktitle"
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
After logging in using the mechanize library I have this piece of code to try to extract it, but here it returns the name of the book as the code asks, I tried several ways to get only the href value but none worked so far
from bs4 import BeautifulSoup as bs4
from requests import Session
from lxml import html
import Downloader as dw
import requests
def getGenders(browser : mc.Browser, url: str, name: str) -> None:
res = browser.open(url)
aux = res.read()
html2 = bs4(aux, 'html.parser')
with open(name, "w", encoding='utf-8') as file2:
file2.write( str( html2 ) )
getGenders(br, "https://www.goodreads.com/shelf/show/art", "gendersBooks.html")
with open("gendersBooks.html", "r", encoding='utf8') as file:
contents = file.read()
bsObj = bs4(contents, "lxml")
aux = open("books.text", "w", encoding='utf8')
officials = bsObj.find_all('a', {'class' : 'booktitle'})
for text in officials:
print(text.get_text())
aux.write(text.get_text().format())
aux.close()
file.close()
Can you try this? (sorry if it doesn't work, I am not on a pc with python right now)
for text in officials:
print(text['href'])
BeautifulSoup works just fine with the html code that you provided, if you want to get the text of a tag you simply use ".text", if you want to get the href you use ".get('href')" or if you are sure the tag has an href value you can use "['href']".
Here is a simple example easy to understand with your html code snipet.
from bs4 import BeautifulSoup
html_code = '''
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
'''
soup = BeautifulSoup(html_code, 'html.parser')
tag = soup.find('a', {'class':'bookTitle'})
# - Book Title -
title = tag.text
print(title)
# - Href Link -
href = tag.get('href')
print(href)
I don't know why you downloaded the html and saved it to disk and then open it again, If you just want to get some tag values, then downloading the html, saving to disk and then reopening is totally unnecessary, you can save the html to a variable and then pass that variable to beautifulsoup.
Now I see that you imported requests library, but you used mechanize instead, as far as I know requests is the easiest and the most modern library to use when getting data from web pages in python. I also see that you imported "session" from requests, session is not necessary unless you want to make mulltiple requests and want to keep the connection open with the server for faster subsecuent request's.
Also if you open a file with the "with" statement, you are using python context managers, which handles the closing of a file, which means you don't have to close the file at the end.
So your code more simplify without saving the downloaded 'html' to disk, I will make it like this.
from bs4 import BeautifulSoup
import requests
url = 'https://www.goodreads.com/shelf/show/art/gendersBooks.html'
html_source = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# - To get the tag that we want -
tag = soup.find('a', {'class' : 'booktitle'})
# - Extract Book Title -
href = tag.text
# - Extract href from Tag -
title = tag.get('href')
Now if you got multiple "a" tags with the same class name: ('a', {'class' : 'booktitle'}) then you do it like this.
get all the "a" tags first:
a_tags = soup.findAll('a', {'class' : 'booktitle'})
and then scrape all the book tags info and append each book info to a books list.
books = []
for a in a_tags:
try:
title = a.text
href = a.get('href')
books.append({'title':title, 'href':href}) #<-- add each book dict to books list
print(title)
print(href)
except:
pass
To understand your code better I advise you to read this related links:
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests:
https://requests.readthedocs.io/en/master/
Python Context Manager:
https://book.pythontips.com/en/latest/context_managers.html
https://effbot.org/zone/python-with-statement.htm

Categories

Resources