how to get text of the latest post with BeautifulSoup, select()

how to get text of the latest post with BeautifulSoup, select() - python

I'd like to get the latest posts text using BeautifulSoup and select() method.
import requests
from bs4 import BeautifulSoup
headers = 'User-Agent':'Mozilla/5.0'
url = "https:// "
req = requests.get(url, headers=headers)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
link = soup.select('#flagList > div.clear.ab-webzine > div > a')
title = soup.select('#flagList > div.clear.ab-webzine > div > div.wz-item-header > a > span')
latest_link = link[0] # link of latest post
latest_title = title[0].text # title of latest post
# to get the text of latest post
t_url = latest_link
t_req = requests.get(t_url, headers=headers)
t_html = c_res.text
t_soup = BeautifulSoup(t_html, 'html.parser')
maintext = t_soup.select ('#flagArticle > div.document_1234567_0.rhymix_content.xe_content')
print(maintext)
It returns [].
I copied #flagArticle > div.document_1234567_0.rhymix_content.xe_content from chrome developer tools on the posts. so it has specific post number "1234567"
But I want the text of "latest post" not certain post.
So I changed it to just #flagArticle
And it returns as below.
[<article id="flagArticle">
<!--BeforeDocument(1234567,0)-->
<div class="document_1234567_0 rhymix_content xe_content"><p>TEXTTEXTTEXT 1</p>
<p>TEXTTEXTTEXT 2</p>
<p>TEXTTEXTTEXT 3</p></div><!--AfterDocument(1234567,0)-->
<!--
-- color class --
vb-white
vb-green
vb-blue
vb-skyblue
vb-orange
vb-red
-->
<div class="vote">
<button class="vb-btn vb-orange" onclick="vote_doVote('Up','1234567');return false;" type="button">
<span class="lang">
<i class="fas fa-star fa-spin fa-fw"></i>
recommended </span>
<span class="num" id="vm_v_count">
4 </span>
</button> <button class="vb-btn vb-skyblue" onclick="vote_doVote('Declare','1234567');return false;" type="button">
<span class="lang">
<i class="fa fa-times-circle"></i>
report </span>
<span class="num" id="vm_d_count">
</span>
</button></div> </article>]
But I want to get
TEXTTEXTTEXT 1
TEXTTEXTTEXT 2
TEXTTEXTTEXT 3
What should I change?
(I can't share the URL because it's private site)

Just get the first div.
from bs4 import BeautifulSoup
data = '''\
<article id="flagArticle">
<!--BeforeDocument(1234567,0)-->
<div class="document_1234567_0 rhymix_content xe_content"><p>TEXTTEXTTEXT 1</p>
<p>TEXTTEXTTEXT 2</p>
<p>TEXTTEXTTEXT 3</p></div><!--AfterDocument(1234567,0)-->
<!--
-- color class --
vb-white
vb-green
vb-blue
vb-skyblue
vb-orange
vb-red
-->
<div class="vote">
<button class="vb-btn vb-orange" onclick="vote_doVote('Up','1234567');return false;" type="button">
<span class="lang">
<i class="fas fa-star fa-spin fa-fw"></i>
recommended </span>
<span class="num" id="vm_v_count">
4 </span>
</button> <button class="vb-btn vb-skyblue" onclick="vote_doVote('Declare','1234567');return false;" type="button">
<span class="lang">
<i class="fa fa-times-circle"></i>
report </span>
<span class="num" id="vm_d_count">
</span>
</button></div> </article>
'''
soup = BeautifulSoup(data, 'html.parser')
div = soup.select_one('#flagArticle div.xe_content.rhymix_content')
for p in div.select('p'):
print(p.text)

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

I am trying to do web scraping using BeautifulSoup. The code I have written is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(type(questions[0]))
When I run the code, I get the error message below:
print(type(questions[10]))
IndexError: list index out of range
Then i tried to print the list like below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(questions)
And then I get an empty list: []
What am I doing wrong?
Thanks for your answers.

.question-summary is incorrect locator because it's a portion of id meaning each id value start with question-summary. Now it's working.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select('[id^="question-summary"]')
print(questions)
Output:
1" data-post-type-id="1" id="question-summary-71715531">
<div class="s-post-summary--stats js-post-summary-stats">
<div class="s-post-summary--stats-item s-post-summary--stats-item__emphasized" title="Score of 0">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">votes</span>
</div>
<div class="s-post-summary--stats-item" title="0 answers">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">answers</span>
</div>
<div class="s-post-summary--stats-item" title="5 views">
<span class="s-post-summary--stats-item-number">5</span>
<span class="s-post-summary--stats-item-unit">views</span>
</div>
</div>
<div class="s-post-summary--content">
<h3 class="s-post-summary--content-title">
<a class="s-link" href="/questions/71715531/is-it-possible-to-draw-a-logistic-regression-graph-with-multiple-x-variable">Is it possible to draw a
logistic regression graph with multiple x variable?</a>
</h3>
<div class="s-post-summary--content-excerpt">
Currently, this is my X and V value. May I know is it possible to draw a logistic regression curve with X that has multiple column? Or I am required to draw multiple graphs to do so?
X = df1.drop(['...
</div>
<div class="s-post-summary--meta">
<div class="s-post-summary--meta-tags tags js-tags t-python-3ûx t-machine-learning">
<a class="post-tag flex--item mt0 js-tagname-python-3ûx" href="/questions/tagged/python-3.x" rel="tag" title="show questions tagged 'python-3.x'">python-3.x</a> <a class="post-tag flex--item mt0 js-tagname-machine-learning" href="/questions/tagged/machine-learning" rel="tag" title="show questions tagged 'machine-learning'">machine-learning</a>
</div>
<div class="s-user-card s-user-card__minimal">
<a class="s-avatar s-avatar__16 s-user-card--avatar" href="/users/14128881/christopher-chua"> <div class="gravatar-wrapper-16" data-user-id="14128881">
<img ,="" alt="user avatar" class="s-avatar--image" height="16" src="https://lh6.googleusercontent.com/-Sn3B_E5hiJc/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucl1oyfdhJiXhrx73JLYqzKAK9icag/photo.jpg?sz=32" width="16"/>
</div>
</a>
<div class="s-user-card--info">
<div class="s-user-card--link d-flex gs4">
<a class="flex--item" href="/users/14128881/christopher-chua">Christopher Chua</a>
</div>
<ul class="s-user-card--awards">
<li class="s-user-card--rep"><span class="todo-no-class-here" dir="ltr" title="reputation score ">7</span></li>
</ul>
</div>
<time class="s-user-card--time">asked <span class="relativetime" title="2022-04-02 07:03:06Z">13 mins ago</span></time>
.. so on

GitHub get commits number using python and beautiful soup

I'm trying to get the number of commits of github repos using python and beautiful soup
html code:
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>
my code:
r = requests.get(source_code_link)
soup = bs(r.content, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
number = span.select_one('strong')
sometimes works but sometimes no because there are more then one span tag with class d-none d-sm-inline.
how can i solve ?

Here's an approach using list commits from GitHub's REST API
import requests
user = ... # username or organisation
repo = ... # repository name
response = requests.get(f"https://api.github.com/repos/{user}/{repo}/commits")
if response.ok:
ncommits = len(response.json())
else:
raise ValueError(f"error: {response.url} responded {response.reason}")
print(ncommits)

from bs4 import BeautifulSoup as bs
html="""<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>23</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>"""
I have combine both example which look up for tag strong and based on that prints data by using .contents method
soup = bs(html, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
for tag in span.contents:
if tag.name=="strong" :
print(tag.get_text())
using list comprehension :
for span in spans:
data=[tag for tag in span.contents if tag.name=="strong"]
print(data[0].get_text())
Ouput for both case:
26
23

You can use the find_next() method to look for a <strong> after the class d-none d-sm-inline.
In your case:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("span", class_="d-none d-sm-inline"):
print(tag.find_next("strong").text)

Parsing HTML with BeatifulSoup class == AND title CONTAINS

I am trying to parse the following HTML:
<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
,
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>
I am trying to get the 'id' where the title contains 'Blue' AND the item is not sold.
I have tried:
soup.find_all("a",href=re.compile("Blue"),class_="")
links = soup.find_all("a", href=re.compile("Blue", "Add To Cart"))
ids = [tag["id"] for tag in soup.find_all("a", href=re.compile("Blue"))]
But it is not returning the info I'm looking for.
I would like it to return:
AddToCartSimple-3593

I think your html is corrupted. You can do the entire filtering with css selectors using :has, :not, and :contains (:-soup-contains - latest soupsieve), along with attribute = value selectors. The ^ is a starts with operator, meaning attribute value starts with the string after the =. The ~ is a general sibling combinator and the > is a child combinator. This means looking for a sibling with class (.) tocart and then a child with id that starts with AddToCartSimple-, but that doesn't have text containing SOLD displayed. Less specific than !="SOLD" , as it can be a partial string exclusion. Depends on observed variation in actual data.
from bs4 import BeautifulSoup as bs
html ='''
<div class="product-details">
<h4 class="title">Blue - Standard</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a>
</div>
<div class="product-details">
<h4 class="title">Blue - Wide</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576">SOLD</a>
</div>
'''
soup = bs(html, 'html.parser')
print(soup.select_one('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')['id'])
You should check there was a match before accessing with ['id'] of course. You could also go for all matches as follows:
[i['id'] for i in soup.select('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')]

To get the data where the "title" contains "Blue" and the item is not "SOLD":
Use a CSS selector .product-details > h4 a[title*='Blue'] which will select all a where the title=Blue under an h4 under the class product-details
Find the next div using the find_next() method, and check that the text is not "SOLD".
Print the next div's id
from bs4 import BeautifulSoup
html = """<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select(".product-details > h4 a[title*='Blue']"):
if tag.find_next("div").text != "SOLD":
print(tag.find_next("div")["id"])
Output:
AddToCartSimple-3593

BeautifulSoap get multiple element for all img in a div with specific class

I am trying to get the links in image-file attribute (relative link as it is) in img tags under div with id previewImages (I don't want the src link).
Here is the sample HTML:
<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>
I tried the following but it only gives me the first link and not all:
import sys
import urllib2
from bs4 import BeautifulSoup
quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
images_box = soup.find('div', attrs={'id': 'previewImages'})
if images_box.find('img'):
imagesurl = images_box.find('img').get('image-file')
print imagesurl
How can I get all the links in image-file attritube for img tags in div with class previewImages?

Use .findAll
Ex:
from bs4 import BeautifulSoup
html = """<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
images_box = soup.find('div', attrs={'id': 'previewImages'})
for link in images_box.findAll("img"):
print link.get('image-file')
Output:
/image/15.jpg
/image/2.jpg
/image/0.jpg
/image/3.jpg
/image/4.jpg

I think it faster to use id with attribute selector passed to select
from bs4 import BeautifulSoup as bs
html = '''
<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>
'''
soup = bs(html, 'lxml')
links = [item['image-file'] for item in soup.select('#previewImages [image-file]')]
print(links)

BeautifulSoup has method .find_all() - check the docs. This is how you can use it in your code:
import sys
import urllib2
from bs4 import BeautifulSoup
quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
images_box = soup.find('div', attrs={'id': 'previewImages'})
links = [img['image-file'] for img in images_box('img')]
print links # in Python 3: print(links)

To Add up if in case we have do the same scenario with lxml,
import lxml.html
tree = lxml.html.fromstring(sample)
images = tree.xpath("//img/#image-file")
print(images)
Output
['/image/15.jpg', '/image/2.jpg', '/image/0.jpg', '/image/3.jpg', '/image/4.jpg']

python Beautifulsoup count element which has no content

How to count element which has no content?
By saying element which has no content, I mean <div class="myclass" id="myid"></div>
Here is the code I wrote with attempting to achieve the goal:
from bs4 import BeautifulSoup
html_doc = """
<dl>
<dt class="details-row-7">Overall</dt>
<dd id="c0r11" class=" alternate details-row-7">
<div class="mobile-headings">Overall</div>
<div class="mobile-value">
<div class="ca-rating-star" data-size="1"><i class="icon-star icon-1x" style="color: #FF9900"></i>
<i class="icon-star icon-1x" style="color: #FF9900"></i>
<i class="icon-star icon-1x" style="color: #FF9900"></i>
<i class="icon-star icon-1x" style="color: #FF9900"></i>
<i class="icon-star-empty icon-1x" style="color: #FF9900"></i>
</div>
</div>
</dd>
</dl>
"""
soup = BeautifulSoup(html_doc)
ele = soup.find("dd", {"id": "c0r11"}, {"class": "alternate details-row-7"})
if ele.find(text=False):
con_str = ele.find("div", {"class":"mobile-value"})
if con_str.find(text=False):
star_ele = con_str.find("div", {"class":"ca-rating-star"})
if star_ele.find(text=False):
star = star_ele.find_all("i", {"class":"icon-star icon-1x"})
i = 0
for s in star:
if s.find(text=False):
i += 1
print(i)
But the result is 0.....

I answered your question in a gist here.
https://gist.github.com/greatghoul/c2fab58e798a91a736a4

The problem is you're looking for children of the <i> elements where text=False when you say s.find(text=False), but the <i> tags don't have children. You want to see if the <i> tags themselves have empty text. So replace s.find(text=False) with s.get_text() == "".
from bs4 import BeautifulSoup
html_doc = """
<dl>
<dt class="details-row-7">Overall</dt>
<dd id="c0r11" class=" alternate details-row-7">
<div class="mobile-headings">Overall</div>
<div class="mobile-value">
<div class="ca-rating-star" data-size="1"><i class="icon-star icon-1x" style="color: #FF9900"></i>
<i class="icon-star icon-1x" style="color: #FF9900"></i>
<i class="icon-star icon-1x" style="color: #FF9900"></i>
<i class="icon-star icon-1x" style="color: #FF9900"></i>
<i class="icon-star-empty icon-1x" style="color: #FF9900"></i>
</div>
</div>
</dd>
</dl>
"""
soup = BeautifulSoup(html_doc)
ele = soup.find("dd", {"id": "c0r11"}, {"class": "alternate details-row-7"})
if ele.find(text=False):
con_str = ele.find("div", {"class":"mobile-value"})
if con_str.find(text=False):
star_ele = con_str.find("div", {"class":"ca-rating-star"})
if star_ele.find(text=False):
star = star_ele.find_all("i", {"class":"icon-star icon-1x"})
i = 0
for s in star:
if s.get_text() == "": # CHANGE ON THIS LINE
i += 1
print(i)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to get text of the latest post with BeautifulSoup, select() - python

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

GitHub get commits number using python and beautiful soup

Parsing HTML with BeatifulSoup class == AND title CONTAINS

BeautifulSoap get multiple element for all img in a div with specific class

python Beautifulsoup count element which has no content

Categories

Resources