Python - Filter BS4 content - python

Current code:
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
print(bs4_content)
Only content I manage to get is
[<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>]
I'm trying to only get the content between the strong tags.
Thank you, all help much appreciated

You can use inner .find_all
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
for strong in bs4_content.find_all('strong'):
print(strong.text)

Try using a CSS Selector .user-post-count strong, which selects the <strong> tags under the user-post-count class.
from bs4 import BeautifulSoup
html = '''<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>
'''
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select('.user-post-count strong'):
print(tag.text)
Output:
5 posts

Related

How to print the text that's inside the a class using BS4?

Html
<div class="QRiHXd">
"Some very secret link" <<< This is the content I want to print out / btw is a link
</div>
Code
import requests
import urllib
import bs4
url = 'https://www.reddit.com/' # There is actually another link
url_contents = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(url_contents, "html.parser")
div = soup.find('div', {'class_': 'QRiHXd'})
content = str(div)
print(content)
I need to print the text that's inside the class but when I try to print it it returns: "None" and I do not know why.
To get the text from a tag you can use the .text with the tag
from bs4 import BeautifulSoup
html_doc = """<div class="QRiHXd">Some very secret link</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
div = soup.find('div', {'class': 'QRiHXd'})
print(div.text) # Some very secret link

Retrive html tag content using beautifulSoup

I'm trying to get the plain text of a website article using python. I've heard about the BeautifulSoup library, but how to retrieve a specific tag in html page?
This is what I have done:
base_url = 'http://www.nytimes.com'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
Look this:
import bs4 as bs
import requests as rq
html = rq.get('site.com')
s = bs.BeautifulSoup(html.text, features="html.parser")
div = s.find('div', {'class': 'yourclass'}) # or id
print(str(div.text)) # print text

How to extract links from a page using Beautiful soup

I have a HTML Page with multiple divs like:
<div class="post-info-wrap">
<h2 class="post-title">sample post – example 1 post</h2>
<div class="post-meta clearfix">
<div class="post-info-wrap">
<h2 class="post-title">sample post – example 2 post</h2>
<div class="post-meta clearfix">
and I need to get the value for all the divs with class post-info-wrap I am new to BeautifulSoup
so I need these urls:
https://www.example.com/blog/111/this-is-1st-post/
https://www.example.com/blog/111/this-is-2nd-post/
and so on...
I have tried:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for link in soup.select('.post-info-wrap'):
print link.find('a').attrs['href']
This code doesnt seem to be working. I am new to beautiful soup. How can i extract the links?
link = i.find('a',href=True) always not return anchor tag (a), it may be return NoneType, so you need to validate link is None, continue for loop,else get link href value.
Scrape link by url:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Scrape link by HTML:
from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title">sample post – example 1 post</h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title">sample post – example 2 post</h2><div class="post-meta clearfix">'''
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Update:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")
soup = BeautifulSoup(driver.page_source, "html.parser")
for i in soup.find_all('div', {'class': 'post-info-wrap'}):
link = i.find('a', href=True)
if link is None:
continue
print(link['href'])
O/P:
https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/
For chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
selenium tutorial
https://selenium-python.readthedocs.io/
Where '/usr/bin/chromedriver' chrome webdriver path.
You can use soup.find_all:
from bs4 import BeautifulSoup as soup
r = [i.a['href'] for i in soup(html, 'html.parser').find_all('div', {'class':'post-info-wrap'})]
Output:
['https://www.example.com/blog/111/this-is-1st-post/', 'https://www.example.com/blog/111/this-is-2nd-post/']

Pull out href's with beautifulsoup attrs

I am trying something new with pulling out all the href's in the a tags. It isn't pulling out the hrefs though and cant figure out why.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.findAll('a'):
h = href.attrs['href']
print(h)
You should check if the key exists, since it may also not exist an href between <a> tags.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
for a in soup.findAll('a'):
if 'href' in a.attrs:
print(a.attrs['href'])

Get value of span tag using BeautifulSoup

I have a number of facebook groups that I would like to get the count of the members of. An example would be this group: https://www.facebook.com/groups/347805588637627/
I have looked at inspect element on the page and it is stored like so:
<span id="count_text">9,413 members</span>
I am trying to get "9,413 members" out of the page. I have tried using BeautifulSoup but cannot work it out.
Thanks
Edit:
from bs4 import BeautifulSoup
import requests
url = "https://www.facebook.com/groups/347805588637627/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
span = soup.find("span", id="count_text")
print(span.text)
In case there is more than one span tag in the page:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_input, 'html.parser')
span = soup.find("span", id="count_text")
span.text
You can use the text attribute of the parsed span:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<span id="count_text">9,413 members</span>', 'html.parser')
>>> soup.span
<span id="count_text">9,413 members</span>
>>> soup.span.text
'9,413 members'
If you have more than one span tag you can try this
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
for tag in tags:
print(tag.contents[0])
Facebook uses javascrypt to prevent bots from scraping. You need to use selenium to extract data on python.

Categories

Resources