Retrive html tag content using beautifulSoup

Retrive html tag content using beautifulSoup - python

I'm trying to get the plain text of a website article using python. I've heard about the BeautifulSoup library, but how to retrieve a specific tag in html page?
This is what I have done:
base_url = 'http://www.nytimes.com'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")

Look this:
import bs4 as bs
import requests as rq
html = rq.get('site.com')
s = bs.BeautifulSoup(html.text, features="html.parser")
div = s.find('div', {'class': 'yourclass'}) # or id
print(str(div.text)) # print text

Related

Python - Filter BS4 content

Current code:
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
print(bs4_content)
Only content I manage to get is
[<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>]
I'm trying to only get the content between the strong tags.
Thank you, all help much appreciated

You can use inner .find_all
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
for strong in bs4_content.find_all('strong'):
print(strong.text)

Try using a CSS Selector .user-post-count strong, which selects the <strong> tags under the user-post-count class.
from bs4 import BeautifulSoup
html = '''<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>
'''
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select('.user-post-count strong'):
print(tag.text)
Output:
5 posts

How to find all classes under table tag in python using web scraping library beautiful soup

import requests
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")
from bs4 import BeautifulSoup
soup.table["class"]

Add this and you will find the class of table in that page.
soup = BeautifulSoup(req.content, 'html.parser')
soup.table["class"]
Result:
['infobox', 'vcard']

How to receive website link in Python using BeautifulSoup

I want to collect the link : /hmarchhak/102217 from a site (https://www.vanglaini.org/) and print it as https://www.vanglaini.org/hmarchhak/102217. Please help
Img
import requests
import pandas as pd
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
headline = article.a.text
summary=article.p.text
link = article.a.href
print(headline)
print(summary)
print(link)
print()
This is my code.

Unless I am missing something headline and summary appear to be the same text. You can use :has with bs4 4.7.1+ to ensure your article has a child href; and this seems to strip out article tag elements that are not part of main body which I suspect is actually your aim
from bs4 import BeautifulSoup as bs
import requests
base = 'https://www.vanglaini.org'
r = requests.get(base)
soup = bs(r.content, 'lxml')
for article in soup.select('article:has([href])'):
headline = article.h5.text.strip()
summary = re.sub(r'\n+|\r+',' ',article.p.text.strip())
link = f"{base}{article.a['href']})"
print(headline)
print(summary)
print(link)

How to get text following a table/span with BeautifulSoup and Python?

I need to get the text 2,585 shown in the screenshot below. I very new to coding, but this is what i have so far:
import urllib2
from bs4 import BeautifulSoup
url= 'insertURL'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
span = soup.find('span', id='d21475972e793-wk-Fact -8D34B98C76EF518C788A2177E5B18DB0')
print (span.text)
Any info is helpful!! Thanks.
Website HTML

3 things, your using requests not urllib2. Your selecting XML with namespaces so you need to use xml as the parser. The element you want is not span it is ix:nonFraction. Here is a working example using another web-page (you just need to point it at your page and use the commented line).
# Using requests no need for urllib2.
import requests
from bs4 import BeautifulSoup
# Using this page as an example.
url= 'https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004.txt'
r = requests.get(url)
data = r.text
# use xml as the parser.
soup = BeautifulSoup(data, 'xml')
ix = soup.find('ix:nonFraction', id="Fact-7365D69E1478B0A952B8159A2E39B9D8-wk-Fact-7365D69E1478B0A952B8159A2E39B9D8")
# Your original code for your page.
# ix = soup.find('ix:nonFraction', id='d21475972e793-wk-Fact-8D34B98C76EF518C788A2177E5B18DB0')
print (ix.text)

Get value of span tag using BeautifulSoup

I have a number of facebook groups that I would like to get the count of the members of. An example would be this group: https://www.facebook.com/groups/347805588637627/
I have looked at inspect element on the page and it is stored like so:
<span id="count_text">9,413 members</span>
I am trying to get "9,413 members" out of the page. I have tried using BeautifulSoup but cannot work it out.
Thanks
Edit:
from bs4 import BeautifulSoup
import requests
url = "https://www.facebook.com/groups/347805588637627/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
span = soup.find("span", id="count_text")
print(span.text)

In case there is more than one span tag in the page:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_input, 'html.parser')
span = soup.find("span", id="count_text")
span.text

You can use the text attribute of the parsed span:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<span id="count_text">9,413 members</span>', 'html.parser')
>>> soup.span
<span id="count_text">9,413 members</span>
>>> soup.span.text
'9,413 members'

If you have more than one span tag you can try this
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
for tag in tags:
print(tag.contents[0])

Facebook uses javascrypt to prevent bots from scraping. You need to use selenium to extract data on python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Retrive html tag content using beautifulSoup - python

I'm trying to get the plain text of a website article using python. I've heard about the BeautifulSoup library, but how to retrieve a specific tag in html page? This is what I have done: base_url = 'http://www.nytimes.com' r = requests.get(base_url) soup = BeautifulSoup(r.text, "html.parser")

Look this: import bs4 as bs import requests as rq html = rq.get('site.com') s = bs.BeautifulSoup(html.text, features="html.parser") div = s.find('div', {'class': 'yourclass'}) # or id print(str(div.text)) # print text

Related

Python - Filter BS4 content

How to find all classes under table tag in python using web scraping library beautiful soup

How to receive website link in Python using BeautifulSoup

How to get text following a table/span with BeautifulSoup and Python?

Get value of span tag using BeautifulSoup

Categories

Resources