I'm trying to get the plain text of a website article using python. I've heard about the BeautifulSoup library, but how to retrieve a specific tag in html page?
This is what I have done:
base_url = 'http://www.nytimes.com'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
Look this:
import bs4 as bs
import requests as rq
html = rq.get('site.com')
s = bs.BeautifulSoup(html.text, features="html.parser")
div = s.find('div', {'class': 'yourclass'}) # or id
print(str(div.text)) # print text
Related
Current code:
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
print(bs4_content)
Only content I manage to get is
[<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>]
I'm trying to only get the content between the strong tags.
Thank you, all help much appreciated
You can use inner .find_all
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
for strong in bs4_content.find_all('strong'):
print(strong.text)
Try using a CSS Selector .user-post-count strong, which selects the <strong> tags under the user-post-count class.
from bs4 import BeautifulSoup
html = '''<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>
'''
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select('.user-post-count strong'):
print(tag.text)
Output:
5 posts
import requests
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")
from bs4 import BeautifulSoup
soup.table["class"]
Add this and you will find the class of table in that page.
soup = BeautifulSoup(req.content, 'html.parser')
soup.table["class"]
Result:
['infobox', 'vcard']
I want to collect the link : /hmarchhak/102217 from a site (https://www.vanglaini.org/) and print it as https://www.vanglaini.org/hmarchhak/102217. Please help
Img
import requests
import pandas as pd
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
headline = article.a.text
summary=article.p.text
link = article.a.href
print(headline)
print(summary)
print(link)
print()
This is my code.
Unless I am missing something headline and summary appear to be the same text. You can use :has with bs4 4.7.1+ to ensure your article has a child href; and this seems to strip out article tag elements that are not part of main body which I suspect is actually your aim
from bs4 import BeautifulSoup as bs
import requests
base = 'https://www.vanglaini.org'
r = requests.get(base)
soup = bs(r.content, 'lxml')
for article in soup.select('article:has([href])'):
headline = article.h5.text.strip()
summary = re.sub(r'\n+|\r+',' ',article.p.text.strip())
link = f"{base}{article.a['href']})"
print(headline)
print(summary)
print(link)
I need to get the text 2,585 shown in the screenshot below. I very new to coding, but this is what i have so far:
import urllib2
from bs4 import BeautifulSoup
url= 'insertURL'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
span = soup.find('span', id='d21475972e793-wk-Fact -8D34B98C76EF518C788A2177E5B18DB0')
print (span.text)
Any info is helpful!! Thanks.
Website HTML
3 things, your using requests not urllib2. Your selecting XML with namespaces so you need to use xml as the parser. The element you want is not span it is ix:nonFraction. Here is a working example using another web-page (you just need to point it at your page and use the commented line).
# Using requests no need for urllib2.
import requests
from bs4 import BeautifulSoup
# Using this page as an example.
url= 'https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004.txt'
r = requests.get(url)
data = r.text
# use xml as the parser.
soup = BeautifulSoup(data, 'xml')
ix = soup.find('ix:nonFraction', id="Fact-7365D69E1478B0A952B8159A2E39B9D8-wk-Fact-7365D69E1478B0A952B8159A2E39B9D8")
# Your original code for your page.
# ix = soup.find('ix:nonFraction', id='d21475972e793-wk-Fact-8D34B98C76EF518C788A2177E5B18DB0')
print (ix.text)
I have a number of facebook groups that I would like to get the count of the members of. An example would be this group: https://www.facebook.com/groups/347805588637627/
I have looked at inspect element on the page and it is stored like so:
<span id="count_text">9,413 members</span>
I am trying to get "9,413 members" out of the page. I have tried using BeautifulSoup but cannot work it out.
Thanks
Edit:
from bs4 import BeautifulSoup
import requests
url = "https://www.facebook.com/groups/347805588637627/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
span = soup.find("span", id="count_text")
print(span.text)
In case there is more than one span tag in the page:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_input, 'html.parser')
span = soup.find("span", id="count_text")
span.text
You can use the text attribute of the parsed span:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<span id="count_text">9,413 members</span>', 'html.parser')
>>> soup.span
<span id="count_text">9,413 members</span>
>>> soup.span.text
'9,413 members'
If you have more than one span tag you can try this
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
for tag in tags:
print(tag.contents[0])
Facebook uses javascrypt to prevent bots from scraping. You need to use selenium to extract data on python.