BS4: Getting text in tag - python

I'm using beautiful soup. There is a tag like this:
<li> s.r.o., <small>small</small></li>
I want to get the text within the anchor <a> tag only, without any from the <small> tag in the output; i.e. " s.r.o., "
I tried find('li').text[0] but it does not work.
Is there a command in BS4 which can do that?

One option would be to get the first element from the contents of the a element:
>>> from bs4 import BeautifulSoup
>>> data = '<li> s.r.o., <small>small</small></li>'
>>> soup = BeautifulSoup(data)
>>> print soup.find('a').contents[0]
s.r.o.,
Another one would be to find the small tag and get the previous sibling:
>>> print soup.find('small').previous_sibling
s.r.o.,
Well, there are all sorts of alternative/crazy options also:
>>> print next(soup.find('a').descendants)
s.r.o.,
>>> print next(iter(soup.find('a')))
s.r.o.,

Use .children
soup.find('a').children.next()
s.r.o.,

If you would like to loop to print all content of anchor tags located in html string/web page (must utilise urlopen from urllib), this works:
from bs4 import BeautifulSoup
data = '<li>s.r.o., <small>small</small</li> <li>2nd</li> <li>3rd</li>'
soup = BeautifulSoup(data,'html.parser')
a_tag=soup('a')
for tag in a_tag:
print(tag.contents[0]) #.contents method to locate text within <a> tags
Output:
s.r.o.,
2nd
3rd
a_tag is a list containing all anchor tags; collecting all anchor tags in a list, enables group editing (if more than one <a> tags present.
>>>print(a_tag)
[s.r.o., <small>small</small>, 2nd, 3rd]

From the documentation, retrieving the text of the tag can be done by calling string property
soup = BeautifulSoup('<li> s.r.o., <small>small</small></li>')
res = soup.find('a')
res.small.decompose()
print(res.string)
# s.r.o.,
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring

Related

Why do I keep getting more than one item in my list when retrieving information with BS4 and requests?

from bs4 import BeautifulSoup
import requests
Webpage = requests.get('https://www.brainyquote.com/quote_of_the_day')
soup = BeautifulSoup( Webpage.content, 'html.parser')
qoute = soup.find(class_='qotd-q-cntr')
words = [qoute.find('a').text for item in qoute]
print(words)
When printing the variable words I get the same quote appearing three times in my list but I want to just get it one time. my output is similar to the following
['qoute','qoute','qoute']
I'm looking to get it to be something like this
['qoute']
This is because you are scraping via class attribute and the one you gave is the one for all the quotes when you inspect that website.
Instead, search for something more specific. Like an h2 tag with class qotd-h2 and innerText "Quote of the Day".
Then from getting that anchor element you can traverse the DOM to get to the quote.
Example
from bs4 import BeautifulSoup
import requests
Webpage = requests.get('https://www.brainyquote.com/quote_of_the_day')
soup = BeautifulSoup( Webpage.content, 'html.parser')
#? Find the quote of the day title
anchor = soup.find('h2', class_='qotd-h2', text="Quote of the Day")
quoteDiv = anchor.parent.find(class_="clearfix") #? The div surrounding the quote
quote = quoteDiv.find(title="view quote") #? The quote tag
print(quote.text)

Extracting text from td tag containing br tags inside

I want to extract text from td tag containing br tags inside.
from bs4 import BeautifulSoup
html = "<td class=\"text\">This is <br/>a breakline<br/><br/></td>"
soup = BeautifulSoup(html, 'html.parser')
print(soup.td.string)
Actual Output: None
Expected output: This is a breakline
From Beautiful Soup document:
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
And if you want text part (document):
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:
So you can use following:
print(soup.get_text())
For specific tag soup.td.get_text()
This will give you what you are looking for:
print(soup.td.text)
This is for the specific td tag
Otherwise you also have:
print(soup.text)

Print out all article titles

I'm new to Python. Here are some lines of coding in Python to print out all article titles on http://www.nytimes.com/.
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.nytimes.com'
r = requests.get(base_url)
soup = BeautifulSoup(r.text)
for story_heading in soup.find_all(class_="story-heading"):
if story_heading.a:
print(story_heading.a.text.replace("\n", " ").strip())
else:
print(story_heading.contents[0].strip())
What do .a and .text mean?
Thank you very much.
First, let's see what printing one story_heading alone gives us:
>>> story_heading
<h2 class="story-heading">Mortgage Calculator</h2>
To extract only the a tag, we access it using story_heading.a:
>>> story_heading.a
Mortgage Calculator
To get only the text inside the tag itself, and not it's attributes, we use .text:
>>> story_heading.a.text
'Mortgage Calculator'
Here,
.a gives you the first anchor tag
.text gives you the text within the tag

How find specific data attribute from html tag in BeautifulSoup4?

Is there a way to find an element using only the data attribute in html, and then grab that value?
For example, with this line inside an html doc:
<ul data-bin="Sdafdo39">
How do I retrieve Sdafdo39 by searching the entire html doc for the element that has the data-bin attribute?
A little bit more accurate
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
This way, the iterated list only has the ul elements that has the attr you want to find
from bs4 import BeautifulSoup
bs = BeautifulSoup(html_doc)
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
You can use find_all method to get all the tags and filtering based on "data-bin" found in its attributes will get us the actual tag which has got it. Then we can simply extract the value corresponding to it, like this
from bs4 import BeautifulSoup
html_doc = """<ul data-bin="Sdafdo39">"""
bs = BeautifulSoup(html_doc)
print [item["data-bin"] for item in bs.find_all() if "data-bin" in item.attrs]
# ['Sdafdo39']
You could solve this with gazpacho in just a couple of lines:
First, import and turn the html into a Soup object:
from gazpacho import Soup
html = """<ul data-bin="Sdafdo39">"""
soup = Soup(html)
Then you can just search for the "ul" tag and extract the href attribute:
soup.find("ul").attrs["data-bin"]
# Sdafdo39
As an alternative if one prefers to use CSS selectors via select() instead of find_all():
from bs4 import BeautifulSoup
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
soup = BeautifulSoup(html_doc)
# Select
soup.select('ul[data-bin]')

How can i extract only text in scrapy selector in python

I have this code
site = hxs.select("//h1[#class='state']")
log.msg(str(site[0].extract()),level=log.ERROR)
The ouput is
[scrapy] ERROR: <h1 class="state"><strong>
1</strong>
<span> job containing <strong>php</strong> in <strong>region</strong> paying <strong>$30-40k per year</strong></span>
</h1>
Is it possible to only get the text without any html tags
//h1[#class='state']
in your above xpath you are selecting h1 tag that has class attribute state
so that's why it's selecting everything that comes in h1 element
if you just want to select text of h1 tag all you have to do is
//h1[#class='state']/text()
if you want to select text of h1 tag as well as its children tags, you have to use
//h1[#class='state']//text()
so the difference is /text() for specific tag text and //text() for text of specific tag as well as its children tags
below mentioned code works for you
site = ''.join(hxs.select("//h1[#class='state']/text()").extract()).strip()
You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '''
<td>Please can you strip me?
<br/>I am waiting....
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())
I haven't got a scrapy instance running so I couldn't test this; but you could try to use text() within your search expression.
For example:
site = hxs.select("//h1[#class='state']/text()")
(got it from the tutorial)
You can use BeautifulSoup to strip html tags, here is an example:
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))
You can then strip all the additional whitespaces, new lines etc.
if you don't want to use additional modules, you can try simple regex:
# replace html tags with ' '
text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))
You can use html2text
import html2text
converter = html2text.HTML2Text()
print converter.handle("<div>Please!!!<span>remove me</span></div>")

Categories

Resources