BeautifulSoup 4: Dealing with urls containing <br />

BeautifulSoup 4: Dealing with urls containing <br /> - python

I'm dealing with html/xhtml links with beautifulsoup 4.3.2 and have faced some strangeness with br occuring in a elements.
from bs4 import BeautifulSoup
html = BeautifulSoup('<html><head></head><body>ABCD0000000<br /></body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
Gives an empty list.
As I've found, it's caused by the br tag, appearing in the a tag.
Hmm. Well, lets replace it with a newline as someone advised here..
html.find('br').replaceWith('\n')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
Again an empy list, damn.
Maybe,
html.find('br').replaceWith('')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
The same result..
But
html = BeautifulSoup('<html><head></head><body>ABCD0000000</body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
[ABCD0000000]
- Works fine.
So, as I see there is no way to bypass this except to clean or replace br's before feeding the data to bs4.
import re
re.sub(re.compile('<br\s*/>', re.IGNORECASE), '\n', '<html><head></head><body>ABCD0000000<br /></body></html>')
Or any?
Thanks for suggestions and complements.
Best regards,
~S.

One option would be to remove all br tags using extract() and then perform the search:
import re
from bs4 import BeautifulSoup
html = BeautifulSoup('<html><head></head><body>ABCD0000000<br /></body></html>')
for br in html('br'):
br.extract()
print html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
Prints:
[ABCD0000000]
Another option would be to check that href attribute ends with ABCD0000000 using CSS Selector:
html.select('a[href$="ABCD0000000"]')
Another option would be to use a function and check that the link text starts with ABCD0000000:
html.find_all(lambda tag: tag.name == 'a' and tag.text.startswith('ABCD0000000'))

Related

In BeautifulSoup 4.7.0+, how can I select all elements that don't contain the specified text in one of their properties

I want to select all anchor tags that do not contain mailto: in their href property.
Up until version 4.7.0 of BeautifulSoup, I was able to use this code:
links = soup.select("a[href^=mailto:]")
Version 4.7.0 of BeautifulSoup replaced their CSS selector implementation with SoupSieve, which is supposed to be more modern and complete.
Unfortunately, the above code now throws this error:
soupsieve.util.SelectorSyntaxError: Malformed attribute selector
What is the drop-in replacement for that code? What is the proper way to target those same elements?

It appears that the colon in the href's value just needed to be escaped.
You can do that by escaping the individual character:
soup.select("a[href^=mailto\\:]")
Or by quoting the whole value:
soup.select('a[href^="mailto:"]')

IE < 8 doesn't recognize that escape \: so worth knowing you can also use code points. Also, you want to negate to exclude with :not (bs4 4.7.1+)
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head></head>
<body>
Nada
Tada
</body>
</html>
'''
soup = bs(html, 'lxml')
print(soup.select('[href]:not([href*=mailto\\3A])'))
N.B. In browser would be [href*=mailto\3A]

Using BeautifulSoup to extract text within a tag

I am trying to scrape a text within a site source code using BeautifulSoup. Part of the source code looks like this:
<hr />
<div class="see-more inline canwrap" itemprop="genre">
<h4 class="inline">Genres:</h4>
<a href="/genre/Horror?ref_=tt_stry_gnr"
> Horror</a> <span>|</span>
<a href="/genre/Mystery?ref_=tt_stry_gnr"
> Mystery</a> <span>|</span>
<a href="/genre/Thriller?ref_=tt_stry_gnr"
> Thriller</a>
</div>
So I have been trying to extract the texts 'horror' 'mystery' and 'thriller' with these codes:
import requests
from bs4 import BeautifulSoup
url1='http://www.imdb.com/title/tt5308322/?ref_=inth_ov_tt'
r1=requests.get(url1)
soup1= BeautifulSoup(r1.text, 'lxml')
genre1=soup1.find('div',attrs={'itemprop':'genre'}).contents
print(genre1)
But the return comes out as:
['\n', <h4 class="inline">Genres:</h4>, '\n', <a href="/genre/Horror?
ref_=tt_stry_gnr"> Horror</a>, '\xa0', <span>|</span>, '\n', Mystery, '\xa0', <span>|</span>,
'\n', Thriller, '\n']
I am pretty new at python and webscraping, so I would appreciate all the help I can get. Thanks!

Use straight-forward BeautifulSoup.select() function to extract the needed elements to CSS selector:
import requests
from bs4 import BeautifulSoup
url1 = 'http://www.imdb.com/title/tt5308322/?ref_=inth_ov_tt'
soup = BeautifulSoup(requests.get(url1).text, 'lxml')
genres = [a.text.strip() for a in soup.select("div[itemprop='genre'] > a")]
print(genres)
The output:
['Horror', 'Mystery', 'Thriller']
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

You can use BeautifulSoup get_text() method indstead od the .contents property to get what you want:
From get_text() documentation:
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:
markup = '\nI linked to <i>example.com</i>\n'
soup = BeautifulSoup(markup)
soup.get_text()
>>> u'\nI linked to example.com\n'
soup.i.get_text()
>>> u'example.com'
You can specify a string to be used to join the bits of text together:
soup.get_text("|")
>>> u'\nI linked to |example.com|\n'
You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:
soup.get_text("|", strip=True)
>>> u'I linked to|example.com'
But at that point you might want to use the .stripped_strings generator instead, and process the text yourself:
[text for text in soup.stripped_strings]
>>> [u'I linked to', u'example.com']

Try this, I am using html.parser. Let us know if you face any problems:
for data in genre1:
get_a = data.find_all("a")
text = ""
for i in get_a:
text = i.text
print(text)
Please check the indentation as I am using cellphone.

You can do the same in several ways. Css selectors are precise, easy to understand and less error prone. So you can go with selectors as well to serve the purpose:
from bs4 import BeautifulSoup
import requests
link = 'http://www.imdb.com/title/tt5308322/?ref_=inth_ov_tt'
res = requests.get(link).text
soup = BeautifulSoup(res,'lxml')
genre = ' '.join([item.text.strip() for item in soup.select(".canwrap a[href*='genre']")])
print(genre)
Result:
Horror Mystery Thriller

Remove HTML tags from output Python

I'm working in Python 2, and I have the following script:
from bs4 import BeautifulSoup
import requests, re
page = "http://hidden.com/example"
headers = {'User-Agent': 'Craig'}
html = requests.post(page, headers=headers)
soup = BeautifulSoup(html.text, "html.parser")
final = soup.find('p',{'class':'text'})
print final
This works on a website which I'm not gonna post publically, it returns this.
<p>Example text Example more example Second example</p>
How would I remove the <p> and <a href=""> tags? And any other tags lurking about?

Most bs4 tags have a .strings attribute that is a generator for all strings in the tag.
print(''.join(final.strings))
# Example text Example more example Second example

I suggest you checking for the html tags using regex and replacing them with empty string .
reg = r'\<\*[^>]+>' . This seems to be working .

Python Beautiful Soup - find value based on text in HTML

I am having a problem finding a value in a soup based on text. Here is the code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text)
findit=soup.find("td", text=re.compile('Market Cap'))
This returns [], yet there absolutely is text in a 'td' tag with 'Market Cap'.
When I use
soup.find_all("td")
I get a result set which includes:
<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>

Explanation:
The problem is that this particular tag has other child elements and the .string value, which is checked when you apply the text argument, is None (bs4 has it documented here).
Solutions/Workarounds:
Don't specify the tag name here at all, find the text node and go up to the parent:
soup.find(text=re.compile('Market Cap')).parent.get_text()
Or, you can use find_parent() if td is not the direct parent of the text node:
soup.find(text=re.compile('Market Cap')).find_parent("td").get_text()
You can also use a "search function" to search for the td tags and see if the direct text child nodes has the Market Cap text:
soup.find(lambda tag: tag and
tag.name == "td" and
tag.find(text=re.compile('Market Cap'), recursive=False))
Or, if you are looking to find the following number 5:
soup.find(text=re.compile('Market Cap')).next_sibling.get_text()

You can't use regex with tag. It just won't work. Don't know if it's a bug of specification. I just search after all, and then get the parent back in a list comprehension cause "td" "regex" would give you the td tag.
Code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text, "lxml")
findit=soup.find_all(text=re.compile('Market Cap'))
findit=[x.parent for x in findit if x.parent.name == "td"]
print(findit)
Output
[<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>]

Regex is just a terrible thing to integrate into parsing code and in my humble opinion should be avoided whenever possible.
Personally, I don't like BeautifulSoup due to its lack of XPath support. What you're trying to do is the sort of thing that XPath is ideally suited for. If I were doing what you're doing, I would use lxml for parsing rather than BeautifulSoup's built in parsing and/or regex. It's really quite elegant and extremely fast:
from lxml import etree
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = etree.HTML(source)
tds_w_market_cap = parsed.xpath('//td[contains(., "Market Cap")]')
FYI the above returns an lxml object rather than the text of the page source. In lxml you don't really work with the source directly, per se. If you need to return a list of the actual source for some reason, you would add something like:
print [etree.tostring(i) for i in tds_w_market_cap]
If you absolutely have to use BeautifulSoup for this task, then I'd use a list comprehension:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = bs(source, 'lxml')
tds_w_market_cap = [i for i in parsed.find_all('td') if 'Market Cap' in i.get_text()]

How can i extract only text in scrapy selector in python

I have this code
site = hxs.select("//h1[#class='state']")
log.msg(str(site[0].extract()),level=log.ERROR)
The ouput is
[scrapy] ERROR: <h1 class="state"><strong>
1</strong>
<span> job containing <strong>php</strong> in <strong>region</strong> paying <strong>$30-40k per year</strong></span>
</h1>
Is it possible to only get the text without any html tags

//h1[#class='state']
in your above xpath you are selecting h1 tag that has class attribute state
so that's why it's selecting everything that comes in h1 element
if you just want to select text of h1 tag all you have to do is
//h1[#class='state']/text()
if you want to select text of h1 tag as well as its children tags, you have to use
//h1[#class='state']//text()
so the difference is /text() for specific tag text and //text() for text of specific tag as well as its children tags
below mentioned code works for you
site = ''.join(hxs.select("//h1[#class='state']/text()").extract()).strip()

You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '''
<td>Please can you strip me?
<br/>I am waiting....
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())

I haven't got a scrapy instance running so I couldn't test this; but you could try to use text() within your search expression.
For example:
site = hxs.select("//h1[#class='state']/text()")
(got it from the tutorial)

You can use BeautifulSoup to strip html tags, here is an example:
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))
You can then strip all the additional whitespaces, new lines etc.
if you don't want to use additional modules, you can try simple regex:
# replace html tags with ' '
text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))

You can use html2text
import html2text
converter = html2text.HTML2Text()
print converter.handle("<div>Please!!!<span>remove me</span></div>")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup 4: Dealing with urls containing <br /> - python

Related

In BeautifulSoup 4.7.0+, how can I select all elements that don't contain the specified text in one of their properties

Using BeautifulSoup to extract text within a tag

Remove HTML tags from output Python

Python Beautiful Soup - find value based on text in HTML

How can i extract only text in scrapy selector in python

Categories

Resources