Python count number or letters on scraped page

Python count number or letters on scraped page - python

I'm making requests in Python with requests.
I then use bs4 to select the wanted div. I now want to count the length of the text in that div, but the string I get out of it includes all the tags too, for example:
<div><a class="some_class">Text here!</a></div>
I want to only count the Text here!, without all the div and a tags.
Anyone have any idea how I could do that?

Do you mean:
tag.text
or
tag.string
tag means the tag that you found use soup.find(). Check the document for more details.
Here is a simple demo that helps you understand what I mean:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div><a class="some_class">Text here!</a></div></body></html>', "html.parser")
>>> tag = soup.find('div')
>>> tag
<div><a class="some_class">Text here!</a></div>
>>> tag.string
'Text here!'
>>> tag.text
'Text here!'
>>>
About count the length of the text, do you mean use len() here?
>>> tag.text
'Text here!'
>>> len(tag.text)
10

Related

BeautifulSoup cannot extract item using find_all()

I am try to get the location of text from HTML like below using BeautfulSoup,here are my html:
<p><em>code of Drink<br></em>
Budweiser: 4BDB1CD96<br>
price: 10$</p>
with codes:
soup = BeautifulSoup(html,'lxml')
result = re.escape('4BDB1CD96')
tag = soup.find(['li','div','p','em'],string=re.compile(result))
I cannot extract tag,but where I changed the find_all() into:
tag = soup.find(string=re.compile(result))
then I can get the result:
Budweiser: 4BDB1CD96
So I want to know why and how to get the result like in tag fromat

The problem here is that your tags have nested tags, and the text you are searching for is inside such a tag (p here).
So, the easiest approach is to use a lambda inside .find() to check tag names and if there .text property contains your pattern. Here, you do not even need a regex:
>>> tag = soup.find(lambda t: t.name in ['li','div','p','em'] and '4BDB1CD96' in t.text)
>>> tag
<p><em>code of Drink<br/></em>
Budweiser: 4BDB1CD96<br/>
price: 10$</p>
>>> tag.string
>>> tag.text
'code of Drink\nBudweiser: 4BDB1CD96\nprice: 10$'
Of course, you may use a regex for more complex searches:
r = re.compile('4BDB1CD96') # or whatever the pattern is
tag = soup.find(lambda t: t.name in ['li','div','p','em'] and r.search(t.text))

Get real text with beautifulSoup after unwrap()

I need your help : I have <p> tag with many other tags in like in the example below :
<p>I <strong>AM</strong> a <i>text</i>.</p>
I would like to get only "I am a text" so I unwrap() the tags strong and i
with the code below :
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
Next, if i print the soup.p all is right, but if i don't know the name of the tag where my string is, problems start !
The code below should be more clear :
from bs4 import BeautifulSoup
html = '''
<html>
<header></header>
<body>
<p>I <strong>AM</strong> a <i>text</i>.</p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
print soup.p
# output :
# <p>I AM a text.</p>
for s in soup.stripped_strings:
print s
# output
'''
I
AM
a
text
.
'''
Why does BeautifulSoup separate all my strings while I concatenate it with my unwrap() before ?

If you .unwrap() the tag, you remove the tag, and put the content in the parent tag. But the text is not merged, as a result, you obtain a list of NavigableStrings (a subclass of str):
>>> [(c,type(c)) for c in soup.p.children]
[('I ', <class 'bs4.element.NavigableString'>), ('AM', <class 'bs4.element.NavigableString'>), (' a ', <class 'bs4.element.NavigableString'>), ('text', <class 'bs4.element.NavigableString'>), ('.', <class 'bs4.element.NavigableString'>)]
Each of these elements thus is a separated text element. So although you removed the tag itself, and injected the text, these strings are not concatenated. This seems logical, since the elements on the left and the right could still be tags: by unwrapping <strong> you have not unwrapped <i> at the same time.
You can however use .text, to obtain the full text:
>>> soup.p.get_text()
'I AM a text.'
Or you can decide to join the elements together:
>>> ''.join(soup.p.strings)
'I AM a text.'

I kinda hacky (yet easy) way I found to solve this is to convert the "clean" soup into a string and then parse it again.
So, in your code
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
string_soup = str(soup)
new_soup = BeautifulSoup(string_soup, 'lxml')
for s in new_soup .stripped_strings:
print s
should give the desirable output

Matching a group with OR condition in pattern

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.

Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.

You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')

get a string between a tag (TEST in <div><p>p1</p>TEST<p>p2</p></div>)

Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div()
Result:
[<p>p1</p>, <p>p2</p>]
How come the string TEST isn't in the result set? How can I get it?

soup.div() is a shortcut for soup.div.find_all() which would find you all tags inside the div tag - as you can see, it does the job. TEST is a text between the p tags, or, in other words, the tail of the first p tag.
You can get the TEST string by getting the first p tag and using .next_sibling:
>>> soup.div.p.next_sibling
u'TEST'
Or, by getting the second element of the div's .contents:
>>> soup.div.contents[1]
u'TEST'

from bs4
import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div.text
u'p1TESTp2'

how to eliminate an specific part of html file in python

I am working on a html file which has item 1, item 2, and item 3. I want to delete all the text that comes after the LAST item 2. There may be more than one item 2 in the file. I am using this but it does not work:
text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> a=re.search ('(?<=<B>)Item 2.',text)
>>> b= a.group(0)
>>> newText= text.partition(b)[0]
>>> newText
'<A href="#106">'
it deletes the text after the first item 2 not the second one.

I'd use BeautifulSoup to parse the HTML and modify it. You might want to use the decompose() or extract() method.
BeautifulSoup is nice because it's pretty good at parsing malformed HTML.
For your specific example:
>>> import bs4
>>> text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> soup = bs4.BeautifulSoup(text)
>>> soup.b.next_sibling.extract()
u' this is an example this is an example'
>>> soup
<html><body>Item 2. <b>Item 2. Properties</b></body></html>
If you really wanna use regular expressions, a non-greedy regex would work for your example:
>>> import re
>>> text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> m = re.match(".*?Item 2\.", text)
>>> m.group(0)
'<A href="#106">Item 2.'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python count number or letters on scraped page - python

Related

BeautifulSoup cannot extract item using find_all()

Get real text with beautifulSoup after unwrap()

Matching a group with OR condition in pattern

get a string between a tag (TEST in <div><p>p1</p>TEST<p>p2</p></div>)

how to eliminate an specific part of html file in python

Categories

Resources