avoid outer element wrap in lxml - python

>>> from lxml import html
>>> html.tostring(html.fromstring('<div>1</div><div>2</div>'))
'<div><div>1</div><div>2</div></div>' # I dont want to outer <div>
>>> html.tostring(html.fromstring('I am pure text'))
'<p>I am pure text</p>' # I dont need the extra <p>
How to avoid the outer <div> and <p> in lxml?

By default, lxml will create a parent div when the string contains multiple elements.
You can work with individual fragments instead:
from lxml import html
test_cases = ['<div>1</div><div>2</div>', 'I am pure text']
for test_case in test_cases:
fragments = html.fragments_fromstring(test_case)
print(fragments)
output = ''
for fragment in fragments:
if isinstance(fragment, str):
output += fragment
else:
output += html.tostring(fragment).decode('UTF-8')
print(output)
output:
[<Element div at 0x3403ea8>, <Element div at 0x3489368>]
<div>1</div><div>2</div>
['I am pure text']
I am pure text

Related

BeautifulSoup cannot extract item using find_all()

I am try to get the location of text from HTML like below using BeautfulSoup,here are my html:
<p><em>code of Drink<br></em>
Budweiser: 4BDB1CD96<br>
price: 10$</p>
with codes:
soup = BeautifulSoup(html,'lxml')
result = re.escape('4BDB1CD96')
tag = soup.find(['li','div','p','em'],string=re.compile(result))
I cannot extract tag,but where I changed the find_all() into:
tag = soup.find(string=re.compile(result))
then I can get the result:
Budweiser: 4BDB1CD96
So I want to know why and how to get the result like in tag fromat
The problem here is that your tags have nested tags, and the text you are searching for is inside such a tag (p here).
So, the easiest approach is to use a lambda inside .find() to check tag names and if there .text property contains your pattern. Here, you do not even need a regex:
>>> tag = soup.find(lambda t: t.name in ['li','div','p','em'] and '4BDB1CD96' in t.text)
>>> tag
<p><em>code of Drink<br/></em>
Budweiser: 4BDB1CD96<br/>
price: 10$</p>
>>> tag.string
>>> tag.text
'code of Drink\nBudweiser: 4BDB1CD96\nprice: 10$'
Of course, you may use a regex for more complex searches:
r = re.compile('4BDB1CD96') # or whatever the pattern is
tag = soup.find(lambda t: t.name in ['li','div','p','em'] and r.search(t.text))

Get real text with beautifulSoup after unwrap()

I need your help : I have <p> tag with many other tags in like in the example below :
<p>I <strong>AM</strong> a <i>text</i>.</p>
I would like to get only "I am a text" so I unwrap() the tags strong and i
with the code below :
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
Next, if i print the soup.p all is right, but if i don't know the name of the tag where my string is, problems start !
The code below should be more clear :
from bs4 import BeautifulSoup
html = '''
<html>
<header></header>
<body>
<p>I <strong>AM</strong> a <i>text</i>.</p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
print soup.p
# output :
# <p>I AM a text.</p>
for s in soup.stripped_strings:
print s
# output
'''
I
AM
a
text
.
'''
Why does BeautifulSoup separate all my strings while I concatenate it with my unwrap() before ?
If you .unwrap() the tag, you remove the tag, and put the content in the parent tag. But the text is not merged, as a result, you obtain a list of NavigableStrings (a subclass of str):
>>> [(c,type(c)) for c in soup.p.children]
[('I ', <class 'bs4.element.NavigableString'>), ('AM', <class 'bs4.element.NavigableString'>), (' a ', <class 'bs4.element.NavigableString'>), ('text', <class 'bs4.element.NavigableString'>), ('.', <class 'bs4.element.NavigableString'>)]
Each of these elements thus is a separated text element. So although you removed the tag itself, and injected the text, these strings are not concatenated. This seems logical, since the elements on the left and the right could still be tags: by unwrapping <strong> you have not unwrapped <i> at the same time.
You can however use .text, to obtain the full text:
>>> soup.p.get_text()
'I AM a text.'
Or you can decide to join the elements together:
>>> ''.join(soup.p.strings)
'I AM a text.'
I kinda hacky (yet easy) way I found to solve this is to convert the "clean" soup into a string and then parse it again.
So, in your code
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
string_soup = str(soup)
new_soup = BeautifulSoup(string_soup, 'lxml')
for s in new_soup .stripped_strings:
print s
should give the desirable output

Python count number or letters on scraped page

I'm making requests in Python with requests.
I then use bs4 to select the wanted div. I now want to count the length of the text in that div, but the string I get out of it includes all the tags too, for example:
<div><a class="some_class">Text here!</a></div>
I want to only count the Text here!, without all the div and a tags.
Anyone have any idea how I could do that?
Do you mean:
tag.text
or
tag.string
tag means the tag that you found use soup.find(). Check the document for more details.
Here is a simple demo that helps you understand what I mean:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div><a class="some_class">Text here!</a></div></body></html>', "html.parser")
>>> tag = soup.find('div')
>>> tag
<div><a class="some_class">Text here!</a></div>
>>> tag.string
'Text here!'
>>> tag.text
'Text here!'
>>>
About count the length of the text, do you mean use len() here?
>>> tag.text
'Text here!'
>>> len(tag.text)
10

How to grab item outside of tag using python+beautifulsoup

Using python+beautifulsoup, let's say I have a <class 'bs4.element.Tag'> object, a:
<div class="class1"><em>text1</em> text2</div>
I can use the following command to extract text1 text2 and put it in b:
b = a.text
I can use the following command to extract text1 and put it in c:
c = a.findAll("em")[0].text
But how can I extract just text2?
I edited your HTML snippet slightly to have more than just one word in and outside the <em> tag so that getText() extracting all the text form your <div> container leads to the following output:
'text1 foo bar text2 foobar baz'
As you can see, this is just a string where the <em> tags have been removed. As far as I understood you want to kind of remove the contents of the <em> tag from the contents in your <div> container.
My solution is not very nice, but this can be done by using .replace() to replace the contents of the <em> tag with an empty string ''. Since this could lead to leading or trailing spaces you could call .lstrip() to get rid of those:
#!/usr/bin/env python3
# coding: utf-8
from bs4 import BeautifulSoup
html = '<div class="class1"><em>text1 foo bar</em> text2 foobar baz</div>'
soup = BeautifulSoup(html, 'html.parser')
result = soup.getText().replace(soup.em.getText(), '').lstrip()
print(result)
Output of print statement:
'text2 foobar baz'
You can remove all children of the div parent and then get the content of the parent like this:
>>> a = BeautifulSoup(out_div, 'html.parser')
>>> for child in a.div.findChildren():
... child.replace_with('')
...
<em>text1</em>
>>> a.get_text()
u' text2'

Using regex to extract all the html attrs

I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.
Please don't use regex. Use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}
Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie

Categories

Resources