how to eliminate an specific part of html file in python - python

I am working on a html file which has item 1, item 2, and item 3. I want to delete all the text that comes after the LAST item 2. There may be more than one item 2 in the file. I am using this but it does not work:
text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> a=re.search ('(?<=<B>)Item 2.',text)
>>> b= a.group(0)
>>> newText= text.partition(b)[0]
>>> newText
'<A href="#106">'
it deletes the text after the first item 2 not the second one.

I'd use BeautifulSoup to parse the HTML and modify it. You might want to use the decompose() or extract() method.
BeautifulSoup is nice because it's pretty good at parsing malformed HTML.
For your specific example:
>>> import bs4
>>> text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> soup = bs4.BeautifulSoup(text)
>>> soup.b.next_sibling.extract()
u' this is an example this is an example'
>>> soup
<html><body>ItemĀ 2. <b>ItemĀ 2. Properties</b></body></html>
If you really wanna use regular expressions, a non-greedy regex would work for your example:
>>> import re
>>> text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> m = re.match(".*?Item 2\.", text)
>>> m.group(0)
'<A href="#106">Item 2.'

Related

How to parse data from nested HTML tags with beautifulsoup [duplicate]

I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:
I have
import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text
The output to this is yesno. I want simply 'yes'.
What's the best way of achieving this?
Edit: I also want yes to be output when parsing '<html><b>no</b>yes</html>'.
what about .find(text=True)?
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True)
u'no'
EDIT:
I think that I've understood what you want now. Try this:
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False)
u'yes'
You could use contents
>>> print soup.html.contents[0]
yes
or to get all the texts under html, use findAll(text=True, recursive=False)
>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>')
>>> soup.html.findAll(text=True, recursive=False)
[u'x', u'yes']
above joined to form a single string
>>> ''.join(soup.html.findAll(text=True, recursive=False))
u'xyes'
This works for me in bs4:
import bs4
node = bs4.BeautifulSoup('<html><div>A<span>B</span>C</div></html>').find('div')
print "".join([t for t in node.contents if type(t)==bs4.element.NavigableString])
output:
AC
You might want to look into lxml's soupparser module, which has support for XPath:
>>> from lxml.html.soupparser import fromstring
>>> s1 = '<html>yes<b>no</b></html>'
>>> s2 = '<html><b>no</b>yes</html>'
>>> soup1 = fromstring(s1)
>>> soup2 = fromstring(s2)
>>> soup1.xpath("text()")
['yes']
>>> soup2.xpath("text()")
['yes']

Matching a group with OR condition in pattern

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.
Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.
You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')

Python count number or letters on scraped page

I'm making requests in Python with requests.
I then use bs4 to select the wanted div. I now want to count the length of the text in that div, but the string I get out of it includes all the tags too, for example:
<div><a class="some_class">Text here!</a></div>
I want to only count the Text here!, without all the div and a tags.
Anyone have any idea how I could do that?
Do you mean:
tag.text
or
tag.string
tag means the tag that you found use soup.find(). Check the document for more details.
Here is a simple demo that helps you understand what I mean:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div><a class="some_class">Text here!</a></div></body></html>', "html.parser")
>>> tag = soup.find('div')
>>> tag
<div><a class="some_class">Text here!</a></div>
>>> tag.string
'Text here!'
>>> tag.text
'Text here!'
>>>
About count the length of the text, do you mean use len() here?
>>> tag.text
'Text here!'
>>> len(tag.text)
10

What do group() and contents[] mean?

I'am learning about the modules of re and BeautifulSoup. I have a doubt in few lines of the next code. I don't know the use of group() and what's inside of brackets in contents[]
from bs4 import BeautifulSoup
import urllib2
import re
url = 'http://www.ebay.es/itm/LOTE-5-BOTES-CERVEZAARGUS-SET-5-BEER-CANSLOT-5-CANETTES-BIRES-LATTINE-BIRRA-/321162173293' #raw_input('URL: ')
code = urllib2.urlopen(url).read();
soup = BeautifulSoup(code)
tag = soup.find('span', id='v4-27').contents[0]
price_string = re.search('(\d+,\d+)', tag).group(1)
precio_final = float(price_string.replace(',' , '.'))
print precio_final
.contents returns a list of items from a tag. For example:
>>> from bs4 import BeautifulSoup as BS
>>> soup = BS('<span class="foo"> bar baz link</span>')
>>> print soup.find('span').contents
[u' bar baz ', link]
[0] is used to access the first element of the list .contents returns. In the example above, it will return bar baz
.group(1) returns the second (indexing starts at 0, remember) matched value from a regular expression. Looking at your regular expression, it returns the second digit from something that looks like n1,n2.

Only extracting text from this element, not its children

I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:
I have
import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text
The output to this is yesno. I want simply 'yes'.
What's the best way of achieving this?
Edit: I also want yes to be output when parsing '<html><b>no</b>yes</html>'.
what about .find(text=True)?
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True)
u'no'
EDIT:
I think that I've understood what you want now. Try this:
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False)
u'yes'
You could use contents
>>> print soup.html.contents[0]
yes
or to get all the texts under html, use findAll(text=True, recursive=False)
>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>')
>>> soup.html.findAll(text=True, recursive=False)
[u'x', u'yes']
above joined to form a single string
>>> ''.join(soup.html.findAll(text=True, recursive=False))
u'xyes'
This works for me in bs4:
import bs4
node = bs4.BeautifulSoup('<html><div>A<span>B</span>C</div></html>').find('div')
print "".join([t for t in node.contents if type(t)==bs4.element.NavigableString])
output:
AC
You might want to look into lxml's soupparser module, which has support for XPath:
>>> from lxml.html.soupparser import fromstring
>>> s1 = '<html>yes<b>no</b></html>'
>>> s2 = '<html><b>no</b>yes</html>'
>>> soup1 = fromstring(s1)
>>> soup2 = fromstring(s2)
>>> soup1.xpath("text()")
['yes']
>>> soup2.xpath("text()")
['yes']

Categories

Resources