How not capture a string with regex

How not capture a string with regex - python

i have this string
<div class"ewSvNa"><a class="ugP" href="link">Description</a><span data-testid=""><small>$</small><span>0,00</span></div>
and this regex /ewS.*?ugP\".*?f=\"(.*?)\">(.*?)<.*?<s.*?n>(.*?)</g. The result is:
Group 1 = 'link'
Group 2 = 'Description'
Group 3 = '0,00'
My question is: It`s possible have the result of Group 3 like '$0,00'?
Thank u guys =]]]]]

It's recommend to not use regex to parse HTML - instead use a proper parser such as Beautiful Soup.
Then your code becomes:
from bs4 import BeautifulSoup
text = '<div class"ewSvNa"><a class="ugP" href="link">Description</a><span data-testid=""><small>$</small><span>0,00</span></div>'
soup = BeautifulSoup(text)
amount = soup.select_one('span[data-testid]').get_text()
# '$0,00'

Related

Python regex: getting text from html elements with similar structure [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 4 years ago.
For some reason I need to use regular expressions to extract some data from a web site. The data has similar HTML structure, only text differs.
For simplicity I show it this way:
p = '<div class="col-xs-6"><p>Gender:</p></div><div class="col-xs-6">Herr, Dam</div>'
t = '<div class="col-xs-6"><p>Kategori:</p></div><div class="col-xs-6">Boots</div>'
s = p + t
I am only interested in 'Gender' which means I want to extract 'Herr' and 'Dam' only.
So far I came up with two options - both not working:
m = re.findall("Gender.+?<div.+?>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)
gives:
['Herr']
I guess because it is non-greedy
But if I make it greedy:
re.findall("Gender.+?<div.+>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)
It returns:
['Boots']
So I am struggling to figure out how to get both 'Herr' and 'Dam' and nothing more?

You can use BeautifulSoup in such a way
from bs4 import BeautifulSoup
a='<div class="col-xs-6"><p>Gender:</p></div><div class="col-xs-6">Herr, Dam</div>'
soup = BeautifulSoup(a,"html.parser")
if 'Gender' in (str(soup.findAll('div'))):
for ana in soup.findAll('div'):
for i in ana.findAll('a'):
print(i.next_element)
Output:
Herr
Dam
I would recommend to add name attribute to the divs so it would be easier to determine the correct tags
p = '<div name="Gender" class="col-xs-6"><p>Gender:</p></div><div name="Gender" class="col-xs-6">Herr, Dam</div>'
t = '<div class="col-xs-6"><p>Kategori:</p></div><div class="col-xs-6">Boots</div>'
a = p + t
soup = BeautifulSoup(a,"html.parser")
for ana in soup.findAll('div',{"name":"Gender"}):
for i in ana.findAll('a'):
print(i.next_element)
Output:
Herr
Dam

Matching a group with OR condition in pattern

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.

Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.

You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')

get a string between a tag (TEST in <div><p>p1</p>TEST<p>p2</p></div>)

Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div()
Result:
[<p>p1</p>, <p>p2</p>]
How come the string TEST isn't in the result set? How can I get it?

soup.div() is a shortcut for soup.div.find_all() which would find you all tags inside the div tag - as you can see, it does the job. TEST is a text between the p tags, or, in other words, the tail of the first p tag.
You can get the TEST string by getting the first p tag and using .next_sibling:
>>> soup.div.p.next_sibling
u'TEST'
Or, by getting the second element of the div's .contents:
>>> soup.div.contents[1]
u'TEST'

from bs4
import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div.text
u'p1TESTp2'

Finding the second of the same two words in a line

I am using line.rfind() to find a certain line in an html page and then I am splitting the line to pull out individual numbers. For example:
position1 = line.rfind('Wed')
This finds this particular line of html code:
<strong class="temp">79<span>°</span></strong><span class="low"><span>Lo</span> 56<span>°</span></span>
First I want to pull out the '79', which is done with the following code:
if position1 > 0 :
self.high0 = lines[line_number + 4].split('<span>')[0].split('">')[-1]
This works perfectly. The problem I am encountering is trying to extract the '56' from that line of html code. I can't split it between '< span>' and '< /span> since the first '< span>' it finds in the line is after the '79'. Is there a way to tell the script to look for the second occurrence of '< span>'?
Thanks for your help!

Concerns about parsing HTML with regex aside, I've found that regex tends to be fairly useful for grabbing information from limited, machine-generated HTML.
You can pull out both values with a regex like this:
import re
matches = re.findall(r'<strong class="temp">(\d+).*?<span>Lo</span> (\d+)', lines[line_number+4])
if matches:
high, low = matches[0]
Consider this quick-and-dirty: if you rely on it for a job, you may want to use a real parser like BeautifulSoup.

import re
html = """
<strong class="temp">79<span>°</span></strong><span class="low"><span>Lo</span> 56<span>°</span></span>
"""
numbers = re.findall(r"\d+", html, re.X|re.M|re.S)
print numbers
--output:--
['79', '56']
With BeautifulSoup:
from bs4 import BeautifulSoup
html = """
<strong class="temp">
79
<span>°</span>
</strong>
<span class="low">
<span>Lo</span>
56
<span>°</span>
</span>
"""
soup = BeautifulSoup(html)
low_span = soup.find('span', class_="low")
for string in low_span.stripped_strings:
print string
--output:--
Lo
56
°

Using regex to extract all the html attrs

I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.

Please don't use regex. Use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}

Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How not capture a string with regex - python

Related

Python regex: getting text from html elements with similar structure [duplicate]

Matching a group with OR condition in pattern

get a string between a tag (TEST in <div><p>p1</p>TEST<p>p2</p></div>)

Finding the second of the same two words in a line

Using regex to extract all the html attrs

Categories

Resources