Python regex: getting text from html elements with similar structure [duplicate] - python

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 4 years ago.
For some reason I need to use regular expressions to extract some data from a web site. The data has similar HTML structure, only text differs.
For simplicity I show it this way:
p = '<div class="col-xs-6"><p>Gender:</p></div><div class="col-xs-6">Herr, Dam</div>'
t = '<div class="col-xs-6"><p>Kategori:</p></div><div class="col-xs-6">Boots</div>'
s = p + t
I am only interested in 'Gender' which means I want to extract 'Herr' and 'Dam' only.
So far I came up with two options - both not working:
m = re.findall("Gender.+?<div.+?>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)
gives:
['Herr']
I guess because it is non-greedy
But if I make it greedy:
re.findall("Gender.+?<div.+>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)
It returns:
['Boots']
So I am struggling to figure out how to get both 'Herr' and 'Dam' and nothing more?

You can use BeautifulSoup in such a way
from bs4 import BeautifulSoup
a='<div class="col-xs-6"><p>Gender:</p></div><div class="col-xs-6">Herr, Dam</div>'
soup = BeautifulSoup(a,"html.parser")
if 'Gender' in (str(soup.findAll('div'))):
for ana in soup.findAll('div'):
for i in ana.findAll('a'):
print(i.next_element)
Output:
Herr
Dam
I would recommend to add name attribute to the divs so it would be easier to determine the correct tags
p = '<div name="Gender" class="col-xs-6"><p>Gender:</p></div><div name="Gender" class="col-xs-6">Herr, Dam</div>'
t = '<div class="col-xs-6"><p>Kategori:</p></div><div class="col-xs-6">Boots</div>'
a = p + t
soup = BeautifulSoup(a,"html.parser")
for ana in soup.findAll('div',{"name":"Gender"}):
for i in ana.findAll('a'):
print(i.next_element)
Output:
Herr
Dam

Related

How not capture a string with regex

i have this string
<div class"ewSvNa"><a class="ugP" href="link">Description</a><span data-testid=""><small>$</small><span>0,00</span></div>
and this regex /ewS.*?ugP\".*?f=\"(.*?)\">(.*?)<.*?<s.*?n>(.*?)</g. The result is:
Group 1 = 'link'
Group 2 = 'Description'
Group 3 = '0,00'
My question is: It`s possible have the result of Group 3 like '$0,00'?
Thank u guys =]]]]]
It's recommend to not use regex to parse HTML - instead use a proper parser such as Beautiful Soup.
Then your code becomes:
from bs4 import BeautifulSoup
text = '<div class"ewSvNa"><a class="ugP" href="link">Description</a><span data-testid=""><small>$</small><span>0,00</span></div>'
soup = BeautifulSoup(text)
amount = soup.select_one('span[data-testid]').get_text()
# '$0,00'

Extract non numeric chars between html tags [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I've been trying unsuccessfully to extract words up to the numeric chars from the below:
<div class="text">hello there 234 44</div>
Here is what I am doing:
regex_name = re.compile(r'<div class="text">([^\d].+)</div>')
As a starting point, I'd use BeautifulSoup HTML parser to find the desired element in the HTML input and extract the element's text.
Then, I'd use itertools.takewhile() to get all the characters in a string until a digit is met:
In [1]: from itertools import takewhile
In [2]: from bs4 import BeautifulSoup
In [3]: data = """<div class="text">hello there 234 44</div>"""
In [4]: soup = BeautifulSoup(data, "html.parser")
In [5]: text = soup.find("div", class_="text").get_text()
In [6]: ''.join(takewhile(lambda x: not x.isdigit(), text))
Out[6]: u'hello there '
You may wanna use a positive look-behind assertion
(?<=">)[^\d]+
^^^^^^^
see regex demo
python ( demo )
import re
s = """<div class="text">A hawking party 64 x 48 1/2in (163 x 123.3cm)</div>"""
r = r"(?<=\">)[^\d]+"
o = re.findall(r, s)
print o
# ['A hawking party ']
data = '<div class="text">A hawking party 64 x 48 1/2in (163 x 123.3cm)</div>'
final =''
for i in data.replace('<div class="text">','').replace('</div>',''):
if not i.isdigit():
final+= i
else:
break
print final
results in
A hawking party

Matching a group with OR condition in pattern

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.
Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.
You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')

Finding the second of the same two words in a line

I am using line.rfind() to find a certain line in an html page and then I am splitting the line to pull out individual numbers. For example:
position1 = line.rfind('Wed')
This finds this particular line of html code:
<strong class="temp">79<span>°</span></strong><span class="low"><span>Lo</span> 56<span>°</span></span>
First I want to pull out the '79', which is done with the following code:
if position1 > 0 :
self.high0 = lines[line_number + 4].split('<span>')[0].split('">')[-1]
This works perfectly. The problem I am encountering is trying to extract the '56' from that line of html code. I can't split it between '< span>' and '< /span> since the first '< span>' it finds in the line is after the '79'. Is there a way to tell the script to look for the second occurrence of '< span>'?
Thanks for your help!
Concerns about parsing HTML with regex aside, I've found that regex tends to be fairly useful for grabbing information from limited, machine-generated HTML.
You can pull out both values with a regex like this:
import re
matches = re.findall(r'<strong class="temp">(\d+).*?<span>Lo</span> (\d+)', lines[line_number+4])
if matches:
high, low = matches[0]
Consider this quick-and-dirty: if you rely on it for a job, you may want to use a real parser like BeautifulSoup.
import re
html = """
<strong class="temp">79<span>°</span></strong><span class="low"><span>Lo</span> 56<span>°</span></span>
"""
numbers = re.findall(r"\d+", html, re.X|re.M|re.S)
print numbers
--output:--
['79', '56']
With BeautifulSoup:
from bs4 import BeautifulSoup
html = """
<strong class="temp">
79
<span>°</span>
</strong>
<span class="low">
<span>Lo</span>
56
<span>°</span>
</span>
"""
soup = BeautifulSoup(html)
low_span = soup.find('span', class_="low")
for string in low_span.stripped_strings:
print string
--output:--
Lo
56
°

Using regex to extract all the html attrs

I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.
Please don't use regex. Use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}
Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie

Categories

Resources