Using regex to extract all the html attrs

Using regex to extract all the html attrs - python

I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.

Please don't use regex. Use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}

Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie

Related

How to get value between two different tags using beautiful soup?

I need to extract data present between a ending tag and a tag in below code snippet:
<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>
What I need is : W, 65, 3
But the problem is that these values can be empty too, like-
<td><b>First Type :</b><br><b>Second Type :</b><br><b>Third Type :</b></td>
I want to get these values if present else an empty string
I tried making use of nextSibling and find_next('br') but it returned
<br><b>Second Type :</b><br><b>Third Type :</b></br></br>
and
<br><b>Third Type :</b></br>
in case if values(W, 65, 3) are not present between the tags
</b> and <br>
All I need is that it should return a empty string if nothing is present between those tags.

I would use a <b> tag by </b> tag strategy, looking at what type of info their next_sibling contains.
I would just check whether their next_sibling.string is not None, and accordingly append the list :)
>>> html = """<td><b>First Type :</b><br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> b = soup.find_all("b")
>>> data = []
>>> for tag in b:
if tag.next_sibling.string == None:
data.append(" ")
else:
data.append(tag.next_sibling.string)
>>> data
[' ', u'65', u'3'] # Having removed the first string
Hope this helps!

I would search for a td object then use a regex pattern to filter the data that you need, instead of using re.compile in the find_all method.
Like this:
import re
from bs4 import BeautifulSoup
example = """<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third
Type :</b>3</td>
<td><b>First Type :</b><br><b>Second Type :</b>69<br><b>Third Type :</b>6</td>"""
soup = BeautifulSoup(example, "html.parser")
for o in soup.find_all('td'):
match = re.findall(r'</b>\s*(.*?)\s*(<br|</br)', str(o))
print ("%s,%s,%s" % (match[0][0],match[1][0],match[2][0]))
This pattern finds all text between the </b> tag and <br> or </br> tags. The </br> tags are added when converting the soup object to string.
This example outputs:
W,65,3
,69,6
Just an example, you can alter to return an empty string if one of the regex matches is empty.

In [5]: [child for child in soup.td.children if isinstance(child, str)]
Out[5]: ['W', '65', '3']
Those text and tag are td's child, you can access them use contents(list) or children(generator)
In [4]: soup.td.contents
Out[4]:
[<b>First Type :</b>,
'W',
<br/>,
<b>Second Type :</b>,
'65',
<br/>,
<b>Third Type :</b>,
'3']
then you can get the text by test whether it's the instance of str

I think this works:
from bs4 import BeautifulSoup
html = '''<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>'''
soup = BeautifulSoup(html, 'lxml')
td = soup.find('td')
string = str(td)
list_tags = string.split('</b>')
list_needed = []
for i in range(1, len(list_tags)):
if list_tags[i][0] == '<':
list_needed.append('')
else:
list_needed.append(list_tags[i][0])
print(list_needed)
#['W', '65', '3']
Because the values you want are always after the end of tags it's easy to catch them this way, no need for re.

regex substitute html href and u tags in list (python)

In python I have a list with string items that look like that:
My website is <a href="WEBSITE1" target='_blank'><u>WEBSITE1</u></a>
The link is <a href="LINK1" target='_blank'><u>LINK1</u></a>
...
And what I want to do is to substitute (in every list item) the href syntax leaving only the link as text, so my list would look like:
My website is WEBSITE1
The link is LINK1
...
I was thinking about matching and replacing this regex:
<a href="(.*?)" target='_blank'><u>(.*?)</u></a>
with:
(.*?)
but it doesn't work. It seems to complex. Any easy way to have as output a list object with cleaned items?

You can also process the string with an HTML parser, e.g. BeautifulSoup and it's replace_with() - finding all a elements in a string and replacing them with the link texts:
>>> from bs4 import BeautifulSoup
>>> l = [
... """My website is <a href="WEBSITE1" target='_blank'><u>WEBSITE1</u></a>""",
... """The link is <a href="LINK1" target='_blank'><u>LINK1</u></a>"""
... ]
>>> for item in l:
... soup = BeautifulSoup(item, "html.parser")
... for a in soup("a"):
... a.replace_with(a.text)
... print(str(soup))
...
My website is WEBSITE1
The link is LINK1
Or, as pointed out by #user3100115 in comments, just getting the text of the "soup" object also works on your sample data:
>>> for item in l:
... print(BeautifulSoup(item, "html.parser").get_text())
...
My website is WEBSITE1
The link is LINK1

This regex seems to work
([^<]+)<a\s+href\s*=\s*"([^"]+).*
Regex Demo
Python Code
p = re.compile(r'<a\s+href\s*=\s*"([^"]+).*')
test_str = ["My website is <a href=\"WEBSITE1\" target='_blank'><u>WEBSITE1</u></a>", "The link is <a href=\"LINK1\" target='_blank'><u>LINK1</u></a>"]
for x in test_str:
print(re.sub(p, r"\1", x))
Ideone Demo

If I had to use a regex I would use something like
<a href.*?><u>(.*?)<\/u><\/a>
and then replace in a list comprehension
pattern = re.compile('<a href.*?><u>(.*?)<\/u><\/a>')
print [re.sub(pattern, r"\1", string) for string in my_list]
But consider using beautifulsoup or another html parser, as pointed out in other answers, which will provide you with a more generic solution
Regex Explanation
<a href.*?> match an a href tag, non greedy, up to the first closing bracket
<u> match the u tag
(.*?) match the string you want to keep
<\/u><\/a> match the closing tags

Retrieve the parenthesized capture group in your re.sub:
>>>s = """
My website is <a href="WEBSITE1" target='_blank'><u>WEBSITE1</u></a>
The link is <a href="LINK1" target='_blank'><u>LINK1</u></a>
"""
>>> re.sub("<a href=\"(.*?)\" target='_blank'><u>(.*?)</u></a>", r'\1', s)
'\nMy website is WEBSITE1 \nThe link is LINK1 \n'
Make sure the replacement string is a proper r escaped string, else it will simply replace with \1.
As your input is a list (let's assume its name is s):
>>> for i in range(0,len(s)):
... s[i] = re.sub("<a href=\"(.*?)\" target='_blank'><u>(.*?)</u></a>", r'\1', s[i])
>>> s
['My website is WEBSITE1', 'The link is LINK1']
If done regularly or on a large list, you can compile the regex before the loop.

Please clarify: your title says to remove html href tags, but in your example, you also remove the u tags.
Your answer can be simplified if we are guaranteed to have no other HTML tags than a and u (or if we want to remove all tags). In this case, we can search for anything between < and >, or for anything between <a or </a> and >. My answer assumes this, so it will be invalid if it is not.
import re
S = (
'My website is <u>WEBSITE1</u>',
'The link is <u>LINK1</u>',
)
RE1 = re.compile( r"<\/?[^>]*>")
RE2 = re.compile( r"<\/?[aA][^>]*>")
for s in S:
s1 = RE1.sub( "", s ) # remove all tags
s2 = RE2.sub( "", s ) # remove only <a> and </a> tags
print (s)
print (s1)
print (s2)
print ("")
When run (python2), it produces
My website is <u>WEBSITE1</u>
My website is WEBSITE1
My website is <u>WEBSITE1</u>
The link is <u>LINK1</u>
The link is LINK1
The link is <u>LINK1</u>
first line is original string, second is with all HTML tags removed, third is with only a tags removed.
I did not include a third choice: only remove the a href tags.

Matching a group with OR condition in pattern

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.

Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.

You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')

Searching for text in Beautifulsoup

I have the below URL and would like to extract prices. For that I load the page into beautifulsoup:
soup = bs(content, 'lxml')
for e in soup.find_all(class_="totalPrice"):
Now I get a text that looks like this (this is one single Element of type bs4.element.Tag):
<td class="totalPrice" colspan="3">
<div data-component="track" data-hash="OLNYSRfCbdWGffSRe" data-stage="1" data-track="view"></div>
Total: £145
</td>
How can I create another find expression that will extract the 145? Is there a way to search as for "Total" and then get the text just next to it?
URL with original content that I extract

Use a regex!
>>> import re
>>> search_text = 'blah Total: result'
>>> result = re.findall(r'Total: (.*)', search_text)
>>> result
['result']
If you want to be more general and capture anything that looks like currency, try this:
>>> result = re.findall(r': (£\d*)', search_text)
This will get you the currency symbol £ + and of the following digits.

You can get text from tag
text = e.get_text()
and you have normal string Total: £145 so you can split it
text.split(' ') # [`Total:', '£145`]
slice it
text[8:] # 145
use regular expression, etc.

get a string between a tag (TEST in <div><p>p1</p>TEST<p>p2</p></div>)

Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div()
Result:
[<p>p1</p>, <p>p2</p>]
How come the string TEST isn't in the result set? How can I get it?

soup.div() is a shortcut for soup.div.find_all() which would find you all tags inside the div tag - as you can see, it does the job. TEST is a text between the p tags, or, in other words, the tail of the first p tag.
You can get the TEST string by getting the first p tag and using .next_sibling:
>>> soup.div.p.next_sibling
u'TEST'
Or, by getting the second element of the div's .contents:
>>> soup.div.contents[1]
u'TEST'

from bs4
import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div.text
u'p1TESTp2'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex to extract all the html attrs - python

Related

How to get value between two different tags using beautiful soup?

regex substitute html href and u tags in list (python)

Matching a group with OR condition in pattern

Searching for text in Beautifulsoup

get a string between a tag (TEST in <div><p>p1</p>TEST<p>p2</p></div>)

Categories

Resources