Python and "re"

Python and "re" - python

A tutorial I have on Regex in python explains how to use the re module in python, I wanted to grab the URL out of an A tag so knowing Regex I wrote the correct expression and tested it in my regex testing app of choice and ensured it worked. When placed into python it failed:
result = re.match("a_regex_of_pure_awesomeness", "a string containing the awesomeness")
# result is None`
After much head scratching I found out the issue, it automatically expects your pattern to be at the start of the string. I have found a fix but I would like to know how to change:
regex = ".*(a_regex_of_pure_awesomeness)"
into
regex = "a_regex_of_pure_awesomeness"
Okay, it's a standard URL regex but I wanted to avoid any potential confusion about what I wanted to get rid of and possibly pretend to be funny.

In Python, there's a distinction between "match" and "search"; match only looks for the pattern at the start of the string, and search looks for the pattern starting at any location within the string.
Python regex docs
Matching vs searching

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_html)
for a in soup.findAll('a', href=True):
# do something with `a` w/ href attribute
print a['href']

>>> import re
>>> pattern = re.compile("url")
>>> string = " url"
>>> pattern.match(string)
>>> pattern.search(string)
<_sre.SRE_Match object at 0xb7f7a6e8>

Are you using the re.match() or re.search() method? My understanding is that re.match() assumes a "^" at the beginning of your expression and will only search at the beginning of the text, while re.search() acts more like the Perl regular expressions and will only match the beginning of the text if you include a "^" at the beginning of your expression. Hope that helps.

Related

Regex search For HTML Tag with UUID

I'm trying to match a single HTML tag with an id attribute which is a UUID. I tested it with an external resource to make sure the regex is correct with the same input string. The UUID is extracted dynamically so the string replacement is necessary.
The output I would would expect is for the last line to print:
<tr class="ref_row" id="b9060ff1-015d-4089-a193-8fef57e7c2ef">
This is the code I tried:
content = '<tbody><tr class="ref_row" id="b9060ff1-015d-4089-a193-8fef57e7c2ef"><td><b>01/08/2016 14:41:00</b></td>'
ref = 'b9060ff1-015d-4089-a193-8fef57e7c2ef'
regex = '<[^>]+?id=\"%s\"[^<]*?>' % ref
element_to_link = re.search(regex, content)
print element_to_link.string
The output I get when printing is the whole input string, which would suggest the regex is incorrect. What's going on here?
Please don't suggest that I use Beautiful Soup, this should be possible with regular expressions.

Why won't you use group method? This works for me:
element_to_link.group(0)

From the Python re module documentation the MatchObject.string property returns "The string passed to match() or search().". Use one of the methods of MatchObject such as group(), groups() or groupdict().

How to return everything in a string that is not matched by a regex?

I have a string and a regular expression that matches portions of the string. I want to return a string representing what's left of the original string after all matches have been removed.
import re
string="<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"
pattern = r'<[a-zA-Z0-9 ="/\-:;.]*>'
re.findall(pattern, string)
['<font size="2px" face="Tahoma">',
'<br>',
'</font>',
'<div>',
'<br>',
'</div>',
'<div>']
desired_string = "Good Morning, As per last email"

Instead of re.findall, use re.sub to replace each matche with an empty string.
re.sub(pattern, "", string)
While that's the literal answer to your general question about removing patterns from a string, it appears that your specific problem is related to manipulating HTML. It's generally a bad idea to try to manipulate HTML with regular expressions. For more information see this answer to a similar question: https://stackoverflow.com/a/1732454/7432

Instead of a regular expression, use an HTML parser like BeautifulSoup. It looks like you are trying to strip the HTML elements and get the underlying text.
from bs4 import BeautifulSoup
string="""<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"""
soup = BeautifulSoup(string, 'lxml')
print(soup.get_text())
This outputs:
Good Morning, As per last email
One thing to notice is that the was changed to a regular space using this method.

How do I ensure that re.findall() stops at the right place?

Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?

Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.

Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy

re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.

It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4

Python Regex Tokenize

I'm trying to figure out how to use regular expressions in Python to extract out certain URLs in strings. For example, I might have 'blahblahblah (a href="example.com")'. In this case I want to extract all "example.com" links. How can I do that instead of just splitting the string?
Thanks!

There is a great module called BeautifulSoup (link: http://www.crummy.com/software/BeautifulSoup/) which is great for parsing HTML. You should use this instead of using regex to get info from HTML. Here's an example of BeautifulSoup:
>>> from bs4 import BeautifulSoup
>>> html = """<p> some HTML and another link</p>"""
>>> soup = BeautifulSoup(html)
>>> mylist = soup.find_all('a')
>>> for link in mylist:
... print link['href']
http://link.com
http://second.com
Here is a link to the documentation, which is really easy to follow: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Regex are very powerful tools, but they might not be your tool in all circumstances (as other has suggested already). That said, here's a minimal example from the console that uses - as per request - regex:
>>> import re
>>> s = 'blahblahblah (a href="example.com") another bla <a href="subdomain.example2.net">'
>>> re.findall(r'a href="(.*?)"', s)
['example.com', 'subdomain.example2.net']
Focus on r'a href="(.*?)"'. In Englis it translates in: "find a string beginning with a href=", then save as a result any character until you hit the next ". The syntax is:
the () means "save only stuff in here"
the . means "any character"
the * means "any number of times"
the ? means "non greedy" or in other terms: find the shortest string that satisfy the requirements (try without the question mark and you will see what happens).
HTH!

Do not use regexp:
Here is why you should not think at regex in the first place when dealing with HTML or XML (or URLs).
If you wish to use regex anyway,
You can find several pattern that do the job, and several way to fetch the strings you wish to find.
These patterns do the job:
r'\(a href="(.*?)"\)'
r'\(a href="(.*)"\)'
r'\(a href="(+*)"\)'
1. re.findall()
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
import re
st = 'blahblahblah (a href="example.com") another bla <a href="polymer.edu">'
re.findall(r'\(a href="(+*)"\)',s)
2. re.search()
re.search(pattern, string, flags=0)
Scan through string looking for a location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
Then, go with re.group() through groups. For instance, using regex r'\(a href="(.+?(.).+?)"\)', that is also working here, you have several enclosed groups: group 0 is a match to the whole pattern, group 1 is a match to the first enclosed sub-pattern surrounded with parenthesis, (.+?(.).+?)
You would use search when looking for first occurence of pattern only. And with your example this would be
>>> st = 'blahblahblah (a href="example.com") another bla (a href="polymer.edu")'
>>> m=re.search(r'\(a href="(.+?(.).+?)"\)', st)
>>> m.group(1)
'example.com'

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)

Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello

Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name

Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.

I'd think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
... assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However ...
Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.

The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.

I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1

Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python and "re" - python

In Python, there's a distinction between "match" and "search"; match only looks for the pattern at the start of the string, and search looks for the pattern starting at any location within the string. Python regex docs Matching vs searching

from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(your_html) for a in soup.findAll('a', href=True): # do something with `a` w/ href attribute print a['href']

>>> import re >>> pattern = re.compile("url") >>> string = " url" >>> pattern.match(string) >>> pattern.search(string) <_sre.SRE_Match object at 0xb7f7a6e8>

Related

Regex search For HTML Tag with UUID

How to return everything in a string that is not matched by a regex?

How do I ensure that re.findall() stops at the right place?

Python Regex Tokenize

Extract part of a regex match

Categories

Resources