How to make regex with unicode symbols? - python

I need to make regex which will capture the following:
Fixed unicode text:
<br>
<strong>
text I am looking for
</strong>
I do something like
regex = re.compile(unicode('Fixed unicode text:.*','utf-8'))
How to modify that to capture remaining text?

Simply prefix u (in Python 2.x, nothing in Python 3) to get a unicode string, and use parentheses to capture the remaining text, like this:
import re
haystack = u'Fixed unicode text:\n<br><strong>\ntext I\nam looking for</strong>'
match = re.search(ur'Fixed unicode text:(.*)', haystack, re.DOTALL)
print(match.group(1))
However, it looks like your input is HTML. If that's the case, you should not use a regular expression, but parse the HTML with lxml, BeautifulSoup, or another HTML parser.

Related

How to return html content without escaping in serializer?

I have model with TextField that contains html. Let's say I have a row that contains google in TextField. The API returns "google".
How can I remove " escaping?
You can use the html module, which has a method named escape:
html.escape(s, quote=True)
Convert the characters &, < and > in string s to HTML-safe sequences. Use this if you need to display text that might contain
such characters in HTML. If the optional flag quote is true, the
characters (") and (') are also translated; this helps for inclusion
in an HTML attribute value delimited by quotes, as in <a href="...">.
New in version 3.2.
Let s be: s = 'example' then:
from html import escape
html_line = escape(s)
Now the html_line contains the s string without any 'escaping', looking like this:
<a href="http://example.com">example</a>
If you want to keep the characters < > & etc. but avoid the escaping of ", you can utilize the other method of the html module, called unescape:
from html import unescape
html_line = unescape(s)
Now the html_line will look like this:
example

Python - Remove apostrophe from Regular Expression

I have the following regular expression to extract song names from a certain website:
<h2 class="chart-row__song">(.*?)</h2>
It displays the results below :
Where ' is in the output below, is an apostrophe on the website the song name is extract from.
How would I go about changing my regular expression to remove those characters? '
TIA
As stated in the comments, you can't do that using a regex alone. You need to unescape the HTML entities present in the match separately.
import re
import html
regex = re.compile(r'<h2 class="chart-row__song">(.*?)</h2>')
result = [html.unescape(s) for s in regex.findall(mystring)]

How to return everything in a string that is not matched by a regex?

I have a string and a regular expression that matches portions of the string. I want to return a string representing what's left of the original string after all matches have been removed.
import re
string="<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"
pattern = r'<[a-zA-Z0-9 ="/\-:;.]*>'
re.findall(pattern, string)
['<font size="2px" face="Tahoma">',
'<br>',
'</font>',
'<div>',
'<br>',
'</div>',
'<div>']
desired_string = "Good Morning, As per last email"
Instead of re.findall, use re.sub to replace each matche with an empty string.
re.sub(pattern, "", string)
While that's the literal answer to your general question about removing patterns from a string, it appears that your specific problem is related to manipulating HTML. It's generally a bad idea to try to manipulate HTML with regular expressions. For more information see this answer to a similar question: https://stackoverflow.com/a/1732454/7432
Instead of a regular expression, use an HTML parser like BeautifulSoup. It looks like you are trying to strip the HTML elements and get the underlying text.
from bs4 import BeautifulSoup
string="""<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"""
soup = BeautifulSoup(string, 'lxml')
print(soup.get_text())
This outputs:
Good Morning, As per last email
One thing to notice is that the was changed to a regular space using this method.

python re.search (regex) to search words who have pattern like {{world}} only

I have on HTML file in which I have inserted the custom tags like {{name}}, {{surname}}. Now I want to search the tags who exactly match the pattern like {{world}} only not even {world}}, {{world}, {world}, { word }, {{ world }}, etc.
I wrote the small code for the
re.findall(r'\{(\w.+?)\}', html_string)
It returns the words which follow the pattern {{world}} ,{world},{world}}
that I don't want. I want to match exactly the {{world}}. Can anybody please guide me?
Um, shouldn't the regex be:
'\{\{(\w.+?)\}\}'
Ok, after the comments, I understand your requirements more:
'\{\{\w+?\}\}'
should work for you.
Basically, you want {{any nnumber of word characters including underscore}}. You don't even need the lazy match in this case actually so you may remove th ? in the expression.
Something like {{keyword1}} other stuff {{keyword2}} will not match as a whole now.
To get only the keyword without getting the {{}} use below:
'(?<=\{\{)\w+?(?=\}\})'
How about this?
re.findall('{{(\w+)}}', html_string)
Or, if you want the curly braces included in the results:
re.findall('({{\w+}})', html_string)
If you're trying to accomplish html templating, though, I recommend using a good template engine.
This will match no curly braces within your result, do you want that?
'\{\{(\w[^\{\}]+?)\}\}'
http://rubular.com/r/79YwR13MS0
If you want to match doubled curly brackets, you should specify them in your regex:
re.findall(r'\{\{(\w[^}]?)\}\}', html_string)
You say the other answers don't work, but they seem to for me:
>>> import re
>>> html_string = '{{realword}} {fake1}} {{fake2} {fake3} fake4'
>>> re.findall(r'\{\{(\w.+?)\}\}', html_string)
['realword']
If it doesn't work for you, you'll need to give more details.
Edit: How about the following? Getting rid of the dot (.) and using only \w also allows you to use greedy qualifiers and works for the example HTML from your comment:
>>> html_string = 'html>\n <head>\n </head>\n <title>\n </title>\n <body>\n <h1>\n T - Shirts\n </h1>\n <img src="March-Tshirts/skull_headphones_tshirt.jpg" />\n <img src="/March-Tshirts/star-wars-t-shirts-6.jpeg" />\n <h2>\n we - we - we\n </h2>\n {{unsubscribe}} -- {{tracking_beacon} -- {web_url}} -- {name} \n </body>\n</html>\n'
>>> re.findall(r'\{\{(\w+)\}\}', html_string)
['unsubscribe']
The \w matches alphanumeric characters and the underscore; if you need to match more characters you could add it to a set (e.g., [\w\+] to also match the plus sign).

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?
Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)
Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello
Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name
Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.
I'd think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
... assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However ...
Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.
I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1
Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.

Categories

Resources