Python - Remove apostrophe from Regular Expression - python

I have the following regular expression to extract song names from a certain website:
<h2 class="chart-row__song">(.*?)</h2>
It displays the results below :
Where ' is in the output below, is an apostrophe on the website the song name is extract from.
How would I go about changing my regular expression to remove those characters? '
TIA

As stated in the comments, you can't do that using a regex alone. You need to unescape the HTML entities present in the match separately.
import re
import html
regex = re.compile(r'<h2 class="chart-row__song">(.*?)</h2>')
result = [html.unescape(s) for s in regex.findall(mystring)]

Related

Python - regex issue with getting formula consisted between ><

I'm practicing regular expressions on html file.
My goal is to fetch tittle of the file:
<tittle>Popular baby names</tittle>
I tried something like this:
pattern = re.compile(r'>.+<')
and instead of what I'm looking for I'm getting:
((1791, 1794), '>?<')
((2544, 2547), '>1<')
((2605, 2608), '>2<')
I've read that dot represents any character but a newline. That makes me wonder why it isn't working.
If you want to catch what's inside the tag only, use capturing groups ().
import re
s = '<tittle>Popular baby names</tittle> some text <title>Other title</title> <strong>bold</strong>'
re.findall(r'>([\w\s]+)</', s)
# ['Popular baby names', 'Other title', 'bold']

How to use regex to remove string within certain HTML tag and string must contain empty space

I try to clean some HTML data with regular expression in python. Given the input string with HTML tags, I want to remove tags and its content if the content contains space. The requirements is like below:
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = regexProcess(inputString)
print outputString
>>I want to remove not sole <code>word</code>
The regex re.sub("<code>.+?</code>", " ", inputString) can only remove all tags, how to improve it or are there some other methods?
Thanks in advance.
Using regex with HTML is fraught with various issues, that is why you should be aware of all possible consequences. So, your <code>.+?</code> regex will only work in case the <code> and </code> tags are on one line and if there are no nested <code> tags inside them.
Assuming there are no nested code tags you might extend your current approach:
import re
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = re.sub("<code>(.+?)</code>", lambda m: " " if " " in m.group(1) else m.group(), inputString, flags=re.S)
print(outputString)
The re.S flag will enable . to match line breaks and a lambda will help to perform a check against each match: any code tag that contains a whitespace in its node value will be turned into a regular space, else it will be kept.
See this Python demo
A more common way to parse HTML in Python is to use BeautifulSoup. First, parse the HTML, then get all the code tags and then replace the code tag if the nodes contains a space:
>>> from bs4 import BeautifulSoup
soup = BeautifulSoup('I want to remove <code>tag with space</code> not sole <code>word</code>', "html.parser")
>>> for p in soup.find_all('code'):
if p.string and " " in p.string:
p.replace_with(" ")
>>> print(soup)
I want to remove not sole <code>word</code>
bad idea to parse HTML with regex. However if your HTML is simple enough you could do this:
re.sub(r"<code>[^<]*\s[^<]*</code>", " ", inputString)
We're looking for at least a space somewhere, to be able to make it work with code tags on the same line, I've added filtering on < char (it has no chance to be in a tag, since even escaping it is <).
Ok, it's still a hack, a proper html parser is preferred.
small test:
inputString = "<code>hello </code> <code>world</code> <code>hello world</code> <code>helloworld</code>"
I get:
<code>world</code> <code>helloworld</code>
You can used to remove tags according to open and close tags also .
inputString = re.sub(r"<.*?>", " ", inputString)
In my case it is working .
Enjoy ...

How to return everything in a string that is not matched by a regex?

I have a string and a regular expression that matches portions of the string. I want to return a string representing what's left of the original string after all matches have been removed.
import re
string="<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"
pattern = r'<[a-zA-Z0-9 ="/\-:;.]*>'
re.findall(pattern, string)
['<font size="2px" face="Tahoma">',
'<br>',
'</font>',
'<div>',
'<br>',
'</div>',
'<div>']
desired_string = "Good Morning, As per last email"
Instead of re.findall, use re.sub to replace each matche with an empty string.
re.sub(pattern, "", string)
While that's the literal answer to your general question about removing patterns from a string, it appears that your specific problem is related to manipulating HTML. It's generally a bad idea to try to manipulate HTML with regular expressions. For more information see this answer to a similar question: https://stackoverflow.com/a/1732454/7432
Instead of a regular expression, use an HTML parser like BeautifulSoup. It looks like you are trying to strip the HTML elements and get the underlying text.
from bs4 import BeautifulSoup
string="""<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"""
soup = BeautifulSoup(string, 'lxml')
print(soup.get_text())
This outputs:
Good Morning, As per last email
One thing to notice is that the was changed to a regular space using this method.

How to make regex with unicode symbols?

I need to make regex which will capture the following:
Fixed unicode text:
<br>
<strong>
text I am looking for
</strong>
I do something like
regex = re.compile(unicode('Fixed unicode text:.*','utf-8'))
How to modify that to capture remaining text?
Simply prefix u (in Python 2.x, nothing in Python 3) to get a unicode string, and use parentheses to capture the remaining text, like this:
import re
haystack = u'Fixed unicode text:\n<br><strong>\ntext I\nam looking for</strong>'
match = re.search(ur'Fixed unicode text:(.*)', haystack, re.DOTALL)
print(match.group(1))
However, it looks like your input is HTML. If that's the case, you should not use a regular expression, but parse the HTML with lxml, BeautifulSoup, or another HTML parser.

Python Regex 'not' to identify pattern within <a></a>

I am dealing a problem to write a python regex 'not'to identify a certain pattern within href tags.
My aim is to replace all occurrences of DSS[a-z]{2}[0-9]{2} with a href link as shown below,but without replacing the same pattern occurring inside href tags
Present Regex:
replaced = re.sub("[^http://*/s](DSS[a-z]{2}[0-9]{2})", "\\1", input)
I need to add this new regex using an OR operator to the existing one I have
EDIT:
I am trying to use regex just for a simple operation. I want to replace the occurrences of the pattern anywhere in the html using a regex except occurring within<a><\a>.
The answer to any question having regexp and HTML in the same sentence is here.
In Python, the best HTML parser is indeed Beautilf Soup.
If you want to persist with regexp, you can try a negative lookbehind to avoid anything precessed by a ". At your own risk.

Categories

Resources