Extract unicode substrings with the re module - python

I have a string like this:
s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
I want this text:
result = 'the unicode text I want with an é'
I've tried to use this code:
expr = r'(?<=BEGIN)[\sa-zA-Z]+(?=END)'
result = re.search(expr, s)
result = re.sub(r'(^\s+)|(\s+$)', '', result) # just to strip out leading/trailing white space
But as long as the é is in the string s, re.search always returns None.
Note, I've tried using different combinations of .* instead of [\sa-zA-Z]+ without success.

The character ranges a-z and A-Z only capture ASCII characters. You can use . to capture Unicode characters:
>>> import re
>>> s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
>>> print re.search(r'BEGIN(.+?)END', s).group(1)
the unicode text I want with an é
>>>
Note too that I simplified your pattern a bit. Here is what it does:
BEGIN # Matches BEGIN
(.+?) # Captures one or more characters non-greedily
END # Matches END
Also, you do not need Regex to remove whitespace from the ends of a string. Just use str.strip:
>>> ' a '.strip()
'a'
>>>

Related

Replacing everything with a backslash till next white space

As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d+', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.
Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]+
And replace with nothing. See an online demo.
\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d]+ - 1+ (Greedy) Characters as per given class.
First, the / doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d+'.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.
Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d+', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'
Does that work for you?
re.sub(
r"\\\w+\s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w+\s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello \qc23424 there")
'there'
'\\' matches '\' and 'w+' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w+', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'
I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" + removable + " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)

Python regex: remove short lines

I have a string with multiple newline symbols:
text = 'foo\na\nb\n$\n\nxz\nbar'
I want to remove the lines that are shorter than 3 symbols. The desired output is
'foo\n\nbar'
I tried
re.sub(r'(\n([\s\S]{0,2})\n)+', '\nX\n', text, flags= re.S)
but this matches only some subset of the string and the result is
'foo\nX\nb\nX\nxz\nbar'
I need somehow to do greedy search and replace the longest string matching the pattern.
re.S makes . match everything including newline, and you don't want that. Instead use re.M so ^ matches beginning of string and after newline, and use:
>>> import re
>>> text = 'foo\na\nb\n$\n\nxz\nbar'
>>> re.findall('(?m)^.{0,2}\n',text)
['a\n', 'b\n', '$\n', '\n', 'xz\n']
>>> re.sub('(?m)^.{0,2}\n','',text)
'foo\nbar'
That's "from start of a line, match 0-2 non-newline characters, followed by a newline".
I noticed your desired output has a \n\n in it. If that isn't a mistake use .{1,2} if blank lines are to be left in.
You might also want to allow the final line of the string to have an optional terminating newline, for example:
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar') # 3 symbols at end, no newline
'foo\nbar'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar\n') # same, with newline
'foo\nbar\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba\n') # <3 symbols, newline
'foo\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba') # < 3 symbols, no newline
'foo\n'
Perhaps you can use re.findall instead:
text = 'foo\na\nb\n$\n\nxz\nbar'
import re
print (repr("".join(re.findall(r"\n?\w{3,}\n?",text))))
#
'foo\n\nbar'
You can use this regex, which looks for any set of less than 3 non-newline characters following either start-of-string or a newline and followed by a newline or end-of-string, and replace it with an empty string:
(^|\n)[^\n]{0,2}(?=\n|$)
In python:
import re
text = 'foo\na\nb\n$\n\nxz\nbar'
print(re.sub(r'(^|\n)[^\n]{0,2}(?=\n|$)', '', text))
Output
foo
bar
Demo on rextester
There's no need to use regex for this.
raw_str = 'foo\na\nb\n$\n\nxz\nbar'
str_res = '\n'.join([curr for curr in raw_str.splitlines() if len(curr) >= 3])
print(str_res):
foo
bar

Remove trailing special characters from string

I'm trying to use a regex to clean some data before I insert the items into the database. I haven't been able to solve the issue of removing trailing special characters at the end of my strings.
How do I write this regex to only remove trailing special characters?
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'([_+!##$?^])', '', item))
print (clean_this)
outputs this:
string01 # correct
string02 # incorrect because it remove _ in the string
string03 # correct
string041 # incorrect because it remove _ in the string
string05a # incorrect because it remove _ in the string and not just the trailing _
You could also use the special purpose rstrip method of strings
[s.rstrip('_+!##$?^') for s in strings]
# ['string01', 'str_ing02', 'string03', 'string04_1', 'string05_a']
You could repeat the character class 1+ times or else only 1 special character would be replaced. Then assert the end of the string $. Note that you don't need the capturing group around the character class:
[_+!##$?^]+$
For example:
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
print (clean_this)
See the Regex demo | Python demo
If you also want to remove whitespace characters at the end you could add \s to the character class:
[_+!##$?^\s]+$
Regex demo
You need an end-of-word anchor $
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
Demo

Detect latin characters in regex

I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.
def clean_str(string):
string = re.sub(r"#(#[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' \1 ', string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
string = re.sub(r'(\s{2,})', ' ', string, re.UNICODE)
return string.lower().strip()
My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.
example:
if I have a text like "#aaa bbb các. ddd".
it should be like "bbb các . ddd" with space "before the DOT" and with deleting the Tag "#aaa".
But it produces the same input text!: "#aaa bbb các. ddd"
Did I miss something?
You have several issues in the current code:
To match any Unicode word char, use \w (rather than [A-Za-z0-9_]) with a Unicode flag
When using a re.U with re.sub, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just use flags=re.U/ flags=re.UNICODE
To match any non-word char but a whitespace, you may use [^\w\s]
When you want to replace with a whole match, you do not have to wrap the whole pattern with (...), just make sure you use \g<0> backreference in the replacement pattern.
See an updated method to clean the strings:
>>> def clean_str(s):
... s = re.sub(r'#\w+', ' ', s, flags=re.U)
... s = re.sub(r'[^\w\s]', r' \g<0>', s, flags=re.U)
... s = re.sub(r'\s{2,}', ' ', s, flags=re.U)
... return s.lower().strip()
...
>>> print(clean_str(s))

How to remove non alphabetic characters from words-ends

When I have a string "Mary's!!" I want to get "Mary's!", so only one non alphabetic character is removed at the beginning and/or the end of each word in the string, not in the middle of the word.
I have this so far in Python 3
import re
s = "Mary's!! string. With. Punctuation?" # Sample string
out = re.sub(r'[^\w\d\s]','', s)
print(out)
This outputs:
"Marys string With Punctuation"
It removes everything, while it should be like this:
"Mary's! string With Punctuation"
You could require that there is a space next to it (or start/end of string):
re.sub(r'(\s|^)[^\w\d\s]|[^\w\d\s](\s|$)', r'\1\2', s)
Or, alternatively with look-around:
re.sub(r'(?<!\S)[^\w\d\s]|[^\w\d\s](?!\S)', '', s)

Categories

Resources