Detect latin characters in regex - python

I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.
def clean_str(string):
string = re.sub(r"#(#[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' \1 ', string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
string = re.sub(r'(\s{2,})', ' ', string, re.UNICODE)
return string.lower().strip()
My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.
example:
if I have a text like "#aaa bbb các. ddd".
it should be like "bbb các . ddd" with space "before the DOT" and with deleting the Tag "#aaa".
But it produces the same input text!: "#aaa bbb các. ddd"
Did I miss something?

You have several issues in the current code:
To match any Unicode word char, use \w (rather than [A-Za-z0-9_]) with a Unicode flag
When using a re.U with re.sub, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just use flags=re.U/ flags=re.UNICODE
To match any non-word char but a whitespace, you may use [^\w\s]
When you want to replace with a whole match, you do not have to wrap the whole pattern with (...), just make sure you use \g<0> backreference in the replacement pattern.
See an updated method to clean the strings:
>>> def clean_str(s):
... s = re.sub(r'#\w+', ' ', s, flags=re.U)
... s = re.sub(r'[^\w\s]', r' \g<0>', s, flags=re.U)
... s = re.sub(r'\s{2,}', ' ', s, flags=re.U)
... return s.lower().strip()
...
>>> print(clean_str(s))

Related

Splitting words before and after '/' in using Regex

Have the following string:
text = 'interesting/practical/useful'
I need to split it like interesting / practical / useful using Regex.
I'm trying this
newText = re.sub(r'[a-zA-Z](/)[a-zA-Z]', ' \g<0> ', text)
but it returns
interestin g/p ractica l/u seful
P.S. I have other texts in the Corpus that also have forward slashes that shouldn't be altered due to this operation, which is my I'm looking for regex patterns specifically with characters before and after the '/'.
Use lookarounds for the letters before and after the /, so they're not included in the match.
newText = re.sub(r'(?<=[a-zA-Z])/(?=[a-zA-Z])', ' \g<0> ', text)
I'm used to do it this way:
newText = re.sub(r'([a-zA-Z])/([a-zA-Z])', '\1 / \2', text)
\1 and \2 are the references on the first and second founded expressions in the brakets ()
In plain English it means: look for any letter, a slash, a letter and change them with the founded letter, a space, a slash, a space and the second founded letter.

how to match either word or sentence in this Python regex?

I have a decent familiarity with regex but this is tricky. I need to find instances like this from a SQL case statement:
when col_name = 'this can be a word or sentence'
I can match the above when it's just one word, but when it's more than one word it's not working.
s = """when col_name = 'a sentence of words'"""
x = re.search("when\s(\w+)\s*=\s*\'(\w+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a"
I want group(2) to return "a sentence of words" but I'm just getting the first word. That part could either be one word or several. How to do it?
When I add in the second \', then I get no match:
x = re.search("when\s(\w+)\s*=\s*\'(\w+)\'", s)
You may match all characters other than single quotation mark rather than matching letters, digits and connector punctuation ("word" chars) with the Group 2 pattern:
import re
s = """when col_name = 'a sentence of words'"""
x = re.search(r"when\s+(\w+)\s*=\s*'([^']+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a sentence of words"
See the Python demo
The [^'] is a negated character class that matches any char but a single quotation mark, see the regex demo.
In case the string can contain escaped single quotes, you may consider replacing [^'] with
If the escape char is ': ([^']*(?:''[^']*)*)
If the escape char is \: ([^\\']*(?:\\.[^'\\]*)*).
Note the use of the raw string literal to define the regex pattern (all backslashes are treated as literal backslashes inside it).

Replacing punctuation except intra-word dashes with a space

There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string), but it does not work in Python:
my_string = 'compactified on a calabi-yau threefold # ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)
gives 'compactified on a calab yau threefold # ,.'
So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash. And it does not remove punctuation
Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'
R uses TRE (POSIX) or PCRE regex engine depending on the perl option (or function used). Python uses a modified, much poorer Perl-like version as re library. Python does not support POSIX character classes, as [:alnum:] that matches alpha (letters) and num (digits).
In Python, [:alnum:] can be replaced with [^\W_] (or ASCII only [a-zA-Z0-9]) and the negated [^[:alnum:]] - with [\W_] ([^a-zA-Z0-9] ASCII only version).
The [^[:alnum:]['-] matches any 1 symbol other than alphanumeric (letter or digit), [, ', or -. That means the R question you refer to does not provide a correct answer.
You can use the following solution:
import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No - d'Ante compactified on a calabi-yau threefold # ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)
The (\b[-']\b)|[\W_] regex matches and captures intraword - and ' and we restore them in the re.sub by checking if the capture group matched and re-inserting it with m.group(1), and the rest (all non-word characters and underscores) are just replaced with a space.
If you want to remove sequences of non-word characters with one space, use
p = re.compile(r"(\b[-']\b)|[\W_]+")

Extract unicode substrings with the re module

I have a string like this:
s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
I want this text:
result = 'the unicode text I want with an é'
I've tried to use this code:
expr = r'(?<=BEGIN)[\sa-zA-Z]+(?=END)'
result = re.search(expr, s)
result = re.sub(r'(^\s+)|(\s+$)', '', result) # just to strip out leading/trailing white space
But as long as the é is in the string s, re.search always returns None.
Note, I've tried using different combinations of .* instead of [\sa-zA-Z]+ without success.
The character ranges a-z and A-Z only capture ASCII characters. You can use . to capture Unicode characters:
>>> import re
>>> s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
>>> print re.search(r'BEGIN(.+?)END', s).group(1)
the unicode text I want with an é
>>>
Note too that I simplified your pattern a bit. Here is what it does:
BEGIN # Matches BEGIN
(.+?) # Captures one or more characters non-greedily
END # Matches END
Also, you do not need Regex to remove whitespace from the ends of a string. Just use str.strip:
>>> ' a '.strip()
'a'
>>>

Python regex : adding space after comma only if not followed by a number

I want to add spaces after and before comma's in a string only if the following character isn't a number (9-0). I tried the following code:
newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
But it seems like the \1 is taking the 2 matching characters and not just the comma.
Example:
>>> newLine = "abc,abc"
>>> newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
"abc ,a bc"
Expected Output:
"abc , abc"
How can I tell the sub to take only the 'comma' ?
Use this one:
newLine = re.sub(r'[,]+(?![0-9])', r' , ', newLine)
Here using negative lookahead (?![0-9]) it is checking that the comma(s) are not followed by a digit.
Your regex didn't work because you picked the comma and the next character(using ([,]+[^0-9])) in a group and placed space on both sides.
UPDATE: If it is not only comma and other things as well, then place them inside the character class [] and capture them in group \1 using ()
newLine = re.sub(r'([,/\\]+)(?![0-9])', r' \1 ', newLine)

Categories

Resources