Python regex, remove all punctuation except hyphen for unicode string

Python regex, remove all punctuation except hyphen for unicode string - python

I have this code for removing all punctuation from a regex string:
import regex as re
re.sub(ur"\p{P}+", "", txt)
How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I'm wrong, P with anything after it is punctuation.

[^\P{P}-]+
\P is the complementary of \p - not punctuation. So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes.
Example: http://www.rubular.com/r/JsdNM3nFJ3
If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk

Here's how to do it with the re module, in case you have to stick with the standard libraries:
# works in python 2 and 3
import re
import string
remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern
txt = ")*^%{}[]thi's - is - ###!a !%%!!%- test."
re.sub(pattern, "", txt)
# >>> 'this - is - a - test'
If performance matters, you may want to use str.translate, since it's faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove}).

You could either specify the punctuation you want to remove manually, as in [._,] or supply a function instead of the replacement string:
re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)

Related

Merging three regex patterns used for text cleaning to improve efficiency

Given a text I want to make some modifications:
replace uppercase chars at the beginning of a sentence.
remove chars like ’ or ' (without adding whitespace)
remove unwanted chars for example ³ or ? , ! . (and replace with whitespace)
def multiple_replace(text):
# first sub so words like can't will change to cant and not can t
first_strip=re.sub("[’']",'',text)
def cap(match):
return (match.group().lower())
p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
#second sub to change all words that begin a sentence to lowercase
second_strip = p.sub(cap,first_strip)
# third_strip is to remove all . from text unless they are used in decimal numbers
third_strip= re.sub(r'(?<!\d)\.|\.(?!\d)','',second_strip)
# fourth strip to remove unexpected char that might be in text for example !,?³ and replace with whitespace
forth_strip=re.sub('[^A-Za-z0-9##_$&%]+',' ', third_strip)
return forth_strip
I am wondering if there is a more efficient way of doing it? Because I am going over the text 4 times just so it can be in the right format for me to parse. This seems a lot especially if there are millions of documents. Is there a more efficient way of doing this?

You could make use of an alternation to match either an uppercase char A-Z at the start of the string, or after . ? or ! followed by a whitespace char.
I think you can also add a . to the negated character class [^A-Za-z0-9##_$&%.]+ to not remove the dot for a decimal value and change the order of operations to use cap first before removing any dots.
import re
def cap(match):
return match.group().lower()
p = re.compile(r'(?<=[.?!]\s)[A-Z]|^[A-Z]', re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
first_strip = p.sub(cap, text)
second_strip = re.sub(r"[`']+|(?<!\d)\.|\.(?!\d)", '', first_strip)
third_strip = re.sub('[^A-Za-z0-9##_$&%.]+', ' ', second_strip)
print(third_strip)
Output
a test here this is test but keep 1.2
Python demo
You could also use a lambda with all 3 patterns and 2 capturing groups checking the group values in the callback, but I think that would not benefit the readability or making it easier to change or test.
import re
p = re.compile(r"(?:((?<=[.?!]\s)[A-Z]|^[A-Z])|[`']+|((?<!\d)\.|\.(?!\d))|[^A-Za-z0-9##_$&%.]+)", re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
result = re.sub(p, lambda x: x.group(1).lower() if x.group(1) else ('' if x.group(2) else ' '), text)
print(result)
Output
a test here this is test but keep 1.2
Python demo

python RE white space in the pattern

I am writing a Python script to find a tag name in a string like this:
string='Tag Name =LIC100 State =TRUE'
If a use a expression like this
re.search('Name(.*)State',string)
I get " =LIC100". I would like to get just LIC100.
Any suggestions on how to set up the pattern to eliminate the whitespace and the equal signal?

That is because you get 0+ chars other than line break chars from Name up to the last State. You may restrict the pattern in Group 1 to just non-whitespaces:
import re
string='Tag Name =LIC100 State =TRUE'
m = re.search(r'Name\s*=(\S*)',string)
if m:
print(m.group(1))
See the Python demo
Pattern details:
Name - a literal char sequence
\s* - 0+ whitespaces
= - a literal =
(\S*) - Group 1 capturing 0+ chars other than whitespace (or \S+ can be used to match 1 or more chars other than whitespace).

The easiest solution would probably just be to strip it out after the fact, like so:
s = " =LIC100 "
s = s.strip('= ')
print(s)
#LIC100
If you insist on doing it within the regex, you can try something like:
reg = r'Name[ =]+([A-Za-z0-9]+)\s+State'

Your current regex is failing because (.*) captures all characters until the occurance of State. Instead of capturing everything, you can use a positive lookbehind to describe what preceeds, but is not included in, the content you actually want to capture. In this case, "Name =" preceeds the match, so we can stick it in the lookbehind assertion as (?<=Name =), then proceed to capture everything until the next whitespace:
>>> import re
>>> s = 'Tag Name =LIC100 State =TRUE'
>>> r = re.compile("(?<=Name =)\w*")
>>> print(r.search(s))
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>
>>> print(r.search(s).group(0))
LIC100

Following the tips above, I manage to find a nice solution.
Actually, the string I am trying to process has some non-printable characters. It is like this
"Tag Name\x00=LIC100\x00\tState=TRUE"
Using the concept of lookahead and lookbehind I found the following solution:
import re
s = 'Tag Name\x00=LIC100\x00\tState=TRUE'
T=re.search(r'(?<=Name\x00=)(.*)(?=\x00\tState)',s)
print(T.group(0))
The nice thing about this is that the outcome does not have any non-printable character on it.
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>

python regex - replace newline (\n) to something else

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?

EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.

Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns

Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo

Replacing punctuation except intra-word dashes with a space

There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string), but it does not work in Python:
my_string = 'compactified on a calabi-yau threefold # ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)
gives 'compactified on a calab yau threefold # ,.'
So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash. And it does not remove punctuation
Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'

R uses TRE (POSIX) or PCRE regex engine depending on the perl option (or function used). Python uses a modified, much poorer Perl-like version as re library. Python does not support POSIX character classes, as [:alnum:] that matches alpha (letters) and num (digits).
In Python, [:alnum:] can be replaced with [^\W_] (or ASCII only [a-zA-Z0-9]) and the negated [^[:alnum:]] - with [\W_] ([^a-zA-Z0-9] ASCII only version).
The [^[:alnum:]['-] matches any 1 symbol other than alphanumeric (letter or digit), [, ', or -. That means the R question you refer to does not provide a correct answer.
You can use the following solution:
import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No - d'Ante compactified on a calabi-yau threefold # ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)
The (\b[-']\b)|[\W_] regex matches and captures intraword - and ' and we restore them in the re.sub by checking if the capture group matched and re-inserting it with m.group(1), and the rest (all non-word characters and underscores) are just replaced with a space.
If you want to remove sequences of non-word characters with one space, use
p = re.compile(r"(\b[-']\b)|[\W_]+")

How to replace .. in a string in python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?

Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)

Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho

Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)

[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)

[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)

Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex, remove all punctuation except hyphen for unicode string - python

You could either specify the punctuation you want to remove manually, as in [._,] or supply a function instead of the replacement string: re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)

Related

Merging three regex patterns used for text cleaning to improve efficiency

python RE white space in the pattern

python regex - replace newline (\n) to something else

Replacing punctuation except intra-word dashes with a space

How to replace .. in a string in python

Categories

Resources