Remove duplicated letters except in abbreviations - python

I'd like to remove duplicated letters from a string as long as there are more letters. For instance, consider the following list:
aaa --> it is untouched because all are the same letters
aa --> it is untouched because all are the same letters
a --> not touched, just one letter
broom --> brom
school --> schol
boo --> should be bo
gool --> gol
ooow --> should be ow
I use the following regex to get rid of the duplicates as follows:
(?<=[a-zA-Z])([a-zA-Z])\1+(?=[a-zA-Z])
However, this is failing in the string boo which is kept as the original boo instead of removing the double o. The same happens with oow which is not reduced to ow.
Do you know why boo is not taken by the regex?

You can match and capture whole words consisting of identical chars into one capturing group, and then match repetitive consecutive letters in all other contexts, and replace accordingly:
import re
text = "aaa, aa, a,broom, school...boo, gool, ooow."
print( re.sub(r'\b(([a-zA-Z])\2+)\b|([a-zA-Z])\3+', r'\1\3', text) )
# => aaa, aa, a,brom, schol...bo, gol, ow.
See the Python demo and the regex demo.
Regex details
\b - a word boundary
(([a-zA-Z])\2+) - Group 1: an ASCII letter (captured into Group 2) and then one or more occurrences of the same letter
\b - a word boundary
| - or
([a-zA-Z]) - Group 3: an ASCII letter captured into Group 3
\3+ - one or more occurrences of the letter captured in Group 3.
The replacement is a concatenation of Group 1 and Group 3 values.
To match any Unicode letters, replace [a-zA-Z] with [^\W\d_].

You regular expression dosen't match boo because it searches for a duplicate that has at least one different character both before and after.
One possibility is to make a simpler regex to catch all duplicates and then revert if the result is one character
def remove_duplicate(string):
new_string = re.sub(r'([a-zA-Z])\1+', r'\1', string)
return new_string if len(new_string) > 1 else string
Here is a possible solution without regular expression. It's faster but it will remove duplicates of white space and punctuation too. Not only letters.
def remove_duplicate(string):
new_string = ''
last_c = None
for c in string:
if c == last_c:
continue
else:
new_string += c
last_c = c
if len(new_string) > 1:
return new_string
else:
return string

Related

Remove tuple based on character count

I have a dataset consisting of tuple of words. I want to remove words that contain less than 4 characters, but I could not figure out a way to iterate my codes.
Here is a sample of my data:
content clean4Char
0 [yes, no, never] [never]
1 [to, every, contacts] [every, contacts]
2 [words, tried, describe] [words, tried, describe]
3 [word, you, go] [word]
Here is the code that I'm working with (it keeps showing me error warning).
def remove_single_char(text):
text = [word for word in text]
return re.sub(r"\b\w{1,3}\b"," ", word)
df['clean4Char'] = df['content'].apply(lambda x: remove_single_char(x))
df.head(3)
the problem is with your remove_single_char function. This will do the job:
Also there is no need to use lambda since you already are passing a function to applay
def remove(input):
return list(filter(lambda x: len(x) > 4, input))
df['clean4Char'] = df['content'].apply(remove)
df.head(3)
We can use str.replace here for a Pandas option:
df["clean4Char"] = df["content"].str.replace(r'\b\w{1,3}\b,?\s*', '', regex=True)
The regex used here says to match:
\b a word boundary (only match entire words)
\w{1,3} a word with no more than 3 characters
\b closing word boundary
,? optional comma
\s* optional whitespace
We then replace with empty string to effectively remove the 3 letter or less matching words along with optional trailing whitespace and comma.
Here is a regex demo showing that the replacement logic is working.

Python regex remove dots from dot separated letters

I would like to remove the dots within a word, such that a.b.c.d becomes abcd, But under some conditions:
There should be at least 2 dots within the word, For example, a.b remains a.b, But a.b.c is a match.
This should match on 1 or 2 letters only. For example, a.bb.c is a match (because a, bb and c are 1 or 2 letters each), but aaa.b.cc is not a match (because aaa consists of 3 letters)
Here is what I've tried so far:
import re
texts = [
'a.b.c', # Should be: 'abc'
'ab.c.dd.ee', # Should be: 'abcddee'
'a.b' # Should remain: 'a.b'
]
for text in texts:
text = re.sub(r'((\.)(?P<word>[a-zA-Z]{1,2})){2,}', r'\g<word>', text)
print(text)
This selects "any dot followed by 1 or 2 letters", which repeats 2 or more times. Selection works fine, but replacement with group, causes only on last match and repetition is ignored.
So, it prints:
ac
abee
a.b
Which is not what I want. I would appreciate any help, thanks.
Starting the match with a . dot not make sure that there is a char a-zA-Z before it.
If you use the named group word in the replacement, that will contain the value of the last iteration as it is by itself in a repeated group.
You can match 2 or more dots with 1 or 2 times a char a-zA-Z and replace the dots with an empty string when there is a match instead.
To prevent aaa.b.cc from matching, you could make use of word boundaries \b
\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b
The pattern matches:
\b A word boundary to prevent the word being part of a larger word
[a-zA-Z]{1,2} Match 1 or 2 times a char a-zA-Z
(?: Non capture group
\.[a-zA-Z]{1,2} Match a dot and 1 or 2 times a char a-zA-Z
){2,} Close non capture group and repeat 2 or more times to match at least 2 dots
\b A word boundary
Regex demo | Python demo
import re
pattern = r"\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b"
texts = [
'a.b.c',
'ab.c.dd.ee',
'a.b',
'aaa.b.cc'
]
for s in texts:
print(re.sub(pattern, lambda x: x.group().replace(".", ""), s))
Output
abc
abcddee
a.b
aaa.b.cc
^(?=(?:.*?\.){2,}.*$)[a-z]{1,2}(?:\.[a-z]{1,2})+$
You can use this to match the string.If its a match, you can just remove . using any naive method.
See demo.
https://regex101.com/r/BrNBtk/1

Getting word of a char in string based on its location

I have a string, e.g.:
"This is my very boring string"
In addition, I have a location of a char in the string without spaces.
e.g.:
The location 13, which in this example matches the o in the word boring.
What I need is, based on the index I get (13) to return the word (boring).
This code will return the char (o):
re.findall('[a-z]',s)[13]
But for some reason I don't think of a good way to return the word boring.
Any help will be appreciated.
You can use regex \w+ to match words and keep accumulating the lengths of the matches until the total length exceeds to target position:
def get_word_at(string, position):
length = 0
for word in re.findall(r'\w+', string):
length += len(word)
if length > position:
return word
so that get_word_at('This is my very boring string', 13) would return:
boring
A non-regex solution that strives for the elegance the OP desires:
def word_out_of_string(string, character_index):
words = string.split()
while words and character_index >= len(words[0]):
character_index -= len(words.pop(0))
return words.pop(0) if words else None
print(word_out_of_string("This is my very boring string", 13))
Do not require var length lookbehind which is slow and ugly.
Using a simple lookahead with a capture group will get the word.
This regex uses non-whitespace as the character.
^(?:\s*(?=(?<!\S)(\S+))?\S){13}
demo 13th char
Use word if need be but whatever the character sought it must
be used with the anti-character otherwise nothing will work,
it will stop because ALL characters mut be matched.
Examples:
\w used with \W
\s used with \S
demo 1st char
demo 18th char
You can install and use the regex module, which supports patterns with variable-length lookbehinds, so that you can use such a pattern to assert that there are exactly the desired number of word characters, optionally surrounded by white spaces, behind the matching word:
import regex
regex.search(r'\w*(?<=^\s*(\w\s*){13})\w+', 'This is my very boring string').group()
This returns:
boring
This function will take in two arguments: a string and an index. It will convert the index to be the same index equivalent to the original string. Then, it will return the word that the character of the converted index belongs to in the original string.
def find(string,idx):
# Find the index of the character relative original string
i1 = idx
for char in string:
if char == ' ':
i1 += 1
if string[i1] == string.replace(' ','')[idx]:
break
# Find which word the index belongs to in the original string
i2 = 0
for word in string.split():
for l in word:
i2 += 1
if i2 == i1:
return(word)
i2+=1
print(find("This is my very boring string", 13))
Output:
boring
If Python's alternative regex engine is used, one could replace matches of the following regular expression with empty strings:
r'^(?:\s*\S){0,13}\s|(?<=(?:\s*\S){13,})\s.*'
Regex demo <¯\_(ツ)_/¯> Python demo
For the example string the 'o' in 'boring' is at index 13 after whitespace has been removed. If both 13's in the regex are changed to any number in the range 12-17, 'boring' is returned. If they are changed to 12, 'very' is returned; if they are changed to 18, `'string' is returned.
The regex engine performs the following operations.
^ : match beginning of string
(?:\s*\S) : match 0+ ws chars, then 1 non-ws char, in a non-capture group
{0,13} : execute the non-capture group 0-13 times
\s : match a ws char
| : or
(?<= : begin a positive lookbehind
(?:\s*\S) : match 0+ ws chars, then 1 non-ws char, in a non-capture group
{13,} : execute the non-capture group at least 13 times
) : end positive lookahead
\s : match 1 ws char
.* : match 0+ chars

How to efficiently pass or ignore some tokens resolved by a python regex?

I am applying a function to a list of tokens as follows:
def replace(e):
return e
def foo(a_string):
l = []
for e in a_string.split():
l.append(replace(e.lower()))
return ' '.join(l)
With the string:
s = 'hi how are you today 23:i ok im good 1:i'
The function foo corrects the spelling of the tokens in s. However, there are some cases that I would like to ignore, for example 12:i or 2:i. How can I apply foo to all the tokens that are not resolved by the regex:\d{2}\b:i\b|\d{1}\b:i\b? That is, I would like that foo ignore all the tokens with the form 23:i or 01:e or 1:i. I was thinking on a regex, however, maybe there is a better way of doing this.
The expected output would be:
'hi how are you today 23:i ok im good 1:e'
In other words the function foo ignores tokens with the form nn:i or n:i, where n is a number.
You may use
import re
def replace(e):
return e
s = 'hi how are you today 23:i ok im good 1:e'
rx = r'(?<!\S)(\d{1,2}:[ie])(?!\S)|\S+'
print(re.sub(rx, lambda x: x.group(1) if x.group(1) else replace(x.group().lower()), s))
See the Python demo online and the regex demo.
The (?<!\S)(\d{1,2}:[ie])(?!\S)|\S+ pattern matches
(?<!\S)(\d{1,2}:[ie])(?!\S) - 1 or 2 digits, : and i or e that are enclosed with whitespaces or string start/end positions (with the substring captured into group 1)
| - or
\S+ - 1+ non-whitespace chars.
Once Group 1 matches, its value is pasted back as is, else, the lowercased match is passed to the replace method and the result is returned.
Another regex approach:
rx = r'(?<!\S)(?!\d{1,2}:[ie](?!\S))\S+'
s = re.sub(rx, lambda x: replace(x.group().lower()), s)
See another Python demo and a regex demo.
Details
(?<!\S) - checks if the char immediately to the left is a whitespace or asserts the string start position
(?!\d{1,2}:[ie](?!\S)) - a negative lookahead that fails the match if, immediately to the right of the current location, there is 1 or 2 digits, :, i or e, and then a whitespace or end of string should follow
\S+ - 1+ non-whitespace chars.
Try this:
s = ' '.join([i for i in s.split() if ':e' not in i])

Python regular expression to find letters and numbers

Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?
Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.
This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())

Categories

Resources