Remove tuple based on character count - python

I have a dataset consisting of tuple of words. I want to remove words that contain less than 4 characters, but I could not figure out a way to iterate my codes.
Here is a sample of my data:
content clean4Char
0 [yes, no, never] [never]
1 [to, every, contacts] [every, contacts]
2 [words, tried, describe] [words, tried, describe]
3 [word, you, go] [word]
Here is the code that I'm working with (it keeps showing me error warning).
def remove_single_char(text):
text = [word for word in text]
return re.sub(r"\b\w{1,3}\b"," ", word)
df['clean4Char'] = df['content'].apply(lambda x: remove_single_char(x))
df.head(3)

the problem is with your remove_single_char function. This will do the job:
Also there is no need to use lambda since you already are passing a function to applay
def remove(input):
return list(filter(lambda x: len(x) > 4, input))
df['clean4Char'] = df['content'].apply(remove)
df.head(3)

We can use str.replace here for a Pandas option:
df["clean4Char"] = df["content"].str.replace(r'\b\w{1,3}\b,?\s*', '', regex=True)
The regex used here says to match:
\b a word boundary (only match entire words)
\w{1,3} a word with no more than 3 characters
\b closing word boundary
,? optional comma
\s* optional whitespace
We then replace with empty string to effectively remove the 3 letter or less matching words along with optional trailing whitespace and comma.
Here is a regex demo showing that the replacement logic is working.

Related

extract uppercase words from string

I want to extract all the words that are complete in uppercase (so not only the first letter, but all the letters in the word) from strings in columnY in dataset X
I have the following script:
X['uppercase'] = X['columnY'].str.extract('([A-Z][A-Z]+)')
But that only extract the first uppercased word in the string.
Then I tried extractall:
X['uppercase'] = X['columnY'].str.extractall('([A-Z][A-Z]+)')
But I got the following error:
TypeError: incompatible index of inserted column with frame index
What am I doing wrong?
We can use regular expressions and list comprehensions as below
import re
def extract_uppercase_words(text):
return re.findall(r'\b[A-Z]+\b', text)
X['columnY'].apply(extract_uppercase_words)
Try this,
X['uppercase'] = X['columnY'].str.findall('\b[A-Z]+\b')
This will give you a list of all the UPPERCASE words.
And If you want all these words to be concatenated in a single string you can use the below code.
X['uppercase'] = X['columnY'].str.findall('\b[A-Z]+\b').str.join(' ')
Assuming you only have words in the column, you could try:
X["uppercase"] = X["columnY"].str.replace(r'\s*\b\w*[a-z]\w*\b\s*', ' ', regex=True)
.str.replace(r'\s{2,}', ' ', regex=True)
.str.strip()
The first replacement targets non all uppercase words (being defined as any word with at least one lowercase letter), as well as any surrounding spaces. We replace with just a single space. The second replacement targets any excess spaces and replaces with just a single space.

Remove duplicated letters except in abbreviations

I'd like to remove duplicated letters from a string as long as there are more letters. For instance, consider the following list:
aaa --> it is untouched because all are the same letters
aa --> it is untouched because all are the same letters
a --> not touched, just one letter
broom --> brom
school --> schol
boo --> should be bo
gool --> gol
ooow --> should be ow
I use the following regex to get rid of the duplicates as follows:
(?<=[a-zA-Z])([a-zA-Z])\1+(?=[a-zA-Z])
However, this is failing in the string boo which is kept as the original boo instead of removing the double o. The same happens with oow which is not reduced to ow.
Do you know why boo is not taken by the regex?
You can match and capture whole words consisting of identical chars into one capturing group, and then match repetitive consecutive letters in all other contexts, and replace accordingly:
import re
text = "aaa, aa, a,broom, school...boo, gool, ooow."
print( re.sub(r'\b(([a-zA-Z])\2+)\b|([a-zA-Z])\3+', r'\1\3', text) )
# => aaa, aa, a,brom, schol...bo, gol, ow.
See the Python demo and the regex demo.
Regex details
\b - a word boundary
(([a-zA-Z])\2+) - Group 1: an ASCII letter (captured into Group 2) and then one or more occurrences of the same letter
\b - a word boundary
| - or
([a-zA-Z]) - Group 3: an ASCII letter captured into Group 3
\3+ - one or more occurrences of the letter captured in Group 3.
The replacement is a concatenation of Group 1 and Group 3 values.
To match any Unicode letters, replace [a-zA-Z] with [^\W\d_].
You regular expression dosen't match boo because it searches for a duplicate that has at least one different character both before and after.
One possibility is to make a simpler regex to catch all duplicates and then revert if the result is one character
def remove_duplicate(string):
new_string = re.sub(r'([a-zA-Z])\1+', r'\1', string)
return new_string if len(new_string) > 1 else string
Here is a possible solution without regular expression. It's faster but it will remove duplicates of white space and punctuation too. Not only letters.
def remove_duplicate(string):
new_string = ''
last_c = None
for c in string:
if c == last_c:
continue
else:
new_string += c
last_c = c
if len(new_string) > 1:
return new_string
else:
return string

Calculate the index of the nth word in a string

Given the index of a word in a string starting at zero ("index" is position two in this sentence), and a word being defined as that which is separated by whitespace, I need to find the index of the first char of that word.
My whitespace regex pattern is "( +|\t+)+", just to cover all my bases (except new line chars, which are excluded). I used split() to separate the string into words, and then summed the lengths of each of those words. However, I need to account for the possibility that more than once whitespace character is used between words, so I can't simply add the number of words minus one to that figure and still be accurate every time.
Example:
>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8
Change your regular expression to include the whitespace around each word to prevent it from being lost. The expression \s*\S+\s* will first consume leading whitespace, then the actual word, then trailing spaces, so only the first word in the resulting list might have leading spaces (if the string itself started with whitespace). The rest consist of the word itself potentially followed by whitespace. After you have that list, simply find the total length of all the words before the one you want, and account for any leading spaces the string may have.
def get_word_index(s, idx):
words = re.findall(r'\s*\S+\s*', s)
return sum(map(len, words[:idx])) + len(words[idx]) - len(words[idx].lstrip())
Testing:
>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8
>>> example2 = ' ' + example
>>> get_word_index(example2, 2)
9
Maybe you could try with:
your_string.index(your_word)
documentation

regex select sequences that start with specific number

I want to select select all character strings that begin with 0
x= '1,1,1075 1,0,39 2,4,1,22409 0,1,1,755,300 0,1,1,755,50'
I have
re.findall(r'\b0\S*', x)
but this returns
['0,39', '0,1,1,755,300', '0,1,1,755,50']
I want
['0,1,1,755,300', '0,1,1,755,50']
The problem is that \b matches the boundaries between digits and commas too. The simplest way might be not to use a regex at all:
thingies = [thingy for thingy in x.split() if thingy.startswith('0')]
Instead of using the boundary \b which will match between the comma and number (between any word [a-zA-Z0-9_] and non word character), you will want to match on start of string or space like (^|\s).
(^|\s)0\S*
https://regex101.com/r/Mrzs8a/1
Which will match the start of string or a space preceding the target string. But that will also include the space if present so I would suggest either trimming your matched string or wrapping the latter part with parenthesis to make it a group and then just getting group 1 from the matches like:
(?:^|\s)(0\S*)
https://regex101.com/r/Mrzs8a/2

Remove a character in string if it doesn't belong to a group of matching pattern in Python

If I have a string such that it contains many words. I want to remove the closing parenthesis if the word in the string doesn't start with _.
Examples input:
this is an example to _remove) brackets under certain) conditions.
Output:
this is an example to _remove) brackets under certain conditions.
How can I do that without splitting the words using re.sub?
re.sub accepts a callable as the second parameter, which comes in handy here:
>>> import re
>>> s = 'this is an example to _remove) brackets under certain) conditions.'
>>> re.sub('(\w+)\)', lambda m: m.group(0) if m.group(0).startswith('_') else m.group(1), s)
'this is an example to _remove) brackets under certain conditions.'
I wouldn't use regex here when a list comprehension can do it.
result = ' '.join([word.rstrip(")") if not word.startswith("_") else word
for word in words.split(" ")])
If you have possible input like:
someword))
that you want to turn into:
someword)
Then you'll have to do:
result = ' '.join([word[:-1] if word.endswith(")") and not word.startswith("_") else word
for word in words.split(" ")])

Categories

Resources