Remove specific characters from a pandas column? - python

Hello I have a dataframe where I want to remove a specific set of characters 'fwd' from every row that starts with it. The issue I am facing is that the code I am using to execute this is removing anything that starts with the letter 'f'.
my dataframe looks like this:
summary
0 Fwd: Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Fwd: Please take action on the action needed items
4 Fix all the mistakes please
When i used the code:
df['Clean Summary'] = individual_receivers['summary'].map(lambda x: x.lstrip('Fwd:'))
I end up with a dataframe that looks like this:
summary
0 Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Please take action on the action needed items
4 ix all the mistakes please
I don't want the last row to lose the F in 'Fix'.

You should use a regex remembering ^ indicates startswith:
df['Clean Summary'] = df['Summary'].str.replace('^Fwd','')
Here's an example:
df = pd.DataFrame({'msg':['Fwd: o','oe','Fwd: oj'],'B':[1,2,3]})
df['clean_msg'] = df['msg'].str.replace(r'^Fwd: ','')
print(df)
Output:
msg B clean_msg
0 Fwd: o 1 o
1 oe 2 oe
2 Fwd: oj 3 oj

You are not only loosing 'F' but also 'w', 'd', and ':'. This is the way lstrip works - it removes all of the combinations of characters in the passed string.
You should actually use x.replace('Fwd:', '', 1)
1 - ensures that only the first occurrence of the string is removed.

Related

How to filter rows with non Latin characters

I am stuck in a problem with a dataframe with a column of film names which has a bunch of non-latin names like Japanese or Chinese (and maybe Russian names too) my code is:
df['title'].head(5)
1 I am legend
2 wonder women
3 アライヴ
4 怪獣総進撃
5 dead sea
I just want an output that removes every non-Latin character title, so I want to remove every row that contains character similar to row 3 and 4, so my desired output is:
df['title'].head(5)
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude
Any help with this code?
You can use str.match with the Latin character range to identify non-latin characters, and use the boolean output to slice the data:
df_latin = df[~df['title'].str.match(r'.*[^\x00-\xFF]')]
output:
title
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude
You can encode your title column then decode to latin1. If this double transformation does not match your original data, remove row because it contains some non Latin characters:
df = df[df['title'] == df['title'].str.encode('unicode_escape').str.decode('latin1')]
print(df)
# Output
title
0 I am legend
1 wonder women
3 dead sea
You can use the isascii() method (if you're using Python 3.7+). Example:
"I am legend".isascii() # True
"アライヴ".isascii() # False
Even if you have 1 Non-English letter, the isascii() method will return False.
(Note that for strings like '34?#5' the method will return True, because those are all ASCII characters.)
We can easily makes a function which will return whether it is ascii or not and based on that we can then filter our dataframe.
dict_1 = {'col1':list(range(1,6)), 'col2':['I am legend','wonder women','アライヴ','怪獣総進撃','dead sea']}
def check_ascii(string):
if string.isascii() == True:
return True
else:
return False
df = pd.DataFrame(dict_1)
df['is_eng'] = df['col2'].apply(lambda x: check_ascii(x))
df2 = df[df['is_eng'] == True]
df2
Output

How do you remove standalone alphabets from a column in python?

I am currently working with lyric data of several artists.
When I was working with BTS lyric data, I noticed that the data had names of the member who sang that line at the front as you can see in the example below.
Artist Lyric
bts(방탄소년단) jungkook 'cause i i i'm in the stars tonight s...
bts(방탄소년단) 방탄소년단의 fake love 가사 v jungkook 널 위해서라면 난 슬퍼도...
I tried removing their names with the str.replace() method.
However, one of their name seems to be "v" and when I try to remove "v" it removes all v's from the column as demonstrated below
Artist Lyric
bts(방탄소년단) 'cause i i i'm in the stars tonight s...
bts(방탄소년단) 방탄소년단의 fake lo e 가사 널 위해서라면 난 슬퍼도...
It would be really appreciated if there would be any way to remove a "v" that stands alone, but maintain the v's that are actually inside a word such as the v in love.
Thank you in advance!!
If you want to remove all single letters in you data, let's say that you have the following dataframe
artist lyrics
0 sdsddssd fdgdg v sdsssdvsdd
1 sffxxvxv sddsdsdvdfdf
2 vxvxvxv fdagsgvs v v v
3 zcczc xdfsddfds
4 zcczc vxvxdfdvdvd
then
dg['lyrics'].map(lambda x: ' '.join(word for word in x.split() if len(word)>1))
returns:
0 fdgdg sdsssdvsdd
1 sddsdsdvdfdf
2 fdagsgvs
3 xdfsddfds
4 vxvxdfdvdvd
Name: lyrics, dtype: object

remove words starting with "#" in a column from a dataframe

I have a dataframe called tweetscrypto and I am trying to remove all the words from the column "text" starting with the character "#" and gather the result in a new column "clean_text". The rest of the words should stay exactly the same:
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(filter(lambda x:x[0]!='#', x.split()))
it does not seem to work. Can somebody help?
Thanks in advance
Please str.replace string starting with #
Sample Data
text
0 News via #livemint: #RBI bars banks from links
1 Newsfeed from #oayments_source: How Africa
2 is that bitcoin? not my thing
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\#\w+.*?)',"")
Still, can capture # without escaping as noted by #baxx
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(#\w+.*?)',"")
clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing
In this case it might be better to define a method rather than using a lambda for mainly readability purposes.
def clean_text(X):
X = X.split()
X_new = [x for x in X if not x.startswith("#")
return ' '.join(X_new)
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(clean_text)

Select strings based on numeric range between words

I am trying to write a regex that matches columns in my dataframe. All the columns in the dataframe are
cols = ['after_1', 'after_2', 'after_3', 'after_4', 'after_5', 'after_6',
'after_7', 'after_8', 'after_9', 'after_10', 'after_11', 'after_12',
'after_13', 'after_14', 'after_15', 'after_16', 'after_17', 'after_18',
'after_19', 'after_20', 'after_21', 'after_22', 'after_10_missing',
'after_11_missing', 'after_12_missing', 'after_13_missing',
'after_14_missing', 'after_15_missing', 'after_16_missing',
'after_17_missing', 'after_18_missing', 'after_19_missing',
'after_1_missing', 'after_20_missing', 'after_21_missing',
'after_22_missing', 'after_2_missing', 'after_3_missing',
'after_4_missing', 'after_5_missing', 'after_6_missing',
'after_7_missing', 'after_8_missing', 'after_9_missing']
I want to select all the columns that have values in the strings that range from 1-14.
This code works
df.filter(regex = '^after_[1-9]$|after_([1-9]\D|1[0-4])').columns
but I'm wondering how to make it in one line instead of splititng it in two. The first part selects all strings that end in a number between 1 and 9 (i.e. 'after_1' ... 'after_9') but not their "missing" counterparts. The second part (after the |), selects any string that begins with 'after' and is between 1 and 9 and is followed by a word character, or begins with 1 and is followed by 0-4.
Is there a better way to write this?
I already tried
df.filter(regex = 'after_([1-9]|1[0-4])').columns
But that picks up strings that begin with a 1 or a 2 (i.e. 'after_20')
Try this: after_([1-9]|1[0-4])[a-zA-Z_]*\b
import re
regexp = '''(after_)([1-9]|1[0-4])(_missing)*\\b'''
cols = ['after_1', 'after_14', 'after_15', 'after_14_missing', 'after_15_missing', 'after_9_missing']
for i in cols:
print(i , re.findall(regexp, i))

Using Reg Ex to Match Strings in a Data Frame and Replace - python

i have data frame that looks like this
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT
i want to be able to strip ,60 ,R-12,HT using regex and also deletes the moreinfo and GoCats rows from the df.
My expected Results:
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
I first removed the strings
del = ['hello', 'moreinfo']
for i in del:
df = df[value!= i]
Can somebody suggest a way to use regex to match and delete all case that do match A067-M4FL-CAA-020 or MZF8-050Z-AAB pattern so i don't have to create a list for all possible cases?
I was able to strip a single line like this but i want to be able to strip all matching cases in the dataframe
pattern = r',\w+ \,\w+-\w+\,\w+ *'
line = 'MRF2-050A-TFC,60 ,R-12,HT'
for i in re.findall(pattern, line):
line = line.replace(i,'')
>>> MRF2-050A-TFC
I tried adjusting my code but it prints out the same output for each row
pattern = r',\w+ \,\w+-\w+\,\w+ *'
for d in df:
for i in re.findall(pattern, d):
d = d.replace(i,'')
Any suggestions will be greatly appreciated. Thanks
You may try this
(?:\w+-){2,}[^,\n]*
Demo
Python scripts may be as follows
ss="""0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT"""
import re
regx=re.compile(r'(?:\w+-){2,}[^,\n]*')
m= regx.findall(ss)
for i in range(len(m)):
print("%d %s" %(i, m[i]))
and the output is
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
Here's a simpler approach you can try without using regex. pandas has many in-built functions to deal with text data.
# remove unwanted values
df['value'] = df.value.str.replace(r'moreinfo|60|R-.*|HT|GoCats|\,', '')
# drop na
df = df[(df != '')].dropna()
# print
print(df)
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
3 MZF8-050Z-AAB
5 MZA2-0580-TFD
-----------
# data used
df = pd.read_fwf(StringIO(u'''
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT'''),header=1)
I'd suggest capturing the data you DO want, since it's pretty particular, and the data you do NOT want could be anything.
Your pattern should look something like this:
^\w{4}-\w{4}-\w{3}(?:-\d{3})?
https://regex101.com/r/NtH2Ut/2
I'd recommend being more specific than \w where possible. (Like ^[A-Z]\w{3}) if you know the beginning four character chunk should start with a letter.
edit
Sorry, I may not have read your input and output literally enough:
https://regex101.com/r/NtH2Ut/3
^(?:\d+\s+\w{4}-\w{4}-\w{3}(?:-\d{3})?)|^\s+.*

Categories

Resources