Replace string in pandas dataframe if it contains specific substring

Replace string in pandas dataframe if it contains specific substring - python

I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.

Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers

Related

How to replace last three characters of a string in a column if it starts with character

I have a pandas dataframe of postcodes which have been concatenated with the two-letter country code. Some of these are Brazilian postcodes and I want to replace the last three characters of any postcode which starts with 'BR' with '000'.
import pandas as pd
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
I have tried the below, but it is not changing any of the postcodes:
if df['postcode'].str.startswith('BR').all():
df["postcode"] = df["postcode"].str.replace(r'.{3}$', '000')

Use str.replace with a capturing group:
df['postcode'] = df['postcode'].str.replace(r'(BR.*)...', r'\g<1>000', regex=True)
# or, more generic
df['postcode'] = df['postcode'].str.replace(r'(BR.*).{3}', r'\g<1>'+'0'*3, regex=True)
Output:
postcode
0 BR86037-000
1 GBBB7
2 BR86071-000
3 BR86200-000
4 BR86026-000
5 BR86082-000
6 GBCW9
7 NO3140
regex demo

The code is not working because df['postcode'].str.startswith('BR').all() will return a boolean value indicating whether all postcodes in the column start with 'BR'.
try this
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
mask = df['postcode'].str.startswith('BR')
df.loc[mask, 'postcode'] = df.loc[mask, 'postcode'].str.replace(r'.{3}$', '000')

How to check the pattern of a column in a dataframe

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!

You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999

You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df

You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

remove duplicate pairs from the list in column in pandas

I would like to remove duplicate pairs from the list in column while mainting the order:
for example the input is :
cola. colb
1. [sitea,siteb,sitea,siteb;sitec,sited,sitec,sited]
the expected output is the unique elements before each ';' symbol
cola. colb
1. [sitea,siteb;sitec,sited]
I tried splitting the column based on the ; symbol and the create a set for the list but it didn't work.
df['test'] = df.e2etrail.str.split(';').map(lambda x : ','.join(sorted(set(x),key=x.index)))
I also tried the following
df['test'] = df['e2etrail'].apply(lambda x: list(pd.unique(x)))
Any idea on how to make it work

You can remove [] by strip and then split by , or ; first and then use your solution:
print (df.e2etrail.str.strip('[]').str.split('[;,]'))
dtype: object
0 [sitea, siteb, sitea, siteb, sitec, sited, sit...
Name: e2etrail, dtype: object
f = lambda x : ','.join(sorted(set(x),key=x.index))
df['test'] = df.e2etrail.str.strip('[]').str.split('[;,]').map(f)
print (df)
cola. e2etrail \
0 1.0 [sitea,siteb,sitea,siteb;sitec,sited,sitec,sited]
test
0 sitea,siteb,sitec,sited
If need output list:
f = lambda x : sorted(set(x),key=x.index)
df['test'] = df.e2etrail.str.strip('[]').str.split('[;,]').map(f)
print (df)
cola. e2etrail \
0 1.0 [sitea,siteb,sitea,siteb;sitec,sited,sitec,sited]
test
0 [sitea, siteb, sitec, sited]

eventually I did it by converting the list into series, dropped the duplicates and joined the series again as following :
df['e2etrails']=df['e2etrails'].str.split(';')
df['e2etrails']=df['e2etrails'].apply(lambda row :';'.join(pd.Series(row).str.split(',').map(lambda x : ','.join(sorted(set(x),key=x.index)))))

Remove grave accent from IDs

I have an ID column with grave accent like this `1234ABC40 and I want to remove just that character from this column but keep the dataframe form.
I tried this on the column only. I have a file name x here and has multiple columns. id is the col i want to fix.
pd.read_csv(r'C:\filename.csv', index_col = False)
id = str(x['id'])
id2 = unidecode.unidecode(id)
id3 = id2.replace('`','')
This changes to str but I want that column back in the dataframe form

DataFrames have their own replace() function. Note, for partial replacements you must enable regex=True in the parameters:
import pandas as pd
d = {'id': ["12`3", "32`1"], 'id2': ["004`", "9`99"]}
df = pd.DataFrame(data=d)
df["id"] = df["id"].replace('`','', regex=True)
print df
id id2
0 123 004`
1 321 9`99

Create pandas dataframe from string

I can easily build a pandas dataframe from a string that contains only one key value pair. For example:
string1 = '{"Country":"USA","Name":"Ryan"}'
dict1 = json.loads(string1)
df=pd.DataFrame([dict1])
print(df)
However, when I use a string that has more than one key value pair :
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
dict2 = json.loads(string2)
I get the following error:
raise JSONDecodeError("Extra data", s, end)
I am aware that string2 is not a valid JSON.
What modifications can I do on string2 programmatically so that I can convert it to a valid JSON and then get a dataframe output which is as follows:
| Country | Name |
|---------|------|
| USA | Ryan |
| Sweden | Sam |
| Brazil | Ralf |

Your error
The error says it all. The JSON is not valid. Where did you get that string2? Are you typing it in yourself?
In that case you should surround the items with brackets [] and separate the items with comma ,.
Working example:
import pandas as pd
import json
string2 = '[{"Country":"USA","Name":"Ryan"},{"Country":"Sweden","Name":"Sam"},{"Country":"Brazil","Name":"Ralf"}]'
df = pd.DataFrame(json.loads(string2))
print(df)
Returns:
Country Name
0 USA Ryan
1 Sweden Sam
2 Brazil Ralf
Interestingly, if you are extra observant, in this line here df=pd.DataFrame([dict1]) you are actually putting your dictionary inside an array with brackers[]. This is because pandas DataFrame accepts arrays of data. What you actually have in your first example is an item in which case a serie would make more sense or df = pd.Series(dict1).to_frame().T.
Or:
string1 = '[{"Country":"USA","Name":"Ryan"}]' # <--- brackets here to read json as arr
dict1 = json.loads(string1)
df=pd.DataFrame(dict1)
print(df)
And if you understood this I think it becomes easier to understand that we need , to seperate the elements.
Alternative inputs
But let's say you are creating this dataset yourself, then you could go ahead and do this:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
dict1 = [{"Country":i, "Name":y} for i,y in data] # <-- dictionaries inside arr
df = pd.DataFrame(dict1)
Or:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
df = pd.DataFrame(dict1, columns=['Country','Name'])
Or which I would prefer to use a CSV-structure:
data = '''\
Country,Name
USA,Ryan
Sweden,Sam
Brazil,Ralf'''
df = pd.read_csv(pd.compat.StringIO(data))

In the off chance that you are getting data from elsewhere in the weird format that you described, following regular expression based substitutions can fix your json and there after you can go as per #Anton vBR 's solution.
import pandas as pd
import json
import re
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
#create dict of substitutions
rd = { '^{' : '[{' , #substitute starting char with [
'}$' : '}]', #substitute ending char with ]
'}{' : '},{' #Add , in between two dicts
}
#replace as per dict
for k,v in rd.iteritems():
string2 = re.sub(r'{}'.format(k),r'{}'.format(v),string2)
df = pd.DataFrame(json.loads(string2))
print(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace string in pandas dataframe if it contains specific substring - python

Related

How to replace last three characters of a string in a column if it starts with character

How to check the pattern of a column in a dataframe

remove duplicate pairs from the list in column in pandas

Remove grave accent from IDs

Create pandas dataframe from string

Categories

Resources