Pandas: Randomize letters in a column - python

I have a data frame that looks like this:
id1 | id2
----------------------------
ab51c-ee-1a | cga--=%abd21
I am looking to randomize the letters only:
id1 | id2
----------------------------
ge51r-eq-1b | olp--=%cqw21
I think I can do something like this:
newid1 = []
for index, row in df.iterrows():
string = ''
for i in row['id1']:
if i.isalpha():
string+=random.choice(string.letters)
else:
string+=i
newcolumn.append(string)
But it doesn't seem very efficient. Is there a better way?

Lets use apply, with the power of str.replace to replace only alphabets using regex i.e
import string
import random
letters = list(string.ascii_lowercase)
def rand(stri):
return random.choice(letters)
df.apply(lambda x : x.str.replace('[a-z]',rand))
Output :
id1 id2
0 gp51e-id-1v jvj--=%glw21
For one specific column use
df['id1'].str.replace('[a-z]',rand)
Added by #antonvbr: For future reference, if we want to change upper and lower case we could do this:
letters = dict(u=list(string.ascii_uppercase),l=list(string.ascii_lowercase))
(df['id1'].str.replace('[a-z]',lambda x: random.choice(letters['l']))
.str.replace('[A-Z]',lambda x: random.choice(letters['u'])))

How about this:
import pandas as pd
from string import ascii_lowercase as al
import random
df = pd.DataFrame({'id1': ['ab51c-ee-1a'],
'id2': ['cga--=%abd21']})
al = list(al)
df = df.applymap(lambda x: ''.join([random.choice(al) if i in al else i for i in list(x)]))

Related

Search columns with list of string for a specific set of text and if the text is found enter new a new string of text in a new column

I want to search for names in column col_one where I have a list of names in the variable list20. When searching, if the value of col_one matches in list20, put the same name in a new column named new_col
Most of the time, the name will be at the front, such as ZEN, W, WICE, but there will be some names.
with a symbol after the name again, such as ZEN-R, ZEN-W2, ZEN13P2302A
my data
import pandas as pd
list20 = ['ZEN', 'OOP', 'WICE', 'XO', 'WP', 'K', 'WGE', 'YGG', 'W', 'YUASA', 'XPG', 'ABC', 'WHA', 'WHAUP', 'WFX', 'WINNER', 'WIIK', 'WIN', 'YONG', 'WPH', 'KCE']
data = {
"col_one": ["ZEN", "WPH", "WICE", "YONG", "K" "XO", "WIN", "WP", "WIIK", "YGG-W1", "W-W5", "WINNER", "YUASA", "WGE", "WFX", "XPG", "WHAUP", "WHA", "KCE13P2302A", "OOP-R"],
}
df = pd.DataFrame(data)
# The code you provided will give the result like the picture below. and it's not right
# or--------
df['new_col'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
# or--------
import re
pattern = re.compile(r"|".join(x for x in list20))
df = (df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
# or----------
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
The result obtained from the code above and it's not right
Expected Output
Try to sort the list first:
pattern = re.compile(r"|".join(x for x in sorted(list20, reverse=True, key=len)))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
Try with str.extract
df['new'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
df
Out[121]:
col_one new
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU
One way to do this, less attractive in terms of efficiency, is to use a simple function with a lambda such that:
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
df
expected results:
col_one new_col
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU
Another approach:
pattern = re.compile(r"|".join(x for x in list20))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)

How to check the pattern of a column in a dataframe

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!
You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999
You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df
You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

Replace string in pandas dataframe if it contains specific substring

I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers

Create pandas dataframe from string

I can easily build a pandas dataframe from a string that contains only one key value pair. For example:
string1 = '{"Country":"USA","Name":"Ryan"}'
dict1 = json.loads(string1)
df=pd.DataFrame([dict1])
print(df)
However, when I use a string that has more than one key value pair :
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
dict2 = json.loads(string2)
I get the following error:
raise JSONDecodeError("Extra data", s, end)
I am aware that string2 is not a valid JSON.
What modifications can I do on string2 programmatically so that I can convert it to a valid JSON and then get a dataframe output which is as follows:
| Country | Name |
|---------|------|
| USA | Ryan |
| Sweden | Sam |
| Brazil | Ralf |
Your error
The error says it all. The JSON is not valid. Where did you get that string2? Are you typing it in yourself?
In that case you should surround the items with brackets [] and separate the items with comma ,.
Working example:
import pandas as pd
import json
string2 = '[{"Country":"USA","Name":"Ryan"},{"Country":"Sweden","Name":"Sam"},{"Country":"Brazil","Name":"Ralf"}]'
df = pd.DataFrame(json.loads(string2))
print(df)
Returns:
Country Name
0 USA Ryan
1 Sweden Sam
2 Brazil Ralf
Interestingly, if you are extra observant, in this line here df=pd.DataFrame([dict1]) you are actually putting your dictionary inside an array with brackers[]. This is because pandas DataFrame accepts arrays of data. What you actually have in your first example is an item in which case a serie would make more sense or df = pd.Series(dict1).to_frame().T.
Or:
string1 = '[{"Country":"USA","Name":"Ryan"}]' # <--- brackets here to read json as arr
dict1 = json.loads(string1)
df=pd.DataFrame(dict1)
print(df)
And if you understood this I think it becomes easier to understand that we need , to seperate the elements.
Alternative inputs
But let's say you are creating this dataset yourself, then you could go ahead and do this:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
dict1 = [{"Country":i, "Name":y} for i,y in data] # <-- dictionaries inside arr
df = pd.DataFrame(dict1)
Or:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
df = pd.DataFrame(dict1, columns=['Country','Name'])
Or which I would prefer to use a CSV-structure:
data = '''\
Country,Name
USA,Ryan
Sweden,Sam
Brazil,Ralf'''
df = pd.read_csv(pd.compat.StringIO(data))
In the off chance that you are getting data from elsewhere in the weird format that you described, following regular expression based substitutions can fix your json and there after you can go as per #Anton vBR 's solution.
import pandas as pd
import json
import re
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
#create dict of substitutions
rd = { '^{' : '[{' , #substitute starting char with [
'}$' : '}]', #substitute ending char with ]
'}{' : '},{' #Add , in between two dicts
}
#replace as per dict
for k,v in rd.iteritems():
string2 = re.sub(r'{}'.format(k),r'{}'.format(v),string2)
df = pd.DataFrame(json.loads(string2))
print(df)

Is there a way to use str.count() function with a LIST of values instead of a single string?

I am trying to count the number of times that any string from a list_of_strings appears in a csv file cell.
For example, the following would work fine.
import pandas as pd
data_path = "SurveryResponses.csv"
df = pd.read_csv(data_path)
totalCount = 0
for row in df['rowName']:
if type(row) == str:
print(row.count('word_of_interest'))
However, I would like to be able to enter a list of strings (['str1', str2', str3']) rather than just one 'word_of_interest', such that if any of those strings appear the count value will increase by one.
Is there a way to do this?
Perhaps something along the lines of
totalCount = 0
words_of_interst = ['cat','dog','foo','bar']
for row in df['rowName']:
if type(row) == str:
if sum([word in row for word in words_of_interst]) > 0:
totalCount += 1
Use the str accessor:
df['rowName'].str.count('word_of_interest')
If you need to convert the column to string first, use astype:
df['rowName'].astype(str).str.count('word_of_interest')
Assuming list_of_strings = ['str1', str2', str3'] you can try the following:
if any(map(lambda x: x in row, list_of_strings)):
totalCount += 1
You can use this method to count from an external list
strings = ['string1','string2','string3']
sum([1 if sr in strings else 0 for sr in df.rowName])
Here is an example:
import io
filedata = """animal,amount
"['cat','dog']",2
"['cat','horse']",2"""
df = pd.read_csv(io.StringIO(filedata))
Returns this dataframe:
animal amount
0 ['cat','dog'] 2
1 ['cat','horse'] 2
Search for word cat (looping through all columns as series):
search = "cat"
# sums True for each serie and then wrap a sum around all sums
# sum([2,0]) in this case
sum([sum(df[cols].astype(str).str.contains(search)) for cols in df.columns])
Returns 2

Categories

Resources