Expand pandas columns with dict while keeping fragment of column name - python

I have a dataset in which some columns have lookup values. There are several such columns in the dataset. I need to expand these columns so that the column name consists of the name of the column itself and the keys in the dict.
Example df:
df
col1 col2 col3
a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
Desired result:
df_res
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
a 1a 2a 1a 2a
b 1b 2b 1a 2a
c 1c 2c 1a 2a
How can I do that?

If in columns are dictionaries, not strings use list comprehension with json_normalize:
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'dict'>
dfs = [pd.json_normalize(df.pop(x)).add_prefix(f'{x}_') for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
Solution with strings possible converted to dictionaries:
print (df)
col1 col2 col3
0 a {'key_1': '1a', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
1 b {'key_1': '1b', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
2 c {'key_1': '1c', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'str'>
import ast
dfs = [pd.json_normalize(df.pop(x).apply(ast.literal_eval)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
EDIT: Solution for original format with custom function:
print (df)
col1 col2 col3
0 a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
1 b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
2 c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
# <class 'str'>
f = lambda x: dict([x.split(': ') for x in x.strip("{'}").split(', ')])
dfs = [pd.json_normalize(df.pop(x).apply(f)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a

I would use the df[col].apply(pd.Series) method to achieve this. It would then look something like this:
def explode_dictcol(df, col):
temp = df[col].apply(pd.Series)
temp = temp.rename(columns={cc: col + '_' + cc for cc in temp.columns})
return temp
df = pd.concat([df, explode_dictcol(df, 'col2'), explode_dictcol(df, 'col3')], axis=1)
df = df.drop(columns=['col2', 'col3]
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1b 2b
2 c 1c 2c 1c 2c

If the columns are strings, the following will do the work
df_new = pd.DataFrame(data = [
[row['col1'],
row['col2'].split(':')[1].split(',')[0].strip(),
row['col2'].split(':')[2].split('}')[0].strip(),
row['col3'].split(':')[1].split(',')[0].strip(),
row['col3'].split(':')[2].split('}')[0].strip()]
for index, row in df.iterrows()
]).rename(columns = {0: 'col1', 1: 'col2_key_1', 2: 'col2_key_2', 3: 'col3_key_3', 4: 'col3_key_4'})
[Out]:
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
Notes:
Check the data type with
print(type(df['col2'][0]))
# or
print(type(df['col2'].iat[0]))
The first part of the proposed solution
df_new = pd.DataFrame(data = [
[row['col1'], row['col2'].split(':')[1].split(',')[0].strip(),
row['col2'].split(':')[2].split('}')[0].strip(),
row['col3'].split(':')[1].split(',')[0].strip(),
row['col3'].split(':')[2].split('}')[0].strip()]
for index, row in df.iterrows()
])
gives the following output
0 1 2 3 4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
which is almost the same, but that is why one has to pass .rename() to make sure the column names are as OP wants.

Related

applying pivot table on pandas dataframe instead of grouping

I have a dataframe like this and can group it by library and sample columns and create new columns:
df = pd.DataFrame({'barcode': ['b1', 'b2','b1','b2','b1',
'b2','b1','b2'],
'library': ['l1', 'l1','l1','l1','l2', 'l2','l2','l2'],
'sample': ['s1','s1','s2','s2','s1','s1','s2','s2'],
'category': ['c1', 'c2','c1','c2','c1', 'c2','c1','c2'],
'count': [10,21,13,54,51,16,67,88]})
df
barcode library sample category count
0 b1 l1 s1 c1 10
1 b2 l1 s1 c2 21
2 b1 l1 s2 c1 13
3 b2 l1 s2 c2 54
4 b1 l2 s1 c1 51
5 b2 l2 s1 c2 16
6 b1 l2 s2 c1 67
7 b2 l2 s2 c2 88
I used grouping to reduce dimentions of the df:
grp=df.groupby(['library','sample'])
df=grp.get_group(('l1','s1')).rename(columns={"count":
"l1_s1_count"}).reset_index(drop=True)
df['l1_s2_count']=grp.get_group(('l1','s2'))[['count']].values
df['l2_s1_count']=grp.get_group(('l2','s1'))[['count']].values
df['l2_s2_count']=grp.get_group(('l2','s2'))[['count']].values
df=df.drop(['sample','library'],axis=1)
result
barcode category l1_s1_count l1_s2_count l2_s1_count
l2_s2_count
0 b1 c1 10 13 51 67
1 b2 c2 21 54 16 88
I think there should be a neater way for this transformation, like using pivot table which I failed with, could you please suggest how this could be done with pivot table?
thanks.
try pivot_table function as below,
it will produce multi-index result, which will need to be flattened.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library'], values='count').reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode_ category_ s1_l1 s1_l2 s2_l1 s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
or even without , values='count'.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library']).reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode__ category__ count_s1_l1 count_s1_l2 count_s2_l1 count_s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
as per your preference

Python Data frame: If Column Name is contained in the String Row of Another Column Then 1 Otherwise 0

Column A 2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
Above is the dataframe I want to create.
The first row represents the field names. The logic I want to employ is as follows:
If the column name is in the "Column A" row, then 1 otherwise 0
I have scoured Google looking for code answering a question similar to mine so I can test it out and backward engineer a solution. Unfortunately, I have not been able to find anything.
Otherwise I would post some code that I attempted to solve this problem but I literally have no clue.
You can use a list comprehension to create the desire data based on the columns and rows:
In [39]: row =['2C 1B D2 6F ABC', '2C 1248 Bulers']
In [40]: columns=['2C', 'GAD', 'D2', '6F', 'ABCDE']
In [41]: df = pd.DataFrame([[int(k in r) for k in columns] for r in row], index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [42]: df
Out[42]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
If you want a pure Pandas approach you can use pd.Series() instead of list for preserving the columns and rows then use Series.apply and Series.str.contains to get the desire result:
In [73]: data = columns.apply(row.str.contains).astype(int).transpose()
In [74]: df = pd.DataFrame(data.values, index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [75]: df
Out[75]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0

combine row with different name for a column pandas python

i have a sample data set:
import pandas as pd
df = {
'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']
}
df = pd.DataFrame(df)
it looks like this:
columA columnB count
1A a 1
2A dd 12
3A dd 34
4A ee 52
5A d 3
6A f 2
Update: The combined 2A and 3A name should be something arbitrary like 'SAB' or '2A plus 3A', etc., I used '2A|3A' as the example and it confused some of the people.
I want to sum up the count the rows 2A and 3A and give it a name SAB
desired output:
columA columnB count
1A a 1
SAB dd 46
4A ee 52
5A d 3
6A f 2
We can use a groupby on columnB
df = {'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']}
df = pd.DataFrame(df)
df.groupby('columnB').agg({'count': 'sum', 'columA': 'sum'})
columA count
columnB
a 1A 1
d 5A 3
dd 2A3A 46
ee 4A 52
f 6A 2
If you're concerned about the index name you can write a function like so.
def join_by_pipe(s):
return '|'.join(s)
df.groupby('columnB').agg({'count': 'sum', 'columA': join_by_pipe})
columA count
columnB
a 1A 1
d 5A 3
dd 2A|3A 46
ee 4A 52
f 6A 2

How to order rows within groups in the big dataframe

I have a relatively big dataframe (1.5 Gb), and I want to group rows by ID and order rows by column VAL in ascending order within each group.
df =
ID VAL COL
1A 2 BB
1A 1 AA
2B 2 CC
3C 3 SS
3C 1 YY
3C 2 XX
This is the expected result:
df =
ID VAL COL
1A 1 AA
1A 2 BB
2B 2 CC
3C 1 YY
3C 2 XX
3C 3 SS
This is what I tried, but it runs very long time. Is there any faster solution?:
df = df.groupby("ID").apply(pd.DataFrame.sort, 'VAL')
If you have a big df and speed is important, try a little numpy
# note order of VAL first, then ID is intentional
# np.lexsort sorts by right most column first
df.iloc[np.lexsort((df.VAL.values, df.ID.values))]
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
super charged
v = df.values
i, j = np.searchsorted(df.columns.values, ['VAL', 'ID'])
s = np.lexsort((v[:, i], v[:, j]))
pd.DataFrame(v[s], df.index[s], df.columns)
timing
sort_values on 'ID', 'VAL' should give you
In [39]: df.sort_values(by=['ID', 'VAL'])
Out[39]:
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
Time it for your use-case
In [89]: dff.shape
Out[89]: (12000, 3)
In [90]: %timeit dff.sort_values(by=['ID', 'VAL'])
100 loops, best of 3: 2.62 ms per loop
In [91]: %timeit dff.iloc[np.lexsort((dff.VAL.values, dff.ID.values))]
100 loops, best of 3: 8.8 ms per loop

Pandas python updating values in a table based on preexisting values and conditions

I have a dataframe:
import pandas as pd
df=pd.DataFrame({
'Player': ['John','John','John','Steve','Steve','Ted', 'James','Smitty','SmittyJr','DJ'],
'Name': ['A','B', 'A','B','B','C', 'A','D','D','D'],
'Group':['2A','1B','2A','2A','1B','1C','2A','1C','1C','2A'],
'Medal':['G', '?', '?', 'S', 'B','?','?','?','G','?']
})
df = df[['Player','Group', 'Name', 'Medal']]
print(df)
I want to update all the '?' in the column Medal with values for any of the rows with matching Name & Group columns that are already filled in.
For example since the first row 0 is Name:A, Group:2A, Medal:G, then the '?' on row 6 and 2 would be 'G'
The results should look like:
res=pd.DataFrame({
'Player': ['John','John','John','Steve','Steve','Ted', 'James','Smitty','SmittyJr','DJ'],
'Name': ['A','B', 'A','B','B','C', 'A','D','D','D'],
'Group':['2A','1B','2A','2A','1B','1C','2A','1C','1C','2A'],
'Medal':['G', 'B', 'G', 'S', 'B','?','G','G','G','?']
})
res = res[['Player','Group', 'Name', 'Medal']]
print(res)
What is the most efficient way to do this?
Another solution with replace ? by last value (with iloc) of sorted Medal (with sort_values) in each group:
df['Medal'] = df.groupby(['Group','Name'])['Medal']
.apply(lambda x: x.replace('?', x.sort_values().iloc[-1]))
print(df)
Player Group Name Medal
0 John 2A A G
1 John 1B B B
2 John 2A A G
3 Steve 2A B S
4 Steve 1B B B
5 Ted 1C C ?
6 James 2A A G
7 Smitty 1C D G
8 SmittyJr 1C D G
9 DJ 2A D ?
Timings:
In [81]: %timeit (df.groupby(['Group','Name'])['Medal'].apply(lambda x: x.replace('?', x.sort_values().iloc[-1])))
100 loops, best of 3: 4.13 ms per loop
In [82]: %timeit (df.replace('?', np.nan).groupby(['Name', 'Group']).apply(lambda df: df.ffill().bfill()).fillna('?'))
100 loops, best of 3: 11.3 ms per loop
Try:
import pandas as pd
import numpy as np
myfill = lambda df: df.ffill().bfill()
df.replace('?', np.nan).groupby(['Name', 'Group']).apply(myfill).fillna('?')
Player Group Name Medal
0 John 2A A G
1 John 1B B B
2 John 2A A G
3 Steve 2A B S
4 Steve 1B B B
5 Ted 1C C ?
6 James 2A A G
7 Smitty 1C D G
8 SmittyJr 1C D G
9 DJ 2A D ?

Categories

Resources