I have a dataset in which some columns have lookup values. There are several such columns in the dataset. I need to expand these columns so that the column name consists of the name of the column itself and the keys in the dict.
Example df:
df
col1 col2 col3
a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
Desired result:
df_res
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
a 1a 2a 1a 2a
b 1b 2b 1a 2a
c 1c 2c 1a 2a
How can I do that?
If in columns are dictionaries, not strings use list comprehension with json_normalize:
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'dict'>
dfs = [pd.json_normalize(df.pop(x)).add_prefix(f'{x}_') for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
Solution with strings possible converted to dictionaries:
print (df)
col1 col2 col3
0 a {'key_1': '1a', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
1 b {'key_1': '1b', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
2 c {'key_1': '1c', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'str'>
import ast
dfs = [pd.json_normalize(df.pop(x).apply(ast.literal_eval)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
EDIT: Solution for original format with custom function:
print (df)
col1 col2 col3
0 a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
1 b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
2 c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
# <class 'str'>
f = lambda x: dict([x.split(': ') for x in x.strip("{'}").split(', ')])
dfs = [pd.json_normalize(df.pop(x).apply(f)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
I would use the df[col].apply(pd.Series) method to achieve this. It would then look something like this:
def explode_dictcol(df, col):
temp = df[col].apply(pd.Series)
temp = temp.rename(columns={cc: col + '_' + cc for cc in temp.columns})
return temp
df = pd.concat([df, explode_dictcol(df, 'col2'), explode_dictcol(df, 'col3')], axis=1)
df = df.drop(columns=['col2', 'col3]
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1b 2b
2 c 1c 2c 1c 2c
If the columns are strings, the following will do the work
df_new = pd.DataFrame(data = [
[row['col1'],
row['col2'].split(':')[1].split(',')[0].strip(),
row['col2'].split(':')[2].split('}')[0].strip(),
row['col3'].split(':')[1].split(',')[0].strip(),
row['col3'].split(':')[2].split('}')[0].strip()]
for index, row in df.iterrows()
]).rename(columns = {0: 'col1', 1: 'col2_key_1', 2: 'col2_key_2', 3: 'col3_key_3', 4: 'col3_key_4'})
[Out]:
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
Notes:
Check the data type with
print(type(df['col2'][0]))
# or
print(type(df['col2'].iat[0]))
The first part of the proposed solution
df_new = pd.DataFrame(data = [
[row['col1'], row['col2'].split(':')[1].split(',')[0].strip(),
row['col2'].split(':')[2].split('}')[0].strip(),
row['col3'].split(':')[1].split(',')[0].strip(),
row['col3'].split(':')[2].split('}')[0].strip()]
for index, row in df.iterrows()
])
gives the following output
0 1 2 3 4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
which is almost the same, but that is why one has to pass .rename() to make sure the column names are as OP wants.
Related
I have a dataframe like this and can group it by library and sample columns and create new columns:
df = pd.DataFrame({'barcode': ['b1', 'b2','b1','b2','b1',
'b2','b1','b2'],
'library': ['l1', 'l1','l1','l1','l2', 'l2','l2','l2'],
'sample': ['s1','s1','s2','s2','s1','s1','s2','s2'],
'category': ['c1', 'c2','c1','c2','c1', 'c2','c1','c2'],
'count': [10,21,13,54,51,16,67,88]})
df
barcode library sample category count
0 b1 l1 s1 c1 10
1 b2 l1 s1 c2 21
2 b1 l1 s2 c1 13
3 b2 l1 s2 c2 54
4 b1 l2 s1 c1 51
5 b2 l2 s1 c2 16
6 b1 l2 s2 c1 67
7 b2 l2 s2 c2 88
I used grouping to reduce dimentions of the df:
grp=df.groupby(['library','sample'])
df=grp.get_group(('l1','s1')).rename(columns={"count":
"l1_s1_count"}).reset_index(drop=True)
df['l1_s2_count']=grp.get_group(('l1','s2'))[['count']].values
df['l2_s1_count']=grp.get_group(('l2','s1'))[['count']].values
df['l2_s2_count']=grp.get_group(('l2','s2'))[['count']].values
df=df.drop(['sample','library'],axis=1)
result
barcode category l1_s1_count l1_s2_count l2_s1_count
l2_s2_count
0 b1 c1 10 13 51 67
1 b2 c2 21 54 16 88
I think there should be a neater way for this transformation, like using pivot table which I failed with, could you please suggest how this could be done with pivot table?
thanks.
try pivot_table function as below,
it will produce multi-index result, which will need to be flattened.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library'], values='count').reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode_ category_ s1_l1 s1_l2 s2_l1 s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
or even without , values='count'.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library']).reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode__ category__ count_s1_l1 count_s1_l2 count_s2_l1 count_s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
as per your preference
Column A 2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
Above is the dataframe I want to create.
The first row represents the field names. The logic I want to employ is as follows:
If the column name is in the "Column A" row, then 1 otherwise 0
I have scoured Google looking for code answering a question similar to mine so I can test it out and backward engineer a solution. Unfortunately, I have not been able to find anything.
Otherwise I would post some code that I attempted to solve this problem but I literally have no clue.
You can use a list comprehension to create the desire data based on the columns and rows:
In [39]: row =['2C 1B D2 6F ABC', '2C 1248 Bulers']
In [40]: columns=['2C', 'GAD', 'D2', '6F', 'ABCDE']
In [41]: df = pd.DataFrame([[int(k in r) for k in columns] for r in row], index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [42]: df
Out[42]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
If you want a pure Pandas approach you can use pd.Series() instead of list for preserving the columns and rows then use Series.apply and Series.str.contains to get the desire result:
In [73]: data = columns.apply(row.str.contains).astype(int).transpose()
In [74]: df = pd.DataFrame(data.values, index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [75]: df
Out[75]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
i have a sample data set:
import pandas as pd
df = {
'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']
}
df = pd.DataFrame(df)
it looks like this:
columA columnB count
1A a 1
2A dd 12
3A dd 34
4A ee 52
5A d 3
6A f 2
Update: The combined 2A and 3A name should be something arbitrary like 'SAB' or '2A plus 3A', etc., I used '2A|3A' as the example and it confused some of the people.
I want to sum up the count the rows 2A and 3A and give it a name SAB
desired output:
columA columnB count
1A a 1
SAB dd 46
4A ee 52
5A d 3
6A f 2
We can use a groupby on columnB
df = {'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']}
df = pd.DataFrame(df)
df.groupby('columnB').agg({'count': 'sum', 'columA': 'sum'})
columA count
columnB
a 1A 1
d 5A 3
dd 2A3A 46
ee 4A 52
f 6A 2
If you're concerned about the index name you can write a function like so.
def join_by_pipe(s):
return '|'.join(s)
df.groupby('columnB').agg({'count': 'sum', 'columA': join_by_pipe})
columA count
columnB
a 1A 1
d 5A 3
dd 2A|3A 46
ee 4A 52
f 6A 2
I have a relatively big dataframe (1.5 Gb), and I want to group rows by ID and order rows by column VAL in ascending order within each group.
df =
ID VAL COL
1A 2 BB
1A 1 AA
2B 2 CC
3C 3 SS
3C 1 YY
3C 2 XX
This is the expected result:
df =
ID VAL COL
1A 1 AA
1A 2 BB
2B 2 CC
3C 1 YY
3C 2 XX
3C 3 SS
This is what I tried, but it runs very long time. Is there any faster solution?:
df = df.groupby("ID").apply(pd.DataFrame.sort, 'VAL')
If you have a big df and speed is important, try a little numpy
# note order of VAL first, then ID is intentional
# np.lexsort sorts by right most column first
df.iloc[np.lexsort((df.VAL.values, df.ID.values))]
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
super charged
v = df.values
i, j = np.searchsorted(df.columns.values, ['VAL', 'ID'])
s = np.lexsort((v[:, i], v[:, j]))
pd.DataFrame(v[s], df.index[s], df.columns)
timing
sort_values on 'ID', 'VAL' should give you
In [39]: df.sort_values(by=['ID', 'VAL'])
Out[39]:
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
Time it for your use-case
In [89]: dff.shape
Out[89]: (12000, 3)
In [90]: %timeit dff.sort_values(by=['ID', 'VAL'])
100 loops, best of 3: 2.62 ms per loop
In [91]: %timeit dff.iloc[np.lexsort((dff.VAL.values, dff.ID.values))]
100 loops, best of 3: 8.8 ms per loop
I have a dataframe:
import pandas as pd
df=pd.DataFrame({
'Player': ['John','John','John','Steve','Steve','Ted', 'James','Smitty','SmittyJr','DJ'],
'Name': ['A','B', 'A','B','B','C', 'A','D','D','D'],
'Group':['2A','1B','2A','2A','1B','1C','2A','1C','1C','2A'],
'Medal':['G', '?', '?', 'S', 'B','?','?','?','G','?']
})
df = df[['Player','Group', 'Name', 'Medal']]
print(df)
I want to update all the '?' in the column Medal with values for any of the rows with matching Name & Group columns that are already filled in.
For example since the first row 0 is Name:A, Group:2A, Medal:G, then the '?' on row 6 and 2 would be 'G'
The results should look like:
res=pd.DataFrame({
'Player': ['John','John','John','Steve','Steve','Ted', 'James','Smitty','SmittyJr','DJ'],
'Name': ['A','B', 'A','B','B','C', 'A','D','D','D'],
'Group':['2A','1B','2A','2A','1B','1C','2A','1C','1C','2A'],
'Medal':['G', 'B', 'G', 'S', 'B','?','G','G','G','?']
})
res = res[['Player','Group', 'Name', 'Medal']]
print(res)
What is the most efficient way to do this?
Another solution with replace ? by last value (with iloc) of sorted Medal (with sort_values) in each group:
df['Medal'] = df.groupby(['Group','Name'])['Medal']
.apply(lambda x: x.replace('?', x.sort_values().iloc[-1]))
print(df)
Player Group Name Medal
0 John 2A A G
1 John 1B B B
2 John 2A A G
3 Steve 2A B S
4 Steve 1B B B
5 Ted 1C C ?
6 James 2A A G
7 Smitty 1C D G
8 SmittyJr 1C D G
9 DJ 2A D ?
Timings:
In [81]: %timeit (df.groupby(['Group','Name'])['Medal'].apply(lambda x: x.replace('?', x.sort_values().iloc[-1])))
100 loops, best of 3: 4.13 ms per loop
In [82]: %timeit (df.replace('?', np.nan).groupby(['Name', 'Group']).apply(lambda df: df.ffill().bfill()).fillna('?'))
100 loops, best of 3: 11.3 ms per loop
Try:
import pandas as pd
import numpy as np
myfill = lambda df: df.ffill().bfill()
df.replace('?', np.nan).groupby(['Name', 'Group']).apply(myfill).fillna('?')
Player Group Name Medal
0 John 2A A G
1 John 1B B B
2 John 2A A G
3 Steve 2A B S
4 Steve 1B B B
5 Ted 1C C ?
6 James 2A A G
7 Smitty 1C D G
8 SmittyJr 1C D G
9 DJ 2A D ?