I have a relatively big dataframe (1.5 Gb), and I want to group rows by ID and order rows by column VAL in ascending order within each group.
df =
ID VAL COL
1A 2 BB
1A 1 AA
2B 2 CC
3C 3 SS
3C 1 YY
3C 2 XX
This is the expected result:
df =
ID VAL COL
1A 1 AA
1A 2 BB
2B 2 CC
3C 1 YY
3C 2 XX
3C 3 SS
This is what I tried, but it runs very long time. Is there any faster solution?:
df = df.groupby("ID").apply(pd.DataFrame.sort, 'VAL')
If you have a big df and speed is important, try a little numpy
# note order of VAL first, then ID is intentional
# np.lexsort sorts by right most column first
df.iloc[np.lexsort((df.VAL.values, df.ID.values))]
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
super charged
v = df.values
i, j = np.searchsorted(df.columns.values, ['VAL', 'ID'])
s = np.lexsort((v[:, i], v[:, j]))
pd.DataFrame(v[s], df.index[s], df.columns)
timing
sort_values on 'ID', 'VAL' should give you
In [39]: df.sort_values(by=['ID', 'VAL'])
Out[39]:
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
Time it for your use-case
In [89]: dff.shape
Out[89]: (12000, 3)
In [90]: %timeit dff.sort_values(by=['ID', 'VAL'])
100 loops, best of 3: 2.62 ms per loop
In [91]: %timeit dff.iloc[np.lexsort((dff.VAL.values, dff.ID.values))]
100 loops, best of 3: 8.8 ms per loop
Related
I have a dataset in which some columns have lookup values. There are several such columns in the dataset. I need to expand these columns so that the column name consists of the name of the column itself and the keys in the dict.
Example df:
df
col1 col2 col3
a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
Desired result:
df_res
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
a 1a 2a 1a 2a
b 1b 2b 1a 2a
c 1c 2c 1a 2a
How can I do that?
If in columns are dictionaries, not strings use list comprehension with json_normalize:
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'dict'>
dfs = [pd.json_normalize(df.pop(x)).add_prefix(f'{x}_') for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
Solution with strings possible converted to dictionaries:
print (df)
col1 col2 col3
0 a {'key_1': '1a', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
1 b {'key_1': '1b', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
2 c {'key_1': '1c', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'str'>
import ast
dfs = [pd.json_normalize(df.pop(x).apply(ast.literal_eval)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
EDIT: Solution for original format with custom function:
print (df)
col1 col2 col3
0 a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
1 b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
2 c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
# <class 'str'>
f = lambda x: dict([x.split(': ') for x in x.strip("{'}").split(', ')])
dfs = [pd.json_normalize(df.pop(x).apply(f)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
I would use the df[col].apply(pd.Series) method to achieve this. It would then look something like this:
def explode_dictcol(df, col):
temp = df[col].apply(pd.Series)
temp = temp.rename(columns={cc: col + '_' + cc for cc in temp.columns})
return temp
df = pd.concat([df, explode_dictcol(df, 'col2'), explode_dictcol(df, 'col3')], axis=1)
df = df.drop(columns=['col2', 'col3]
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1b 2b
2 c 1c 2c 1c 2c
If the columns are strings, the following will do the work
df_new = pd.DataFrame(data = [
[row['col1'],
row['col2'].split(':')[1].split(',')[0].strip(),
row['col2'].split(':')[2].split('}')[0].strip(),
row['col3'].split(':')[1].split(',')[0].strip(),
row['col3'].split(':')[2].split('}')[0].strip()]
for index, row in df.iterrows()
]).rename(columns = {0: 'col1', 1: 'col2_key_1', 2: 'col2_key_2', 3: 'col3_key_3', 4: 'col3_key_4'})
[Out]:
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
Notes:
Check the data type with
print(type(df['col2'][0]))
# or
print(type(df['col2'].iat[0]))
The first part of the proposed solution
df_new = pd.DataFrame(data = [
[row['col1'], row['col2'].split(':')[1].split(',')[0].strip(),
row['col2'].split(':')[2].split('}')[0].strip(),
row['col3'].split(':')[1].split(',')[0].strip(),
row['col3'].split(':')[2].split('}')[0].strip()]
for index, row in df.iterrows()
])
gives the following output
0 1 2 3 4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
which is almost the same, but that is why one has to pass .rename() to make sure the column names are as OP wants.
Column A 2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
Above is the dataframe I want to create.
The first row represents the field names. The logic I want to employ is as follows:
If the column name is in the "Column A" row, then 1 otherwise 0
I have scoured Google looking for code answering a question similar to mine so I can test it out and backward engineer a solution. Unfortunately, I have not been able to find anything.
Otherwise I would post some code that I attempted to solve this problem but I literally have no clue.
You can use a list comprehension to create the desire data based on the columns and rows:
In [39]: row =['2C 1B D2 6F ABC', '2C 1248 Bulers']
In [40]: columns=['2C', 'GAD', 'D2', '6F', 'ABCDE']
In [41]: df = pd.DataFrame([[int(k in r) for k in columns] for r in row], index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [42]: df
Out[42]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
If you want a pure Pandas approach you can use pd.Series() instead of list for preserving the columns and rows then use Series.apply and Series.str.contains to get the desire result:
In [73]: data = columns.apply(row.str.contains).astype(int).transpose()
In [74]: df = pd.DataFrame(data.values, index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [75]: df
Out[75]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
I have a dataframe:
import pandas as pd
df=pd.DataFrame({
'Player': ['John','John','John','Steve','Steve','Ted', 'James','Smitty','SmittyJr','DJ'],
'Name': ['A','B', 'A','B','B','C', 'A','D','D','D'],
'Group':['2A','1B','2A','2A','1B','1C','2A','1C','1C','2A'],
'Medal':['G', '?', '?', 'S', 'B','?','?','?','G','?']
})
df = df[['Player','Group', 'Name', 'Medal']]
print(df)
I want to update all the '?' in the column Medal with values for any of the rows with matching Name & Group columns that are already filled in.
For example since the first row 0 is Name:A, Group:2A, Medal:G, then the '?' on row 6 and 2 would be 'G'
The results should look like:
res=pd.DataFrame({
'Player': ['John','John','John','Steve','Steve','Ted', 'James','Smitty','SmittyJr','DJ'],
'Name': ['A','B', 'A','B','B','C', 'A','D','D','D'],
'Group':['2A','1B','2A','2A','1B','1C','2A','1C','1C','2A'],
'Medal':['G', 'B', 'G', 'S', 'B','?','G','G','G','?']
})
res = res[['Player','Group', 'Name', 'Medal']]
print(res)
What is the most efficient way to do this?
Another solution with replace ? by last value (with iloc) of sorted Medal (with sort_values) in each group:
df['Medal'] = df.groupby(['Group','Name'])['Medal']
.apply(lambda x: x.replace('?', x.sort_values().iloc[-1]))
print(df)
Player Group Name Medal
0 John 2A A G
1 John 1B B B
2 John 2A A G
3 Steve 2A B S
4 Steve 1B B B
5 Ted 1C C ?
6 James 2A A G
7 Smitty 1C D G
8 SmittyJr 1C D G
9 DJ 2A D ?
Timings:
In [81]: %timeit (df.groupby(['Group','Name'])['Medal'].apply(lambda x: x.replace('?', x.sort_values().iloc[-1])))
100 loops, best of 3: 4.13 ms per loop
In [82]: %timeit (df.replace('?', np.nan).groupby(['Name', 'Group']).apply(lambda df: df.ffill().bfill()).fillna('?'))
100 loops, best of 3: 11.3 ms per loop
Try:
import pandas as pd
import numpy as np
myfill = lambda df: df.ffill().bfill()
df.replace('?', np.nan).groupby(['Name', 'Group']).apply(myfill).fillna('?')
Player Group Name Medal
0 John 2A A G
1 John 1B B B
2 John 2A A G
3 Steve 2A B S
4 Steve 1B B B
5 Ted 1C C ?
6 James 2A A G
7 Smitty 1C D G
8 SmittyJr 1C D G
9 DJ 2A D ?
Here is my question. Take the dataframe below as an example:
The dataframe df has 8 columns, each of them has finite values.
What I'm going to do:
a. Loop over the dataframe by rows
b. In each row, the value of column B1, B2, B3, B4, B5, B6 will be changed to B* x A
Code like this:
for i in range(0,len(df),1):
col_B = ["B1","B2","B3","B4","B5","B6",]
for j in range(len(col_B)):
df.[col_B[j]].iloc[i] = df.[col_B[j]].iloc[i]*df.A.iloc[i]
In my real data which contain 224 rows and 9 columns, to loop over all these cells cost me 0:01:03.
How to boost up the loop-over velocity in Pandas?
Any advice would be appreciate.
You can first filter DataFrame and then multiple by mul:
print(df.filter(like='B').mul(df.A, axis=0))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'B4':[5,3,6],
'B5':[7,4,3],
'B6':[1,3,7]})
print (df)
A B1 B2 B3 B4 B5 B6
0 1 4 7 1 5 7 1
1 2 5 8 3 3 4 3
2 3 6 9 5 6 3 7
print(df.filter(like='B').mul(df.A, axis=0))
B1 B2 B3 B4 B5 B6
0 4 7 1 5 7 1
1 10 16 6 6 8 6
2 18 27 15 18 9 21
If need column A use concat:
print (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
A B1 B2 B3 B4 B5 B6
0 1 4 7 1 5 7 1
1 2 10 16 6 6 8 6
2 3 18 27 15 18 9 21
Timings:
len(df)=3:
In [416]: %timeit (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
1000 loops, best of 3: 1.01 ms per loop
In [417]: %timeit loop(df)
100 loops, best of 3: 3.28 ms per loop
len(df)=30k:
In [420]: %timeit (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
The slowest run took 4.00 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 3 ms per loop
In [421]: %timeit loop(df)
1 loop, best of 3: 35.6 s per loop
Code for timings:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'B4':[5,3,6],
'B5':[7,4,3],
'B6':[1,3,7]})
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
print (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
def loop(df):
for i in range(0,len(df),1):
col_B = ["B1","B2","B3","B4","B5","B6",]
for j in range(len(col_B)):
df[col_B[j]].iloc[i] = df[col_B[j]].iloc[i]*df.A.iloc[i]
return df
print (loop(df))
I have a question regarding a table (Table A - containing multiple values of three keys and some "value" columns) according to below:
ID TIME1 TIME2 VALUE_A VALUE_B
1 201501 201501 a 1a
1 201502 201502 a 1c
1 201502 201502 b 1d
1 201501 201501 b 2e
1 201501 201501 b 6a
1 201501 201501 b 1d
1 201502 201502 b 2e
1 201502 201502 b 6a
I have used a code creating unique values from another table, getting a reference of the rows I want to extract from table A, given the keys. This table (table B) has the appearance according to below:
ID TIME1 TIME2
1 201502 201502
2 201511 201511
I have manage to take out the values I want by doing a simple merge which gives the values I want from table A, given references. However, I would like to use the "isin"-function to make this happened also. I have my syntax according to below, and it gives me duplicate values. The only thing I want is to take out the rows from Table A, given reference from Table B. How can I gear it to do that?
Table C according to below:
ID TIME1 TIME2 VALUE_A VALUE_B
1 201502 201502 a 1c
1 201502 201502 b 1d
1 201502 201502 b 2e
1 201502 201502 b 6a
Syntax("isin"-version):
subset = df[df.ID.isin(df2['ID']) & (df.TIME1.isin(df2['TIME1']) & df.TIME2.isin(df2['TIME2']))]
Code for creating table A and table B is below:
df = DataFrame({'ID' : [1,1,1,1,1,1,1,1],
'TIME1' : [201501,201502,201502,201501,201501,201501,201502,201502],
'TIME2' : [201501,201502,201502,201501,201501,201501,201502,201502],
'VALUE_A' : ['a', 'a', 'b', 'b', 'b', 'b', 'b', 'b'],
'VALUE_B' : ['1a', '1c', '1d', '2e', '6a', '1d', '2e', '6a']})
df2 = DataFrame({'ID' : [1,2],
'TIME1' : [201502,201501],
'TIME2' : [201502,201501]
})
Many thanks in advance!
I believe you want to modify your boolean condition to this:
In [146]:
subset = df[df.ID.isin(df2['ID']) & (df.TIME1.isin(df2['TIME1']) | df.TIME2.isin(df2['TIME2'])) ]
subset
Out[146]:
ID TIME1 TIME2 VALUE_A VALUE_B
1 1 201502 201-02 a 1c
2 1 201502 201502 b 1d
6 2 201511 201511 b 2e
7 2 201511 201511 b 6a
So this checks that the ID is present and that either Time1 or Time2 is in the other df.
simply you can achieve this using isin() by
In [102]:
df[df.TIME1.isin(df2.TIME1) & df.TIME2.isin(df2.TIME2)]
Out[102]:
ID TIME1 TIME2 VALUE_A VALUE_B
1 201502 201502 a 1c
1 201502 201502 b 1d
2 201511 201511 b 2e
2 201511 201511 b 6a