Here is my question. Take the dataframe below as an example:
The dataframe df has 8 columns, each of them has finite values.
What I'm going to do:
a. Loop over the dataframe by rows
b. In each row, the value of column B1, B2, B3, B4, B5, B6 will be changed to B* x A
Code like this:
for i in range(0,len(df),1):
col_B = ["B1","B2","B3","B4","B5","B6",]
for j in range(len(col_B)):
df.[col_B[j]].iloc[i] = df.[col_B[j]].iloc[i]*df.A.iloc[i]
In my real data which contain 224 rows and 9 columns, to loop over all these cells cost me 0:01:03.
How to boost up the loop-over velocity in Pandas?
Any advice would be appreciate.
You can first filter DataFrame and then multiple by mul:
print(df.filter(like='B').mul(df.A, axis=0))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'B4':[5,3,6],
'B5':[7,4,3],
'B6':[1,3,7]})
print (df)
A B1 B2 B3 B4 B5 B6
0 1 4 7 1 5 7 1
1 2 5 8 3 3 4 3
2 3 6 9 5 6 3 7
print(df.filter(like='B').mul(df.A, axis=0))
B1 B2 B3 B4 B5 B6
0 4 7 1 5 7 1
1 10 16 6 6 8 6
2 18 27 15 18 9 21
If need column A use concat:
print (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
A B1 B2 B3 B4 B5 B6
0 1 4 7 1 5 7 1
1 2 10 16 6 6 8 6
2 3 18 27 15 18 9 21
Timings:
len(df)=3:
In [416]: %timeit (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
1000 loops, best of 3: 1.01 ms per loop
In [417]: %timeit loop(df)
100 loops, best of 3: 3.28 ms per loop
len(df)=30k:
In [420]: %timeit (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
The slowest run took 4.00 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 3 ms per loop
In [421]: %timeit loop(df)
1 loop, best of 3: 35.6 s per loop
Code for timings:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'B4':[5,3,6],
'B5':[7,4,3],
'B6':[1,3,7]})
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
print (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
def loop(df):
for i in range(0,len(df),1):
col_B = ["B1","B2","B3","B4","B5","B6",]
for j in range(len(col_B)):
df[col_B[j]].iloc[i] = df[col_B[j]].iloc[i]*df.A.iloc[i]
return df
print (loop(df))
Related
My DataFrame has around 9K columns, and I want to remove the . from every column name, see example column names below:
`traffic.seas1`
`traffic.seas2`
`traffic.seas3`
These are just three, I have 9K columns, some do not have . but many do. How can I remove them efficiently, as the rename function is too manual.
You can use str.replace:
df.columns = df.columns.str.replace('.','')
Or list comprehension with replace:
df.columns = [x.replace('.','') for x in df.columns]
Sample:
df = pd.DataFrame({'traffic.seas1':list('abcdef'),
'traffic.seas2':[4,5,4,5,5,4],
'traffic.seas3':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
D E F traffic.seas1 traffic.seas2 traffic.seas3
0 1 5 a a 4 7
1 3 3 a b 5 8
2 5 6 a c 4 9
3 7 9 b d 5 4
4 1 2 b e 5 2
5 0 4 b f 4 3
df.columns = df.columns.str.replace('.','')
print (df)
D E F trafficseas1 trafficseas2 trafficseas3
0 1 5 a a 4 7
1 3 3 a b 5 8
2 5 6 a c 4 9
3 7 9 b d 5 4
4 1 2 b e 5 2
5 0 4 b f 4 3
Timings:
N = 9000
df = pd.DataFrame(np.random.randint(10, size=(3, N))).add_prefix('traffic.seas')
print (df)
In [161]: %timeit df.columns = df.columns.str.replace('.','')
4.4 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [162]: %timeit df.columns = [x.replace('.','') for x in df.columns]
2.53 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use list comprehension on df.columns like this:
df.columns = [c.replace('.', '') for c in df.columns]
For example:
df = pd.DataFrame({'foo': [1], 'bar.z': [2]})
>>> df.columns
Index(['bar.z', 'foo'], dtype='object')
df.columns = [c.replace('.', '') for c in df.columns]
>>> df
barz foo
0 2 1
I have a relatively big dataframe (1.5 Gb), and I want to group rows by ID and order rows by column VAL in ascending order within each group.
df =
ID VAL COL
1A 2 BB
1A 1 AA
2B 2 CC
3C 3 SS
3C 1 YY
3C 2 XX
This is the expected result:
df =
ID VAL COL
1A 1 AA
1A 2 BB
2B 2 CC
3C 1 YY
3C 2 XX
3C 3 SS
This is what I tried, but it runs very long time. Is there any faster solution?:
df = df.groupby("ID").apply(pd.DataFrame.sort, 'VAL')
If you have a big df and speed is important, try a little numpy
# note order of VAL first, then ID is intentional
# np.lexsort sorts by right most column first
df.iloc[np.lexsort((df.VAL.values, df.ID.values))]
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
super charged
v = df.values
i, j = np.searchsorted(df.columns.values, ['VAL', 'ID'])
s = np.lexsort((v[:, i], v[:, j]))
pd.DataFrame(v[s], df.index[s], df.columns)
timing
sort_values on 'ID', 'VAL' should give you
In [39]: df.sort_values(by=['ID', 'VAL'])
Out[39]:
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
Time it for your use-case
In [89]: dff.shape
Out[89]: (12000, 3)
In [90]: %timeit dff.sort_values(by=['ID', 'VAL'])
100 loops, best of 3: 2.62 ms per loop
In [91]: %timeit dff.iloc[np.lexsort((dff.VAL.values, dff.ID.values))]
100 loops, best of 3: 8.8 ms per loop
Sometimes, I would manipulate some columns of the dataframe and re-change it.
For example, one dataframe df has 6 columns like this:
A, B1, B2, B3, C, D
And I want to change the values in the columns (B1,B2,B3) transform into (B1*A, B2*A, B3*A).
Aside the loop subroutine which is slow, the df.filter(like = 'B') will accelerate a lot.
df.filter(like = "B").mul(df.A, axis = 0) can produce the right answer. But I can't change the B-like columns in df using:
df.filter(like = "B") =df.filter(like = "B").mul(df.A. axis = 0)`
How to achieve it? I know using pd.concat to creat a new dataframe can get it done. But when the number of columns are huge, this method may be loss of efficiency. What I want to do is to assign new value to the columns already exist.
Any advices would be appreciate!
Use str.contains with boolean indexing:
cols = df.columns[df.columns.str.contains('B')]
df[cols] = df[cols].mul(df.A, axis = 0)
Sample:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'C':[5,3,6],
'D':[7,4,3]})
print (df)
A B1 B2 B3 C D
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
cols = df.columns[df.columns.str.contains('B')]
print (cols)
Index(['B1', 'B2', 'B3'], dtype='object')
df[cols] = df[cols].mul(df.A, axis = 0)
print (df)
A B1 B2 B3 C D
0 1 4 7 1 5 7
1 2 10 16 6 3 4
2 3 18 27 15 6 3
Timings:
len(df)=3:
In [17]: %timeit (a(df))
1000 loops, best of 3: 1.36 ms per loop
In [18]: %timeit (b(df1))
100 loops, best of 3: 2.39 ms per loop
len(df)=30k:
In [14]: %timeit (a(df))
100 loops, best of 3: 2.89 ms per loop
In [15]: %timeit (b(df1))
100 loops, best of 3: 4.71 ms per loop
Code:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'C':[5,3,6],
'D':[7,4,3]})
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
def a(df):
cols = df.columns[df.columns.str.contains('B')]
df[cols] = df[cols].mul(df.A, axis = 0)
return (df)
def b(df):
df.loc[:, df.filter(regex=r'^B').columns] = df.loc[:, df.filter(regex=r'^B').columns].mul(df.A, axis=0)
return (df)
print (a(df))
print (b(df1))
you have almost done it:
In [136]: df.loc[:, df.filter(regex=r'^B').columns] = df.loc[:, df.filter(regex=r'^B').columns].mul(df.A, axis=0)
In [137]: df
Out[137]:
A B1 B2 B3 B4 F
0 1 4 7 1 5 7
1 2 10 16 6 6 4
2 3 18 27 15 18 3
With these two data frames
df1 = pd.DataFrame({'c1':['a','b','c','d'],'c2':[10,20,10,22]})
df2 = pd.DataFrame({'c3':['e','f','a','g','b','c','r','j','d'],'c4':[1,2,3,4,5,6,7,8,9]})
I'm trying to add the values of c4 to df1 for only the elements in c3 that are also present in c1:
>>> df1
c1 c2 c4
a 10 3
b 20 5
c 10 6
d 22 9
Is there a simple way of doing this in pandas?
UPDATE:
If
df2 = pd.DataFrame({'c3':['e','f','a','g','b','c','r','j','d'],'c4':[1,2,3,4,5,6,7,8,9]},'c5':[10,20,30,40,50,60,70,80,90])
how can I achieve this result?
>>> df1
c1 c2 c4 c5
a 10 3 30
b 20 5 50
c 10 6 60
d 22 9 90
Doing:
>>> df1['c1'].map(df2.set_index('c3')['c4','c5'])
gives me a KeyError
You can call map on df2['c4'] after setting the index on df2['c3'], this will perform a lookup:
In [239]:
df1 = pd.DataFrame({'c1':['a','b','c','d'],'c2':[10,20,10,22]})
df2 = pd.DataFrame({'c3':['e','f','a','g','b','c','r','j','d'],'c4':[1,2,3,4,5,6,7,8,9]})
df1['c4'] = df1['c1'].map(df2.set_index('c3')['c4'])
df1
Out[239]:
c1 c2 c4
0 a 10 3
1 b 20 5
2 c 10 6
3 d 22 9
I have a df and want to make a new_df of the same size but with all 1s. Something to the spirit of: new_df=df.replace("*","1"). I think this is faster than creating a new df from scratch, because i would need to get the dimensions, fill it with 1s, and copy all the headers over. Unless I'm wrong about that.
df_new = pd.DataFrame(np.ones(df.shape), columns=df.columns)
import numpy as np
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
%timeit df1 = pd.DataFrame(np.ones(df.shape), columns=df.columns)
10000 loops, best of 3: 94.6 µs per loop
%timeit df2 = df.copy(); df2.loc[:, :] = 1
1000 loops, best of 3: 245 µs per loop
%timeit df3 = df * 0 + 1
1000 loops, best of 3: 200 µs per loop
It's actually pretty easy.
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
df = pd.DataFrame(d, columns=cols)
print df
print "------------------------"
df.loc[:,:] = 1
print df
Result:
A B C D E
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
------------------------
A B C D E
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
3 1 1 1 1 1
4 1 1 1 1 1
Obviously, df.loc[:,:] means you target all rows across all columns. Just use df2 = df.copy() or something if you want a new dataframe.