Given the following DF with multilevel columns:
arrays = [['foo', 'foo', 'bar', 'bar'],
['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(6,4), columns = columnValues)
df['txt'] = 'aaa'
print(df)
yields:
foo bar txt
A B C D
0 0.080029 0.710943 0.157265 0.774827 aaa
1 0.276949 0.923369 0.550799 0.758707 aaa
2 0.416714 0.440659 0.835736 0.130818 aaa
3 0.935763 0.908967 0.502363 0.677957 aaa
4 0.191245 0.291017 0.014355 0.762976 aaa
5 0.365464 0.286350 0.450263 0.509556 aaa
Question: how do i efficiently change values in the foo sub-columns to 100 if their values < 0.5 for the huge DF?
the following works:
In [41]: df.foo < 0.5
Out[41]:
A B
0 True False
1 True False
2 True True
3 False False
4 True True
5 True True
In [42]: df.foo[df.foo < 0.5]
Out[42]:
A B
0 0.080029 NaN
1 0.276949 NaN
2 0.416714 0.440659
3 NaN NaN
4 0.191245 0.291017
5 0.365464 0.286350
but if i try to change the value it throws me:
In [45]: df.foo[df.foo < 0.5] = 100
C:\Users\USER\AppData\Local\Programs\Python35\Scripts\ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
if i try to use locators:
In [46]: df.foo.loc[df.foo < 0.5] = 100
...
ValueError: cannot copy sequence with size 2 to array axis with dimension 6
the same error for df.foo.loc[df.foo < 0.5, 'foo'] = 100
if i try:
df.loc[df.foo < 0.5, 'foo']
i get:
KeyError: 'None of [ A B\n0 True False\n1 True False\n2 True True\n3 False False\n4 True True\n5 True True] are in the [index]'
Solutions - timeit comparison against DF with 10M rows:
In [19]: %timeit df.foo.applymap(lambda x: x if x >= 0.5 else 100)
1 loop, best of 3: 29.4 s per loop
In [20]: %timeit df.foo[df.foo >= 0.5].fillna(100)
1 loop, best of 3: 1.55 s per loop
John Galt:
In [21]: %timeit df.foo.where(df.foo < 0.5, 100)
1 loop, best of 3: 1.12 s per loop
B. M.:
In [5]: %timeit u=df['foo'].values;u[u<.5]=100
1 loop, best of 3: 628 ms per loop
Here's one way using where -- df['foo'] = df['foo'].where(df['foo'] < 0.5, 100)
In [96]: df
Out[96]:
foo bar txt
A B C D
0 0.255309 0.237892 0.491065 0.930555 aaa
1 0.859998 0.008269 0.376213 0.984806 aaa
2 0.479928 0.761266 0.993970 0.266486 aaa
3 0.078284 0.009748 0.461687 0.653085 aaa
4 0.923293 0.642398 0.629140 0.561777 aaa
5 0.936824 0.526626 0.413250 0.732074 aaa
In [97]: df['foo'] = df['foo'].where(df['foo'] < 0.5, 100)
In [98]: df
Out[98]:
foo bar txt
A B C D
0 0.255309 0.237892 0.491065 0.930555 aaa
1 100.000000 0.008269 0.376213 0.984806 aaa
2 0.479928 100.000000 0.993970 0.266486 aaa
3 0.078284 0.009748 0.461687 0.653085 aaa
4 100.000000 100.000000 0.629140 0.561777 aaa
5 100.000000 100.000000 0.413250 0.732074 aaa
Related
self.df['Regular Price'] = self.df['Regular Price'].apply(
lambda x: int(round(x)) if isinstance(
x, (int, float)) else None
)
The above code is assigning None to every value of field Regular Price whenever it encounter a non numeric value in the dataframe. I want to assign None to only that cell where its non number value.
thanks
First is impossible return NaNs with integers, because NaNs is float by design.
Your solution working if mixed types - numeric with strings:
df = pd.DataFrame({
'Regular Price': ['a',1,2.3,'a',7],
'B': list(range(5))
})
print (df)
B Regular Price
0 0 a
1 1 1
2 2 2.3
3 3 a
4 4 7
df['Regular Price'] = df['Regular Price'].apply(
lambda x: int(round(x)) if isinstance(
x, (int, float)) else None
)
print (df)
B Regular Price
0 0 NaN
1 1 1.0
2 2 2.0
3 3 NaN
4 4 7.0
But if all data are strings need to_numeric with errors='coerce' for convert not numeric to NaNs:
df = pd.DataFrame({
'Regular Price': ['a','1','2.3','a','7'],
'B': list(range(5))
})
print (df)
B Regular Price
0 0 a
1 1 1
2 2 2.3
3 3 a
4 4 7
df['Regular Price'] = pd.to_numeric(df['Regular Price'], errors='coerce').round()
print (df)
B Regular Price
0 0 NaN
1 1 1.0
2 2 2.0
3 3 NaN
4 4 7.0
EDIT:
I also need to remove floating points and use int only
It is possible by convert to None for NaNs and cast to int:
df['Regular Price'] = pd.to_numeric(df['Regular Price'],
errors='coerce').round()
df['Regular Price'] = np.where(df['Regular Price'].isnull(),
None,
df['Regular Price'].fillna(0).astype(int))
print (df)
B Regular Price
0 0 None
1 1 1
2 2 2
3 3 None
4 4 7
print (df['Regular Price'].apply(type))
0 <class 'NoneType'>
1 <class 'int'>
2 <class 'int'>
3 <class 'NoneType'>
4 <class 'int'>
Name: Regular Price, dtype: object
But it slow performance, so the best dont use it. There also should be another problems - soe function failed, so the best is floats if working with NaNs:
Testing some function like diff in 50k rows DataFrame:
df = pd.DataFrame({
'Regular Price': ['a','1','2.3','a','7'],
'B': list(range(5))
})
df = pd.concat([df]*10000).reset_index(drop=True)
print (df)
df['Regular Price'] = pd.to_numeric(df['Regular Price'], errors='coerce').round()
df['Regular Price1'] = np.where(df['Regular Price'].isnull(),
None,
df['Regular Price'].fillna(0).astype(int))
In [252]: %timeit df['Regular Price2'] = df['Regular Price1'].diff()
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
In [274]: %timeit df['Regular Price3'] = df['Regular Price'].diff()
1000 loops, best of 3: 301 µs per loop
In [272]: %timeit df['Regular Price2'] = df['Regular Price1'] * 1000
100 loops, best of 3: 4.48 ms per loop
In [273]: %timeit df['Regular Price3'] = df['Regular Price'] * 1000
1000 loops, best of 3: 469 µs per loop
EDIT:
df = pd.DataFrame({
'Regular Price': ['a','1','2.3','a','7'],
'B': list(range(5))
})
print (df)
B Regular Price
0 0 a
1 1 1
2 2 2.3
3 3 a
4 4 7
df['Regular Price'] = pd.to_numeric(df['Regular Price'], errors='coerce').round()
print (df)
B Regular Price
0 0 NaN
1 1 1.0
2 2 2.0
3 3 NaN
4 4 7.0
First is possible remove NaNs rows by column Regular Price and then convert to int.
df1 = df.dropna(subset=['Regular Price']).copy()
df1['Regular Price'] = df1['Regular Price'].astype(int)
print (df1)
B Regular Price
1 1 1
2 2 2
4 4 7
Process what you need, but dont change index.
#e.g. some process
df1['Regular Price'] = df1['Regular Price'] * 100
Last combine_first - it add NaN to Regular Price column.
df2 = df1.combine_first(df)
print (df2)
B Regular Price
0 0.0 NaN
1 1.0 100.0
2 2.0 200.0
3 3.0 NaN
4 4.0 700.0
I am trying to create a new column 'ratioA' in a dataframe df whereby the values are related to a column A:
For a given row, df['ratioA'] is equal to the ratio between df['A'] in that row and the next row.
I iterated over the index column as reference, but not sure why the values are appearing as NaN - Technically only the last row should appear as NaN.
import numpy as np
import pandas as pd
series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})
df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
df = df.reset_index()
for i in df['index']:
df['ratioA'] = df['A'][df['index']==i]/df['A'][df['index']==i+1]
print (df)
The output is:
index A B ratioA
0 0 1 2 NaN
1 1 3 4 NaN
2 2 5 6 NaN
3 3 7 8 NaN
The desired output should be:
index A B ratioA
0 0 1 2 0.33
1 1 3 4 0.60
2 2 5 6 0.71
3 3 7 8 NaN
You can use vectorized solution - divide by div shifted column A:
print (df['A'].shift(-1))
0 3.0
1 5.0
2 7.0
3 NaN
Name: A, dtype: float64
df['ratioA'] = df['A'].div(df['A'].shift(-1))
print (df)
index A B ratioA
0 0 1 2 0.333333
1 1 3 4 0.600000
2 2 5 6 0.714286
3 3 7 8 NaN
In pandas loops are very slow, so the best is avoid them (Jeff (pandas developer) explain it better.):
for i, row in df.iterrows():
if i != df.index[-1]:
df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
print (df)
index A B ratioA
0 0 1 2 0.333333
1 1 3 4 0.600000
2 2 5 6 0.714286
3 3 7 8 NaN
Timings:
series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})
df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
#[4000 rows x 3 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
df = df.reset_index()
In [49]: %timeit df['ratioA1'] = df['A'].div(df['A'].shift(-1))
1000 loops, best of 3: 431 µs per loop
In [50]: %%timeit
...: for i, row in df.iterrows():
...: if i != df.index[-1]:
...: df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
...:
1 loop, best of 3: 2.15 s per loop
I have a pandas dataframe A of size (1500,5) and a dictionary D containing:
D
Out[121]:
{'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
for each key in the dictionary I would like to create a new column in the dataframe A with the values in the dictionary (same value for all the rows of each column)
at the end
A should be of size (1500,8)
Is there a "python" way to do this? thanks!
You can use concat with DataFrame constructor:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame({'A':[1,2],
'B':[4,5],
'C':[7,8]})
print (df)
A B C
0 1 4 7
1 2 5 8
print (pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1))
A B C newcol1 newcol2 newcol3
0 1 4 7 a 2 1
1 2 5 8 a 2 1
Timings:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame(np.random.rand(10000000, 5), columns=list('abcde'))
In [37]: %timeit pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1)
The slowest run took 18.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 875 ms per loop
In [38]: %timeit df.assign(**D)
1 loop, best of 3: 1.22 s per loop
setup
A = pd.DataFrame(np.random.rand(10, 5), columns=list('abcde'))
d = {
'newcol1': 'a',
'newcol2': 2,
'newcol3': 1
}
solution
Use assign
A.assign(**d)
a b c d e newcol1 newcol2 newcol3
0 0.709249 0.275538 0.135320 0.939448 0.549480 a 2 1
1 0.396744 0.513155 0.063207 0.198566 0.487991 a 2 1
2 0.230201 0.787672 0.520359 0.165768 0.616619 a 2 1
3 0.300799 0.554233 0.838353 0.637597 0.031772 a 2 1
4 0.003613 0.387557 0.913648 0.997261 0.862380 a 2 1
5 0.504135 0.847019 0.645900 0.312022 0.715668 a 2 1
6 0.857009 0.313477 0.030833 0.952409 0.875613 a 2 1
7 0.488076 0.732990 0.648718 0.389069 0.301857 a 2 1
8 0.187888 0.177057 0.813054 0.700724 0.653442 a 2 1
9 0.003675 0.082438 0.706903 0.386046 0.973804 a 2 1
Sometimes, I would manipulate some columns of the dataframe and re-change it.
For example, one dataframe df has 6 columns like this:
A, B1, B2, B3, C, D
And I want to change the values in the columns (B1,B2,B3) transform into (B1*A, B2*A, B3*A).
Aside the loop subroutine which is slow, the df.filter(like = 'B') will accelerate a lot.
df.filter(like = "B").mul(df.A, axis = 0) can produce the right answer. But I can't change the B-like columns in df using:
df.filter(like = "B") =df.filter(like = "B").mul(df.A. axis = 0)`
How to achieve it? I know using pd.concat to creat a new dataframe can get it done. But when the number of columns are huge, this method may be loss of efficiency. What I want to do is to assign new value to the columns already exist.
Any advices would be appreciate!
Use str.contains with boolean indexing:
cols = df.columns[df.columns.str.contains('B')]
df[cols] = df[cols].mul(df.A, axis = 0)
Sample:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'C':[5,3,6],
'D':[7,4,3]})
print (df)
A B1 B2 B3 C D
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
cols = df.columns[df.columns.str.contains('B')]
print (cols)
Index(['B1', 'B2', 'B3'], dtype='object')
df[cols] = df[cols].mul(df.A, axis = 0)
print (df)
A B1 B2 B3 C D
0 1 4 7 1 5 7
1 2 10 16 6 3 4
2 3 18 27 15 6 3
Timings:
len(df)=3:
In [17]: %timeit (a(df))
1000 loops, best of 3: 1.36 ms per loop
In [18]: %timeit (b(df1))
100 loops, best of 3: 2.39 ms per loop
len(df)=30k:
In [14]: %timeit (a(df))
100 loops, best of 3: 2.89 ms per loop
In [15]: %timeit (b(df1))
100 loops, best of 3: 4.71 ms per loop
Code:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'C':[5,3,6],
'D':[7,4,3]})
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
def a(df):
cols = df.columns[df.columns.str.contains('B')]
df[cols] = df[cols].mul(df.A, axis = 0)
return (df)
def b(df):
df.loc[:, df.filter(regex=r'^B').columns] = df.loc[:, df.filter(regex=r'^B').columns].mul(df.A, axis=0)
return (df)
print (a(df))
print (b(df1))
you have almost done it:
In [136]: df.loc[:, df.filter(regex=r'^B').columns] = df.loc[:, df.filter(regex=r'^B').columns].mul(df.A, axis=0)
In [137]: df
Out[137]:
A B1 B2 B3 B4 F
0 1 4 7 1 5 7
1 2 10 16 6 6 4
2 3 18 27 15 18 3
I have a df and want to make a new_df of the same size but with all 1s. Something to the spirit of: new_df=df.replace("*","1"). I think this is faster than creating a new df from scratch, because i would need to get the dimensions, fill it with 1s, and copy all the headers over. Unless I'm wrong about that.
df_new = pd.DataFrame(np.ones(df.shape), columns=df.columns)
import numpy as np
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
%timeit df1 = pd.DataFrame(np.ones(df.shape), columns=df.columns)
10000 loops, best of 3: 94.6 µs per loop
%timeit df2 = df.copy(); df2.loc[:, :] = 1
1000 loops, best of 3: 245 µs per loop
%timeit df3 = df * 0 + 1
1000 loops, best of 3: 200 µs per loop
It's actually pretty easy.
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
df = pd.DataFrame(d, columns=cols)
print df
print "------------------------"
df.loc[:,:] = 1
print df
Result:
A B C D E
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
------------------------
A B C D E
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
3 1 1 1 1 1
4 1 1 1 1 1
Obviously, df.loc[:,:] means you target all rows across all columns. Just use df2 = df.copy() or something if you want a new dataframe.