I am trying to create a new column 'ratioA' in a dataframe df whereby the values are related to a column A:
For a given row, df['ratioA'] is equal to the ratio between df['A'] in that row and the next row.
I iterated over the index column as reference, but not sure why the values are appearing as NaN - Technically only the last row should appear as NaN.
import numpy as np
import pandas as pd
series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})
df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
df = df.reset_index()
for i in df['index']:
df['ratioA'] = df['A'][df['index']==i]/df['A'][df['index']==i+1]
print (df)
The output is:
index A B ratioA
0 0 1 2 NaN
1 1 3 4 NaN
2 2 5 6 NaN
3 3 7 8 NaN
The desired output should be:
index A B ratioA
0 0 1 2 0.33
1 1 3 4 0.60
2 2 5 6 0.71
3 3 7 8 NaN
You can use vectorized solution - divide by div shifted column A:
print (df['A'].shift(-1))
0 3.0
1 5.0
2 7.0
3 NaN
Name: A, dtype: float64
df['ratioA'] = df['A'].div(df['A'].shift(-1))
print (df)
index A B ratioA
0 0 1 2 0.333333
1 1 3 4 0.600000
2 2 5 6 0.714286
3 3 7 8 NaN
In pandas loops are very slow, so the best is avoid them (Jeff (pandas developer) explain it better.):
for i, row in df.iterrows():
if i != df.index[-1]:
df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
print (df)
index A B ratioA
0 0 1 2 0.333333
1 1 3 4 0.600000
2 2 5 6 0.714286
3 3 7 8 NaN
Timings:
series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})
df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
#[4000 rows x 3 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
df = df.reset_index()
In [49]: %timeit df['ratioA1'] = df['A'].div(df['A'].shift(-1))
1000 loops, best of 3: 431 µs per loop
In [50]: %%timeit
...: for i, row in df.iterrows():
...: if i != df.index[-1]:
...: df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
...:
1 loop, best of 3: 2.15 s per loop
Related
I have a dataframe with columns (A, B and value) where there are missing values in the value column. And there is a Series indexed by two columns (A and B) from the dataframe. How can I fill the missing values in the dataframe with corresponding values in the series?
I think you need fillna with set_index and reset_index:
df = pd.DataFrame({'A': [1,1,3],
'B': [2,3,4],
'value':[2,np.nan,np.nan] })
print (df)
A B value
0 1 2 2.0
1 1 3 NaN
2 3 4 NaN
idx = pd.MultiIndex.from_product([[1,3],[2,3,4]])
s = pd.Series([5,6,0,8,9,7], index=idx)
print (s)
1 2 5
3 6
4 0
3 2 8
3 9
4 7
dtype: int64
df = df.set_index(['A','B'])['value'].fillna(s).reset_index()
print (df)
A B value
0 1 2 2.0
1 1 3 6.0
2 3 4 7.0
Consider the dataframe and series df and s
df = pd.DataFrame(dict(
A=list('aaabbbccc'),
B=list('xyzxyzxyz'),
value=[1, 2, np.nan, 4, 5, np.nan, 7, 8, 9]
))
s = pd.Series(range(1, 10)[::-1])
s.index = [df.A, df.B]
We can fillna with a clever join
df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
# \_____________/ \_________/
# make series same get old
# name as column column out
# we are filling of the way
A B value
0 a x 1.0
1 a y 2.0
2 a z 7.0
3 b x 4.0
4 b y 5.0
5 b z 4.0
6 c x 7.0
7 c y 8.0
8 c z 9.0
Timing
join is cute, but #jezrael's set_index is quicker
%timeit df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
100 loops, best of 3: 3.56 ms per loop
%timeit df.set_index(['A','B'])['value'].fillna(s).reset_index()
100 loops, best of 3: 2.06 ms per loop
I have a pandas dataframe A of size (1500,5) and a dictionary D containing:
D
Out[121]:
{'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
for each key in the dictionary I would like to create a new column in the dataframe A with the values in the dictionary (same value for all the rows of each column)
at the end
A should be of size (1500,8)
Is there a "python" way to do this? thanks!
You can use concat with DataFrame constructor:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame({'A':[1,2],
'B':[4,5],
'C':[7,8]})
print (df)
A B C
0 1 4 7
1 2 5 8
print (pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1))
A B C newcol1 newcol2 newcol3
0 1 4 7 a 2 1
1 2 5 8 a 2 1
Timings:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame(np.random.rand(10000000, 5), columns=list('abcde'))
In [37]: %timeit pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1)
The slowest run took 18.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 875 ms per loop
In [38]: %timeit df.assign(**D)
1 loop, best of 3: 1.22 s per loop
setup
A = pd.DataFrame(np.random.rand(10, 5), columns=list('abcde'))
d = {
'newcol1': 'a',
'newcol2': 2,
'newcol3': 1
}
solution
Use assign
A.assign(**d)
a b c d e newcol1 newcol2 newcol3
0 0.709249 0.275538 0.135320 0.939448 0.549480 a 2 1
1 0.396744 0.513155 0.063207 0.198566 0.487991 a 2 1
2 0.230201 0.787672 0.520359 0.165768 0.616619 a 2 1
3 0.300799 0.554233 0.838353 0.637597 0.031772 a 2 1
4 0.003613 0.387557 0.913648 0.997261 0.862380 a 2 1
5 0.504135 0.847019 0.645900 0.312022 0.715668 a 2 1
6 0.857009 0.313477 0.030833 0.952409 0.875613 a 2 1
7 0.488076 0.732990 0.648718 0.389069 0.301857 a 2 1
8 0.187888 0.177057 0.813054 0.700724 0.653442 a 2 1
9 0.003675 0.082438 0.706903 0.386046 0.973804 a 2 1
I am having following dataframe:
A,B,C
1,2,3
I have to convert above dataframe like following format:
cols,vals
A,1
B,2
c,3
How to create column names as a new column in pandas?
You can transpose by T:
import pandas as pd
df = pd.DataFrame({'A': {0: 1}, 'C': {0: 3}, 'B': {0: 2}})
print (df)
A B C
0 1 2 3
print (df.T)
0
A 1
B 2
C 3
df1 = df.T.reset_index()
df1.columns = ['cols','vals']
print (df1)
cols vals
0 A 1
1 B 2
2 C 3
If DataFrame has more rows, you can use:
import pandas as pd
df = pd.DataFrame({'A': {0: 1, 1: 9, 2: 1},
'C': {0: 3, 1: 6, 2: 7},
'B': {0: 2, 1: 4, 2: 8}})
print (df)
A B C
0 1 2 3
1 9 4 6
2 1 8 7
df.index = 'vals' + df.index.astype(str)
print (df.T)
vals0 vals1 vals2
A 1 9 1
B 2 4 8
C 3 6 7
df1 = df.T.reset_index().rename(columns={'index':'cols'})
print (df1)
cols vals0 vals1 vals2
0 A 1 9 1
1 B 2 4 8
2 C 3 6 7
What's the most efficient way to select the second to last of each duplicated set in a pandas dataframe?
For instance I basically want to do this operation:
df = df.drop_duplicates(['Person','Question'],take_last=True)
But this:
df = df.drop_duplicates(['Person','Question'],take_second_last=True)
Abstracted question: how to choose which duplicate to keep if duplicate is neither the max nor the min?
With groupby.apply:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': np.arange(10), 'C': np.arange(10)})
df
Out:
A B C
0 1 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 7
8 3 8 8
9 4 9 9
(df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
2 1 2 2
5 2 5 5
7 3 7 7
9 4 9 9
With a different DataFrame, subset two columns:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})
df
Out:
A B C
0 1 1 0
1 1 1 1
2 1 2 2
3 1 1 3
4 2 2 4
5 2 2 5
6 2 2 6
7 3 3 7
8 3 3 8
9 4 4 9
(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
1 1 1 1
2 1 2 2
5 2 2 5
7 3 3 7
9 4 4 9
You could groupby/tail(2) to take the last 2 items, then groupby/head(1) to take the first item from the tail:
df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
If there is only one item in the group, tail(2) returns just the one item.
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)
The builtin groupby methods (such as tail and head) are often much faster
than groupby/apply with custom Python functions. This is especially true if there are a lot of groups:
In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop
In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop
Alternatively, ayhan suggests a nice improvement:
alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)
In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop
Is there a way to apply a function over a moving window centered around the current row?, for example:
>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
... 'B': {0: 1, 1: 3, 2: 5},
... 'C': {0: 2, 1: 4, 2: 6}})
>>> df
C
0 2
1 4
2 6
Desired results generates a column D which is the average of the values of the column C of the previous, current and following rows, that is:
row 0 => D = (2 + 4)/2 = 3
row 1 => D = (2 + 4 + 6)/3 = 4
row 2 => D = (4 + 6)/2 = 5
>>> df_final
C D
0 2 3
1 4 4
2 6 5
It looks like you just want a rolling mean, with a centred window of 3. For example:
>>> df["D"] = pd.rolling_mean(df["C"], window=3, center=True, min_periods=2)
>>> df
A B C D
0 a 1 2 3
1 b 3 4 4
2 c 5 6 5
Updated answer: pd.rolling_mean was deprecated in 0.18 and is no longer available as of pandas=0.23.4.
Window functions are now methods
Window functions have been refactored to be methods on Series/DataFrame objects, rather than top-level functions, which are now deprecated. This allows these window-type functions, to have a similar API to that of .groupby.
It either needs to be called on the dataframe:
In [55]: df['D'] = df['C'].rolling(window=3, center=True, min_periods=2).mean()
In [56]: df
Out[56]:
A B C D
0 a 1 2 3.0
1 b 3 4 4.0
2 c 5 6 5.0
Or from pandas.core.window.Rolling:
In [57]: df['D'] = pd.core.window.Rolling(df['C'], window=3, center=True, min_periods=2).mean()
In [58]: df
Out[58]:
A B C D
0 a 1 2 3.0
1 b 3 4 4.0
2 c 5 6 5.0