New column in dataframe based on location of values in another column

New column in dataframe based on location of values in another column - python

I am trying to create a new column 'ratioA' in a dataframe df whereby the values are related to a column A:
For a given row, df['ratioA'] is equal to the ratio between df['A'] in that row and the next row.
I iterated over the index column as reference, but not sure why the values are appearing as NaN - Technically only the last row should appear as NaN.
import numpy as np
import pandas as pd
series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})
df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
df = df.reset_index()
for i in df['index']:
df['ratioA'] = df['A'][df['index']==i]/df['A'][df['index']==i+1]
print (df)
The output is:
index A B ratioA
0 0 1 2 NaN
1 1 3 4 NaN
2 2 5 6 NaN
3 3 7 8 NaN
The desired output should be:
index A B ratioA
0 0 1 2 0.33
1 1 3 4 0.60
2 2 5 6 0.71
3 3 7 8 NaN

You can use vectorized solution - divide by div shifted column A:
print (df['A'].shift(-1))
0 3.0
1 5.0
2 7.0
3 NaN
Name: A, dtype: float64
df['ratioA'] = df['A'].div(df['A'].shift(-1))
print (df)
index A B ratioA
0 0 1 2 0.333333
1 1 3 4 0.600000
2 2 5 6 0.714286
3 3 7 8 NaN
In pandas loops are very slow, so the best is avoid them (Jeff (pandas developer) explain it better.):
for i, row in df.iterrows():
if i != df.index[-1]:
df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
print (df)
index A B ratioA
0 0 1 2 0.333333
1 1 3 4 0.600000
2 2 5 6 0.714286
3 3 7 8 NaN
Timings:
series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})
df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
#[4000 rows x 3 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
df = df.reset_index()
In [49]: %timeit df['ratioA1'] = df['A'].div(df['A'].shift(-1))
1000 loops, best of 3: 431 µs per loop
In [50]: %%timeit
...: for i, row in df.iterrows():
...: if i != df.index[-1]:
...: df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
...:
1 loop, best of 3: 2.15 s per loop

Related

Fill NA values by a two levels indexed Series

I have a dataframe with columns (A, B and value) where there are missing values in the value column. And there is a Series indexed by two columns (A and B) from the dataframe. How can I fill the missing values in the dataframe with corresponding values in the series?

I think you need fillna with set_index and reset_index:
df = pd.DataFrame({'A': [1,1,3],
'B': [2,3,4],
'value':[2,np.nan,np.nan] })
print (df)
A B value
0 1 2 2.0
1 1 3 NaN
2 3 4 NaN
idx = pd.MultiIndex.from_product([[1,3],[2,3,4]])
s = pd.Series([5,6,0,8,9,7], index=idx)
print (s)
1 2 5
3 6
4 0
3 2 8
3 9
4 7
dtype: int64
df = df.set_index(['A','B'])['value'].fillna(s).reset_index()
print (df)
A B value
0 1 2 2.0
1 1 3 6.0
2 3 4 7.0

Consider the dataframe and series df and s
df = pd.DataFrame(dict(
A=list('aaabbbccc'),
B=list('xyzxyzxyz'),
value=[1, 2, np.nan, 4, 5, np.nan, 7, 8, 9]
))
s = pd.Series(range(1, 10)[::-1])
s.index = [df.A, df.B]
We can fillna with a clever join
df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
# \_____________/ \_________/
# make series same get old
# name as column column out
# we are filling of the way
A B value
0 a x 1.0
1 a y 2.0
2 a z 7.0
3 b x 4.0
4 b y 5.0
5 b z 4.0
6 c x 7.0
7 c y 8.0
8 c z 9.0
Timing
join is cute, but #jezrael's set_index is quicker
%timeit df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
100 loops, best of 3: 3.56 ms per loop
%timeit df.set_index(['A','B'])['value'].fillna(s).reset_index()
100 loops, best of 3: 2.06 ms per loop

adding column in pandas dataframe containing the same value

I have a pandas dataframe A of size (1500,5) and a dictionary D containing:
D
Out[121]:
{'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
for each key in the dictionary I would like to create a new column in the dataframe A with the values in the dictionary (same value for all the rows of each column)
at the end
A should be of size (1500,8)
Is there a "python" way to do this? thanks!

You can use concat with DataFrame constructor:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame({'A':[1,2],
'B':[4,5],
'C':[7,8]})
print (df)
A B C
0 1 4 7
1 2 5 8
print (pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1))
A B C newcol1 newcol2 newcol3
0 1 4 7 a 2 1
1 2 5 8 a 2 1
Timings:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame(np.random.rand(10000000, 5), columns=list('abcde'))
In [37]: %timeit pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1)
The slowest run took 18.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 875 ms per loop
In [38]: %timeit df.assign(**D)
1 loop, best of 3: 1.22 s per loop

setup
A = pd.DataFrame(np.random.rand(10, 5), columns=list('abcde'))
d = {
'newcol1': 'a',
'newcol2': 2,
'newcol3': 1
}
solution
Use assign
A.assign(**d)
a b c d e newcol1 newcol2 newcol3
0 0.709249 0.275538 0.135320 0.939448 0.549480 a 2 1
1 0.396744 0.513155 0.063207 0.198566 0.487991 a 2 1
2 0.230201 0.787672 0.520359 0.165768 0.616619 a 2 1
3 0.300799 0.554233 0.838353 0.637597 0.031772 a 2 1
4 0.003613 0.387557 0.913648 0.997261 0.862380 a 2 1
5 0.504135 0.847019 0.645900 0.312022 0.715668 a 2 1
6 0.857009 0.313477 0.030833 0.952409 0.875613 a 2 1
7 0.488076 0.732990 0.648718 0.389069 0.301857 a 2 1
8 0.187888 0.177057 0.813054 0.700724 0.653442 a 2 1
9 0.003675 0.082438 0.706903 0.386046 0.973804 a 2 1

how to convert header row into new columns in python pandas?

I am having following dataframe:
A,B,C
1,2,3
I have to convert above dataframe like following format:
cols,vals
A,1
B,2
c,3
How to create column names as a new column in pandas?

You can transpose by T:
import pandas as pd
df = pd.DataFrame({'A': {0: 1}, 'C': {0: 3}, 'B': {0: 2}})
print (df)
A B C
0 1 2 3
print (df.T)
0
A 1
B 2
C 3
df1 = df.T.reset_index()
df1.columns = ['cols','vals']
print (df1)
cols vals
0 A 1
1 B 2
2 C 3
If DataFrame has more rows, you can use:
import pandas as pd
df = pd.DataFrame({'A': {0: 1, 1: 9, 2: 1},
'C': {0: 3, 1: 6, 2: 7},
'B': {0: 2, 1: 4, 2: 8}})
print (df)
A B C
0 1 2 3
1 9 4 6
2 1 8 7
df.index = 'vals' + df.index.astype(str)
print (df.T)
vals0 vals1 vals2
A 1 9 1
B 2 4 8
C 3 6 7
df1 = df.T.reset_index().rename(columns={'index':'cols'})
print (df1)
cols vals0 vals1 vals2
0 A 1 9 1
1 B 2 4 8
2 C 3 6 7

Python Pandas Drop Duplicates keep second to last

What's the most efficient way to select the second to last of each duplicated set in a pandas dataframe?
For instance I basically want to do this operation:
df = df.drop_duplicates(['Person','Question'],take_last=True)
But this:
df = df.drop_duplicates(['Person','Question'],take_second_last=True)
Abstracted question: how to choose which duplicate to keep if duplicate is neither the max nor the min?

With groupby.apply:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': np.arange(10), 'C': np.arange(10)})
df
Out:
A B C
0 1 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 7
8 3 8 8
9 4 9 9
(df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
2 1 2 2
5 2 5 5
7 3 7 7
9 4 9 9
With a different DataFrame, subset two columns:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})
df
Out:
A B C
0 1 1 0
1 1 1 1
2 1 2 2
3 1 1 3
4 2 2 4
5 2 2 5
6 2 2 6
7 3 3 7
8 3 3 8
9 4 4 9
(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
1 1 1 1
2 1 2 2
5 2 2 5
7 3 3 7
9 4 4 9

You could groupby/tail(2) to take the last 2 items, then groupby/head(1) to take the first item from the tail:
df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
If there is only one item in the group, tail(2) returns just the one item.
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)
The builtin groupby methods (such as tail and head) are often much faster
than groupby/apply with custom Python functions. This is especially true if there are a lot of groups:
In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop
In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop
Alternatively, ayhan suggests a nice improvement:
alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)
In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop

Pandas moving window over rows

Is there a way to apply a function over a moving window centered around the current row?, for example:
>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
... 'B': {0: 1, 1: 3, 2: 5},
... 'C': {0: 2, 1: 4, 2: 6}})
>>> df
C
0 2
1 4
2 6
Desired results generates a column D which is the average of the values of the column C of the previous, current and following rows, that is:
row 0 => D = (2 + 4)/2 = 3
row 1 => D = (2 + 4 + 6)/3 = 4
row 2 => D = (4 + 6)/2 = 5
>>> df_final
C D
0 2 3
1 4 4
2 6 5

It looks like you just want a rolling mean, with a centred window of 3. For example:
>>> df["D"] = pd.rolling_mean(df["C"], window=3, center=True, min_periods=2)
>>> df
A B C D
0 a 1 2 3
1 b 3 4 4
2 c 5 6 5

Updated answer: pd.rolling_mean was deprecated in 0.18 and is no longer available as of pandas=0.23.4.
Window functions are now methods
Window functions have been refactored to be methods on Series/DataFrame objects, rather than top-level functions, which are now deprecated. This allows these window-type functions, to have a similar API to that of .groupby.
It either needs to be called on the dataframe:
In [55]: df['D'] = df['C'].rolling(window=3, center=True, min_periods=2).mean()
In [56]: df
Out[56]:
A B C D
0 a 1 2 3.0
1 b 3 4 4.0
2 c 5 6 5.0
Or from pandas.core.window.Rolling:
In [57]: df['D'] = pd.core.window.Rolling(df['C'], window=3, center=True, min_periods=2).mean()
In [58]: df
Out[58]:
A B C D
0 a 1 2 3.0
1 b 3 4 4.0
2 c 5 6 5.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

New column in dataframe based on location of values in another column - python

Related

Fill NA values by a two levels indexed Series

adding column in pandas dataframe containing the same value

how to convert header row into new columns in python pandas?

Python Pandas Drop Duplicates keep second to last

Pandas moving window over rows

Categories

Resources