I have a dataframe like this (usually with far more columns and rows):
A B
0 5 10
1 15 3
Now I want to add columns A_ratio and B_ratio to this dataframe whereby the values there represent A/(A + B) and B/(A + B), respectively. So A_ratio and B_ratio should add up to 1 in each row of the dataframe.
My first attempt looked like this:
import pandas as pd
df = pd.DataFrame({'A': [5,15], 'B': [10,3]})
for coli in df:
df[coli + '_ratio'] = df[coli]/df.sum(axis=1)
giving me the following result:
A B A_ratio B_ratio
0 5 10 0.333333 0.652174
1 15 3 0.833333 0.159292
Clearly, the columns A_ratio and B_ratio do not add up to 1. While the values in A_ratio are correct they are wrong in B_ratio since the row sum is changed when A_ratio is added.
A workaround could be to copy the dataframe first:
df2 = pd.DataFrame({'A': [5,15], 'B': [10,3]})
df2cl = df2.copy()
for coli in df2:
df2[coli + '_ratio'] = df2[coli]/df2cl.sum(axis=1)
which gives me the desired output:
A B A_ratio B_ratio
0 5 10 0.333333 0.666667
1 15 3 0.833333 0.166667
Is there a more efficient way of doing this which avoids copying the dataframe?
You don't need to call sum each time.
>>%timeit %run multiple_sum.py
100 loops, best of 3: 6.59 ms per loop
>>%timeit %run single_sum.py
100 loops, best of 3: 3.84 ms per loop
if you have a big dataframe this is going to be needless overhead.
sums = df.sum(axis=1)
for coli in df:
df[coli + '_ratio'] = df[coli]/sums
is sufficient
You can just sub-select from your df so that it only sums those 2 columns:
In [195]:
for coli in df:
df[coli + '_ratio'] = df[coli]/df[['A','B']].sum(axis=1)
df
Out[195]:
A B A_ratio B_ratio
0 5 10 0.333333 0.666667
1 15 3 0.833333 0.166667
You can just take a copy of the column names upfront if you don't want to hardcode them:
In [197]:
cols = df.columns
for coli in df:
df[coli + '_ratio'] = df[coli]/df[cols].sum(axis=1)
df
Out[197]:
A B A_ratio B_ratio
0 5 10 0.333333 0.666667
1 15 3 0.833333 0.166667
Related
I have two dataframes that I want to sum along the y axis, conditionally.
For example:
df_1
a b value
1 1 1011
1 2 1012
2 1 1021
2 2 1022
df_2
a b value
9 9 99
1 2 12
2 1 21
I want to make df_1['value'] -= df_2['value'] if df_1[a] == df_2[a] & df_1[b] == df_2[b], so the output would be:
OUTPUT
a b value
1 1 1011
1 2 1000
2 1 1000
2 2 1022
Is there a way to achieve that instead of iterating the whole dataframe? (It's pretty big)
Make use of index alignment that pandas provides here, by setting a and b as your index before subtracting.
for df in [df1, df2]:
df.set_index(['a', 'b'], inplace=True)
df1.sub(df2, fill_value=0).reindex(df1.index)
value
a b
1 1 1011.0
2 1000.0
2 1 1000.0
2 1022.0
You could also perform a left join and subtract matching values. Here is how to do that:
(pd.merge(df_1, df_2, how='left', on=['a', 'b'], suffixes=('_1', '_2'))
.fillna(0)
.assign(value=lambda x: x.value_1 - x.value_2)
)[['a', 'b', 'value']]
You could let
merged = pd.merge(df_1, df_2, on=['a', 'b'], left_index=True)
df_1.value[merged.index] = merged.value_x - merged.value_y
Result:
In [37]: df_1
Out[37]:
a b value
0 1 1 1011
1 1 2 1000
2 2 1 1000
3 2 2 1022
I have a dataframe with columns (A, B and value) where there are missing values in the value column. And there is a Series indexed by two columns (A and B) from the dataframe. How can I fill the missing values in the dataframe with corresponding values in the series?
I think you need fillna with set_index and reset_index:
df = pd.DataFrame({'A': [1,1,3],
'B': [2,3,4],
'value':[2,np.nan,np.nan] })
print (df)
A B value
0 1 2 2.0
1 1 3 NaN
2 3 4 NaN
idx = pd.MultiIndex.from_product([[1,3],[2,3,4]])
s = pd.Series([5,6,0,8,9,7], index=idx)
print (s)
1 2 5
3 6
4 0
3 2 8
3 9
4 7
dtype: int64
df = df.set_index(['A','B'])['value'].fillna(s).reset_index()
print (df)
A B value
0 1 2 2.0
1 1 3 6.0
2 3 4 7.0
Consider the dataframe and series df and s
df = pd.DataFrame(dict(
A=list('aaabbbccc'),
B=list('xyzxyzxyz'),
value=[1, 2, np.nan, 4, 5, np.nan, 7, 8, 9]
))
s = pd.Series(range(1, 10)[::-1])
s.index = [df.A, df.B]
We can fillna with a clever join
df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
# \_____________/ \_________/
# make series same get old
# name as column column out
# we are filling of the way
A B value
0 a x 1.0
1 a y 2.0
2 a z 7.0
3 b x 4.0
4 b y 5.0
5 b z 4.0
6 c x 7.0
7 c y 8.0
8 c z 9.0
Timing
join is cute, but #jezrael's set_index is quicker
%timeit df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
100 loops, best of 3: 3.56 ms per loop
%timeit df.set_index(['A','B'])['value'].fillna(s).reset_index()
100 loops, best of 3: 2.06 ms per loop
I am trying to concatenate two dataframes, above and below. Not concatenate side-by-side.
The dataframes contain the same data, however, in the first dataframe one column might have name "ObjectType" and in the second dataframe the column might have name "ObjectClass". When I do
df_total = pandas.concat ([df0, df1])
the df_total will have two column names, one with "ObjectType" and another with "ObjectClass". In each of these two columns, half of the values will be "NaN". So I have to manually merge these two columns into one which is a pain.
Can I somehow merge the two columns into one? I would like to have a function that does something like:
df_total = pandas.merge_many_columns(input=["ObjectType,"ObjectClass"], output=["MyObjectClasses"]
which merges the two columns and creates a new column. I have looked into melt() but it does not really do this?
(Maybe it would be nice if I could specify what will happen if there is a collision, say that two columns contain values, in that case I supply a lambda function that says "keep the largest value", "use an average", etc)
I think you can rename column first for align data in both DataFrames:
df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df0)
df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df1)
inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"
#dict comprehension
d = {x:output for x in inputs}
print (d)
{'ObjectType': 'MyObjectClasses', 'ObjectClass': 'MyObjectClasses'}
df0 = df0.rename(columns=d)
df1 = df1.rename(columns=d)
df_total = pd.concat([df0, df1], ignore_index=True)
print (df_total)
B C MyObjectClasses
0 4 7 1
1 5 8 2
2 6 9 3
3 4 7 1
4 5 8 2
5 6 9 3
EDIT:
More simplier is update (working inplace):
df = pd.concat([df0, df1])
df['ObjectType'].update(df['ObjectClass'])
print (df)
B C ObjectClass ObjectType
0 4 7 NaN 1.0
1 5 8 NaN 2.0
2 6 9 NaN 3.0
0 4 7 1.0 1.0
1 5 8 2.0 2.0
2 6 9 3.0 3.0
Or fillna, but then need drop original columns columns:
df = pd.concat([df0, df1])
df["ObjectType"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop('ObjectClass', axis=1)
print (df)
B C ObjectType
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
df = pd.concat([df0, df1])
df["MyObjectClasses"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop(['ObjectType','ObjectClass'], axis=1)
print (df)
B C MyObjectClasses
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
EDIT1:
Timings:
df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df0)
df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df1)
df0 = pd.concat([df0]*1000).reset_index(drop=True)
df1 = pd.concat([df1]*1000).reset_index(drop=True)
inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"
#dict comprehension
d = {x:output for x in inputs}
In [241]: %timeit df_total = pd.concat([df0.rename(columns=d), df1.rename(columns=d)], ignore_index=True)
1000 loops, best of 3: 821 µs per loop
In [240]: %%timeit
...: df = pd.concat([df0, df1])
...: df['ObjectType'].update(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.18 ms per loop
In [242]: %%timeit
...: df = pd.concat([df0, df1])
...: df['MyObjectClasses'] = df['ObjectType'].combine_first(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.21 ms per loop
In [243]: %%timeit
...: df = pd.concat([df0, df1])
...: df['MyObjectClasses'] = df['ObjectType'].fillna(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.28 ms per loop
You can merge two columns separated by Nan's into one using combine_first
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
>>> df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
>>> df = pd.concat([df0, df1])
>>> df['ObjectType'] = df['ObjectType'].combine_first(df['ObjectClass'])
>>> df['ObjectType']
0 1
1 2
2 3
0 1
1 2
3 3
Name: ObjectType, dtype: float64
I have a pandas dataframe A of size (1500,5) and a dictionary D containing:
D
Out[121]:
{'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
for each key in the dictionary I would like to create a new column in the dataframe A with the values in the dictionary (same value for all the rows of each column)
at the end
A should be of size (1500,8)
Is there a "python" way to do this? thanks!
You can use concat with DataFrame constructor:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame({'A':[1,2],
'B':[4,5],
'C':[7,8]})
print (df)
A B C
0 1 4 7
1 2 5 8
print (pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1))
A B C newcol1 newcol2 newcol3
0 1 4 7 a 2 1
1 2 5 8 a 2 1
Timings:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame(np.random.rand(10000000, 5), columns=list('abcde'))
In [37]: %timeit pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1)
The slowest run took 18.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 875 ms per loop
In [38]: %timeit df.assign(**D)
1 loop, best of 3: 1.22 s per loop
setup
A = pd.DataFrame(np.random.rand(10, 5), columns=list('abcde'))
d = {
'newcol1': 'a',
'newcol2': 2,
'newcol3': 1
}
solution
Use assign
A.assign(**d)
a b c d e newcol1 newcol2 newcol3
0 0.709249 0.275538 0.135320 0.939448 0.549480 a 2 1
1 0.396744 0.513155 0.063207 0.198566 0.487991 a 2 1
2 0.230201 0.787672 0.520359 0.165768 0.616619 a 2 1
3 0.300799 0.554233 0.838353 0.637597 0.031772 a 2 1
4 0.003613 0.387557 0.913648 0.997261 0.862380 a 2 1
5 0.504135 0.847019 0.645900 0.312022 0.715668 a 2 1
6 0.857009 0.313477 0.030833 0.952409 0.875613 a 2 1
7 0.488076 0.732990 0.648718 0.389069 0.301857 a 2 1
8 0.187888 0.177057 0.813054 0.700724 0.653442 a 2 1
9 0.003675 0.082438 0.706903 0.386046 0.973804 a 2 1
How can I extract the first and last rows of a given dataframe as a new dataframe in pandas?
I've tried to use iloc to select the desired rows and then concat as in:
df=pd.DataFrame({'a':range(1,5), 'b':['a','b','c','d']})
pd.concat([df.iloc[0,:], df.iloc[-1,:]])
but this does not produce a pandas dataframe:
a 1
b a
a 4
b d
dtype: object
I think the most simple way is .iloc[[0, -1]].
df = pd.DataFrame({'a':range(1,5), 'b':['a','b','c','d']})
df2 = df.iloc[[0, -1]]
print(df2)
a b
0 1 a
3 4 d
You can also use head and tail:
In [29]: pd.concat([df.head(1), df.tail(1)])
Out[29]:
a b
0 1 a
3 4 d
The accepted answer duplicates the first row if the frame only contains a single row. If that's a concern
df[0::len(df)-1 if len(df) > 1 else 1]
works even for single row-dataframes.
Example: For the following dataframe this will not create a duplicate:
df = pd.DataFrame({'a': [1], 'b':['a']})
df2 = df[0::len(df)-1 if len(df) > 1 else 1]
print df2
a b
0 1 a
whereas this does:
df3 = df.iloc[[0, -1]]
print df3
a b
0 1 a
0 1 a
because the single row is the first AND last row at the same time.
I think you can try add parameter axis=1 to concat, because output of df.iloc[0,:] and df.iloc[-1,:] are Series and transpose by T:
print df.iloc[0,:]
a 1
b a
Name: 0, dtype: object
print df.iloc[-1,:]
a 4
b d
Name: 3, dtype: object
print pd.concat([df.iloc[0,:], df.iloc[-1,:]], axis=1)
0 3
a 1 4
b a d
print pd.concat([df.iloc[0,:], df.iloc[-1,:]], axis=1).T
a b
0 1 a
3 4 d
Alternatively you can use take:
In [3]: df.take([0, -1])
Out[3]:
a b
0 1 a
3 4 d
Here is the same style as in large datasets:
x = df[:5]
y = pd.DataFrame([['...']*df.shape[1]], columns=df.columns, index=['...'])
z = df[-5:]
frame = [x, y, z]
result = pd.concat(frame)
print(result)
Output:
date temp
0 1981-01-01 00:00:00 20.7
1 1981-01-02 00:00:00 17.9
2 1981-01-03 00:00:00 18.8
3 1981-01-04 00:00:00 14.6
4 1981-01-05 00:00:00 15.8
... ... ...
3645 1990-12-27 00:00:00 14
3646 1990-12-28 00:00:00 13.6
3647 1990-12-29 00:00:00 13.5
3648 1990-12-30 00:00:00 15.7
3649 1990-12-31 00:00:00 13