Pandas sum two dataframes based on the value of column

Pandas sum two dataframes based on the value of column - python

I have two dataframes that I want to sum along the y axis, conditionally.
For example:
df_1
a b value
1 1 1011
1 2 1012
2 1 1021
2 2 1022
df_2
a b value
9 9 99
1 2 12
2 1 21
I want to make df_1['value'] -= df_2['value'] if df_1[a] == df_2[a] & df_1[b] == df_2[b], so the output would be:
OUTPUT
a b value
1 1 1011
1 2 1000
2 1 1000
2 2 1022
Is there a way to achieve that instead of iterating the whole dataframe? (It's pretty big)

Make use of index alignment that pandas provides here, by setting a and b as your index before subtracting.
for df in [df1, df2]:
df.set_index(['a', 'b'], inplace=True)
df1.sub(df2, fill_value=0).reindex(df1.index)
value
a b
1 1 1011.0
2 1000.0
2 1 1000.0
2 1022.0

You could also perform a left join and subtract matching values. Here is how to do that:
(pd.merge(df_1, df_2, how='left', on=['a', 'b'], suffixes=('_1', '_2'))
.fillna(0)
.assign(value=lambda x: x.value_1 - x.value_2)
)[['a', 'b', 'value']]

You could let
merged = pd.merge(df_1, df_2, on=['a', 'b'], left_index=True)
df_1.value[merged.index] = merged.value_x - merged.value_y
Result:
In [37]: df_1
Out[37]:
a b value
0 1 1 1011
1 1 2 1000
2 2 1 1000
3 2 2 1022

Related

How do you generate a rolling count the number of rows that are duplicated in Pandas? [duplicate]

I come from a sql background and I use the following data processing step frequently:
Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
EX:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?

you can also use sort_values(), groupby() and finally cumcount() + 1:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18

Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2

You can do this by using groupby twice along with the rank method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).

You can use transform and Rank together Here is an example
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information

pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64

Merge duplicated cells instead of dropping them [duplicate]

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:

You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

Match columns pandas Dataframe

I want to match two pandas Dataframes by the name of their columns.
import pandas as pd
df1 = pd.DataFrame([[0,2,1],[1,3,0],[0,4,0]], columns=['A', 'B', 'C'])
A B C
0 0 2 1
1 1 3 0
2 0 4 0
df2 = pd.DataFrame([[0,0,1],[1,5,0],[0,7,0]], columns=['A', 'B', 'D'])
A B D
0 0 0 1
1 1 5 0
2 0 7 0
If the names match, do nothing. (Keep the column of df2)
If a column is in Dataframe 1 but not in Dataframe 2, add the column in Dataframe 2 as a vector of zeros.
If a column is in Dataframe 2 but not in Dataframe 1, drop it.
The output should look like this:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
I know if I do:
df2 = df2[df1.columns]
I get:
KeyError: "['C'] not in index"
I could also add the vectors of zeros manually, but of course this is a toy example of a much longer dataset. Is there any smarter/pythonic way of doing this?

It appears that df2 columns should be the same as df1 columns after this operation, as columns that are in df1 and not df2 should be added, while columns only in df2 should be removed. We can simply reindex df2 to match df1 columns with a fill_value=0 (this is the safe equivalent to df2 = df2[df1.columns] when adding new columns with a fill value):
df2 = df2.reindex(columns=df1.columns, fill_value=0)
df2:
A B C
0 0 0 0
1 1 5 0
2 0 7 0

Extracting data from two dataframes to create a third

I am using Python Pandas for the following. I have three dataframes, df1, df2 and df3. Each has the same dimensions, index and column labels. I would like to create a fourth dataframe that takes elements from df1 or df2 depending on the values in df3:
df1 = pd.DataFrame(np.random.randn(4, 2), index=list('0123'), columns=['A', 'B'])
df1
Out[67]:
A B
0 1.335314 1.888983
1 1.000579 -0.300271
2 -0.280658 0.448829
3 0.977791 0.804459
df2 = pd.DataFrame(np.random.randn(4, 2), index=list('0123'), columns=['A', 'B'])
df2
Out[68]:
A B
0 0.689721 0.871065
1 0.699274 -1.061822
2 0.634909 1.044284
3 0.166307 -0.699048
df3 = pd.DataFrame({'A': [1, 0, 0, 1], 'B': [1, 0, 1, 0]})
df3
Out[69]:
A B
0 1 1
1 0 0
2 0 1
3 1 0
The new dataframe, df4, has the same index and column labels and takes an element from df1 if the corresponding value in df3 is 1. It takes an element from df2 if the corresponding value in df3 is a 0.
I need a solution that uses generic references (e.g. ix or iloc) rather than actual column labels and index values because my dataset has fifty columns and four hundred rows.

As your DataFrames happen to be numeric, and the selector matrix happens to be of indicator variables, you can do the following:
>>> pd.DataFrame(
df1.as_matrix() * df3.as_matrix() + df1.as_matrix() * (1 - df3.as_matrix()),
index=df1.index,
columns=df1.columns)
I tried it by me and it works. Strangely enough, #Yakym Pirozhenko's answer - which I think is superior - doesn't work by me as well.

df4 = df1.where(df3.astype(bool), df2) should do it.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(10, size = (4,2)))
df2 = pd.DataFrame(np.random.randint(10, size = (4,2)))
df3 = pd.DataFrame(np.random.randint(2, size = (4,2)))
df4 = df1.where(df3.astype(bool), df2)
print df1, '\n'
print df2, '\n'
print df3, '\n'
print df4, '\n'
Output:
0 1
0 0 3
1 8 8
2 7 4
3 1 2
0 1
0 7 9
1 4 4
2 0 5
3 7 2
0 1
0 0 0
1 1 0
2 1 1
3 1 0
0 1
0 7 9
1 8 4
2 7 4
3 1 2

Finding common rows (intersection) in two Pandas dataframes

Assume I have two dataframes of this format (call them df1 and df2):
+------------------------+------------------------+--------+
| user_id | business_id | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA | 4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA | 5 |
| mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA | 3 |
+------------------------+------------------------+--------+
I'm looking to get a dataframe of all the rows that have a common user_id in df1 and df2. (ie. if a user_id is in both df1 and df2, include the two rows in the output dataframe)
I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_ids in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.
Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge but I don't think that's what I need.

My understanding is that this question is better answered over in this post.
But briefly, the answer to the OP with this method is simply:
s1 = pd.merge(df1, df2, how='inner', on=['user_id'])
Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.

If I understand you correctly, you can use a combination of Series.isin() and DataFrame.append():
In [80]: df1
Out[80]:
rating user_id
0 2 0x21abL
1 1 0x21abL
2 1 0xdafL
3 0 0x21abL
4 4 0x1d14L
5 2 0x21abL
6 1 0x21abL
7 0 0xdafL
8 4 0x1d14L
9 1 0x21abL
In [81]: df2
Out[81]:
rating user_id
0 2 0x1d14L
1 1 0xdbdcad7
2 1 0x21abL
3 3 0x21abL
4 3 0x21abL
5 1 0x5734a81e2
6 2 0x1d14L
7 0 0xdafL
8 0 0x1d14L
9 4 0x5734a81e2
In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)
In [83]: ind
Out[83]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 False
Name: user_id, dtype: bool
In [84]: df1[ind].append(df2[ind])
Out[84]:
rating user_id
0 2 0x21abL
2 1 0xdafL
3 0 0x21abL
4 4 0x1d14L
6 1 0x21abL
7 0 0xdafL
8 4 0x1d14L
0 2 0x1d14L
2 1 0x21abL
3 3 0x21abL
4 3 0x21abL
6 2 0x1d14L
7 0 0xdafL
8 0 0x1d14L
This is essentially the algorithm you described as "clunky", using idiomatic pandas methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1 and df2 have no overlapping row indices, i.e., if
In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')
In fact, it won't give the expected output if their row indices are not equal.

In SQL, this problem could be solved by several methods:
select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)
or join and then unpivot (possible in SQL server)
select
df1.user_id,
c.rating
from df1
inner join df2 on df2.user_i = df1.user_id
outer apply (
select df1.rating union all
select df2.rating
) as c
Second one could be written in pandas with something like:
>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
user_id rating
0 3 20
0 3 30

This is simple solution:
df1[df1 == df2].dropna()

You can do this for n DataFrames and k colums by using pd.Index.intersection:
import pandas as pd
from functools import reduce
from typing import Union
def dataframe_intersection(
dataframes: list[pd.DataFrame], by: Union[list, str]
) -> list[pd.DataFrame]:
set_index = [d.set_index(by) for d in dataframes]
index_intersection = reduce(pd.Index.intersection, [d.index for d in set_index])
intersected = [df.loc[index_intersection].reset_index() for df in set_index]
return intersected
df1 = pd.DataFrame({"user_id":[1,2,3], "business_id": ['a', 'b', 'c'], "rating":[10, 15, 20]})
df2 = pd.DataFrame({"user_id":[3,4,5], "business_id": ['c', 'd', 'e'], "rating":[30, 35, 40]})
df3 = pd.DataFrame({"user_id":[3,3,3], "business_id": ['f', 'c', 'f'], "rating":[50, 70, 80]})
df_list = [df1, df2, df3]
This gives
>>> pd.concat(dataframe_intersection(df_list, by='user_id'))
user_id business_id rating
0 3 c 20
0 3 c 30
0 3 f 50
1 3 c 70
2 3 f 80
And
>>> pd.concat(dataframe_intersection(df_list, by=['user_id', 'business_id']))
user_id business_id rating
0 3 c 20
0 3 c 30
0 3 c 70

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas sum two dataframes based on the value of column - python

Make use of index alignment that pandas provides here, by setting a and b as your index before subtracting. for df in [df1, df2]: df.set_index(['a', 'b'], inplace=True) df1.sub(df2, fill_value=0).reindex(df1.index) value a b 1 1 1011.0 2 1000.0 2 1 1000.0 2 1022.0

You could also perform a left join and subtract matching values. Here is how to do that: (pd.merge(df_1, df_2, how='left', on=['a', 'b'], suffixes=('_1', '_2')) .fillna(0) .assign(value=lambda x: x.value_1 - x.value_2) )[['a', 'b', 'value']]

You could let merged = pd.merge(df_1, df_2, on=['a', 'b'], left_index=True) df_1.value[merged.index] = merged.value_x - merged.value_y Result: In [37]: df_1 Out[37]: a b value 0 1 1 1011 1 1 2 1000 2 2 1 1000 3 2 2 1022

Related

How do you generate a rolling count the number of rows that are duplicated in Pandas? [duplicate]

Merge duplicated cells instead of dropping them [duplicate]

Match columns pandas Dataframe

Extracting data from two dataframes to create a third

Finding common rows (intersection) in two Pandas dataframes

Categories

Resources