I'm working with a large dataset and have the following issue:
Let's say i'm measuring the input of a substance ("sub-input") into a medium ("id"). For each sub-input i have calculated the year in which it is going to reach the other side of the medium ("y-arrival"). Sometimes several sub-input's arrive in the same year and sometimes no substance arrives in a year.
Example:
import pandas as pd
import numpy as np
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year= [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
in1 = [20,40,10,30,50,80,
60,10,10,40,np.NaN,np.NaN,
np.NaN,120,30,70,60,90]
arr = [2002,2004,2004,2004,2005,np.NaN,
1991,1992,np.NaN,1995,1995,np.NaN,
2001,2002,2004,2004,2005,np.NaN]
dictex3 ={"id":ids,"year":year,"sub-input":in1, "y-arrival":arr}
dfex3 = pd.DataFrame(dictex3)
I have then calculated the sum of "sub-input" for each "y-arrival" using the following code:
dfex3["input_sum_tf"] = dfex3.groupby(["id","y-arrival"])["sub-input"].transform(sum)
print(dfex3)
id year sub-input y-arrival input_sum_tf
0 1 2000 20.0 2002.0 20.0
1 1 2001 40.0 2004.0 80.0
2 1 2002 10.0 2004.0 80.0
3 1 2003 30.0 2004.0 80.0
4 1 2004 50.0 2005.0 50.0
5 1 2005 80.0 NaN NaN
6 2 1990 60.0 1991.0 60.0
7 2 1991 10.0 1992.0 10.0
8 2 1992 10.0 NaN NaN
9 2 1993 40.0 1995.0 40.0
10 2 1994 NaN 1995.0 40.0
11 2 1995 NaN NaN NaN
12 3 2000 NaN 2001.0 0.0
13 3 2001 120.0 2002.0 120.0
14 3 2002 30.0 2004.0 100.0
15 3 2003 70.0 2004.0 100.0
16 3 2004 60.0 2005.0 60.0
17 3 2005 90.0 NaN NaN
Now, for each "id" the sum of the inputs that reach the destination at a "y-arrival" has been calculated.
The goal is to reorder these values so that for each id and each year, the sum of the sub-inputs that will arrive in that year can be shown. Example:
id = 1, year = 2000 --> no y-arrival = 2000 --> = NaN
id = 1, year = 2001 --> no y-arrival = 2001 --> = NaN
id = 1, year = 2002 --> y-arrival = 2002 has an input_sum_tf = 20 --> = 20
id = 1, year = 2003 --> no y-arrival = 2003 --> = NaN
id = 1, year = 2004 --> y-arrival = 2004 has an input_sum_tf = 80 --> = 80
The "input_sum_tf" is the sum of the substances that arrive in a given year. The value "80" for year 2004 is the sum of the sub-input from the years 2001, 2002, 2003 because all of these arrive in year 2004 (y-arrival = 2004).
The result ("input_sum") should look like this:
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
My approach:
I tried solving this by using the merge-function of pandas on two columns, but the result isn't quite right. So far my code only works for the first 5 columns.
dfex3['input_sum'] = dfex3.merge(dfex3, left_on=['id','y-arrival'],
right_on=['id','year'],
how='right')['input_sum_tf_x']
dfex3["input_sum"]
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 80.0
6 80.0
7 50.0
8 NaN
9 60.0
10 10.0
11 NaN
12 NaN
13 40.0
14 40.0
15 NaN
16 0.0
17 120.0
Any help would be much appreciated!
The issue is your code is trying to merge on 'year' and 'y-arrival', so its making multiple matches when you only want one match. E.g. Row 4 where year=2004 will match 3 times where y-arrival=2004 (rows 1-3), hence the duplicates of 80 in the output rows 4-6.
Use groupby to get the last row for each id/y-arrival combo (also looks like you don't want matches where 'input_sum_tf' is zero):
df_last = dfex3.groupby(['id', 'y-arrival']).last().reset_index()
df_last = df_last[df_last['input_sum_tf'] != 0]
Then merge:
dfex3.merge(df_last,
left_on=['id', 'year'],
right_on=['id', 'y-arrival'],
how='left')['input_sum_tf_y']
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
Related
I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN
Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN
For example:
If I have a data frame like this:
20 40 60 80 100 120 140
1 1 1 1 NaN NaN NaN NaN
2 1 1 1 1 1 NaN NaN
3 1 1 1 1 NaN NaN NaN
4 1 1 NaN NaN 1 1 1
How do I find the last index in each row and then count the difference in columns elapsed so I get something like this?
20 40 60 80 100 120 140
1 40 20 0 NaN NaN NaN NaN
2 80 60 40 20 0 NaN NaN
3 60 40 20 0 NaN NaN NaN
4 20 0 NaN NaN 40 20 0
You can try of Transposing the dataframe, then after count only not null values and last set 1
#bit of complex procedure, solution involving with.
def fill_values(df):
df = df[::-1]
a = df == 1
b = a.cumsum()
#Function in counting the cummulative not null values
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
return (b-b.mask(a).ffill().fillna(0).astype(int))[::-1]*20
df.apply(fill_values,1).replace(0,np.nan)-20
Out:
20 40 60 80 100 120 140
1 40.0 20.0 0.0 NaN NaN NaN NaN
2 80.0 60.0 40.0 20.0 0.0 NaN NaN
3 60.0 40.0 20.0 0.0 NaN NaN NaN
4 20.0 0.0 NaN NaN 40.0 20.0 0.0
I am attempting to transpose and merge two pandas dataframes, one containing accounts, the segment which they received their deposit, their deposit information, and what day they received the deposit; the other has the accounts, and withdrawal information. The issue is, for indexing purposes, the segment information from one dataframe should line up with the information of the other, regardless of there being a withdrawal or not.
Notes:
There will always be an account for every person
There will not always be a withdrawal for every person
The accounts and data for the withdrawal dataframe only exist if a withdrawal occurs
Account Dataframe Code
accounts = DataFrame({'person':[1,1,1,1,1,2,2,2,2,2],
'segment':[1,2,3,4,5,1,2,3,4,5],
'date_received':[10,20,30,40,50,11,21,31,41,51],
'amount_received':[1,2,3,4,5,6,7,8,9,10]})
accounts = accounts.pivot_table(index=["person"], columns=["segment"])
Account Dataframe
amount_received date_received
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
Withdrawal Dataframe Code
withdrawals = DataFrame({'person':[1,1,1,2,2],
'withdrawal_segment':[1,1,5,2,3],
'withdraw_date':[1,2,3,4,5],
'withdraw_amount':[10,20,30,40,50]})
withdrawals = withdrawals.reset_index().pivot_table(index = ['index', 'person'], columns = ['withdrawal_segment'])
Since there can only be unique segments for a person it is required that my column only consists of a unique number once, while still holding all of the data, which is why this dataframe looks so much different.
Withdrawal Dataframe
withdraw_date withdraw_amount
withdrawal_segment 1 2 3 5 1 2 3 5
index person
0 1 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2.0 NaN NaN NaN 20.0 NaN NaN NaN
2 1 NaN NaN NaN 3.0 NaN NaN NaN 30.0
3 2 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
4 2 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
Merge
merge = accounts.merge(withdrawals, on='person', how='left')
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 5 1 2 3 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN 10.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN 20.0 NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN 3.0 NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN 40.0 NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN 50.0 NaN
The problem with the merged dataframe is that segments from the withdrawal dataframe aren't lined up with the accounts segments.
The desired dataframe should look something like:
amount_received date_received withdraw_date withdraw_amount
segment 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50 1.0 NaN NaN NaN NaN 10.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 2.0 NaN NaN NaN NaN 20.0 NaN NaN NaN NaN
1 1 2 3 4 5 10 20 30 40 50 NaN NaN NaN NaN 3.0 NaN NaN NaN NaN 30.0
2 6 7 8 9 10 11 21 31 41 51 NaN 4.0 NaN NaN NaN NaN 40.0 NaN NaN NaN
2 6 7 8 9 10 11 21 31 41 51 NaN NaN 5.0 NaN NaN NaN NaN 50.0 NaN NaN
My problem is that I can't seem to merge across both person and segments. I've thought about inserting a row and column, but because I don't know which segments are and aren't going to have a withdrawal this gets difficult. Is it possible to merge the dataframes so that they line up across both people and segments? Thanks!
Method 1 , using reindex
withdrawals=withdrawals.reindex(pd.MultiIndex.from_product([withdrawals.columns.levels[0],accounts.columns.levels[1]]),axis=1)
merge = accounts.merge(withdrawals, on='person', how='left')
merge
Out[79]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN
Method 2 , using unstack and stack
merge = accounts.merge(withdrawals, on='person', how='left')
merge.stack(dropna=False).unstack()
Out[82]:
amount_received date_received \
segment 1 2 3 4 5 1 2 3 4 5
person
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
1 1 2 3 4 5 10 20 30 40 50
2 6 7 8 9 10 11 21 31 41 51
2 6 7 8 9 10 11 21 31 41 51
withdraw_amount withdraw_date
segment 1 2 3 4 5 1 2 3 4 5
person
1 10.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 30.0 NaN NaN NaN NaN 3.0
2 NaN 40.0 NaN NaN NaN NaN 4.0 NaN NaN NaN
2 NaN NaN 50.0 NaN NaN NaN NaN 5.0 NaN NaN
I have a dataframe like this:
index = [0,1,2,3,4,5]
s = pd.Series([1,1,1,2,2,2],index= index)
t = pd.Series([2007,2008,2011,2006,2007,2009],index= index)
f = pd.Series([2,4,6,8,10,12],index= index)
pp =pd.DataFrame(np.c_[s,t,f],columns = ["group","year","amount"])
pp
group year amount
0 1 2007 2
1 1 2008 4
2 1 2011 6
3 2 2006 8
4 2 2007 10
5 2 2009 12
I want to add lines in between missing years for each group. My desire dataframe is like this:
group year amount
0 1.0 2007 2.0
1 1.0 2008 4.0
2 1.0 2009 NaN
3 1.0 2010 NaN
4 1.0 2011 6
5 1.0 2006 8.0
6 2.0 2007 10.0
7 2.0 2008 NaN
8 2.0 2009 12.0
Is there any way to do it for a large dataframe?
First change year to datetime:
df.year = pd.to_datetime(df.year, format='%Y')
set_index with resample
df.set_index('year').groupby('group').amount.resample('Y').mean().reset_index()
group year amount
0 1 2007-12-31 2.0
1 1 2008-12-31 4.0
2 1 2009-12-31 NaN
3 1 2010-12-31 NaN
4 1 2011-12-31 6.0
5 2 2006-12-31 8.0
6 2 2007-12-31 10.0
7 2 2008-12-31 NaN
8 2 2009-12-31 12.0
The data is simplified as follow:
mon site year data1 data2
1 57598 2001 58 1383
2 57598 2001 75 549
1 57598 2002 118 1337
2 57598 2002 162 2213
1 50136 2000 -282 134
2 50136 2000 -242 0
1 50136 2001 -126 102
1 50844 2000 152 411
2 50844 2000 70 117
1 50844 2002 -74 44
2 50844 2002 -173 83
I want to extract the data1 and data2 and change to following form:
this is data1:
2000 2000 2001 2001 2002 2002
1 2 1 2 1 2
50136 -282 -242 -126 NA NA NA
50844 152 70 NA NA -74 -173
57598 58 75 NA NA 118 162
and data2 will be saved as new file with the same form to data1.
I want to use pandas.groupby to operate, but the code as follow is error:
df['data1'].groupby(df['year'],df['mon'],df['site'])
Is easy to go by using groupby?
I think first is best try set_index with unstack:
df1 = df.set_index(['year','mon','site'])['data1'].unstack(level=[0,1]).sort_index(axis=1)
print (df1)
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
but if get:
ValueError: Index contains duplicate entries, cannot reshape
use another solution with groupby or pivot_table:
You can use groupby with unstack:
df1 = df.groupby(['year','mon','site'])['data1'].mean().unstack(level=[0,1])
print (df1)
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
Another possible solution with pivot_table with default aggfunc which is np.mean, but can be changed to another functions like aggfunc='sum', ...:
print (df.pivot_table(index='site', columns=['year','mon'], values='data1', aggfunc=np.mean))
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
Last use DataFrame.to_csv for write file to csv.
df1.to_csv('file_out.csv')
To get the df in the shape in which you need it:
result = df.groupby(['site','mon','year'])['data1'].mean().unstack().unstack()
Out[310]:
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
To save it to csv:
df.groupby(['site','mon','year'])['data1'].mean().unstack().unstack().to_csv('data1.csv')
df.groupby(['site','mon','year'])['data2'].mean().unstack().unstack().to_csv('data2.csv')