This question already has an answer here:
Running sum in pandas (without loop)
(1 answer)
Closed 4 years ago.
I am trying to do an emulation of a loan with monthly payments in pandas.
The credit column contains amount of money which I borrowed from the bank.
The debit column contains amount of money which I payed back to a bank.
The total column should contain the amount which is left to pay to a bank. Basically it contains the subtraction result between the credit and debit column).
I was able to write the following code:
import pandas as pd
# This function returns the subtraction result of credit and debit
def f(x):
return (x['credit'] - x['debit'])
df = pd.DataFrame({'credit': [1000, 0, 0, 500],
'debit': [0, 100, 200, 0]})
for i in df:
df['total'] = df.apply(f, axis=1)
print(df)
It works (it subtracts the debit from the credit). But it doesn't keep results in the total columns. Please see Actual and Expected results below.
Actual result:
credit debit total
0 1000 0 1000
1 0 100 -100
2 0 200 -200
3 500 0 500
Expected result:
credit debit total
0 1000 0 1000
1 0 100 900
2 0 200 700
3 500 0 1200
You could use cumsum:
df['total'] = (df.credit - df.debit).cumsum()
print(df)
Output
credit debit total
0 1000 0 1000
1 0 100 900
2 0 200 700
3 500 0 1200
You don't need apply here.
import pandas as pd
df = pd.DataFrame({'credit': [1000, 0, 0, 500],
'debit': [0, 100, 200, 0]})
df['Total'] = (df['credit'] - df['debit']).cumsum()
print(df)
Output
credit debit Total
0 1000 0 1000
1 0 100 900
2 0 200 700
3 500 0 1200
The reason why apply wasn't working is because apply executes over each row rather than keeping the running total after each subtraction. Passing cumsum() into the subtraction kill keep the running total to get the desired results.
Related
I have a df
Side ref_price price price_diff
0 100 110
1 110 100
I want to keep price_diff values based on side values.
if side==0:
df['price_diff']=df['ref_price']*df['price']
else if side==1:
df['price_diff']=df['ref_price']*df['price']*-1
Tried with
df.loc[df.Side == 0, 'price_diff'] = (df['price']*df['ref_price'])
Not working, throwing errors.
You could use "Side" column as a condition in numpy.where:
df['price_diff'] = np.where(df['Side'].astype(bool), df['ref_price']*df['price']*-1, df['ref_price']*df['price'])
or in this specific case, use "Side" column values as power of -1:
df['price_diff'] = df['ref_price']*df['price']*(-1)**df['Side']
Output:
Side ref_price price price_diff
0 0 100 110 11000
1 1 110 100 -11000
You can use np.where:
df['price_diff'] = np.where(df['side'] == 0,
df['ref_price'] * df['price'],
df['ref_price'] * df['price'] * -1)
print(df)
# Output
side ref_price price price_diff
0 0 100 110 11000
1 1 110 100 -11000
I'm learning pandas and have a query about aggregate functions. Apologies for what might be a very basic question for experts on this forum :).
Here's a sample of my dataset:
EmpID Age_Range Salary
0 321 20, 35 34000
1 561 20, 35 24000
2 789 50, 65 34000
the above dataset is df, and i'm saving down avg. salary info per employee age range into a separate dataframe (df_age), where I'm persisting the above data. I was able to successfully apply mean() on the salary table to get the avg. salary per age range.
So basically what I want is the count of employees for each age_range.
df_age['EmpCount'] = df.groupby('Age_Range')['EmpID'].count() doesn't work, and returns a 'NaN' in my dataset.
additionally, when I used the transform function
df_age['EmpCount'] = df.groupby('Age_Range')['EmpID'].transform(count)
it returns values, but the same value across the three age ranges - 37, which is not correct. There are a total of 100 entries in my dataset.
desired output for df_age:
0 (20, 35] 50000 27
1 (35, 50] 37000 11
2 (50, 65] 65000 30
Thanks!
If I understood your question correctly you want a new column which has count of employees for age_range. Well, you can use aggregate function to get your answer as follows:
df_age = df.set_index(['Age_Range','EmpID']).groupby(level =0).size().reset_index(name='count_of_employees')
df_age['Ave_Salary'] = df.set_index(['Age_Range','Salary']).groupby(level =0).mean()
You can use size or len in a transform, just like you did with count:
# Dummy data
df = pd.DataFrame({"sample": ["sample1", "sample2", "sample2", "sample3", "sample3", "sample3"]})
df["number_of_samples"] = df.groupby("sample").sample.transform("size")
df["number_of_samples_again"] = df.groupby("sample").sample.transform(len)
Output:
sample number_of_samples number_of_samples_again
0 sample1 1 1
1 sample2 2 2
2 sample2 2 2
3 sample3 3 3
4 sample3 3 3
5 sample3 3 3
I have found a solution to this, but it's not neat / efficient:
df_age1 = df.groupby('Age_Range')['Salary'].mean()
df_age1 = df_age1.reset_index()
df_age1.rename(columns={'Salary':'SalAvg'}, inplace=True)
df_age2 = df.groupby('Age_Range')['EmpID'].count()
df_age2 = df_age2.reset_index()
df_age2.rename(columns={'EmpID':'EmpCount'}, inplace=True)
Then finally,
df_age = pd.merge(df_age1, df_age2, on='Age_Range')
The above iteration gives me what I need, but across three dataframes - I'll obviously be ignoring df_age1 and 2, but I'm still on the lookout for an efficient answer!
I basically picked up Python last week, and although I am currently learning the basics, I've been tasked with building a small program in python at work. And would appreciate some help on this.
I would like to create a SUMIFS function similar to the excel version. My data contains a cash flow date (CFDATE), portfolio name (PORTFOLIO) and cash flow amount (CF). I want tot sum the CF based on which portfolio it belongs to and based on the date on which it falls.
I have managed to achieve this using the code below, however I am struggling to output my results as an array/table where the header row comprises of all the portfolios, and the initial column is a list of the dates (duplicates removed) and the CF are grouped according to each combination of (CFDATE,PORTFOLIO).
e.g of desired output:
PORTFOLIO-> 'A' 'B' 'C'
CFDATE
'30/09/2017' 300 600 300
'31/10/2017' 300 0 600
code used so far:
from pandas import Series,DataFrame
from numpy import matrix
import numpy as np
import pandas as pd
df = DataFrame(pd.read_csv("...\Test.csv"))
portfolioMapping = sorted(list(set(df.PORTFOLIO)))
cfDateMapping = list(set(df.CFDATE))
for i in range(0,len(portfolioMapping)):
dfVar = df.CF * np.where(df.PORTFOLIO == portfolioMapping[i] , 1, 0)
for j in range(0,len(cfDateMapping)):
dfVar1 = df.CF/df.CF * np.where(df.CFDATE == cfDateMapping[j] , 1, 0)
print([portfolioMapping[i],[cfDateMapping[j]],sum(dfVar*dfVar1)])
The data is basically in this form:
PORTFOLIO CFDATE CF
A 30/09/2017 300
A 31/10/2017 300
C 31/10/2017 300
B 30/09/2017 300
B 30/09/2017 300
C 30/09/2017 300
C 31/10/2017 300
C 31/10/2017 300
I would really appreciate some help on the matter.
You need groupby + sum + unstack:
df = df.groupby(['CFDATE', 'PORTFOLIO'])['CF'].sum().unstack(fill_value=0)
print (df)
PORTFOLIO A B C
CFDATE
30/09/2017 300 600 300
31/10/2017 300 0 900
Or pivot_table:
df = df.pivot_table(index='CFDATE',
columns='PORTFOLIO',
values='CF',
aggfunc=sum,
fill_value=0)
print (df)
PORTFOLIO A B C
CFDATE
30/09/2017 300 600 300
31/10/2017 300 0 900
You can simply do that with Pandas's pivot_table():
df.pivot_table(index='CFDATE', columns=['PORTFOLIO'], aggfunc=sum, fill_value=0)
The result is the following:
PORTFOLIO A B C
CFDATE
30/09/2017 300 600 300
31/10/2017 300 0 900
I think the best in your case would be to use a groupby method like the following:
df.groupby(['PORTFOLIO', 'CFDATE']).sum()
CF
PORTFOLIO CFDATE
A 30/09/2017 600
31/10/2017 300
B 30/09/2017 600
C 30/09/2017 300
31/10/2017 900
Basically, once you have grouped your dataframe df, you can then perform various method on it (like sum(), mean(), min(), max(), etc)
Also, you cans store you grouped dataframe in an object like the following:
grouped = df.groupby(['PORTFOLIO', 'CFDATE'])
It makes it more flexible to perform different calculations afterward:
grouped.sum()
grouped.mean()
grouped.count()
I have many dataframes with individual counts (e.g. df_boston below). Each row defines a data point that is uniquely identified by its marker and its point. I have a summary dataframe (df_inventory_master) that has custom bins (the points above map to the Begin-End coordinates in the master). I want to add a column to this dataframe for each individual city that sums the counts from that city in a new column. An example is shown.
Two quirks are that the the bins in the master frame can be overlapping (the count should be added to both) and that some counts may not fall in the master (the count should be ignored).
I can do this in pure Python but since the data are in dataframes it would be helpful and likely faster to do the manipulations in pandas. I'd appreciate any tips here!
This is the master frame:
>>> df_inventory_master = pd.DataFrame({'Marker': [1, 1, 1, 2],
... 'Begin': [100, 300, 500, 100],
... 'End': [200, 600, 900, 250]})
>>> df_inventory_master
Begin End Marker
0 100 200 1
1 300 600 1
2 500 900 1
3 100 250 2
This is data for one city:
>>> df_boston = pd.DataFrame({'Marker': [1, 1, 1, 1],
... 'Point': [140, 180, 250, 500],
... 'Count': [14, 600, 1000, 700]})
>>> df_boston
Count Marker Point
0 14 1 140
1 600 1 180
2 1000 1 250
3 700 1 500
This is the desired output.
- Note that the count of 700 (Marker 1, Point 500) falls in 2 master bins and is counted for both.
- Note that the count of 1000 (Marker 1, Point 250) does not fall in a master bin and is not counted.
- Note that nothing maps to Marker 2 because df_boston does not have any Marker 2 data.
>>> desired_frame
Begin End Marker boston
0 100 200 1 614
1 300 600 1 700
2 500 900 1 700
3 100 250 2 0
What I've tried: I looked at the pd.cut() function, but with the nature of the bins overlapping, and in some cases absent, this does not seem to fit. I can add the column filled with 0 values to get part of the way there but then will need to find a way to sum the data in each frame, using bins defined in the master.
>>> df_inventory_master['boston'] = pd.Series([0 for x in range(len(df_inventory_master.index))], index=df_inventory_master.index)
>>> df_inventory_master
Begin End Marker boston
0 100 200 1 0
1 300 600 1 0
2 500 900 1 0
3 100 250 2 0
Here is how I approached it, basically a *sql style left join * using the pandas merge operation, then apply() across the row axis, with a lambda to decide if the individual records are in the band or not, finally groupby and sum:
df_merged = df_inventory_master.merge(df_boston, on=['Marker'],how='left')
# logical overwrite of count
df_merged['Count'] = df_merged.apply(lambda x: x['Count'] if x['Begin'] <= x['Point'] <= x['End'] else 0 , axis=1 )
df_agged = df_merged[['Begin','End','Marker','Count']].groupby(['Begin','End','Marker']).sum()
df_agged_resorted = df_agged.sort_index(level = ['Marker','Begin','End'])
df_agged_resorted = df_agged_resorted.astype(np.int)
df_agged_resorted.columns =['boston'] # rename the count column to boston.
print df_agged_resorted
And the result is
boston
Begin End Marker
100 200 1 614
300 600 1 700
500 900 1 700
100 250 2 0
I am concatenating two Pandas dataframes as below.
part1 = pd.DataFrame({'id' :[100,200,300,400,500],
'amount': np.random.randn(5)
})
part2 = pd.DataFrame({'id' :[700,100,800,500,300],
'amount': np.random.randn(5)
})
concatenated = pd.concat([part1, part2], axis=0)
amount id
0 -0.458653 100
1 2.172348 200
2 0.072494 300
3 -0.253939 400
4 -0.061866 500
0 -1.187505 700
1 -0.810784 100
2 0.321881 800
3 -1.935284 500
4 -1.351507 300
How can I limit the operation so that a row in part2 is only included in concatenated if the row id does not already appear in part1? In a way, I want to treat the id column like a set.
Is it possible to do this during concat() or is this more a post-processing step?
Desired output for this example would be:
concatenated_desired
amount id
0 -0.458653 100
1 2.172348 200
2 0.072494 300
3 -0.253939 400
4 -0.061866 500
0 -1.187505 700
2 0.321881 800
call drop_duplicates() after concat():
part1 = pd.DataFrame({'id' :[100,200,300,400,500],
'amount': np.arange(5)
})
part2 = pd.DataFrame({'id' :[700,100,800,500,300],
'amount': np.random.randn(5)
})
concatenated = pd.concat([part1, part2], axis=0)
print concatenated.drop_duplicates(cols="id")
Calculate the id's not in part1
In [28]:
diff = part2.ix[~part2['id'].isin(part1['id'])]
diff
Out[28]:
amount id
0 -2.184038 700
2 -0.070749 800
now concat
In [29]:
concatenated = pd.concat([part1, diff], axis=0)
concatenated
Out[29]:
amount id
0 -2.240625 100
1 -0.348184 200
2 0.281050 300
3 0.082460 400
4 -0.045416 500
0 -2.184038 700
2 -0.070749 800
You can also put this in a one liner:
concatenated = pd.concat([part1, part2.ix[~part2['id'].isin(part1['id'])]], axis=0)
If you get a column with an id, then use it as an index. Perform manipulations with a real index will make things easier. Here you can use combine_first that does what you are searching for:
part1 = part1.set_index('id')
part2 = part2.set_index('id')
part1.combine_first(p2)
Out[38]:
amount
id
100 1.685685
200 -1.895151
300 -0.804097
400 0.119948
500 -0.434062
700 0.215255
800 -0.031562
If you really need not to get that index, reset it after:
part1.combine_first(p2).reset_index()
Out[39]:
id amount
0 100 1.685685
1 200 -1.895151
2 300 -0.804097
3 400 0.119948
4 500 -0.434062
5 700 0.215255
6 800 -0.031562