Can someone help me create SUMIFS function equivalent on Python? - python

I basically picked up Python last week, and although I am currently learning the basics, I've been tasked with building a small program in python at work. And would appreciate some help on this.
I would like to create a SUMIFS function similar to the excel version. My data contains a cash flow date (CFDATE), portfolio name (PORTFOLIO) and cash flow amount (CF). I want tot sum the CF based on which portfolio it belongs to and based on the date on which it falls.
I have managed to achieve this using the code below, however I am struggling to output my results as an array/table where the header row comprises of all the portfolios, and the initial column is a list of the dates (duplicates removed) and the CF are grouped according to each combination of (CFDATE,PORTFOLIO).
e.g of desired output:
PORTFOLIO-> 'A' 'B' 'C'
CFDATE
'30/09/2017' 300 600 300
'31/10/2017' 300 0 600
code used so far:
from pandas import Series,DataFrame
from numpy import matrix
import numpy as np
import pandas as pd
df = DataFrame(pd.read_csv("...\Test.csv"))
portfolioMapping = sorted(list(set(df.PORTFOLIO)))
cfDateMapping = list(set(df.CFDATE))
for i in range(0,len(portfolioMapping)):
dfVar = df.CF * np.where(df.PORTFOLIO == portfolioMapping[i] , 1, 0)
for j in range(0,len(cfDateMapping)):
dfVar1 = df.CF/df.CF * np.where(df.CFDATE == cfDateMapping[j] , 1, 0)
print([portfolioMapping[i],[cfDateMapping[j]],sum(dfVar*dfVar1)])
The data is basically in this form:
PORTFOLIO CFDATE CF
A 30/09/2017 300
A 31/10/2017 300
C 31/10/2017 300
B 30/09/2017 300
B 30/09/2017 300
C 30/09/2017 300
C 31/10/2017 300
C 31/10/2017 300
I would really appreciate some help on the matter.

You need groupby + sum + unstack:
df = df.groupby(['CFDATE', 'PORTFOLIO'])['CF'].sum().unstack(fill_value=0)
print (df)
PORTFOLIO A B C
CFDATE
30/09/2017 300 600 300
31/10/2017 300 0 900
Or pivot_table:
df = df.pivot_table(index='CFDATE',
columns='PORTFOLIO',
values='CF',
aggfunc=sum,
fill_value=0)
print (df)
PORTFOLIO A B C
CFDATE
30/09/2017 300 600 300
31/10/2017 300 0 900

You can simply do that with Pandas's pivot_table():
df.pivot_table(index='CFDATE', columns=['PORTFOLIO'], aggfunc=sum, fill_value=0)
The result is the following:
PORTFOLIO A B C
CFDATE
30/09/2017 300 600 300
31/10/2017 300 0 900

I think the best in your case would be to use a groupby method like the following:
df.groupby(['PORTFOLIO', 'CFDATE']).sum()
CF
PORTFOLIO CFDATE
A 30/09/2017 600
31/10/2017 300
B 30/09/2017 600
C 30/09/2017 300
31/10/2017 900
Basically, once you have grouped your dataframe df, you can then perform various method on it (like sum(), mean(), min(), max(), etc)
Also, you cans store you grouped dataframe in an object like the following:
grouped = df.groupby(['PORTFOLIO', 'CFDATE'])
It makes it more flexible to perform different calculations afterward:
grouped.sum()
grouped.mean()
grouped.count()

Related

How can build a model that will predict duplicate records in your dataset?

What are the algorithms that will predict duplicates in your dataset.
For example -
Name Marks
A 100
B 90
C 80
A 100
I need something like this -
Name Marks S/D
A 100 Single
B 90 Single
C 80 Single
A 100 Duplicate
I'm looking for some algorithms that can help in this case.
IIUC, you need this:
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C','A'],'Marks': [100, 90, 80, 100]})
df['res'] = df.duplicated().map({False:"Single", True:"Duplicated"})
Output:
>>> df
Name Marks res
0 A 100 Single
1 B 90 Single
2 C 80 Single
3 A 100 Duplicated

Use dask to calculate moving average

I am trying to calculate the moving average of a very large data set. The number of rows is approx 30M. To illustrate using pandas as follows
df = pd.DataFrame({'cust_id':['a', 'a', 'a', 'b', 'b'], 'sales': [100, 200, 300, 400, 500]})
df['mov_avg'] = df.groupby("cust_id")["sales"].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean())
Here I am using pandas to calculate the moving average. Using above it takes around 20 minutes to calculate on the 30M dataset. Is there a way to leverage DASK here?
You can use Dask.delayed for your calculation. In the example below, a standard python function which contains the pandas moving average command is turned into a dask function using a #delayed decorator.
import pandas as pd
from dask import delayed
#delayed
def mov_average(x):
x['mov_avg'] = x.groupby("cust_id")["sales"].apply(
lambda x: x.ewm(alpha=0.5, adjust=False).mean())
return x
df = pd.DataFrame({'cust_id':['a', 'a', 'a', 'b', 'b'],
'sales': [100, 200, 300, 400, 500]})
df['mov_avg'] = df.groupby("cust_id")["sales"].apply(
lambda x: x.ewm(alpha=0.5, adjust=False).mean())
df_1 = mov_average(df).compute()
Output
df
Out[22]:
cust_id sales mov_avg
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
df_1
Out[23]:
cust_id sales mov_avg
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
Alternatively, you could try converting (or reading your file) into a dask data frame. The visualization of the scheduler tasks shows the parallelization of the calculations. So, if your data frame is large enough you might get a reduction in your computation time. You could also try optimizing the number of data frame partitions.
from dask import dataframe
ddf = dataframe.from_pandas(df, npartitions=3)
ddf['emv'] = ddf.groupby('cust_id')['sales'].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean()).compute().sort_index()
ddf.visualize()
ddf.compute()
cust_id sales emv
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0

Adding new records to a dataframe for variables extracted from the same dataframe

I am trying to consolidate variables in a data set.
I have something like this:
import pandas as pd
import numpy as np
data = np.array([[160,90,'skirt_trousers', 'tight_comfy'],[180,100,'trousers_skirt', 'long_short']])
dford = pd.DataFrame(data, columns = ['height','size','order', 'preference'])
and am trying to get it to something like this:
dataForTarget = np.array([['o1',160,90,'skirt', 'tight'],['o2', 180,100,'trousers', 'long'],['o1',160,90,'trousers', 'comfy'],['o2', 180,100,'skirt', 'short']])
Targetdford = pd.DataFrame(dataForTarget, columns = ['orderID','height','size','order', 'preference'])
As a first step, I have extracted as much data as possible from the strings,
then cleaned them:
variables = dford.columns.tolist()
variables.append('ord1')
secondord = dford.order.str.extractall (r'_(.*)')
secondord = secondord.unstack()
secondord.columns = secondord.columns.droplevel()
dford1 = dford.join(secondord)
dford1. columns = variables
dford1.order = dford1.order.str.replace(r'(_.*)','')
variables = dford1.columns.tolist()
variables.append('pref1')
secondpref = dford.preference.str.extractall (r'_(.*)')
secondpref = secondpref.unstack()
secondpref.columns = secondpref.columns.droplevel()
dford2 = dford1.join(secondpref)
dford2. columns = variables
dford2.order = dford2.order.str.replace(r'(_.*)','')
Which gets me here:
At this stage I am at a loss on how to add these new information as observations (in rows).
The best I could come up with follows, but fails as the index contains
duplicate entries. But even if it did not fail, I suspect it would
only be useful if I were trying to fill in missing values.
But I got nowhere.
dford3 = dford2.rename(columns = {'ord1': 'order', 'pref1': 'preference'})
dford3= dford3.stack()
dford3= dford3.unstack()
Use Series.str.split with DataFrame.stack and concat for new DataFrame and add to original by DataFrame.join:
df = pd.concat([dford.pop('order').str.split('_', expand=True).stack().rename('order'),
dford.pop('preference').str.split('_', expand=True).stack().rename('preference')], axis=1)
dford = (dford.join(df.reset_index(level=1)).rename_axis('orderID')
.reset_index()
.sort_values(['level_1','orderID'])
.drop('level_1', 1)
.reset_index(drop=True)
.assign(orderID = lambda x: 'o' + x['orderID'].add(1).astype('str')))
print (dford)
orderID height size order preference
0 o1 160 90 skirt tight
1 o2 180 100 trousers long
2 o1 160 90 trousers comfy
3 o2 180 100 skirt short
Use DataFrame.apply + Series.str.split.
concatenate the resulting dataframes with pd.concat and use Series.map to create the Hight and Size Series:
df=pd.concat([df.T for df in dford[['order','preference']].apply(lambda x: x.str.split('_',expand=True),axis=1)]).rename_axis(index='OrderID').reset_index()
df['height']=df['OrderID'].map(dford['height'])
df['size']=df['OrderID'].map(dford['size'])
print(df)
OrderID order preference height size
0 0 skirt tight 160 90
1 1 trousers comfy 180 100
2 0 trousers long 160 90
3 1 skirt short 180 100
finally add one to the OrderID column and add the character o
df['OrderID']='o'+df['OrderID'].add(1).astype('str')
print(df)
OrderID order preference height size
0 o1 skirt tight 160 90
1 o2 trousers comfy 180 100
2 o1 trousers long 160 90
3 o2 skirt short 180 100

Generating Column Matrix for all columns in pandas

I have a dataframe consisting of 6 columns . What shall be the fastest way to generate a matrix which does the following:
Step 1) col1*col1a , col2*col2a, col3*col3a, col4*col4a
Step 2 ) col_new = (col1*col1a)-col2*col2a)/(col1a-col2a)
Using a for loop is one of the options - but what could be a quicker way to go about this.
import pandas as pd
df=pd.DataFrame()
df['col1']=[100,200,300,400,500]
df['col1a']=[6,71,8,90,10]
df['col2']=[600,700,800,1900,100]
df['col2a']=[6,17,8,9,10]
df['col3']=[100,220,300,440,500]
df['col3a']=[1,22,3,44,5]
df[1x2]=(df['col1']*df['col1a']-df['col2']*df['col2a'])/(df['col1a']-df['col2a'])
I need to have column combinations of 1x3,1x4,1x5,2x3,2x4 and so on...
Here is how I will approach it:
def new_col(df, col1, col2):
"""
Add a new column, modifying the dataframe inplace.
col1: int
column counter in the first column name
col2: int
column counter in the second column name
"""
nr = (
df.loc[:, f"col{col1}"] * df.loc[:, f"col{col1}a"]
- df.loc[:, f"col{col2}"] * df.loc[:, f"col{col2}a"]
)
dr = df.loc[:, f"col{col1}a"] - df.loc[:, f"col{col2}a"]
df.loc[:, f"col{col1}X{col2}"] = nr / dr
I will call this function with desired column combinations. For ex.
new_col(df, 1, 2)
Output:
The call be issued from a loop.
So apparently, my first answer only matched the original question: Here is an answer for the updated question:
from itertools import combinations
from functools import partial
primary_columns = df.columns[~df.columns.str.endswith("a")]
combs = combinations(primary_columns, 2)
def column_comparison(first, second, df):
return (df[first]*df[first+"a"]-df[second]*df[second+"a"])/(df[first+"a"] - df[second+"a"])
dct = {'{first}X{second}'.format(first=comb[0].lstrip("col"), second=comb[1].lstrip("col")):
partial(column_comparison, comb[0], comb[1]) for comb in combs}
So we created a dictionary that contains the name of the desired columns and the right function.
Now we can leverage assign
df.assign(**dct)
to obtain
col1 col1a col2 col2a col3 col3a 1X2 1X3 2X3
0 100 6 600 6 100 1 -inf 100.000000 700.000000
1 200 71 700 17 220 22 42.592593 191.020408 -1412.000000
2 300 8 800 8 300 3 -inf 300.000000 1100.000000
3 400 90 1900 9 440 44 233.333333 361.739130 64.571429
4 500 10 100 10 500 5 inf 500.000000 -300.000000
In a previous version I was using a lambda here, but this was not working - check here for an explanation. I only realized this after finding the solution using partial.

Using apply function to Pandas dataframe [duplicate]

This question already has an answer here:
Running sum in pandas (without loop)
(1 answer)
Closed 4 years ago.
I am trying to do an emulation of a loan with monthly payments in pandas.
The credit column contains amount of money which I borrowed from the bank.
The debit column contains amount of money which I payed back to a bank.
The total column should contain the amount which is left to pay to a bank. Basically it contains the subtraction result between the credit and debit column).
I was able to write the following code:
import pandas as pd
# This function returns the subtraction result of credit and debit
def f(x):
return (x['credit'] - x['debit'])
df = pd.DataFrame({'credit': [1000, 0, 0, 500],
'debit': [0, 100, 200, 0]})
for i in df:
df['total'] = df.apply(f, axis=1)
print(df)
It works (it subtracts the debit from the credit). But it doesn't keep results in the total columns. Please see Actual and Expected results below.
Actual result:
credit debit total
0 1000 0 1000
1 0 100 -100
2 0 200 -200
3 500 0 500
Expected result:
credit debit total
0 1000 0 1000
1 0 100 900
2 0 200 700
3 500 0 1200
You could use cumsum:
df['total'] = (df.credit - df.debit).cumsum()
print(df)
Output
credit debit total
0 1000 0 1000
1 0 100 900
2 0 200 700
3 500 0 1200
You don't need apply here.
import pandas as pd
df = pd.DataFrame({'credit': [1000, 0, 0, 500],
'debit': [0, 100, 200, 0]})
df['Total'] = (df['credit'] - df['debit']).cumsum()
print(df)
Output
credit debit Total
0 1000 0 1000
1 0 100 900
2 0 200 700
3 500 0 1200
The reason why apply wasn't working is because apply executes over each row rather than keeping the running total after each subtraction. Passing cumsum() into the subtraction kill keep the running total to get the desired results.

Categories

Resources