I am trying to calculate the moving average of a very large data set. The number of rows is approx 30M. To illustrate using pandas as follows
df = pd.DataFrame({'cust_id':['a', 'a', 'a', 'b', 'b'], 'sales': [100, 200, 300, 400, 500]})
df['mov_avg'] = df.groupby("cust_id")["sales"].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean())
Here I am using pandas to calculate the moving average. Using above it takes around 20 minutes to calculate on the 30M dataset. Is there a way to leverage DASK here?
You can use Dask.delayed for your calculation. In the example below, a standard python function which contains the pandas moving average command is turned into a dask function using a #delayed decorator.
import pandas as pd
from dask import delayed
#delayed
def mov_average(x):
x['mov_avg'] = x.groupby("cust_id")["sales"].apply(
lambda x: x.ewm(alpha=0.5, adjust=False).mean())
return x
df = pd.DataFrame({'cust_id':['a', 'a', 'a', 'b', 'b'],
'sales': [100, 200, 300, 400, 500]})
df['mov_avg'] = df.groupby("cust_id")["sales"].apply(
lambda x: x.ewm(alpha=0.5, adjust=False).mean())
df_1 = mov_average(df).compute()
Output
df
Out[22]:
cust_id sales mov_avg
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
df_1
Out[23]:
cust_id sales mov_avg
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
Alternatively, you could try converting (or reading your file) into a dask data frame. The visualization of the scheduler tasks shows the parallelization of the calculations. So, if your data frame is large enough you might get a reduction in your computation time. You could also try optimizing the number of data frame partitions.
from dask import dataframe
ddf = dataframe.from_pandas(df, npartitions=3)
ddf['emv'] = ddf.groupby('cust_id')['sales'].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean()).compute().sort_index()
ddf.visualize()
ddf.compute()
cust_id sales emv
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
Related
I try to add a new column "energy_class" to a dataframe "df_energy" which it contains the string "high" if the "consumption_energy" value > 400, "medium" if the "consumption_energy" value is between 200 and 400, and "low" if the "consumption_energy" value is under 200.
I try to use np.where from numpy, but I see that numpy.where(condition[, x, y]) treat only two condition not 3 like in my case.
Any idea to help me please?
Thank you in advance
Try this:
Using the setup from #Maxu
col = 'consumption_energy'
conditions = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices = [ "high", 'medium', 'low' ]
df2["energy_class"] = np.select(conditions, choices, default=np.nan)
consumption_energy energy_class
0 459 high
1 416 high
2 186 low
3 250 medium
4 411 high
5 210 medium
6 343 medium
7 328 medium
8 208 medium
9 223 medium
You can use a ternary:
np.where(consumption_energy > 400, 'high',
(np.where(consumption_energy < 200, 'low', 'medium')))
I like to keep the code clean. That's why I prefer np.vectorize for such tasks.
def conditions(x):
if x > 400: return "High"
elif x > 200: return "Medium"
else: return "Low"
func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])
Then just add numpy array as a column in your dataframe using:
df_energy["energy_class"] = energy_class
The advantage in this approach is that if you wish to add more complicated constraints to a column, it can be done easily.
Hope it helps.
I would use the cut() method here, which will generate very efficient and memory-saving category dtype:
In [124]: df
Out[124]:
consumption_energy
0 459
1 416
2 186
3 250
4 411
5 210
6 343
7 328
8 208
9 223
In [125]: pd.cut(df.consumption_energy,
[0, 200, 400, np.inf],
labels=['low','medium','high']
)
Out[125]:
0 high
1 high
2 low
3 medium
4 high
5 medium
6 medium
7 medium
8 medium
9 medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]
WARNING: Be careful with NaNs
Always be careful that if your data has missing values np.where may be tricky to use and may give you the wrong result inadvertently.
Consider this situation:
df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high',
(np.where(df.consumption_energy < 200, 'low', 'medium')))
# if we do not use this second line, then
# if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan
Alternatively, you can use one-more nested np.where for medium versus nan which would be ugly.
IMHO best way to go is pd.cut. It deals with NaNs and easy to use.
Examples:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])
# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan
# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
(np.where(df.age <20, 'child',
np.where(df.age.isnull(), np.nan, 'medium'))))
# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
age age_cat age_cat2 age_cat3
0 22.0 medium medium medium
1 38.0 medium medium medium
2 26.0 medium medium medium
3 35.0 medium medium medium
4 35.0 medium medium medium
5 NaN NaN medium nan
6 54.0 medium medium medium
Let's start by creating a dataframe with 1000000 random numbers between 0 and 1000 to be used as test
df_energy = pd.DataFrame({'consumption_energy': np.random.randint(0, 1000, 1000000)})
[Out]:
consumption_energy
0 683
1 893
2 545
3 13
4 768
5 385
6 644
7 551
8 572
9 822
A bit of a description of the dataframe
print(df.energy.describe())
[Out]:
consumption_energy
count 1000000.000000
mean 499.648532
std 288.600140
min 0.000000
25% 250.000000
50% 499.000000
75% 750.000000
max 999.000000
There are various ways to achieve that, such as:
Using numpy.where
df_energy['energy_class'] = np.where(df_energy['consumption_energy'] > 400, 'high', np.where(df_energy['consumption_energy'] > 200, 'medium', 'low'))
Using numpy.select
df_energy['energy_class'] = np.select([df_energy['consumption_energy'] > 400, df_energy['consumption_energy'] > 200], ['high', 'medium'], default='low')
Using numpy.vectorize
df_energy['energy_class'] = np.vectorize(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))(df_energy['consumption_energy'])
Using pandas.cut
df_energy['energy_class'] = pd.cut(df_energy['consumption_energy'], bins=[0, 200, 400, 1000], labels=['low', 'medium', 'high'])
Using Python's built in modules
def energy_class(x):
if x > 400:
return 'high'
elif x > 200:
return 'medium'
else:
return 'low'
df_energy['energy_class'] = df_energy['consumption_energy'].apply(energy_class)
Using a lambda function
df_energy['energy_class'] = df_energy['consumption_energy'].apply(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))
Time Comparison
From all the tests that I've done, by measuring time with time.perf_counter() (for other ways to measure time of execution see this), pandas.cut was the fastest approach.
method time
0 np.where() 0.124139
1 np.select() 0.155879
2 numpy.vectorize() 0.452789
3 pandas.cut() 0.046143
4 Python's built-in functions 0.138021
5 lambda function 0.19081
Notes:
For the difference between pandas.cut and pandas.qcut see this: What is the difference between pandas.qcut and pandas.cut?
Try this : Even if consumption_energy contains nulls don't worry about it.
def egy_class(x):
'''
This function assigns classes as per the energy consumed.
'''
return ('high' if x>400 else
'low' if x<200 else 'medium')
chk = df_energy.consumption_energy.notnull()
df_energy['energy_class'] = df_energy.consumption_energy[chk].apply(egy_class)
I second using np.vectorize. It is much faster than np.where and also cleaner code wise. You can definitely tell the speed up with larger data sets. You can use a dictionary format for your conditionals as well as the output of those conditions.
# Vectorizing with numpy
row_dic = {'Condition1':'high',
'Condition2':'medium',
'Condition3':'low',
'Condition4':'lowest'}
def Conditions(dfSeries_element,dictionary):
'''
dfSeries_element is an element from df_series
dictionary: is the dictionary of your conditions with their outcome
'''
if dfSeries_element in dictionary.keys():
return dictionary[dfSeries]
def VectorizeConditions():
func = np.vectorize(Conditions)
result_vector = func(df['Series'],row_dic)
df['new_Series'] = result_vector
# running the below function will apply multi conditional formatting to your df
VectorizeConditions()
myassign["assign3"]=np.where(myassign["points"]>90,"genius",(np.where((myassign["points"]>50) & (myassign["points"]<90),"good","bad"))
when you wanna use only "where" method but with multiple condition. we can add more condition by adding more (np.where) by the same method like we did above. and again the last two will be one you want.
I need to apply a function on df, I used a pandarallel to parallelize the process, however, I have an issue here, I need to give func_do an N rows each call so that I can utilize a vectorization on that function.
The following will call func_do on each row. Any idea how to make a single call for each batch and keep the parallelization process.
def fun_do(value_col):
return do(value_col)
df['processed_col'] = df.parallel_apply(lambda row: fun_do(row['col']), axis=1)
A possible solution is to create virtual groups of N rows:
import pandas as pd
from pandarallel import pandarallel
# Setup MRE
pandarallel.initialize(progress_bar=False)
df = pd.DataFrame({'col1': np.linspace(0, 100, 11)})
def fun_do(sr):
return sr**2
N = 4 # size of chunk
df['col2'] = (df.groupby(pd.RangeIndex(len(df)) // N)
.parallel_apply(lambda x: fun_do(x['col1']))
.droplevel(0)) # <- remove virtual group index
Output:
>>> df
col1 col2
0 0.0 0.0
1 10.0 100.0
2 20.0 400.0
3 30.0 900.0
4 40.0 1600.0
5 50.0 2500.0
6 60.0 3600.0
7 70.0 4900.0
8 80.0 6400.0
9 90.0 8100.0
10 100.0 10000.0
Note: I don't know why groupby(...)['col'].parallel_apply(fun_do) doesn't work. It seems parallel_apply is not available with SeriesGroupBy.
This is the first time I use pandarallel, usually I used multiprocessing module
I can impute the mean and most frequent value using dask-ml like so, this works fine:
mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height'])
df.iloc[:, [0,1]] = mean_imputer.fit_transform(df.iloc[:,[0,1]])
df.iloc[:, [2]] = most_frequent_imputer.fit_transform(df.iloc[:,[2]])
print(df)
Weight Age Height
0 100.0 2.0 5.0
1 85.0 4.5 5.0
2 70.0 7.0 5.0
But what if I have 100 million rows of data it seems that dask would do two loops when it could have done only one, is it possible to run both imputers simultaneously and/or in parallel instead of sequentially? What would be a sample code to achieve that?
You can used dask.delayed as suggested in docs and Dask Toutorial to parallelise the computation if entities are independent of one another.
Your code would look like:
from dask.distributed import Client
client = Client(n_workers=4)
from dask import delayed
import numpy as np
import pandas as pd
from dask_ml import impute
mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
def fit_transform_mi(d):
return mean_imputer.fit_transform(d)
def fit_transform_mfi(d):
return most_frequent_imputer.fit_transform(d)
def setdf(a,b,df):
df.iloc[:, [0,1]]=a
df.iloc[:, [2]]=b
return df
data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height'])
a = delayed(fit_transform_mi)(df.iloc[:,[0,1]])
b = delayed(fit_transform_mfi)(df.iloc[:,[2]])
c = delayed(setdf)(a,b,df)
df= c.compute()
print(df)
client.close()
The c object is a lazy Delayed object. This object holds everything we need to compute the final result, including references to all of the functions that are required and their inputs and relationship to one-another.
Dask is useful for speeding computation by parallel processing and when the data does not fit in memory. In the example below, 300M rows of data contained in ten files are imputed using Dask. The graph of the process shows that: 1. The mean and most frequent imputers are run in parallel; 2. All ten files are processed in parallel as well.
Set-up
To prepare a large amount of data, the three rows of data in your question are replicated, to form a data frame with 30M rows. The data frame is saved in ten different files to yield a total of 300M rows with the same stats as in your question.
import numpy as np
import pandas as pd
N = 10000000
weight = np.array([100, np.nan, 70]*N)
age = np.array([2, np.nan, 7]*N)
height = np.array([5, np.nan, 5]*N)
df = pd.DataFrame({'Weight': weight, 'Age': age, 'Height': height})
# Save ten large data frames to disk
for i in range(10):
df.to_parquet(f'./df_to_impute_{i}.parquet', compression='gzip',
index=False)
Dask Imputation
import graphviz
import dask
import dask.dataframe as dd
from dask_ml.impute import SimpleImputer
# Read all files for imputation in a dask data frame from a specific directory
df = dd.read_parquet('./df_to_impute_*.parquet')
# Set up the imputers and columns
mean_imputer = SimpleImputer(strategy='mean')
mostfreq_imputer = SimpleImputer(strategy='most_frequent')
imputers = [mean_imputer, mostfreq_imputer]
mean_cols = ['Weight', 'Age']
freq_cols = ['Height']
columns = [mean_cols, freq_cols]
# Create a new data frame with imputed values, then visualize the computation.
df_list = []
for imputer, col in zip(imputers, columns):
df_list.append(imputer.fit_transform(df.loc[:, col]))
imputed_df = dd.concat(df_list, axis=1)
imputed_df.visualize(filename='imputed.svg', rankdir='LR')
# Save the new data frame to disk
imputed_df.to_parquet('imputed_df.parquet', compression='gzip')
Output
imputed_df.head()
Weight Age Height
0 100.0 2.0 5.0
1 85.0 4.5 5.0
2 70.0 7.0 5.0
3 100.0 2.0 5.0
4 85.0 4.5 5.0
# Check the summary statistics make sense - 300M rows and stats as expected
imputed_df.describe().compute()
Weight Age Height
count 3.000000e+08 3.000000e+08 300000000.0
mean 8.500000e+01 4.500000e+00 5.0
std 1.224745e+01 2.041241e+00 0.0
min 7.000000e+01 2.000000e+00 5.0
25% 7.000000e+01 2.000000e+00 5.0
50% 8.500000e+01 4.500000e+00 5.0
75% 1.000000e+02 7.000000e+00 5.0
max 1.000000e+02 7.000000e+00 5.0
I try to add a new column "energy_class" to a dataframe "df_energy" which it contains the string "high" if the "consumption_energy" value > 400, "medium" if the "consumption_energy" value is between 200 and 400, and "low" if the "consumption_energy" value is under 200.
I try to use np.where from numpy, but I see that numpy.where(condition[, x, y]) treat only two condition not 3 like in my case.
Any idea to help me please?
Thank you in advance
Try this:
Using the setup from #Maxu
col = 'consumption_energy'
conditions = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices = [ "high", 'medium', 'low' ]
df2["energy_class"] = np.select(conditions, choices, default=np.nan)
consumption_energy energy_class
0 459 high
1 416 high
2 186 low
3 250 medium
4 411 high
5 210 medium
6 343 medium
7 328 medium
8 208 medium
9 223 medium
You can use a ternary:
np.where(consumption_energy > 400, 'high',
(np.where(consumption_energy < 200, 'low', 'medium')))
I like to keep the code clean. That's why I prefer np.vectorize for such tasks.
def conditions(x):
if x > 400: return "High"
elif x > 200: return "Medium"
else: return "Low"
func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])
Then just add numpy array as a column in your dataframe using:
df_energy["energy_class"] = energy_class
The advantage in this approach is that if you wish to add more complicated constraints to a column, it can be done easily.
Hope it helps.
I would use the cut() method here, which will generate very efficient and memory-saving category dtype:
In [124]: df
Out[124]:
consumption_energy
0 459
1 416
2 186
3 250
4 411
5 210
6 343
7 328
8 208
9 223
In [125]: pd.cut(df.consumption_energy,
[0, 200, 400, np.inf],
labels=['low','medium','high']
)
Out[125]:
0 high
1 high
2 low
3 medium
4 high
5 medium
6 medium
7 medium
8 medium
9 medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]
WARNING: Be careful with NaNs
Always be careful that if your data has missing values np.where may be tricky to use and may give you the wrong result inadvertently.
Consider this situation:
df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high',
(np.where(df.consumption_energy < 200, 'low', 'medium')))
# if we do not use this second line, then
# if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan
Alternatively, you can use one-more nested np.where for medium versus nan which would be ugly.
IMHO best way to go is pd.cut. It deals with NaNs and easy to use.
Examples:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])
# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan
# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
(np.where(df.age <20, 'child',
np.where(df.age.isnull(), np.nan, 'medium'))))
# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
age age_cat age_cat2 age_cat3
0 22.0 medium medium medium
1 38.0 medium medium medium
2 26.0 medium medium medium
3 35.0 medium medium medium
4 35.0 medium medium medium
5 NaN NaN medium nan
6 54.0 medium medium medium
Let's start by creating a dataframe with 1000000 random numbers between 0 and 1000 to be used as test
df_energy = pd.DataFrame({'consumption_energy': np.random.randint(0, 1000, 1000000)})
[Out]:
consumption_energy
0 683
1 893
2 545
3 13
4 768
5 385
6 644
7 551
8 572
9 822
A bit of a description of the dataframe
print(df.energy.describe())
[Out]:
consumption_energy
count 1000000.000000
mean 499.648532
std 288.600140
min 0.000000
25% 250.000000
50% 499.000000
75% 750.000000
max 999.000000
There are various ways to achieve that, such as:
Using numpy.where
df_energy['energy_class'] = np.where(df_energy['consumption_energy'] > 400, 'high', np.where(df_energy['consumption_energy'] > 200, 'medium', 'low'))
Using numpy.select
df_energy['energy_class'] = np.select([df_energy['consumption_energy'] > 400, df_energy['consumption_energy'] > 200], ['high', 'medium'], default='low')
Using numpy.vectorize
df_energy['energy_class'] = np.vectorize(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))(df_energy['consumption_energy'])
Using pandas.cut
df_energy['energy_class'] = pd.cut(df_energy['consumption_energy'], bins=[0, 200, 400, 1000], labels=['low', 'medium', 'high'])
Using Python's built in modules
def energy_class(x):
if x > 400:
return 'high'
elif x > 200:
return 'medium'
else:
return 'low'
df_energy['energy_class'] = df_energy['consumption_energy'].apply(energy_class)
Using a lambda function
df_energy['energy_class'] = df_energy['consumption_energy'].apply(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))
Time Comparison
From all the tests that I've done, by measuring time with time.perf_counter() (for other ways to measure time of execution see this), pandas.cut was the fastest approach.
method time
0 np.where() 0.124139
1 np.select() 0.155879
2 numpy.vectorize() 0.452789
3 pandas.cut() 0.046143
4 Python's built-in functions 0.138021
5 lambda function 0.19081
Notes:
For the difference between pandas.cut and pandas.qcut see this: What is the difference between pandas.qcut and pandas.cut?
Try this : Even if consumption_energy contains nulls don't worry about it.
def egy_class(x):
'''
This function assigns classes as per the energy consumed.
'''
return ('high' if x>400 else
'low' if x<200 else 'medium')
chk = df_energy.consumption_energy.notnull()
df_energy['energy_class'] = df_energy.consumption_energy[chk].apply(egy_class)
I second using np.vectorize. It is much faster than np.where and also cleaner code wise. You can definitely tell the speed up with larger data sets. You can use a dictionary format for your conditionals as well as the output of those conditions.
# Vectorizing with numpy
row_dic = {'Condition1':'high',
'Condition2':'medium',
'Condition3':'low',
'Condition4':'lowest'}
def Conditions(dfSeries_element,dictionary):
'''
dfSeries_element is an element from df_series
dictionary: is the dictionary of your conditions with their outcome
'''
if dfSeries_element in dictionary.keys():
return dictionary[dfSeries]
def VectorizeConditions():
func = np.vectorize(Conditions)
result_vector = func(df['Series'],row_dic)
df['new_Series'] = result_vector
# running the below function will apply multi conditional formatting to your df
VectorizeConditions()
myassign["assign3"]=np.where(myassign["points"]>90,"genius",(np.where((myassign["points"]>50) & (myassign["points"]<90),"good","bad"))
when you wanna use only "where" method but with multiple condition. we can add more condition by adding more (np.where) by the same method like we did above. and again the last two will be one you want.
I basically picked up Python last week, and although I am currently learning the basics, I've been tasked with building a small program in python at work. And would appreciate some help on this.
I would like to create a SUMIFS function similar to the excel version. My data contains a cash flow date (CFDATE), portfolio name (PORTFOLIO) and cash flow amount (CF). I want tot sum the CF based on which portfolio it belongs to and based on the date on which it falls.
I have managed to achieve this using the code below, however I am struggling to output my results as an array/table where the header row comprises of all the portfolios, and the initial column is a list of the dates (duplicates removed) and the CF are grouped according to each combination of (CFDATE,PORTFOLIO).
e.g of desired output:
PORTFOLIO-> 'A' 'B' 'C'
CFDATE
'30/09/2017' 300 600 300
'31/10/2017' 300 0 600
code used so far:
from pandas import Series,DataFrame
from numpy import matrix
import numpy as np
import pandas as pd
df = DataFrame(pd.read_csv("...\Test.csv"))
portfolioMapping = sorted(list(set(df.PORTFOLIO)))
cfDateMapping = list(set(df.CFDATE))
for i in range(0,len(portfolioMapping)):
dfVar = df.CF * np.where(df.PORTFOLIO == portfolioMapping[i] , 1, 0)
for j in range(0,len(cfDateMapping)):
dfVar1 = df.CF/df.CF * np.where(df.CFDATE == cfDateMapping[j] , 1, 0)
print([portfolioMapping[i],[cfDateMapping[j]],sum(dfVar*dfVar1)])
The data is basically in this form:
PORTFOLIO CFDATE CF
A 30/09/2017 300
A 31/10/2017 300
C 31/10/2017 300
B 30/09/2017 300
B 30/09/2017 300
C 30/09/2017 300
C 31/10/2017 300
C 31/10/2017 300
I would really appreciate some help on the matter.
You need groupby + sum + unstack:
df = df.groupby(['CFDATE', 'PORTFOLIO'])['CF'].sum().unstack(fill_value=0)
print (df)
PORTFOLIO A B C
CFDATE
30/09/2017 300 600 300
31/10/2017 300 0 900
Or pivot_table:
df = df.pivot_table(index='CFDATE',
columns='PORTFOLIO',
values='CF',
aggfunc=sum,
fill_value=0)
print (df)
PORTFOLIO A B C
CFDATE
30/09/2017 300 600 300
31/10/2017 300 0 900
You can simply do that with Pandas's pivot_table():
df.pivot_table(index='CFDATE', columns=['PORTFOLIO'], aggfunc=sum, fill_value=0)
The result is the following:
PORTFOLIO A B C
CFDATE
30/09/2017 300 600 300
31/10/2017 300 0 900
I think the best in your case would be to use a groupby method like the following:
df.groupby(['PORTFOLIO', 'CFDATE']).sum()
CF
PORTFOLIO CFDATE
A 30/09/2017 600
31/10/2017 300
B 30/09/2017 600
C 30/09/2017 300
31/10/2017 900
Basically, once you have grouped your dataframe df, you can then perform various method on it (like sum(), mean(), min(), max(), etc)
Also, you cans store you grouped dataframe in an object like the following:
grouped = df.groupby(['PORTFOLIO', 'CFDATE'])
It makes it more flexible to perform different calculations afterward:
grouped.sum()
grouped.mean()
grouped.count()