Combine three Pandas Pivot Tables with different column numbers - python

I have three pandas pivot_tables:
# First Pivot Table
# Total Sales of Product 1
Month
Branch 1 2 3
1 100.00 80.00 50.00
2 200.00 50.00 60.00
3 250.00 90.00 65.00
# Second Pivot Table
# Total Commission of Product 1
Month
Branch 1 2 3
1 10.00 8.00 5.00
2 20.00 5.00 6.00
3 25.00 9.00 6.50
# Third Pivot Table
# Sales Count (general including other products)
Month
Branch 2 3
1 5 5
2 1 6
3 3 6
when I try combine the three pivoted dataframes, I'm getting the error: "cannot convert float NaN to integer".
The command I tried:
dfResult = dfTotalSales.append(dfCommission).append(dfSalesCount)
Besides the error above...
How can I combine them and return a new column with the average of Net Sales total?
Formula: (Total Sales - Commission) / Sales account
Thanks in advance.

is your data is like the following
in branch 1 first month sales is 100
in branch 2 first month sales is 80
if yes than I think sales count is not present for the branch 1.
i. e. according to my understanding the sales count is given only for branch 2 and 3.
so while processing he tried to fill null value, but other than sales count the data is integer. so it can not convert the float to int.
so make all in int or float then try.

Related

Calculate aggregate value of column row by row

My apologies for the vague title, it's complicated to translate what I want in writing terms.
I'm trying to build a filled line chart with the date on x axis and total transaction over time on the y axis
My data
The object is a pandas dataframe.
date | symbol | type | qty | total
----------------------------------------------
2020-09-10 ABC Buy 5 10
2020-10-18 ABC Buy 2 20
2020-09-19 ABC Sell 3 15
2020-11-05 XYZ Buy 10 8
2020-12-03 XYZ Buy 10 9
2020-12-05 ABC Buy 2 5
What I whant
date | symbol | type | qty | total | aggregate_total
------------------------------------------------------------
2020-09-10 ABC Buy 5 10 10
2020-10-18 ABC Buy 2 20 10+20 = 30
2020-09-19 ABC Sell 3 15 10+20-15 = 15
2020-11-05 XYZ Buy 10 8 8
2020-12-03 XYZ Buy 10 9 8+9 = 17
2020-12-05 ABC Buy 2 5 10+20-15+5 = 20
Where I am now
I'm working with 2 nested for loops : one for iterating over the symbols, one for iterating each row. I store the temporary results in lists. I'm still unsure how I will add the results to the final dataframe. I could reorder the dataframe by symbol and date, then append each temp lists together and finally assign that temp list to a new column.
The code below is just the inner loop over the rows.
af = df.loc[df['symbol'] == 'ABC']
for i in (range(0,af.shape[0])):
# print(af.iloc[0:i,[2,4]])
# if type is a buy, we add the last operation to the aggregate
if af.iloc[i,2] == "BUY":
temp_agg_total.append(temp_agg_total[i] + af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] + af.iloc[i, 3])
else:
temp_agg_total.append(temp_agg_total[i] - af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] - af.iloc[i, 3])
# Remove first element of list (0)
temp_agg_total.pop(0)
temp_agg_qty.pop(0)
af = af.assign(agg_total = temp_agg_total,
agg_qty = temp_agg_qty)
My question
Is there a better way to do this in pandas or numpy ? It feels really heavy for something relatively simple.
The presence of the Buy/Sell type of operation complicates things.
Regards
# negate qty of Sells
df.loc[df['type']=='Sell', 'total'] *=-1
# cumulative sum of the qty based on symbol
df['aggregate_total'] = df.groupby('symbol')['total'].cumsum()
Is this which you're looking for?
df['Agg'] = 1
df.loc[df['type'] == 'Sell', 'Agg'] = -1
df['Agg'] = df['Agg']*df['total']
df['Agg'].cumsum()
df["Type_num"] = df["type"].map({"Buy":1,"Sell":-1})
df["Num"] = df.Type_num*df.total
df.groupby(["symbol"],as_index=False)["Num"].cumsum()
pd.concat([df,df.groupby(["symbol"],as_index=False)["Num"].cumsum()],axis=1)
date symbol type qty total Type_num Num CumNum
0 2020-09-10 ABC Buy 5 10 1 10 10
1 2020-10-18 ABC Buy 2 20 1 20 30
2 2020-09-19 ABC Sell 3 15 -1 -15 15
3 2020-11-05 XYZ Buy 10 8 1 8 8
4 2020-12-03 XYZ Buy 10 9 1 9 17
5 2020-12-05 ABC Buy 2 5 1 5 20
The most important thing here is the cumulative sum. The regrouping is used to make sure that the cumulative sum is just performed on each kind of different symbol. The renaming and dropping of columns should be easy for you.
Trick is that I made {sell; buy} into {1,-1}

Groupby and sum by 1 column, keep all other columns, and mutate a new column, counting summed rows with pandas

I am new to Python and can see at least 5 similar questions and this one is very close but non of them work for me.
I have a dataframe with non-unique customers.
customer_id amount male age income days reward difficulty duration
0 id_1 16.06 1 45 62000.0 608 2.0 10.0 10.0
1 id_1 18.00 1 45 62000.0 608 2.0 10.0 10.0
I am trying to group them by customer_id, sum by amount and keep all other columns PLUS add one column total, counting my transactions
Desired output
customer_id amount male age income days reward difficulty duration total
0 id_1 34.06 1 45 62000.0 608 2.0 10.0 10.0 2
My best personal attempt so far does not preserve all columns
groupby('customer_id')['amount'].agg(total_sum = 'sum', total = 'count')
You could do it this way, include all other columns in your groupby then reset_index after aggregating:
df.groupby(df.columns.difference(['amount']).tolist())['amount']\
.agg(total_sum='sum',total='count').reset_index()
Output:
age customer_id days difficulty duration income male reward total_sum total
0 45 id_1 608 10.0 10.0 62000.0 1 2.0 34.06 2
you could do:
grouper = df.groupby('customer_id')
first_dict = {col: 'first' for col in df.columns.difference(['customer_id', 'amount'])}
o = grouper.agg({
'amount': 'size',
**first_dict,
})
o['total'] = grouper.size().values
Based on #Scott Boston's answer, I found an answer myself too and I acknowledge that my solution is not elegant (maybe something will help to clean it). But it gives me an expanded solution, when I have non-unique rows (for instance, each customer_id has five different transactions).
df.groupby('customer_id').agg({'amount':['sum'], 'reward_':['sum'], 'difficulty':['mean'],
'duration':['mean'], 'male':['mean'], 'male':['mean'],
'income':['mean'], 'days':['mean'], 'age':['mean'],
'customer_id':['count']}).reset_index()
df_grouped = starbucks_grouped.droplevel(1, axis = 1)
My output is

Aggregation of pandas rows

I am trying to parse and organize a trading history file.I am trying to aggregate every 3 or 4 rows that has the same Type: BUY or SELL together if they come after each other only. and if they don't then I want to take only one row.
as you can see the example below those multi-buy trades I want them to be aggregated within one row which after it will come another one row of sell trade.
in a new df with aggregated trades prices, and amounts.
link for csv: https://drive.google.com/file/d/1GoDRdI7G8uJzuLoFrm5InbDg23mAwW6o/view?usp=sharing
You can use this to get the results you are looking for. I am using cumulative sum when the previous value is not equal to the current value.
dictionary = { "BUY": 1, "SELL": 0}
df['id1'] = df['Type'].map(dictionary)
df['grp'] = (df['id1']!=df['id1'].shift()).cumsum()
Now you can aggregate the values using a simple groupby like below. This will sum the amount for each consecutive buy and sell
df.groupby(['grp'])['Amount'].sum()
This is the output of grp column.
Type grp
0 BUY 1
1 BUY 1
2 BUY 1
3 BUY 1
4 SELL 2
5 SELL 2
6 SELL 2
7 SELL 2
8 BUY 3
9 SELL 4
10 BUY 5
11 SELL 6
12 BUY 7
13 SELL 8
14 BUY 9
15 BUY 9
16 SELL 10
17 SELL 10

Applying function to Pandas Groupby

I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).
Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).
I'm ultimately wanting to iterate over multiple columns, doing the following:
Compute a rolling average for each time step in a given column, per group ID
Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
Repeat and append new residuals to df_resid (as seen below)
df_resid
date id rev_resid exp_resid
2005-09-01 1 NaN NaN
2005-12-01 1 -10000 -5500
2006-03-01 1 -352584 -262058.5
2006-06-01 1 240000 190049.5
2006-09-01 1 82648.75 37724.25
2005-09-01 2 NaN NaN
2005-12-01 2 4206.5 24353
2006-03-01 2 -302574 -331951
2006-06-01 2 103179 117405.5
2006-09-01 2 -52650 -72296.5
Here's small sample of the original data.
df
date id rev exp
2005-09-01 1 745168.0 545168.0
2005-12-01 1 725168.0 534168.0
2006-03-01 1 20000.0 10051.0
2006-06-01 1 500000.0 390150.0
2006-09-01 1 665297.5 465598.5
2005-09-01 2 956884.0 736987.0
2005-12-01 2 965297.0 785693.0
2006-03-01 2 360149.0 121791.0
2006-06-01 2 566507.0 356602.0
2006-09-01 2 461207.0 212009.0
And the (very slow) code:
df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean())
I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.
Any help would be truly appreciated.
To quicken up the calculation, if dataframe is already sorted on 'id' then you don't have to do rolling within a groupby (if it isn't sorted... do so). Then since your window is only length 2 then we mask the result by checking where id == id.shift This works because it's sorted.
d1 = df[['rev', 'exp']]
df.join(
d1.rolling(2).mean().rsub(d1).add_suffix('_resid')[df.id.eq(df.id.shift())]
)
date id rev exp rev_resid exp_resid
0 2005-09-01 1 745168.0 545168.0 NaN NaN
1 2005-12-01 1 725168.0 534168.0 -10000.00 -5500.00
2 2006-03-01 1 20000.0 10051.0 -352584.00 -262058.50
3 2006-06-01 1 500000.0 390150.0 240000.00 190049.50
4 2006-09-01 1 665297.5 465598.5 82648.75 37724.25
5 2005-09-01 2 956884.0 736987.0 NaN NaN
6 2005-12-01 2 965297.0 785693.0 4206.50 24353.00
7 2006-03-01 2 360149.0 121791.0 -302574.00 -331951.00
8 2006-06-01 2 566507.0 356602.0 103179.00 117405.50
9 2006-09-01 2 461207.0 212009.0 -52650.00 -72296.50

pandas group dates to quarterly and sum sales column

I am learning python and at the moment I am playing with some sales data. The data is in csv format and is showing weekly sales.
I have below columns with some sample data as below:
store# dept# dates weeklysales
1 1 01/01/2005 50000
1 1 08/01/2005 120000
1 1 15/01/2005 75000
1 1 22/01/2005 25000
1 1 29/01/2005 18000
1 2 01/01/2005 15000
1 2 08/01/2005 12000
1 2 15/01/2005 75000
1 2 22/01/2005 35000
1 2 29/01/2005 28000
1 1 01/02/2005 50000
1 1 08/02/2005 120000
1 1 15/02/2005 75000
1 1 22/03/2005 25000
1 1 29/03/2005 18000
I want to add the weeklysales to monthly basis in each department and want to display the records.
I have tried to use groupby function in Pandas from below links:
how to convert monthly data to quarterly in pandas
Pandas group by and sum two columns
Pandas group-by and sum
But what is happening in the above that I get sum of all the columns and getting the following output by adding the store and dept numbers as well:
store# dept# dates weeklysales
4 3 01/2005 28800
4 1 01/2005 165000
4 3 02/2005 245000
4 3 03/2005 43000
I do not want to add store and dept numbers but want to just add the weeklysales figure by each month and want the display like:
store# dept# dates weeklysales
1 1 01/2005 28800
1 2 01/2005 165000
1 1 02/2005 245000
1 1 03/2005 43000
Will be grateful if I can get a solution for that.
Cheers,
Is this what you are after?
Convert dates to month/year format and then group and sum sales.
(df.assign(dates=df.dates.dt.strftime('%m/%Y'))
.groupby(['store#','dept#','dates'])
.sum()
.reset_index()
)
Out[243]:
store# dept# dates weeklysales
0 1 1 01/2005 288000
1 1 1 02/2005 245000
2 1 1 03/2005 43000
3 1 2 01/2005 165000

Categories

Resources