pandas group dates to quarterly and sum sales column - python

I am learning python and at the moment I am playing with some sales data. The data is in csv format and is showing weekly sales.
I have below columns with some sample data as below:
store# dept# dates weeklysales
1 1 01/01/2005 50000
1 1 08/01/2005 120000
1 1 15/01/2005 75000
1 1 22/01/2005 25000
1 1 29/01/2005 18000
1 2 01/01/2005 15000
1 2 08/01/2005 12000
1 2 15/01/2005 75000
1 2 22/01/2005 35000
1 2 29/01/2005 28000
1 1 01/02/2005 50000
1 1 08/02/2005 120000
1 1 15/02/2005 75000
1 1 22/03/2005 25000
1 1 29/03/2005 18000
I want to add the weeklysales to monthly basis in each department and want to display the records.
I have tried to use groupby function in Pandas from below links:
how to convert monthly data to quarterly in pandas
Pandas group by and sum two columns
Pandas group-by and sum
But what is happening in the above that I get sum of all the columns and getting the following output by adding the store and dept numbers as well:
store# dept# dates weeklysales
4 3 01/2005 28800
4 1 01/2005 165000
4 3 02/2005 245000
4 3 03/2005 43000
I do not want to add store and dept numbers but want to just add the weeklysales figure by each month and want the display like:
store# dept# dates weeklysales
1 1 01/2005 28800
1 2 01/2005 165000
1 1 02/2005 245000
1 1 03/2005 43000
Will be grateful if I can get a solution for that.
Cheers,

Is this what you are after?
Convert dates to month/year format and then group and sum sales.
(df.assign(dates=df.dates.dt.strftime('%m/%Y'))
.groupby(['store#','dept#','dates'])
.sum()
.reset_index()
)
Out[243]:
store# dept# dates weeklysales
0 1 1 01/2005 288000
1 1 1 02/2005 245000
2 1 1 03/2005 43000
3 1 2 01/2005 165000

Related

Calculate aggregate value of column row by row

My apologies for the vague title, it's complicated to translate what I want in writing terms.
I'm trying to build a filled line chart with the date on x axis and total transaction over time on the y axis
My data
The object is a pandas dataframe.
date | symbol | type | qty | total
----------------------------------------------
2020-09-10 ABC Buy 5 10
2020-10-18 ABC Buy 2 20
2020-09-19 ABC Sell 3 15
2020-11-05 XYZ Buy 10 8
2020-12-03 XYZ Buy 10 9
2020-12-05 ABC Buy 2 5
What I whant
date | symbol | type | qty | total | aggregate_total
------------------------------------------------------------
2020-09-10 ABC Buy 5 10 10
2020-10-18 ABC Buy 2 20 10+20 = 30
2020-09-19 ABC Sell 3 15 10+20-15 = 15
2020-11-05 XYZ Buy 10 8 8
2020-12-03 XYZ Buy 10 9 8+9 = 17
2020-12-05 ABC Buy 2 5 10+20-15+5 = 20
Where I am now
I'm working with 2 nested for loops : one for iterating over the symbols, one for iterating each row. I store the temporary results in lists. I'm still unsure how I will add the results to the final dataframe. I could reorder the dataframe by symbol and date, then append each temp lists together and finally assign that temp list to a new column.
The code below is just the inner loop over the rows.
af = df.loc[df['symbol'] == 'ABC']
for i in (range(0,af.shape[0])):
# print(af.iloc[0:i,[2,4]])
# if type is a buy, we add the last operation to the aggregate
if af.iloc[i,2] == "BUY":
temp_agg_total.append(temp_agg_total[i] + af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] + af.iloc[i, 3])
else:
temp_agg_total.append(temp_agg_total[i] - af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] - af.iloc[i, 3])
# Remove first element of list (0)
temp_agg_total.pop(0)
temp_agg_qty.pop(0)
af = af.assign(agg_total = temp_agg_total,
agg_qty = temp_agg_qty)
My question
Is there a better way to do this in pandas or numpy ? It feels really heavy for something relatively simple.
The presence of the Buy/Sell type of operation complicates things.
Regards
# negate qty of Sells
df.loc[df['type']=='Sell', 'total'] *=-1
# cumulative sum of the qty based on symbol
df['aggregate_total'] = df.groupby('symbol')['total'].cumsum()
Is this which you're looking for?
df['Agg'] = 1
df.loc[df['type'] == 'Sell', 'Agg'] = -1
df['Agg'] = df['Agg']*df['total']
df['Agg'].cumsum()
df["Type_num"] = df["type"].map({"Buy":1,"Sell":-1})
df["Num"] = df.Type_num*df.total
df.groupby(["symbol"],as_index=False)["Num"].cumsum()
pd.concat([df,df.groupby(["symbol"],as_index=False)["Num"].cumsum()],axis=1)
date symbol type qty total Type_num Num CumNum
0 2020-09-10 ABC Buy 5 10 1 10 10
1 2020-10-18 ABC Buy 2 20 1 20 30
2 2020-09-19 ABC Sell 3 15 -1 -15 15
3 2020-11-05 XYZ Buy 10 8 1 8 8
4 2020-12-03 XYZ Buy 10 9 1 9 17
5 2020-12-05 ABC Buy 2 5 1 5 20
The most important thing here is the cumulative sum. The regrouping is used to make sure that the cumulative sum is just performed on each kind of different symbol. The renaming and dropping of columns should be easy for you.
Trick is that I made {sell; buy} into {1,-1}

Python select data for top 3 values per group in dataframe

From the given dataframe sorted by ID and Date:
ID Date Value
1 12/10/1998 0
1 04/21/2002 21030
1 08/16/2013 56792
1 09/18/2014 56792
1 09/14/2016 66354
2 06/16/2015 46645
2 12/08/2015 47641
2 12/11/2015 47641
2 04/13/2017 47641
3 07/29/2009 28616
3 03/31/2011 42127
3 03/17/2013 56000
I would like to get values for top 3 Dates, group by ID:
56792
56792
66354
47641
47641
47641
28616
42127
56000
I need values only
You could sort_values both by ID and Date, and use GroupBy.tail to take the values for the top 3 dates:
df.Date = pd.to_datetime(df.Date)
df.sort_values(['ID','Date']).groupby('ID').Value.tail(3).to_numpy()
# array([56792, 56792, 66354, 47641, 47641, 47641, 28616, 42127, 56000])

How to check date and time of max values in large data set Python

I have data sets that are ~30-60,000,000 lines each. Each Name has one or more unique ID associated with it for every day in the data set. Some OP_DATE and OP_HOUR the unique IDs can have 0 or blank values for each Load1,2,3.
I'm looking for a way to calculate the total maximum values of columns over all the OP_DATE that look like these:
Name ID OP_DATE OP_HOUR OP_TIME Load1 Load2 Load3
OMI 1 2001-01-01 1 1 11 10 12
OMI 1 2001-01-01 2 0.2 1 12 10
.
.
OMI 2A 2001-01-01 1 0.4 5
.
.
OMI 2A 2001-01-01 24 0.6 2 7 12
.
.
Kain 2 01 2002-01-01 1 0.1 6 12
Kain 2 01 2002-01-01 2 0.98 3 14 7
.
.
OMI 1 2018-01-01 1 0.89 12 10 20
.
.
I want to find the maximum values of Load1, Load2, Load3, and find what OP_DATE, OP_TIME and OP_HOUR that it occurred on.
The output I want is:
Name ID max OP_DATE max OP_HOUR max OP_TIME max Load1 max Load2 max Load3
OMI 1 2011-06-11 22 ..... max values on dates
OMI 2A 2012-02-01 12 ..... max values on dates
Kain 2 01 2006-01-01 1..... max values on dates
Is there a way I can do this easily?
I've tried:
unique_MAX = df.groupby(['Name','ID'])['Load1', 'Load2', 'Load3'].max().reset_index()
But this would group only by the dates and give me a total maximum - I'd like the associated dates, hours, and times as well.
To get the full row of information for any given fields [max]:
Get the index locations for the max of each group you desire
Use the indexes to return the full row at each location
An example for finding the max Load1 for each Name & ID pair
idx = df.groupby(['Name','ID'])['Load1'].transform(max) == df['Load1']
df[idx]
Out[14]:
name ID dt x y
1 Fred 050 1/2/2018 2 4
4 Dave 001 1/3/2018 6 1
5 Carly 002 1/3/2018 5 7

Combine three Pandas Pivot Tables with different column numbers

I have three pandas pivot_tables:
# First Pivot Table
# Total Sales of Product 1
Month
Branch 1 2 3
1 100.00 80.00 50.00
2 200.00 50.00 60.00
3 250.00 90.00 65.00
# Second Pivot Table
# Total Commission of Product 1
Month
Branch 1 2 3
1 10.00 8.00 5.00
2 20.00 5.00 6.00
3 25.00 9.00 6.50
# Third Pivot Table
# Sales Count (general including other products)
Month
Branch 2 3
1 5 5
2 1 6
3 3 6
when I try combine the three pivoted dataframes, I'm getting the error: "cannot convert float NaN to integer".
The command I tried:
dfResult = dfTotalSales.append(dfCommission).append(dfSalesCount)
Besides the error above...
How can I combine them and return a new column with the average of Net Sales total?
Formula: (Total Sales - Commission) / Sales account
Thanks in advance.
is your data is like the following
in branch 1 first month sales is 100
in branch 2 first month sales is 80
if yes than I think sales count is not present for the branch 1.
i. e. according to my understanding the sales count is given only for branch 2 and 3.
so while processing he tried to fill null value, but other than sales count the data is integer. so it can not convert the float to int.
so make all in int or float then try.

Pandas: Assign multi-index DataFrame with with DataFrame by index-level-0

Please, suggest a more suitable title for this question
I have: Two-level indexed DF (crated via groupby):
clicks yield
country report_date
AD 2016-08-06 1 31
2016-12-01 1 0
AE 2016-10-11 1 0
2016-10-13 2 0
I need:
Consequently take country by country data, process it and put it back:
for country in set(DF.get_level_values(0)):
DF_country = process(DF.loc[country])
DF[country] = DF_country
Where process add new rows to DF_country.
Problem is in last string:
ValueError: Wrong number of items passed 2, placement implies 1
I just modify your code, I change the process to add, Base on my understanding process is a self-define function right ?
for country in set(DF.index.get_level_values(0)): # change here
DF_country = DF.loc[country].add(1)
DF.loc[country] = DF_country.values #and here
DF
Out[886]:
clicks yield
country report_date
AD 2016-08-06 2 32
2016-12-01 2 1
AE 2016-10-11 2 1
2016-10-13 3 1
EDIT :
l=[]
for country in set(DF.index.get_level_values(0)):
DF1=DF.loc[country]
DF1.loc['2016-01-01']=[1,2] #adding row here
l.append(DF1)
pd.concat(l,axis=0,keys=set(DF.index.get_level_values(0)))
Out[923]:
clicks yield
report_date
AE 2016-10-11 1 0
2016-10-13 2 0
2016-01-01 1 2
AD 2016-08-06 1 31
2016-12-01 1 0
2016-01-01 1 2

Categories

Resources