Group column data into Week in Python - python

I have 4 columns which have Date , Account #, Quantity and Sale respectively. I have daily data but I want to be able to show Weekly Sales per Customer and the Quantity.
I have been able to group the column by week, but I also want to group it by OracleNumber, and Sum the Quantity and Sales columns. How would I get that to work without messing up the Week format.
import pandas as pd
names = ['Date','OracleNumber','Quantity','Sale']
sales = pd.read_csv("CustomerSalesNVG.csv",names=names)
sales['Date'] = pd.to_datetime(sales['Date'])
grouped=sales.groupby(sales['Date'].map(lambda x:x.week))
print(grouped.head())

IIUC, you could groupby w.r.t the week column and OracleNumber column by providing an extra key to the list for which the Groupby object has to use and perform sum operation later:
sales.groupby([sales['Date'].dt.week, 'OracleNumber']).sum()

Related

Using a Rolling Function in Pandas based on Date and a Categorical Column

Im currently working on a dataset where I am using the rolling function in pandas to
create features.
The functions rely on three columns a DaysLate numeric column from which the mean is calculated from, an Invoice Date column from which the date is derived from and a customerID column which denotes the customer of a row.
Im trying to get a rolling mean of the DaysLate for the last 30 days limited to invoices raised to a specific customerID.
The following two functions are working.
Mean of DaysLate for the last five invoices raised for the row's customer
df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window = 5,min_periods = 1).\
DaysLate.mean().reset_index().set_index("level_1").\
sort_index()["DaysLate"]
Mean of DaysLate for all invoices raised in the last 30 days
df = df.sort_values('InvoiceDate')
df["GlobalDaysLate_30days"] = df.rolling(window = '30d', on = "InvoiceDate").DaysLate.mean()
Just cant seem to find the code get the mean of the last 30 days by CustomerID. Any help on above is greatly appreciated.
Set the date column as index then sort to ensure ascending order then group the sorted dataframe by customer id and for each group calculate 30d rolling mean.
mean_30d = (
df
.set_index('InnvoiceDate') # !important
.sort_index()
.groupby('customerID')
.rolling('30d')['DaysLate'].mean()
.reset_index(name='GlobalDaysLate_30days')
)
# merge the rolling mean back to original dataframe
result = df.merge(mean_30d)

Python Dataframe with Date lies in date-range of dataframe, then extract and aggregate column values in date by date case

I have one data frame with start_date and end_date (01-02-2020), based on these two dates it can be daily (if start and end are one day apart), similarly for yearly or quarterly.
Then there is a column Value (3.5) in values column.
Now if there exists one record for monthly with 2.5 value and one record with quarterly with 4.5 and multiple for daily like 1.5 and one for yearly like 0.5.
enter image description here
Then I need to get one row for one date like (01-01-2020) with summing values and giving aggregate value (2.5+4.5+1.5+0.5 = 9), hence 9 is total_value on 01-01-2020. Something like below:
enter image description here
There are years of data like this with multiple records existing for same time period. And I need to get aggregated value for one by one dates for all distinct 'names'
I have been trying to do this in Python with no success till now. Any help is appreciated.

Counting Distinct Items That Equal 0 Based On Another Column

I'm trying to count the number of 0s in a column based on conditions in another column. I have three columns in the spreadsheet: DATE, LOCATION, and SALES. Column 1 is the date column. Column 2 is the location column (there are 5 different locations). Column 3 is the sales volume for the day.
I want to count the number of instances where the different locations have 0 sales for the day.
I have tried a number of groupby combinations and cannot get an answer.
df_summary = df.groupby(['Location']).count()['Sales'] == 0
Any help is appreciated.
Try filter first:
(df[df['Volume']==0].groupby('Location').size()
.reindex(df['Location'].unique()) # fill in the zero numbers
.reset_index(name='No Sales Days') # convert to dataframe
)
Or
df['Volume'].eq(0).groupby(df['Location']).sum()

How to aggr. daily sales orders into monthly totals and have it as datetime format (e.g. January 2015) in Python?

I have a pandas dataframe with 3 columns:
OrderID_new (integer)
OrderTotal (float)
OrderDate_new (string or datetime sometimes)
Sales order ID's are in the first column, order values (totals) are in the 2nd column and order date - in mm/dd/yyyy format are in the last column.
I need to do 2 things:
to aggregate the order totals:
a) first into total sales per each day and then
b) into total sales per each calendar month
to convert values in OrderDate_new from mm/dd/yyyy format (e.g. 01/30/2015) into MM YYYY (e.g. January 2015) format.
The problem is some input files have 3rd column (date) already in datetime format while some have it as string format so that means sometimes string to datetime parsing will be needed while in other cases, reformatting datetime.
I have been trying to do 2 step aggregation with groupby but I'm getting some strange daily and monthly totals that make no sense.
What I need as the final stage is time series with 2 columns - 1. monthly sales and 2. month (Month Year)...
Then I will need to select and train some model for monthly sales time series forecast (out of scope for this question)...
What am I doing wrong?
How to do it effectively in Python?
dataframe example:
You did not provide usable sample data, hence I've synthesized.
resample() allows you to rollup a date column. Have provided daily and monthly
pd.to_datetime() gives you what you want
def mydf(size=10):
return pd.DataFrame({"OrderID_new":np.random.randint(100,200, size),
"OrderTotal":np.random.randint(200, 10000, size),
"OrderDate_new":np.random.choice(pd.date_range(dt.date(2019,8,1),dt.date(2020,1,1)),size)})
# smash orderdate to be a string for some rows
df = pd.concat([mydf(5), mydf(5).assign(OrderDate_new=lambda dfa: dfa.OrderDate_new.dt.strftime("%Y/%m/%d"))])
# make sure everything is a date..
df.OrderDate_new = pd.to_datetime(df.OrderDate_new)
# totals
df.resample("1d", on="OrderDate_new")["OrderTotal"].sum()
df.resample("1m", on="OrderDate_new")["OrderTotal"].sum()

How to extract certain under specific condition in pandas? (Sentimental analysis)

The picture is what my dataframe looks like. I have user_name, movie_name and time column. I want to extract only rows that are first day of certain movie. For example, if movie a's first date in the time column is 2018-06-27, i want all the rows in that date and if movie b's first date in the time column is 2018-06-12, i only want those rows. How would i do that with pandas?
I assume that time column is of datetime type. If not, convert this
column calling pd.to_datetime.
Then run:
df.groupby('movie_name').apply(lambda grp:
grp[grp.time.dt.date == grp.time.min().date()])
Groupby groups the source DataFrame into grops concerning particular films.
Then grp.time.min().date() computes the minimal (first) date from the
current group.
And finally the whole lamda function returns only rows from this date
(also from the current group).
The same for other groups of rows (films).

Categories

Resources