I'm trying to count the number of 0s in a column based on conditions in another column. I have three columns in the spreadsheet: DATE, LOCATION, and SALES. Column 1 is the date column. Column 2 is the location column (there are 5 different locations). Column 3 is the sales volume for the day.
I want to count the number of instances where the different locations have 0 sales for the day.
I have tried a number of groupby combinations and cannot get an answer.
df_summary = df.groupby(['Location']).count()['Sales'] == 0
Any help is appreciated.
Try filter first:
(df[df['Volume']==0].groupby('Location').size()
.reindex(df['Location'].unique()) # fill in the zero numbers
.reset_index(name='No Sales Days') # convert to dataframe
)
Or
df['Volume'].eq(0).groupby(df['Location']).sum()
Related
Im currently working on a dataset where I am using the rolling function in pandas to
create features.
The functions rely on three columns a DaysLate numeric column from which the mean is calculated from, an Invoice Date column from which the date is derived from and a customerID column which denotes the customer of a row.
Im trying to get a rolling mean of the DaysLate for the last 30 days limited to invoices raised to a specific customerID.
The following two functions are working.
Mean of DaysLate for the last five invoices raised for the row's customer
df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window = 5,min_periods = 1).\
DaysLate.mean().reset_index().set_index("level_1").\
sort_index()["DaysLate"]
Mean of DaysLate for all invoices raised in the last 30 days
df = df.sort_values('InvoiceDate')
df["GlobalDaysLate_30days"] = df.rolling(window = '30d', on = "InvoiceDate").DaysLate.mean()
Just cant seem to find the code get the mean of the last 30 days by CustomerID. Any help on above is greatly appreciated.
Set the date column as index then sort to ensure ascending order then group the sorted dataframe by customer id and for each group calculate 30d rolling mean.
mean_30d = (
df
.set_index('InnvoiceDate') # !important
.sort_index()
.groupby('customerID')
.rolling('30d')['DaysLate'].mean()
.reset_index(name='GlobalDaysLate_30days')
)
# merge the rolling mean back to original dataframe
result = df.merge(mean_30d)
I am looking to calculate last 3 months of unique employee ID count using pandas. I am able to calculate unique employee ID count for current month but not sure how to do it for last 3 months.
df['DateM'] = df['Date'].dt.to_period('M')
df.groupby("DateM")["EmpId"].nunique().reset_index().rename(columns={"EmpId":"One Month Unique EMP count"}).sort_values("DateM",ascending=False).reset_index(drop=True)
testdata.xlsx Google drive link..
https://docs.google.com/spreadsheets/d/1Kaguf72YKIsY7rjYfctHop_OLIgOvIaS/edit?usp=sharing&ouid=117123134308310688832&rtpof=true&sd=true
After using above groupby command I get output for 1 month groups based on DateM column which correct.
Similarly I'm looking for another column where 3 months unique active user count based on EmpId is calculated.
Sample output:
I tried calculating same using rolling window but it doesn't help. Even I tried creating period for last 3 months and also search it before asking this question. Thanks for your help in advance, otherwise I'll have to calculate it manually.
I don't know if you are looking for 3 consecutive months or something else because your date discontinues at 2022-09 to 2022-10.
I also don't know your purpose, so I give a general solution here. In case you only want to count unique for every 3 consecutive months, then it is much easier. The solution here gives you the list of unique empid for every 3 consecutive months. Note that: this means for 2022-08, I will count 3 consecutive months as 2022-08, 2022-09, and 2022-10. And so on
# Sort data:
df.sort_values(by='datem', inplace=True, ignore_index=True)
# Create `dfu` which is `df` with unique `empid` for each `datem` only:
dfu = df.groupby(['datem', 'empid']).count().reset_index()
dfu.rename(columns={'date':'count'}, inplace=True)
dfu.sort_values(by=['datem', 'empid'], inplace=True, ignore_index=True)
dfu
# Obtain the list of unique periods:
unique_period = dfu['datem'].unique()
# Create empty dataframe:
dfe = pd.DataFrame(columns=['datem', 'empid', 'start_period'])
for p in unique_period:
# Create 3 consecutive range:
tem_range = pd.period_range(start=p, freq='M', periods=3)
# Extract dataframe from `dfu` with period in range wanted:
tem_dfu = dfu.loc[dfu['datem'].isin(tem_range),:].copy()
# Some cleaning:
tem_dfu.drop_duplicates(subset='empid', keep='first')
tem_dfu.drop(columns='count', inplace=True)
tem_dfu['start_period'] = p
# Concat and obtain desired output:
dfe = pd.concat([dfe, tem_dfu])
dfe
Hope this is what you are looking for
I want to extract specific information from [this csv file][1].
I need make a list of days and give an overview.
You're looking for DataFrame.resample. Based on a specific column, it will group the rows of the dataframe by a specific time interval.
First you need to do this, if you haven't already:
data['Date/Time'] = pd.to_datetime(data['Date/Time'])
Get the lowest 5 days of visibility:
>>> df.resample(rule='D', on='Date/Time')['Visibility (km)'].mean().nsmallest(5)
Date/Time
2012-03-01 2.791667
2012-03-14 5.350000
2012-12-27 6.104167
2012-01-17 6.433333
2012-02-01 6.795833
Name: Visibility (km), dtype: float64
Basically what that does is this:
Groups all the rows by day
Converts each group to the average value of all the Visibility (km) items for that day
Returns the 5 smallest
Count the number of foggy days
>>> df.resample(rule='D', on='Date/Time').apply(lambda x: x['Weather'].str.contains('Fog').any()).sum()
78
Basically what that does is this:
Groups all the rows by day
For each day, adds a True if any row inside that day contains 'Fog' in the Weather column, False otherwise
Counts how many True's there were, and thus the number of foggy days.
This will get you an array of all unique foggy days. you can use the shape method to get its dimension
df[df["Weather"].apply(lambda x : "Fog" in x)]["Date/Time"].unique()
I need make a list of days with lowest visibility and give an overview of other time parameters for those days in tabular form.
Since your Date/Time column represents a particular hour, you'll need to do some grouping to get the minimum visibility for a particular day. The following will find the 5 least-visible days.
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Group on the new "Date" column and get the minimum values of
# each column for each group.
>>> min_by_day = data.groupby("Date").min()
# Now we can use nsmallest, since 1 row == 1 day in min_by_day.
# Since `nsmallest` returns a pandas.Series with "Date" as the index,
# we have to use `.index` to pull the date objects from the result.
>>> least_visible_days = min_by_day.nsmallest(5, "Visibility (km)").index
Then you can limit your original dataset to the least-visible days with
data[data["Date"].isin(least_visible_days)]
I also need the total number of foggy days.
We can use the extracted date in this case too:
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Filter on hours which have foggy weather
>>> foggy = data[data["Weather"].str.contains("Fog")]
# Count number of unique days
>>> len(foggy["Date"].unique())
I have 4 columns which have Date , Account #, Quantity and Sale respectively. I have daily data but I want to be able to show Weekly Sales per Customer and the Quantity.
I have been able to group the column by week, but I also want to group it by OracleNumber, and Sum the Quantity and Sales columns. How would I get that to work without messing up the Week format.
import pandas as pd
names = ['Date','OracleNumber','Quantity','Sale']
sales = pd.read_csv("CustomerSalesNVG.csv",names=names)
sales['Date'] = pd.to_datetime(sales['Date'])
grouped=sales.groupby(sales['Date'].map(lambda x:x.week))
print(grouped.head())
IIUC, you could groupby w.r.t the week column and OracleNumber column by providing an extra key to the list for which the Groupby object has to use and perform sum operation later:
sales.groupby([sales['Date'].dt.week, 'OracleNumber']).sum()
I have a DataFrame df. This dataframe has 1000 rows and 20 columns. Rows are dates and 20 columns are different shares.
This is 1000 dates of stock prices from old to new (last row is todays price).
I want to calculate for each column, the percentile rank of todays price (last element in a column), against the full history of that particular column. So for example the first value of our output would be the final value in column (1) percentranked against all the values in column (1) and so on.
So the output would be just 20 values of 20xpercentage ranks of todays price against the history of the entire stock.
Tough to tell what you're looking for in terms of output, but this would give you a 1x20 DataFrame where the only row in the index is the latest day and the values in each column are the percentage rank of each column versus the full history of that particular column. Assuming you have a 1000 x 20 DataFrame sorted in oldest to newest...
data.rank().tail(1).div(len(data))