Output certain groupby-rows in a pandas dataframe as columns? - python

I'm trying to get sales numbers of the last 5 years of a dataframe shown in additional columns, so I can see the sold items per year of the last 5 years.
Currently my code looks like this:
import pandas as pd
data = [
[1,'Apples','2017-02-23',10,0.4],
[2,'Oranges','2017-03-06',20,0.7],
[1,'Apples','2017-09-23',8,0.5],
[1,'Apples','2018-05-14',14,0.5],
[1,'Apples','2019-04-27',7,0.6],
[2,'Apples','2018-09-10',14,0.4],
[1,'Oranges','2018-07-12',9,0.7],
[1,'Oranges','2018-12-07',4,0.7]]
df = pd.DataFrame(data, columns = ['CustomerID','Product','Invoice Date','Amount','Price'])
df['Invoice Date'] = pd.to_datetime(df['Invoice Date']).dt.strftime('%Y')
grpyear = df.groupby(['CustomerID','Product','Invoice Date','Price'])
grpyear[['Amount']].sum()
How can I get the years to show in columns looking like this:
Customer ID | Product | Amount in 2017 | Amount in 2018 | etc.

I think you did not mean to group by price. Please correct me if I'm wrong though.
In order to get a dataset like you asked:
# Removed `Price` from group
grpyear = df.groupby(['CustomerID','Product','Invoice Date'])
# Sum amounts by group
grpyear = grpyear[['Amount']].sum()
# Pivot result and fill NAs with 0
grpyear.reset_index().pivot(index=['CustomerID','Product'], columns=['Invoice Date'], values=['Amount']).fillna(0)

Related

Calculate 3 months unique Emp count for a given month from last 3 months data using pandas

I am looking to calculate last 3 months of unique employee ID count using pandas. I am able to calculate unique employee ID count for current month but not sure how to do it for last 3 months.
df['DateM'] = df['Date'].dt.to_period('M')
df.groupby("DateM")["EmpId"].nunique().reset_index().rename(columns={"EmpId":"One Month Unique EMP count"}).sort_values("DateM",ascending=False).reset_index(drop=True)
testdata.xlsx Google drive link..
https://docs.google.com/spreadsheets/d/1Kaguf72YKIsY7rjYfctHop_OLIgOvIaS/edit?usp=sharing&ouid=117123134308310688832&rtpof=true&sd=true
After using above groupby command I get output for 1 month groups based on DateM column which correct.
Similarly I'm looking for another column where 3 months unique active user count based on EmpId is calculated.
Sample output:
I tried calculating same using rolling window but it doesn't help. Even I tried creating period for last 3 months and also search it before asking this question. Thanks for your help in advance, otherwise I'll have to calculate it manually.
I don't know if you are looking for 3 consecutive months or something else because your date discontinues at 2022-09 to 2022-10.
I also don't know your purpose, so I give a general solution here. In case you only want to count unique for every 3 consecutive months, then it is much easier. The solution here gives you the list of unique empid for every 3 consecutive months. Note that: this means for 2022-08, I will count 3 consecutive months as 2022-08, 2022-09, and 2022-10. And so on
# Sort data:
df.sort_values(by='datem', inplace=True, ignore_index=True)
# Create `dfu` which is `df` with unique `empid` for each `datem` only:
dfu = df.groupby(['datem', 'empid']).count().reset_index()
dfu.rename(columns={'date':'count'}, inplace=True)
dfu.sort_values(by=['datem', 'empid'], inplace=True, ignore_index=True)
dfu
# Obtain the list of unique periods:
unique_period = dfu['datem'].unique()
# Create empty dataframe:
dfe = pd.DataFrame(columns=['datem', 'empid', 'start_period'])
for p in unique_period:
# Create 3 consecutive range:
tem_range = pd.period_range(start=p, freq='M', periods=3)
# Extract dataframe from `dfu` with period in range wanted:
tem_dfu = dfu.loc[dfu['datem'].isin(tem_range),:].copy()
# Some cleaning:
tem_dfu.drop_duplicates(subset='empid', keep='first')
tem_dfu.drop(columns='count', inplace=True)
tem_dfu['start_period'] = p
# Concat and obtain desired output:
dfe = pd.concat([dfe, tem_dfu])
dfe
Hope this is what you are looking for

Remove Date Grouping from Data

Looking to clean multiple data sets in a more automated way. The current format is year as column, month as row, the number values.
Below is an example of the current format, the original data has multiple years/months.
Current Format:
Year
Jan
Feb
2022
300
200
Below is an example of how I would like the new format to look like. It combines month and year into one column and transposes the number into another column.
How would I go about doing this in excel or python? Have files with many years and multiple months.
New Format:
Date
Number
2022-01
300
2022-02
200
Check below solution. You need to extend month_df for the months, current just cater to the example.
import pandas as pd
df = pd.DataFrame({'Year':[2022],'Jan':[300],'Feb':[200]})
month_df = pd.DataFrame({'Char_Month':['Jan','Feb'], 'Int_Month':['01','02']})
melted_df = pd.melt(df, id_vars=['Year'], value_vars=['Jan', 'Feb'], var_name='Char_Month',value_name='Number')
pd.merge(melted_df, month_df,left_on='Char_Month', right_on='Char_Month').\
assign(Year=melted_df['Year'].astype('str')+'-'+month_df['Int_Month'])\
[['Year','Number']]
Output:

Group column data into Week in Python

I have 4 columns which have Date , Account #, Quantity and Sale respectively. I have daily data but I want to be able to show Weekly Sales per Customer and the Quantity.
I have been able to group the column by week, but I also want to group it by OracleNumber, and Sum the Quantity and Sales columns. How would I get that to work without messing up the Week format.
import pandas as pd
names = ['Date','OracleNumber','Quantity','Sale']
sales = pd.read_csv("CustomerSalesNVG.csv",names=names)
sales['Date'] = pd.to_datetime(sales['Date'])
grouped=sales.groupby(sales['Date'].map(lambda x:x.week))
print(grouped.head())
IIUC, you could groupby w.r.t the week column and OracleNumber column by providing an extra key to the list for which the Groupby object has to use and perform sum operation later:
sales.groupby([sales['Date'].dt.week, 'OracleNumber']).sum()

putting .size() into new column python pandas

I am very new to python (and to stack overflow!) so hopefully this makes sense!
I have a dataframe which contains years and names (amongst otherthings however this is all I am interested in working with).
I have done df = df.groupby(['year', 'name']).size() to get the amount of times each names appears in each year.
it returns something similar to this:
year name
2001 nameone 2
2001 nametwo 3
2002 nameone 1
2002 nametwo 5
what I'm trying to do is put the size data in to a new column called 'count'.
(eventually what I am intending to do with this is plot it on graphs)
Any help would be greatly appreciated!
Here is the raw code (I have condensed it a bit for convenience) :
hso_df = pd.read_csv('HibernationSurveyObservationsCleaned.csv')
hso_df[["startDate", "endDate", "commonName"]]
year_df = hso_df
year_df['startDate'] = pd.to_datetime(hso_df['startDate'] )
year_df['year'] = year_df['startDate'].dt.year
year_df = year_df[["year", "commonName"]].sort_values('year')
year_df = year_df.groupby(['year', 'commonName']).size()
here is an image of the first 3 rows of the data displayed with .head()
The only columns that are of interest from this data are the commonName and the year (I have taken this from startDate)
IIUC you want transform to add the result of the groupby with its index aligned to the original df:
df['count'] = df.groupby(['year', 'name']).transform('size')
EDIT
Looking at your requirements, I suggest calling reset_index on the groupby result and then merging this back to your main df:
year_df= year_df.reset_index()
hso_df.merge(year_df).rename(columns={0:'count'})

How to apply a function to each column of a pivot table in pandas?

Code:
df = pd.read_csv("example.csv", parse_dates=['ds'])
df2 = df.set_index(['ds', 'city']).unstack('city')
rm = pd.rolling_mean(df2, 3)
sd = pd.rolling_std(df2,3)
df2 output:
What I want: I want to be able to see whether for each city, for each date, if the number is greater than 1 std dev away from the mean of bookings for that city. For ex pseudocode:
for each (city column)
for each (date)
see whether the (number of bookings) - (same date and city rolling mean) > (same date and city std dev)
print that date and city and number of bookings
What the problem is: I'm having trouble trying to figure out how to access the data I need from each of the data frames to do so. The parts of the pseudocode in parenthesis is what I need help figuring out.
What I tried:
df2['city']
list(df2)
Both give me errors.
df2[1:2]
Splicing works, but I feel like thats not the best way to access it.
You should use apply function of DataFrame API. Demo is below:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5]; 'B': [1,2,3,4,5]})
df['C'] = df.apply(lambda row: row['A']*row['B'], axis=1)
Output:
>>> df
A B C
0 1 1 1
1 2 2 4
2 3 3 9
3 4 4 16
4 5 5 25
More concretely for your case:
You have to precompute: "same date and city rolling mean", "same date and city std dev". You can use groupby function for it, it allows to aggregate data by city and date, after that you can calculate std dev and mean.
Put std dev and mean in your table, use dictionary for it: some_dict = {('city', 'date'):[std_dev, mean], ..}. For putting data in dataframe use apply function.
You have all necessary data for running your check by apply function.

Categories

Resources