Add count column to Pandas pivot table - python

I have a Pandas data frame that keeps data for checkouts of laptops in my department. The dataframe has columns for time checked out (column name Out), time checked in (In), the name of the person checking out (Name), and the number of machines checked out by that person (Number). I want to create a new dataframe that shows both the number of times checkouts occurred in a given week, and the number of machines checked out in a given week. The original data frame is called cb.
I was able to make a pivot table that gets me the number of machines checked out by week:
dates = pd.pivot_table(cb, values="Number", index="Out", aggfunc=sum)
I'm wondering what I can add to this line of code to add a new column that calculates the number of times machines were checked out. For example if two people checked out laptops in a given week, person 1 checked out 10 laptops, and person 2 checked out 5, then there should be a "Number" column that reads "15" for this week and another column "Frequency" that reads "2".
Is this possible with a single pivot_table line or is there more to it? Thanks in advance.
EDIT: Here's what I hope is a small example of what I am looking for. First, here's raw data from the CSV I am reading:
Name Number DateOut TimeOut DateIn TimeIn
C 1 8/31/2017 2:00p 9/1/2017 3:40p
Ma 2 8/31/2017 3:30p . .
S 1 9/6/2017 10:50a 9/6/2017 1:55p
S 3 9/7/2017 10:00a 9/7/2017 3:00p
C 1 9/7/2017 2:20p 9/8/2017 11:00a
Ma 2 9/7/2017 4:00p 9/8/2017 10:00a
S 4 9/8/2017 10:50a 9/8/2017 3:15p
W 6 9/11/2017 8:15a 9/11/2017 11:00a
B 4 9/11/2017 10:45a 9/11/2017 1:00p
S 4 9/11/2017 10:55a 9/11/2017 3:55p
S 3 9/12/2017 12:55p 9/12/2017 3:00p
Ma 2 9/12/2017 4:00p 9/15/2017 10:00a
S 1 9/13/2017 11:00a 9/13/2017 1:00p
T 1 9/13/2017 1:00p . .
K 1 9/13/2017 2:00p 9/14/2017 10:00a
F 2 9/13/2017 4:00p 9/14/2017 11:45a
S 3 9/14/2017 1:00p 9/14/2017 3:00p
C 1 9/14/2017 3:50p 9/15/2017 10:00a
F 4 9/15/2017 9:35a 9/15/2017 3:00p
(Names redacted for privacy.)
The code for reading it in (parsing the given dates into a correct DateTime index):
import pandas as pd
cb = pd.read_csv("chromebookdata.csv", na_values=".",
parse_dates={"In": [2,3], "Out":[4,5]})
cb['In'] = pd.to_datetime(cb['In'], errors="coerce")
cb['Out'] = pd.to_datetime(cb['Out'], errors="coerce")
Creating a pivot table that gives the number of machines each week:
dates = pd.pivot_table(cb, values="Number", index="Out", aggfunc=sum)
dates_weekly = dates.resample("W").sum()
This pivot table gives me the number of machines checked out per week:
Number
In
2017-09-03 3.0
2017-09-10 11.0
2017-09-17 33.0
What I want is a new column for the number of times checkouts occurred, so for these data it would look like:
Number Count
In
2017-09-03 3.0 2
2017-09-10 11.0 5
2017-09-17 33.0 12

Assuming your dates_weekly and cb dataframes are sorted by date:
# Round your dates to the day
cb['dates'] = cb['dates'].dt.floor('d')
# Group by rounded date and count the number of rows per each date
dates_weekly['frequency'] = cb.groupby('dates').agg('count')

You can pass a list in the aggfunction. Try aggfunc=['sum', 'count'] when you create the pivot table

Related

How to create a new column showing when a change to an observation occurred?

I have a data-frame formatted like so:
Contract
Agreement_Date
Date
A
2017-02-10
2020-02-03
A
2017-02-10
2020-02-04
A
2017-02-11
2020-02-09
A
2017-02-11
2020-02-10
A
2017-02-11
2020-04-21
B
2017-02-14
2020-08-01
B
2017-02-15
2020-08-11
B
2017-02-17
2020-10-14
C
2017-02-11
2020-12-12
C
2017-02-11
2020-12-16
In this data-frame I have multiple observations for each contract. For some of the contracts, their Agreement_Date changes as new amendments occur. As an example, Contract A had its agreements change from 2017-02-10 to 2017-02-11, and Contract B had its agreement_date change 3 times. Contract C had no change to Agreement_Date
What I would like is an output that looks like so:
Contract
Date
Number_of_Changes
A
2020-02-09
1
B
2017-08-11
2
B
2017-10-14
2
Where the Date column shows when the change to Agreement_Date occurs (e.g. for contract A the Agreement_Date first went from 2017-02-10 to 2017-02-11 on 2020-02-09). This is shown in bold in my first table. I then want a Number_of_Changes column which simply shows how many times the Agreement_Date changed for that contract.
I have been working on this for a few hours to no avail, so any help would be appreciated.
Thanks :)
I posted a previous answer, but realised it's not what you expected. Would this one work out though ?
#Create 'progressive' number of changes column per Contract
df['Changes']=df.groupby('Contract')['Agreement_Date'].transform(lambda x:(x!=x.shift()).cumsum())-1
#Assign to new df, filter for changes and drop duplicates assuming it's already sorted per 'Date'
newdf=df[df['Changes']>0].drop_duplicates(subset=['Contract','Changes'])[['Contract','Date','Changes']]
#Reassign values of 'changes' for max 'Change' per Contract
newdf['Changes']=newdf.groupby('Contract')['Changes'].transform('max')
newdf
This problem revolves around setting up some pieces for later computational use. You'll need multiple passes to
shift the dates & retrieve the records where the changes occur
calculate the number of changes that occurred
We can do this by working with the groupby object in 2 steps.
contract_grouped = df.groupby('Contract')['Agreement_Date']
# subset data based on whether or not a date change occurred
shifted_dates = contract_grouped.shift()
changed_df = df.loc[
shifted_dates.ne(df['Agreement_Date']) & shifted_dates.notnull()
].copy()
# calculate counts and assign back to df
changed_df['count'] = changed_df['Contract'].map(contract_grouped.nunique() - 1)
del changed_df['Date'] # unneeded column
print(changed_df)
Contract Agreement_Date count
2 A 2017-02-11 1
6 B 2017-02-15 2
7 B 2017-02-17 2
Here is the same approach written out with method chaining & assignment expression syntax. If the above is more readable to you, please use that. I put this here mainly because I enjoy writing my pandas answers both ways.
changed_df = (
df.groupby('Contract')['Agreement_Date']
.pipe(lambda grouper:
df.loc[
(shifted := grouper.shift()).ne(df['Agreement_Date'])
& shifted.notnull()
]
.assign(count=lambda d: d['Contract'].map(grouper.nunique().sub(1)))
.drop(columns='Date')
)
)
print(changed_df)
Contract Agreement_Date count
2 A 2017-02-11 1
6 B 2017-02-15 2
7 B 2017-02-17 2
This gives the desired output: first, generate the difference between one row and the one before and locate the index of the rows when the value is neither 0 nor NaT; then create a 'change' column based on the count
df.Agreement_Date = pd.to_datetime(df.Agreement_Date)
out = df.loc[np.where((df.groupby('Contract')['Agreement_Date'].diff().notna())&(df['Agreement_Date'].diff()!='0 days'))][['Contract', 'Date']]
out['Change'] = out.groupby('Contract').transform('count').values
out.set_index('Contract',drop=True, inplace=True)
Output:
Date Change
Contract
A 2020-02-09 1
B 2020-08-11 2
B 2020-10-14 2

Uniqueness Test on Dataframe column and cross reference with value in second column - Python

I have a dataframe of daily license_type activations (either full or trial) as shown below. Basically, I am trying to see the monthly count of Trial to Full License conversions. I am trying to do this by taking into consideration the daily data and the user_email column.
Date User_Email License_Type P.Letter Month (conversions)
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
4 2017-04-08 761179767639020420 full g 2017-04
The logic I have is to iteratively check the User_Email column. If the User_Email value is a duplicate, then check license_type column. If value in license_type = 'full' return 1 in a new column called 'Conversions' else return 0 in 'conversion' column. This would be the amendment to the original dataframe above.
Then group 'Date' column by month and I should have a aggregate value of monthly conversions in 'Conversion' column? Should look something like below:
Date
2017-Apr 1
2017-Feb 2
2017-Jan 1
2017-Jul 0
2017-Mar 1
Name: Conversion
below was my trial at getting the desire output above
#attempt to create a new column Conversion and fill with 1 and 0 for if converted or not.
for values in df['User_email']:
if value.is_unique:
df['Conversion'] = 0 #because there is no chance to go from trial to Full
else:
if df['License_type'] = 'full': #check if license type is full
df['Conversion'] = 1 #if full, I assume it was originally trial and now is full
# Grouping daily data by month to get monthly total of conversions
converted = df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
Your sample data doesn't have the features you note you are looking for. Rather than loop (always a pandas anti-pattern) have a simple function that operates row by row
for uniqueness test I'm getting a count of use of email address first and setting the number of times it occurs on each row
your logic I've transcribed in a slightly different way.
data = """ Date User_Email License_Type P.Letter Month
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
3 2017-03-13 2475366081966194134 full c 2017-03
3 2017-03-13 2475366081966194 full c 2017-03
4 2017-04-08 761179767639020420 full g 2017-04"""
a = [[t.strip() for t in re.split(" ",l) if t.strip()!=""] for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(a[1:], columns=a[0])
df["Date"] = pd.to_datetime(df["Date"])
df = df.assign(
emailc=df.groupby("User_Email")["User_Email"].transform("count"),
Conversion=lambda dfa: dfa.apply(lambda r: 0 if r["emailc"]==1 or r["License_Type"]=="trial" else 1, axis=1)
).drop("emailc", axis=1)
df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
output
Date
2017-Apr 0
2017-Feb 1
2017-Jan 0
2017-Jul 0
2017-Mar 1

Splitting a dataframe's column by ';' copying all static rows and doing math on others

So I have a dataframe that has been read from a CSV. It has 36 columns and 3000+ rows. I want to split the dataframe on a column that contains items separated by a semicolon.
It is purchasing data and has mostly rows I would want to just copy down for the split; for example: Invoice Number, Sales Rep, etc. That is the first step and I have found many answers for this on SO, but none that solve for the second part.
There are other columns: Quantity, Extended Cost, Extended Price, and Extended Gross Profit that would need to be recalculated based on the split. The quantity, for the rows with values in the column in question, would need to be 1 for each item in the list; the subsequent columns would need to be recalculated based on that column.
See below for an example DF:
How would I go about this?
A lot of implementations use df.split(';') and some use of df.apply, but unfortunately I am not understanding the process from front to back.
Edit: This is the output I am looking for:
Proposed output
Using pandas 0.25.1+ you can use explode:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Quantity':[6,50,25,4]
,'Column in question':['1;2;3;4;5;6','','','7;8;9;10']
,'Price':['$1.00','$10.00','$0.10','$25.00']
,'Invoice Close Date':['9/3/2019','9/27/2019','9/18/2019','9/30/2019']})
df_out = df.assign(ciq=df['Column in question'].str.split(';')).explode('ciq')\
.drop('Column in question', axis=1)\
.rename(columns={'ciq':'Column in question'})
df_out['Quantity'] = (df_out['Quantity'] / df_out.groupby(level=0)['Quantity'].transform('size'))
df_out
Output:
Quantity Price Invoice Close Date Column in question
0 1.0 $1.00 9/3/2019 1
0 1.0 $1.00 9/3/2019 2
0 1.0 $1.00 9/3/2019 3
0 1.0 $1.00 9/3/2019 4
0 1.0 $1.00 9/3/2019 5
0 1.0 $1.00 9/3/2019 6
1 50.0 $10.00 9/27/2019
2 25.0 $0.10 9/18/2019
3 1.0 $25.00 9/30/2019 7
3 1.0 $25.00 9/30/2019 8
3 1.0 $25.00 9/30/2019 9
3 1.0 $25.00 9/30/2019 10
Details:
First, create a column containing a list using str.split and assign.
Next, use explode then rename new column to old name after drop.

Applying function to Pandas Groupby

I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).
Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).
I'm ultimately wanting to iterate over multiple columns, doing the following:
Compute a rolling average for each time step in a given column, per group ID
Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
Repeat and append new residuals to df_resid (as seen below)
df_resid
date id rev_resid exp_resid
2005-09-01 1 NaN NaN
2005-12-01 1 -10000 -5500
2006-03-01 1 -352584 -262058.5
2006-06-01 1 240000 190049.5
2006-09-01 1 82648.75 37724.25
2005-09-01 2 NaN NaN
2005-12-01 2 4206.5 24353
2006-03-01 2 -302574 -331951
2006-06-01 2 103179 117405.5
2006-09-01 2 -52650 -72296.5
Here's small sample of the original data.
df
date id rev exp
2005-09-01 1 745168.0 545168.0
2005-12-01 1 725168.0 534168.0
2006-03-01 1 20000.0 10051.0
2006-06-01 1 500000.0 390150.0
2006-09-01 1 665297.5 465598.5
2005-09-01 2 956884.0 736987.0
2005-12-01 2 965297.0 785693.0
2006-03-01 2 360149.0 121791.0
2006-06-01 2 566507.0 356602.0
2006-09-01 2 461207.0 212009.0
And the (very slow) code:
df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean())
I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.
Any help would be truly appreciated.
To quicken up the calculation, if dataframe is already sorted on 'id' then you don't have to do rolling within a groupby (if it isn't sorted... do so). Then since your window is only length 2 then we mask the result by checking where id == id.shift This works because it's sorted.
d1 = df[['rev', 'exp']]
df.join(
d1.rolling(2).mean().rsub(d1).add_suffix('_resid')[df.id.eq(df.id.shift())]
)
date id rev exp rev_resid exp_resid
0 2005-09-01 1 745168.0 545168.0 NaN NaN
1 2005-12-01 1 725168.0 534168.0 -10000.00 -5500.00
2 2006-03-01 1 20000.0 10051.0 -352584.00 -262058.50
3 2006-06-01 1 500000.0 390150.0 240000.00 190049.50
4 2006-09-01 1 665297.5 465598.5 82648.75 37724.25
5 2005-09-01 2 956884.0 736987.0 NaN NaN
6 2005-12-01 2 965297.0 785693.0 4206.50 24353.00
7 2006-03-01 2 360149.0 121791.0 -302574.00 -331951.00
8 2006-06-01 2 566507.0 356602.0 103179.00 117405.50
9 2006-09-01 2 461207.0 212009.0 -52650.00 -72296.50

Grouping records with close DateTimes in Python pandas DataFrame

I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2

Categories

Resources