Fill Missing Dates in DataFrame with Duplicate Dates in Groupby - python

I am trying to get a daily status count from the following DataFrame (it's a subset, the real data set is ~14k jobs with overlapping dates, only one status at any given time within a job):
Job Status User
Date / Time
1/24/2011 10:58:04 1 A Ted
1/24/2011 10:59:20 1 C Bill
2/11/2011 6:53:14 1 A Ted
2/11/2011 6:53:23 1 B Max
2/15/2011 9:43:13 1 C Bill
2/21/2011 15:24:42 1 F Jim
3/2/2011 15:55:22 1 G Phil Jr.
3/4/2011 14:57:45 1 H Ted
3/7/2011 14:11:02 1 I Jim
3/9/2011 9:57:34 1 J Tim
8/18/2014 11:59:35 2 A Ted
8/18/2014 13:56:21 2 F Bill
5/21/2015 9:30:30 2 G Jim
6/5/2015 13:17:54 2 H Jim
6/5/2015 14:40:38 2 I Ted
6/9/2015 10:39:15 2 J Tom
1/16/2015 7:45:58 3 A Phil Jr.
1/16/2015 7:48:23 3 C Jim
3/6/2015 14:09:42 3 A Bill
3/11/2015 11:16:04 3 K Jim
My initial thought (from the following link) was to groupby the job column, fill in the missing dates for each group and then ffill the statuses down.
Pandas reindex dates in Groupby
I was able to make this work...kinda...if two statuses occurred on the same date, one would not be included in output and consequently some statuses were missing.
I then found the following, it supposedly handles the duplicate issue, but I am unable to get it to work with my data.
Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe
Am I on the right path thinking that filling in the missing dates and then ffill down the statuses is the correct way to ultimately capture daily counts of individual statuses? Is there another method that might better use pandas features that I'm missing?

Related

Using .size() to sum 2 columns in groupby?

I have created a groupby object with 2 columns, the names of directors and how much revenue their film has returned. I would like to sum the total revenue for each director in to a data frame, but the formatting that my code is returning is a bit off
Any suggestions?
gross_revised.groupby(['director','updated_inflation_values']).size()
and the code its returning looks like this:
director updated_inflation_values
Art Stevens 9088096.00 1
Ben Sharpsteen 528279994.00 1
2187090808.00 1
Clyde Geronimi 91305448.00 1
138612686.00 1
920608730.00 1
David Hand 1078510579.00 1
5228953251.00
is there any way to have the sum value show as 1 number?
Thank you for your help!

Compare different df's row by row and return changes

Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)

Conditional copy of values from one column to another columns

I have a pandas dataframe that looks something like this:
name job jobchange_rank date
Thisguy Developer 1 2012
Thisguy Analyst 2 2014
Thisguy Data Scientist 3 2015
Anotherguy Developer 1 2018
The jobchange_rank represents the each individual's (based on name) ranked change in position, where rank nr 1 represent his/her first position nr 2 his/her second position, etc.
Now for the fun part. I want to create a new column where I can see a person's previous job, something like this:
name job jobchange_rank date previous_job
Thisguy Developer 1 2012 None
Thisguy Analyst 2 2014 Developer
Thisguy Data Scientist 3 2015 Analyst
Anotherguy Developer 1 2018 None
I've created the following code to get the "None" values where there was no job change:
df.loc[df['jobchange_rank'].sub(df['jobchange_rank'].min()) == 0, 'previous_job'] = 'None'
Sadly, I can't seem to figure out how to get the values from the other column where the needed condition applies.
Any help is more then welcome!
Thanks in advance.
This answer assumes that your DataFrame is sorted by name and jobchange_rank, if that is not the case, sort first.
# df = df.sort_values(['name', 'jobchange_rank'])
m = df['name'].eq(df['name'].shift())
df['job'].shift().where(m)
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Or using a groupby + shift (assuming at least sorted by jobchange_rank)
df.groupby('name')['job'].shift()
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Although the groupby + shift is more concise, on larger inputs, if your data is already sorted like your example, it may be faster to avoid the groupby and use the first solution.

Find events recurring every week

I am trying to find the keys that are recurring at a weekly cadence in a set of events, similar to the following:
_index _time key
0 2018-12-01T23:59:56.000+0000 mike
1 2018-12-04T23:59:36.000+0000 mike
2 2018-12-13T23:59:05.000+0000 mike
3 2018-12-20T23:57:45.000+0000 mike
4 2018-12-31T23:57:21.000+0000 jerry
5 2018-12-31T23:57:15.000+0000 david
6 2018-12-31T23:55:13.000+0000 tom
7 2018-12-31T23:54:28.000+0000 mike
8 2018-12-31T23:54:21.000+0000 john
I have tried creating groups by date, using the following:
df = [g for n, g in df.groupby(pd.Grouper(key='_time',freq='W'), as_index=False)]
but have been unable to find the intersection of the various groups using: set.intersection(), reduce & pd.merge, and df.join
Maybe we groupby key then check whether this name show in all weeks
s=df['_time'].dt.strftime('%Y-%w').groupby(df['key']).nunique()
nweek=df['_time'].dt.strftime('%Y-%w').nunique()
s[s==nweek]
key
mike 4
Name: _time, dtype: int64

Pandas - make a large DataFrame into several small DataFrames and run each through a function

I have a huge dataset with about 60000 data. I would first use some criteria to do groupby on the whole dataset, and what I want to do next is to separate the whole dataset to many small datasets within the criteria and to run a function to each of the small dataset automatically to get a parameter for each small dataset. I have no idea on how to do this. Is there any code to make it possible?
This is what I have
Date name number
20100101 John 1
20100102 Kate 3
20100102 Kate 2
20100103 John 3
20100104 John 1
And I want it to be split into two small ones
Date name number
20100101 John 1
20100103 John 3
20100104 John 1
Date name number
20100102 Kate 3
20100102 Kate 2
I think a more efficient way than filtering the original data set using subsetting is groupby(), as a demo:
for _, g in df.groupby('name'):
print(g)
# Date name number
#0 20100101 John 1
#3 20100103 John 3
#4 20100104 John 1
# Date name number
#1 20100102 Kate 3
#2 20100102 Kate 2
So to get a list of small data frames, you can do [g for _, g in df.groupby('name')].
To expand on this answer, we can see more clearly what df.groupby() returns as follows:
for k, g in df.groupby('name'):
print(k)
print(g)
# John
# Date name number
# 0 20100101 John 1
# 3 20100103 John 3
# 4 20100104 John 1
# Kate
# Date name number
# 1 20100102 Kate 3
# 2 20100102 Kate 2
For each element returned by groupby(), it contains a key and a data frame with name which has a unique value of the key. In the above solution, we don't need the key, so we can just specify a position holder and discard it.
Unless your function is really slow, this can probably be accomplished by slicing (e.g. df_small = df[a:b] for some indices a and b). The only trick is to choose a and b. I use range in the code below but you could do it other ways:
param_list = []
n = 10000 #size of smaller dataframe
# loop up to 60000-n, n at a time
for i in range(0,60000-n,n):
# take a slice of big dataframe and apply function to get 'param'
df_small = df[i:i+n] #
param = function( df_small )
# keep our results in a list
param_list.append(param)
EDIT: Based on update, you could do something like this:
# loop through names
for i in df.name.values.unique():
# take a slice of big dataframe and apply function to get 'param'
df_small = df[df.name==i]

Categories

Resources