Find events recurring every week - python

I am trying to find the keys that are recurring at a weekly cadence in a set of events, similar to the following:
_index _time key
0 2018-12-01T23:59:56.000+0000 mike
1 2018-12-04T23:59:36.000+0000 mike
2 2018-12-13T23:59:05.000+0000 mike
3 2018-12-20T23:57:45.000+0000 mike
4 2018-12-31T23:57:21.000+0000 jerry
5 2018-12-31T23:57:15.000+0000 david
6 2018-12-31T23:55:13.000+0000 tom
7 2018-12-31T23:54:28.000+0000 mike
8 2018-12-31T23:54:21.000+0000 john
I have tried creating groups by date, using the following:
df = [g for n, g in df.groupby(pd.Grouper(key='_time',freq='W'), as_index=False)]
but have been unable to find the intersection of the various groups using: set.intersection(), reduce & pd.merge, and df.join

Maybe we groupby key then check whether this name show in all weeks
s=df['_time'].dt.strftime('%Y-%w').groupby(df['key']).nunique()
nweek=df['_time'].dt.strftime('%Y-%w').nunique()
s[s==nweek]
key
mike 4
Name: _time, dtype: int64

Related

Pandas Number of Unique Values from 2 Fields

I am trying to find the number of unique values that cover 2 fields. So for example, a typical example would be last name and first name. I have a data frame.
When I do the following, I just get the number of unique fields for each column, in this case, Last and First. Not a composite.
df[['Last Name','First Name']].nunique()
Thanks!
Groupby both columns first, and then use nunique
>>> df.groupby(['First Name', 'Last Name']).nunique()
IIUC, you could use value_counts() for that:
df[['Last Name','First Name']].value_counts().size
3
For another example, if you start with this extended data frame that contains some dups:
Last Name First Name
0 Smith Bill
1 Johnson Bill
2 Smith John
3 Curtis Tony
4 Taylor Elizabeth
5 Smith Bill
6 Johnson Bill
7 Smith Bill
Then value_counts() gives you the counts by unique composite last-first name:
df[['Last Name','First Name']].value_counts()
Last Name First Name
Smith Bill 3
Johnson Bill 2
Curtis Tony 1
Smith John 1
Taylor Elizabeth 1
Then the length of that frame will give you the number of unique composite last-first names:
df[['Last Name','First Name']].value_counts().size
5

Pivot table rank by Name(Index) and Title(Column)

I have a dataset that looks like this:
The count represents the number of times they worked.
Title Name Count
Coach Bob 4
teacher sam 5
driver mark 8
Coach tina 10
teacher kate 3
driver frank 2
I want to create a table which I think will have to be a pivot, that sorts by count times worked, the name and title, so for example the output would look like this:
coach teacher driver
tina 10 sam 5 mark 8
bob 4 kate 3 drank 2
I am familiar with general pivot table code but I think Im going to need to use something a little bit more comprehensive.
DF_PIV = pd.pivot_table(DF, values=['count'], index=['title','Name'], columns=['title']
aggfunc=np.max)
I get an error ValueError: Grouper for 'view_title' not 1-dimensional, but I do not even think I on the right track here.
You can try:
(df.set_index(['Title', df.groupby('Title').cumcount()])
.unstack(0)
.astype(str)
.T
.groupby(level=1).agg(' '.join)
.T)
Output:
Title Coach driver teacher
0 Bob 4 mark 8 sam 5
1 tina 10 frank 2 kate 3

Checking unique value for a variable in a different column

I currently have a dataframe which looks like this:
Owner Vehicle_Color
0 James Red
1 Peter Green
2 James Blue
3 Sally Blue
4 Steven Red
5 James Blue
6 James Red
7 Peter Blue
And I am trying to verify whether one Owner has one or multiple vehicle colors assigned to the person. Keeping in mind that my dataframe has more than a million number of different entries for owners (which can be duplicate), what would be the best solution?
One way may be to use groupby and nunique:
df.groupby('Owner')['Vehicle_Color'].nunique()
Results:
Owner
James 2
Peter 2
Sally 1
Steven 1
Name: Vehicle_Color, dtype: int64

Pandas - make a large DataFrame into several small DataFrames and run each through a function

I have a huge dataset with about 60000 data. I would first use some criteria to do groupby on the whole dataset, and what I want to do next is to separate the whole dataset to many small datasets within the criteria and to run a function to each of the small dataset automatically to get a parameter for each small dataset. I have no idea on how to do this. Is there any code to make it possible?
This is what I have
Date name number
20100101 John 1
20100102 Kate 3
20100102 Kate 2
20100103 John 3
20100104 John 1
And I want it to be split into two small ones
Date name number
20100101 John 1
20100103 John 3
20100104 John 1
Date name number
20100102 Kate 3
20100102 Kate 2
I think a more efficient way than filtering the original data set using subsetting is groupby(), as a demo:
for _, g in df.groupby('name'):
print(g)
# Date name number
#0 20100101 John 1
#3 20100103 John 3
#4 20100104 John 1
# Date name number
#1 20100102 Kate 3
#2 20100102 Kate 2
So to get a list of small data frames, you can do [g for _, g in df.groupby('name')].
To expand on this answer, we can see more clearly what df.groupby() returns as follows:
for k, g in df.groupby('name'):
print(k)
print(g)
# John
# Date name number
# 0 20100101 John 1
# 3 20100103 John 3
# 4 20100104 John 1
# Kate
# Date name number
# 1 20100102 Kate 3
# 2 20100102 Kate 2
For each element returned by groupby(), it contains a key and a data frame with name which has a unique value of the key. In the above solution, we don't need the key, so we can just specify a position holder and discard it.
Unless your function is really slow, this can probably be accomplished by slicing (e.g. df_small = df[a:b] for some indices a and b). The only trick is to choose a and b. I use range in the code below but you could do it other ways:
param_list = []
n = 10000 #size of smaller dataframe
# loop up to 60000-n, n at a time
for i in range(0,60000-n,n):
# take a slice of big dataframe and apply function to get 'param'
df_small = df[i:i+n] #
param = function( df_small )
# keep our results in a list
param_list.append(param)
EDIT: Based on update, you could do something like this:
# loop through names
for i in df.name.values.unique():
# take a slice of big dataframe and apply function to get 'param'
df_small = df[df.name==i]

Fill Missing Dates in DataFrame with Duplicate Dates in Groupby

I am trying to get a daily status count from the following DataFrame (it's a subset, the real data set is ~14k jobs with overlapping dates, only one status at any given time within a job):
Job Status User
Date / Time
1/24/2011 10:58:04 1 A Ted
1/24/2011 10:59:20 1 C Bill
2/11/2011 6:53:14 1 A Ted
2/11/2011 6:53:23 1 B Max
2/15/2011 9:43:13 1 C Bill
2/21/2011 15:24:42 1 F Jim
3/2/2011 15:55:22 1 G Phil Jr.
3/4/2011 14:57:45 1 H Ted
3/7/2011 14:11:02 1 I Jim
3/9/2011 9:57:34 1 J Tim
8/18/2014 11:59:35 2 A Ted
8/18/2014 13:56:21 2 F Bill
5/21/2015 9:30:30 2 G Jim
6/5/2015 13:17:54 2 H Jim
6/5/2015 14:40:38 2 I Ted
6/9/2015 10:39:15 2 J Tom
1/16/2015 7:45:58 3 A Phil Jr.
1/16/2015 7:48:23 3 C Jim
3/6/2015 14:09:42 3 A Bill
3/11/2015 11:16:04 3 K Jim
My initial thought (from the following link) was to groupby the job column, fill in the missing dates for each group and then ffill the statuses down.
Pandas reindex dates in Groupby
I was able to make this work...kinda...if two statuses occurred on the same date, one would not be included in output and consequently some statuses were missing.
I then found the following, it supposedly handles the duplicate issue, but I am unable to get it to work with my data.
Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe
Am I on the right path thinking that filling in the missing dates and then ffill down the statuses is the correct way to ultimately capture daily counts of individual statuses? Is there another method that might better use pandas features that I'm missing?

Categories

Resources