Grouping data based on time interval - python

I have to group a dataset with multiple participants. The participants work a specific time on a specific tablet. If rows are the same tablet, and the time difference between consecutive rows is no more than 10 minutes, the rows belong to one participant. I would like to create a new column ("Participant") that numbers the participants. I know some python but this goes over my head. Thanks a lot!
Dataframe:
ID, Time, Tablet
1, 9:12, a
2, 9:14, a
3, 9:17, a
4, 9:45, a
5, 9:49, a
6, 9:51, a
7, 9:13, b
8, 9:15, b
...
Goal:
ID, Time, Tablet, Participant
1, 9:12, a, 1
2, 9:14, a, 1
3, 9:17, a, 1
4, 9:45, a, 2
5, 9:49, a, 2
6, 9:51, a, 2
7, 9:13, b, 3
8, 9:15, b, 3
...

You can groupby first then do a cumsum to get the participant column the way you want. Please make sure the time column is in datetime format and also sort it before you do this.
df['time'] = pd.to_datetime(df['time'])
df['time_diff']=df.groupby(['tablet'])['time'].diff().dt.seconds/60
df['participant'] = np.where((df['time_diff'].isnull()) | (df['time_diff']>10), 1,0).cumsum()

I've done something similar before, I used a combination of a group_by statement and using the Pandas shift function.
df = df.sort_values(["Tablet", "Time"])
df["Time_Period"] = df.groupby("Tablet")["Time"].shift(-1)-df["Time"]
df["Time_Period"] = df["Time_Period"].dt.total_seconds()
df["New_Participant"] = df["Time_Period"] > 10*60 #10 Minutes
df["Participant_ID"] = df["New_Participant"].cumsum()
Basically I flag every time there is a gap of over 10 minutes between sessions, then do a rolling sum to give each participant a unique ID

Related

Is it possible to choose the right rows in an ordered sequence of events without loops?

example row:
B1
S1
B2
B3/S2
B4
B5
B6/S4
S3
Rules:
A row can be B (buy), S (sell) or both
It is known which sell belongs to which buy and viceversa
buys are ordered, sells are possibly not ordered
When a buy has no matching sell, all the subsequent buys are discarded
we want all the buy rows such that if there is a sell for that row, all the buy rows from that point up to the respective sell row are discarded.
This can be done with a simple loop, that skips the overlapping buys, but trying to implement this with vectors has been challenging and I am wondering if it is possible?
The most promising method I tried was padding the index of the buy and backfilling the indexes of the sell, and making sense of the possible combinations, although I am not sure they can give a unique view of the state...
Output from example would be:
B1
B2
B4
Here is a suggestion, using pandas. I dont know if it is more efficient than what you are doing, but if the goal is to avoid looping, I think this will do it.
I will just assume your buy/sell-data can be split into two dataframes, one for buys and one for sells. I also add a 'time' column to each frame. I.e. when is the order to buy/sell, placed. Putting your data in a dataframe and the splitting this into the two abovementioned dataframes is probably an easy exercise, but I will skip it.
import pandas as pd
# Your data split into two frames (for instance, in df_buy, num=2, would be equivalent
# to B2 occuring at the second, zero-indexed, time-step)
df_buy = pd.DataFrame({'Num': [1, 2, 3, 4, 5, 6],
'Time': [0, 2, 3, 4, 5, 6]})
# S1, S2, S4, S3 happening at time 1, 3, 6 and 7
df_sell = pd.DataFrame({'Num':[1, 2, 4, 3],
'Time': [1, 3, 6, 7]})
# Merge buy/sell to find all possible trades
df_trades = pd.merge(df_buy, df_sell, on='Num', suffixes=['_Buy', '_Sell'])
# Order all trades according to which time they would happen, i.e. time_sell.
# (or perhaps at max(time_sell, time_buy)?)
df_trades.sort_values(by='Time_Sell', inplace=True)
# Only trades that happen in increasing order would be allowed. So we filter
# out the trades that happen in decreasing order (ie. trade 3. cannot come
# after trade 4)
df_final = df_trades[df_trades['Num'].sub(df_trades['Num'].shift(), fill_value=0)>=0]
# Here we have Num = 1, 2, 4 i.e. B1/S1, B2/S2 and B4/S4
Out[11]:
Num Time_Buy Time_Sell
0 1 0 1
1 2 2 3
3 4 4 6

sum of list without change length of list pandas

I would like to groupby column and sum of list for another column in dataframe, but it seems like the following code is not working. The length of each user is different after I use sum function.
dt2 = dt.groupby(['user']).sum()
the data like this:
user vector
1 [1,2,3,4,5]
2 [1,3,2,4,5]
1 [3,3,3,4,4]
1 [1,2,2,1,1]
2 [1,1,2,0,0]
The expect table should be
user vector
1 [5,7,8,9,9]
2 [2,4,4,4,5]
here is one way which creates a df based on the vector column and groups on user and sum , finally aggregate as list on axis=1:
(pd.DataFrame(df['vector'].tolist())
.groupby(df['user']).sum().agg(list,axis=1).reset_index(name='vector'))
user vector
0 1 [5, 7, 8, 9, 10]
1 2 [2, 4, 4, 4, 5]

Python Rolling Mean of Dataframe row

So basically I just need advice on how to calculate a 24 month rolling mean over each row of a dataframe. Every row indicates a particular city, and the columns are the respective sales for that month. If anyone could help me figure this out, it would be much appreciated
Edit: Clearly I failed to explain myself properly. I know that pandas has a rolling method built in. The problem is that I don't want to take the moving average of a singular column, I want to take it of columns in a row.
Sample Dataset
State - M1 - M2 - M3 - M4 - ..... - M48
UT - 40 - 20 - 30 - 60 -..... 60
CA - 30 - 60 - 20 - 40 -..... 70
So I want to find the rolling average for each states most recent 24 months (M24-M48 columns)
What I've tried:
Data['24_Month_Moving_Average'] = Data.rolling(window=24, win_type='triang', min_periods=1, axis=1).mean()
error: Wrong number of items passed 139, placement implies 1
edit 2, Sample Dataset:
Data = pd.Dataframe({'M1':[1, 2], 'M2':[3,5], 'M3':[5,6]}, index = ['UT', 'CA'])
# need code that will add column that is the rolling 24 month average for each state
Picture of Dataframe
You can use functions rolling() with mean() and specify the parameters you want window, min_periods as follow:
df.col1.rolling(n, win_type='triang', min_periods=1).mean()
Don't know what shoudl eb your expected outptu but listing a sample to show with the apply() for each row generate the rolling, make the state column the index for your dataframe, hope it helps:
import pandas as pd
df = pd.DataFrame({'B': [6, 1, 2, 20, 4],'C': [1, 1, 2, 30, 4],'D': [10, 1, 2, 5, 4]})
def test_roll(data):
return(data.rolling(window=2, win_type='triang', min_periods=1, axis=0).mean())
print(df.apply(test_roll, axis=1))
pandas.DataFrame.rolling

Pandas conditions across multiple series

Lets say I have some data like this:
category = pd.Series(np.ones(4))
job1_days = pd.Series([1, 2, 1, 2])
job1_time = pd.Series([30, 35, 50, 10])
job2_days = pd.Series([1, 3, 1, 3])
job2_time = pd.Series([10, 40, 60, 10])
job3_days = pd.Series([1, 2, 1, 3])
job3_time = pd.Series([30, 15, 50, 15])
Each entry represents an individual (so 4 people total). xxx_days represents the number of days an individual did something and xxx_time represents the number of minutes spent doing that job on a single day
I want to assign a 2 to category for an individual, if across all jobs they spent at least 3 days of 20 minutes each. So for example, person 1 does not meet the criteria because they only spent 2 total days with at least 20 minutes (their job 2 day count does not count toward the total because time is < 20). Person 2 does meet the criteria as they spent 5 total days (jobs 1 and 2).
After replacement, category should look like this:
[1, 2, 2, 1]
My current attempt to do this requires a for loop and manually indexing into each series and calculating the total days where time is greater than 20. However, this approach doesn't scale well to my actual dataset. I haven't included the code here as i'd like to approach it from a Pandas perspective instead
Whats the most efficient way to do this in Pandas? The thing that stumps me is checking conditions across multiple series and act accordingly after summation of days
Put days and time in two data frames with column positions correspondence maintained, then do the calculation in a vectorized approach:
import pandas as pd
time = pd.concat([job1_time, job2_time, job3_time], axis = 1) ​
days = pd.concat([job1_days, job2_days, job3_days], axis = 1)
((days * (time >= 20)).sum(1) >= 3) + 1
#0 1
#1 2
#2 2
#3 1
#dtype: int64

Python sorting dictionary according to value? Please review my algorithm

I have some files to parse. It has time info followed by label and value if it has been modified on that time frame. A very simple example is like:
Time 1:00
a 1
b 1
c 2
d 4
Time 2:00
d 2
a 4
c 5
e 7
Time 3:00
c 3
Time 4:00
e 3
a 2
b 5
I need to put this into CSV file so I will plot afterwards. The CSV file should look like
Time, a, b, c, d, e
1:00, 1, 1, 2, 4, 0
2:00, 4, 1, 5, 4, 7
3:00, 4, 1, 3, 4, 7
4:00, 2, 5, 3, 4, 3
Also I need to find the max value of each labels so I can sort my graphs.
Since max values are a:4, b:5, c:5, d:4, e:7, I like to have list such as:
['e', 'b', 'c', 'a', 'd' ]
What I am doing is going through the log once and reading all labels since I don't know what labels can it be.
Then going through second time in the whole file to parse. My algorithm is simply:
for label in labelList:
currentValues[label] = 0
maxValues[item] = 0
for line in content:
if endOfCurrentTimeStamp:
put_current_values_to_CSV()
else:
label = line.split()[0]
value = line.split()[1]
currentValues[label] = value
if maxValues[label] < value:
maxValues[label] = value
I got the maxValues of each label in the dictionary. Then what should I do to have a list of sorted from max to min values as said above?
Also let me know if you think an easier way to do the whole thing?
By the way my data is big. I am talking about this input file can easily be hundreds of megabytes with thousands of different labels. So every time I finish a time frame, I put data to CSV.
Regards
Dictionaries are by nature unsorted so you will have to convert it to a different data type.
This is probably a little inefficient but you could try the following:
to_sort = []
for key in maxValues:
to_sort.append((maxValues[key], key))
to_sort.sort()
A list of tuples will sort based upon the first object if I'm not mistaken
If the tuples won't sort try using itemgetter
Use pandas once you've created your CSV. I'll emulate your file with StringIO; you'll feed read_csv a real file name:
df = pandas.read_csv(io.StringIO("""Time, a, b, c, d, e
1:00, 1, 1, 2, 4, 0
2:00, 4, 1, 5, 4, 7
3:00, 4, 1, 3, 4, 7
4:00, 2, 5, 3, 4, 3"""), index_col=0)
df.apply(max).sort(inplace=False)
Output:
a 4
d 4
b 5
c 5
e 7
dtype: int64
Plotting is easy too:
df.plot()

Categories

Resources