I have a time series with irregular frequency of samples. To get useful data of this, I need to find 10 minute periods with roughly evenly spaced samples (this I have defined the average timedelta between 2 samples is less than 20s).
Example Data:
(For the sake of this example, I will make it 10s intervals with avg 2s deltas.)
| timestamp | speed |
| 2010-01-01 09:20:12 | 10 |
| 2010-01-01 09:20:14 | 14 |
| 2010-01-01 09:20:16 | 12 |
| 2010-01-01 09:20:27 | 18 |
| 2010-01-01 09:20:28 | 19 |
| 2010-01-01 09:20:29 | 19 |
The result I am hoping for is a grouping like follows. Note that the second group does not get included because the samples are bunched together at the end of the 10s period (27, 28, 29) which means an implicit extra time gap of 7s which makes the average delta 3s.
| timestamp | avg | std | std_over_avg |
| 2010-01-01 09:20:10 | 12 | 1.63 | 0.136 |
EDIT:
I think I was combining multiple things in my question (and some incorrectly) so I would like to correct/clarify what I am looking for.
Referring back to the example data, I would like to group it into irregular peiords of 10s; that is, if there is a gap of data the next 10s period should start from the timestamp of the next viable rcord. (Please ignore the previous mention of evenly spaced samples, turns out I misinterpreted that requirement, and I can always filter it out at a later stage if need be). So I would want something like this:
| period | count | avg | std | std_over_avg |
| 2010-01-01 09:20:12 - 2010-01-01 09:20:22 | 3 | 12 | 1.63 | 0.136 |
| 2010-01-01 09:20:27 - 2010-01-01 09:20:37 | 3 | 18.6 | 0.577| 0.031 |
I have found a method for achieving most of what I wanted but it is ugly and slow. Hopefully someone can use this as a starting point to develop something more useful:
group_num = 0
cached_future_time = None
def group_by_time(df, ind):
global group_num
global cached_future_time
curr_time = ind
future_time = ind + datetime.timedelta(minutes=10)
# Assume records are sorted chronologically ascending for this to work.
end = df.index.get_loc(future_time, method='pad')
start = df.index.get_loc(curr_time)
num_records = end - start
if cached_future_time is not None and curr_time < cached_future_time:
pass
elif cached_future_time is not None and curr_time >= cached_future_time:
group_num += 1
# Only increase the cached_future_time mark if we have sufficient data points to make this group useful.
if num_records >= 30:
cached_future_time = future_time
elif cached_future_time is None:
cached_future_time = future_time
return group_num
grp = df.groupby(lambda x: group_by_time(df, x))
Edit:
Ok I found a much more Pandas-ic way to do this which also significantly faster than the ugly loop above. My downfall in the above answer was thinking that I needed to do most of the work for calculating the groups in the groupby function (and thinking there wasn't a way to apply such a method across all the rows intelligently).
# Add 10min to our timestamp and shift the values in that column 30 records
# into the future. We can then find all the timestamps that are 30 records
# newer but still within 10min of the original timestamp (ensuring that we have a 10min group with
# at least 30 records).
records["future"] = records["timestamp"] + datetime.timedelta(minutes=10)
starts = list(records[(records["timestamp"] <= records.future.shift(30)) & records.group_num.isnull()].index)
group_num = 1
# For each of those starting timestamps, grab a slice up to 10min in the future
# and apply a group number.
for start in starts:
group = records.loc[start:start + datetime.timedelta(minutes=10), 'group_num']
if len(group[group.isnull()]) >= 30:
# Only apply group_num to null values so that we get disjoint groups (no overlaps).
group[group.isnull()] = group_num
group_num += 1
Related
I have a dataframe which contains Date, Visitor_ID and Pages columns. In the Page_visited column there are different row wise entries for each dates. Please refer the below table to understand the data.
[| Dates | Visitor_ID| Pages |
|:------ |:---------:| -----: |
| 10/1/2021 | 1 | xy |
| 10/1/2021 | 1 | step2 |
|10/1/2021 | 1 | xx |
|10/1/2021 | 1 | NetBanking|
| 10/1/2021 | 2 | step1 |
| 10/1/2021 | 2 | xy |
|10/1/2021 | 3 | step1 |
|10/1/2021 | 3 | NetBanking|
|11/1/2021 | 4 | step1 |
|12/1/2021 | 4 | NetBanking|][1]
Desired output:
Date Visitor_ID
|10/1/2021 | 1 |
|10/1/2021 | 3 |
the output should be a subset of actual data where the condition is that if for same Visitor_ID the page contains string "step" before string "Netbanking in same date then return the Visitor ID.
To initialise your dataframe you could do:
import pandas as pd
columns = ["Dates", "Visitor_ID", "Pages"]
records = [
["10/1/2021", 1, "xy"],
["10/1/2021", 1, "step2"],
["10/1/2021", 1, "NetBanking"],
["10/1/2021", 2, "step1"],
["10/1/2021", 2, "xy"],
["10/1/2021", 3, "step1"],
["10/1/2021", 3, "NetBanking"],
["11/1/2021", 4, "step1"],
["12/1/2021", 4, "NetBanking"]]
data = pd.DataFrame().from_records(records, columns=columns)
data["Dates"] = pd.DatetimeIndex(data["Dates"])
index_names = columns[:2]
data.set_index(index_names, drop=True, inplace=True)
Note that I have left out your third line in the records, otherwise I cannot reproduce your desired output. I have made this a multi-index data frame in order to easily loop over the groups 'date/visitor'. The structure of the dataframe looks like:
print(data)
Pages
Dates Visitor_ID
2021-10-01 1 xy
1 step2
1 NetBanking
2 step1
2 xy
3 step1
3 NetBanking
2021-11-01 4 step1
2021-12-01 4 NetBanking
Now to select the customers from the same date and from the same group, I am going to loop over these groups and use 2 masks to select the required records:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
# select the column with the Pages
pages = data_per_visitor["Pages"].str
# make 2 boolean masks, for the records with step and netbanking
has_step = pages.contains("step")
has_netbanking = pages.contains("NetBanking")
# to get the records after each 'step' records, apply a diff on 'has_step'
# Convert to int first for the correct result
# each diff with outcome -1 fulfills this requirement. Make a
# mask based on this requirement
diff_step = has_step.astype(int).diff()
records_after_step = diff_step == -1
# combine the 2 mask to create your final mask to make a selection
mask = records_after_step & has_netbanking
# select the records and print to screen
selection = data_per_visitor[mask]
if not selection.empty:
print(selection.reset_index()[index_names])
This gives the following output:
Dates Visitor_ID
0 2021-10-01 1
1 2021-10-01 3
EDIT:
I was reading your question again. The solution above assumed that only records with 'NetBanking' directly following a record with 'step' is valid. That is why I thought your example input was not corresponding with your desired output. However, in case you are allowing rows in between an occurrence with 'step' and the first 'netbanking', the solution does not work. In that case, it is better to explicitly iterate of the rows of your dataframe per date and client id. An example then would be:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
after_step = False
index_selection = list()
data_per_visitor.reset_index(inplace=True)
for index, records in data_per_visitor.iterrows():
page = records["Pages"]
if "step" in page and not after_step:
after_step = True
if "NetBanking" in page and after_step:
index_selection.append(index)
after_step = False
selection = data_per_visitor.reindex(index_selection)
if not selection.empty:
print(selection.reset_index()[index_names]
Normally I would not recommend to use 'iterrows' as it is really slow, but in this case I don't see an easy other solution. The output of the second algorithm is the same as the first for my data. In case you do include the third line from your example data, the second algorithm still gives the same output.
Im trying to create a column where i sum the previous x rows of a column by a parm given in a different column row.
I have a solution but its really slow so i was wondering if anyone could help do this alot faster.
| time | price |parm |
|--------------------------|------------|-----|
|2020-11-04 00:00:00+00:00 | 1.17600 | 1 |
|2020-11-04 00:01:00+00:00 | 1.17503 | 2 |
|2020-11-04 00:02:00+00:00 | 1.17341 | 3 |
|2020-11-04 00:03:00+00:00 | 1.17352 | 2 |
|2020-11-04 00:04:00+00:00 | 1.17422 | 3 |
and the slow slow code
#jit
def rolling_sum(x,w):
return np.convolve(x,np.ones(w,dtype=int),'valid')
#jit
def rol(x,y):
for i in range(len(x)):
res[i] = rolling_sum(x, y[i])[0]
return res
dfa = df[:500000]
res = np.empty(len(dfa))
r = rol(dfa.l_x.values, abs(dfa.mb).values+1)
r
Maybe something like this could work. I have made up an example with to_be_summed being the column of the value that should be summed up and looback holding the number of rows to be looked back
df = pd.DataFrame({"to_be_summed": range(10), "lookback":[0,1,2,3,2,1,4,2,1,2]})
summed = df.to_be_summed.cumsum()
result = [summed[i] - summed[max(0,i - lookback - 1)] for i, lookback in enumerate(df.lookback)]
What I did here is to first do a cumsum over the column that should be summed up. Now, for the i-th entry I can take the entry of this cumsum, and subtract the one i + 1 steps back. Note that this include the i-th value in the sum. If you don't want to inlcude it, you just have to change from summed[i] to summed[i - 1]. Also note that this part max(0,i - lookback - 1) will prevent you from accidentally looking back too many rows.
I have a simple data frame which might look like this:
| Label | Average BR_1 | Average BR_2 | Average BR_3 | Average BR_4 |
| ------- | ------------ | ------------ | ------------ | ------------ |
| Label 1 | 50 | 30 | 50 | 50 |
| Label 2 | 60 | 20 | 50 | 50 |
| Label 3 | 65 | 50 | 50 | 50 |
What I would like to be able to do is to add a % symbol in every column.
I know that I can do something like this for every column:
df['Average BR_1'] = df['Average BR_1'].astype(str) + '%'
However, the problem is, that I read in the data from a CSV file which might contain more of these columns, so instead of Average BR_1 to Average BR_4, it might contain Average BR_1 to say Average BR_10.
So I would like this change to happen automatically for every column which contains Average BR_ in its column name.
I have been reading about .loc but I managed only to change column values to an entirely new value like so:
df.loc[:, ['Average BR_1', 'Average BR_2']] = "Hello"
Also, I haven't yet been able to implement regex here.
I tried with a list:
colsArr = [c for c in df.columns if 'Average BR_' in c]
print(colsArr)
But I did not manage to implement this with .loc.
I suppose I could do this using a loop, but I feel like there must be some better pandas solution, but I can not figure it out.
Could you help and point me in the right direction?
Thank you
# extract the column names that need to be updated
cols = df.columns[df.columns.str.startswith('Average BR')]
# update the columns
df[cols] = df[cols].astype(str).add('%')
print(df)
Label Average BR_1 Average BR_2 Average BR_3 Average BR_4
0 Label 1 50% 30% 50% 50%
1 Label 2 60% 20% 50% 50%
2 Label 3 65% 50% 50% 50%
working example
You can use df.update and df.filter
df.update(df.filter(like='Average BR_').astype('str').add('%'))
df
Out:
Label Average BR_1 Average BR_2 Average BR_3 Average BR_4
0 Label 1 50% 30% 50% 50%
1 Label 2 60% 20% 50% 50%
2 Label 3 65% 50% 50% 50%
I have a dataframe with various events(id) and following structure, the df is grouped by id and sorted on timestamp :
id | timestamp | A | B
1 | 02-05-2016|bla|bla
1 | 04-05-2016|bla|bla
1 | 05-05-2016|bla|bla
2 | 11-02-2015|bla|bla
2 | 14-02-2015|bla|bla
2 | 18-02-2015|bla|bla
2 | 31-03-2015|bla|bla
3 | 02-08-2016|bla|bla
3 | 07-08-2016|bla|bla
3 | 27-09-2016|bla|bla
Each timestamp-id combo indicates a different stage in the process of the event with that particular id. Each new record for a specific id indicates the start of a new stage for that event-id.
I would like to add a new column Duration that calculates the duration of each stage for each event (see desired df below). This is easy as i can simply calculate the difference between the timestamp of the next stage for the same event id with the timestamp of the current stage as following:
df['Start'] = pd.to_datetime(df['timestamp'])
df['End'] = pd.to_datetime(df['timestamp'].shift(-1))
df['Duration'] = df['End'] - df['Start']
My problem appears on the last stage of each event id, as i want to simply display NaNs or dashes as the stage has not finished yet and the end time is unknown. My solution simply takes the timestamp of the next row which is not always correct, as it might belong to a completele different event.
Desired output:
id | timestamp | A | B | Duration
1 | 02-05-2016|bla|bla| 2 days
1 | 04-05-2016|bla|bla| 1 days
1 | 05-05-2016|bla|bla| ------
2 | 11-02-2015|bla|bla| 3 days
2 | 14-02-2015|bla|bla| 4 days
2 | 18-02-2015|bla|bla| 41 days
2 | 31-03-2015|bla|bla| -------
3 | 02-08-2016|bla|bla| 5 days
3 | 07-08-2016|bla|bla| 50 days
3 | 27-09-2016|bla|bla| -------
I think this does what you want:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Duration'] = df.groupby('id')['timestamp'].diff().shift(-1)
If I understand correctly: groupby('id') tells pandas to apply .diff().shift(-1) to each group as if it were a miniature DataFrame independent of the other rows. I tested it on this fake data:
import pandas as pd
import numpy as np
# Generate some fake data
df = pd.DataFrame()
df['id'] = [1]*5 + [2]*3 + [3]*4
df['timestamp'] = pd.to_datetime('2017-01-1')
duration = sorted(np.random.randint(30,size=len(df)))
df['timestamp'] += pd.to_timedelta(duration)
df['A'] = 'spam'
df['B'] = 'eggs'
but double-check just to be sure I didn't make a mistake!
Here is one approach using apply
def timediff(row):
row['timestamp'] = pd.to_datetime(row['timestamp'], format='%d-%m-%Y')
return pd.DataFrame(row['timestamp'].diff().shift(-1))
res = df.assign(duration=df.groupby('id').apply(timediff))
Output:
id timestamp duration
0 1 02-05-2016 2 days
1 1 04-05-2016 1 days
2 1 05-05-2016 NaT
3 2 11-02-2015 3 days
4 2 14-02-2015 4 days
5 2 18-02-2015 41 days
6 2 31-03-2015 NaT
7 3 02-08-2016 5 days
8 3 07-08-2016 51 days
9 3 27-09-2016 NaT
I have a dataset like this:
Policy | Customer | Employee | CoveragDate | LapseDate
123 | 1234 | 1234 | 2011-06-01 | 2015-12-31
124 | 1234 | 1234 | 2016-01-01 | ?
125 | 1234 | 1234 | 2011-06-01 | 2012-01-01
124 | 5678 | 5555 | 2014-01-01 | ?
I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.
So, expected output would be:
Policy | Customer | Employee
123 | 1234 | 1234
because policy 123's lapse date was within 5 days of policy 124's covered date.
So far, I've used this code:
import pandas
import datetime
#Pull in data from query
wd = pandas.read_csv('DATA')
wd=wd.set_index('Policy#')
wd = wd.rename(columns={'Policy#':'Policy'})
Resultlist=[]
for EMPID in wd.groupby(['EMPID', 'Customer']):
for Policy in wd.groupby(['EMPID','Customer']):
EffDate = pandas.to_datetime(wd['CoverageEffDate'])
for Policy in wd.groupby(['EMPID','Customer']):
check=wd['LapseDate'].astype(str)
if check.any() =='?': #here lies the problem - it's evaluating if ANY of the items ='?'
print(check)
continue
else:
LapseDate = pandas.to_datetime(wd['LapseDate']) + datetime.timedelta(days=5)
if EffDate < LapseDate:
Resultlist.append(wd['Policy','Customer'])
print(Resultlist)
I'm trying to use the pandas .any() function to evaluate if the current row is a '?' (which means null data, i.e. the policy hasn't lapsed). However, it appears that this statement just evaluates if there is a '?' row in the entire column, not the current row. I need to determine this because if I compare the '?' value against a date I get an error.
Is there a way to reference just the row I'm iterating on for a conditional check? To my knowledge, I can't use the pandas apply function technique because I need each employee's policy data compared against any other policies they hold.
Thank you!
check.str.contains('?') would return a boolean array showing which entries had a '?' in them. Otherwise you might consider just iterating through i.e
check=wd['LapseDate'].astype(str)
for row in check:
if row == '?':
print(check)
but there's really no difference between checking for any match and returning if there's a match and iterating through all and returning if there's a match.