Count repeating events in pandas timeseries - python

I'm working on very basic pandas since a few days but struggle with my current task:
I have a (non normalized) timeseries with items that contains a userid per timestamp. So something like: (date, userid, payload) So think about an server logffile where I would like to find how much IPs return within a certain timeperiod.
Now I like to find how much of the users have multiple items within an intervall for example in 4 weeks etc. So it's more a sliding window than constant intervals on the t-axis.
So my approaches were:
df_users reindex on userids
or multiindex?
Sadly I didn't found a way to generate the results successfully.
So all in all I'm not sure how I realize that kind of search with Pandas, or maybe this is easier to implement in pure Python? Or do I just lack some keywords for that problem?

Some dummy data that I think fits your problem.
df = pd.DataFrame({'id': ['A','A','A','B','B','B','C','C','C'],
'time': ['2013-1-1', '2013-1-2', '2013-1-3',
'2013-1-1', '2013-1-5', '2013-1-7',
'2013-1-1', '2013-1-7', '2013-1-12']})
df['time'] = pd.to_datetime(df['time'])
This approach requires some kind non-missing numeric column to count with, so just add a dummy one.
df['dummy_numeric'] = 1
My approach to the problem is this. First, groupby the id and iterate so we are working with one user id worth of data at time. Next, resample the irregular data up to daily values so it is normalized.
Then, using the rolling_count function, count the number of observations in each X day window (using 3 here). This works because the upsampled data will be filled with NaN and not counted. Notice that only the numeric column is being passed to rolling_count, and also note the use of double-brackets (which results in a DataFrame being selected rather than a series).
window_days = 3
ids = []
for _, df_gb in df.groupby('id'):
df_gb = df_gb.set_index('time').resample('D')
df_gb = pd.rolling_count(df_gb[['dummy_numeric']], window_days).reset_index()
ids.append(df_gb)
Combine all the data back together, mark the spans with more than observations
df_stack = pd.concat(ids, ignore_index=True)
df_stack['multiple_requests'] = (df_stack['dummy_numeric'] > 1).astype(int)
Then groupby and sum, and you should have the right answer.
df_stack.groupby('time')['multiple_requests'].sum()
Out[356]:
time
2013-01-01 0
2013-01-02 1
2013-01-03 1
2013-01-04 0
2013-01-05 0
2013-01-06 0
2013-01-07 1
2013-01-08 0
2013-01-09 0
2013-01-10 0
2013-01-11 0
2013-01-12 0
Name: multiple, dtype: int32

Related

Filling missing time values in a multi-indexed dataframe

Problem and what I want
I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
2,1,5 # skip some sensors for some time steps
0,2,6
2,2,7
2,3,8
1,5,9 # skip some time steps
2,5,10
Important note the actual time column is of datetime type.
What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
1,1,2 # ID 1 hold value from time step 0
2,1,5
0,2,6
1,2,2 # ID 1 still holding
2,2,7
0,3,6 # ID 0 holding
1,3,2 # ID 1 still holding
2,3,8
0,5,6 # ID 0 still holding, can skip totally missing time steps
1,5,9 # ID 1 finally updates
2,5,10
Pandas attempts so far
I initialize my dataframe and set my indices:
df = pd.read_csv(filename, dtype=np.int)
df.set_index(['ID', 'time'], inplace=True)
I try to mess with things like:
filled = df.reindex(method='ffill')
or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing".
I also tried:
df.update(df.groupby(level=0).ffill())
or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go.
Numpy attempt so far
I have had some luck with numpy and non-integer indexing using something like:
data = [np.array(df.loc[level].data) for level in df.index.levels[0]]
shapes = [arr.shape for arr in data]
print(shapes)
# [(3,), (2,), (5,)]
data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data]
print([arr.shape for arr in data])
# [(5,), (5,), (5,)]
But this has two problems:
It takes me out of the pandas world, and I now have to manually maintain my sensor IDs, time index, etc. along with my feature vector (the actual data column is not just one column but a ton of values from a sensor suite).
Given the number of columns and the size of the actual dataset, this is going to be clunky and inelegant to implement on my real example. I would prefer a way of doing it in pandas.
The application
Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step).
Thank you for your help!
Here is one way , by using reindex and category
df.time=df.time.astype('category',categories =[0,1,2,3,4,5])
new_df=df.groupby('time',as_index=False).apply(lambda x : x.set_index('ID').reindex([0,1,2])).reset_index()
new_df['data']=new_df.groupby('ID')['data'].ffill()
new_df.drop('time',1).rename(columns={'level_0':'time'})
Out[311]:
time ID data
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 1 0 4.0
4 1 1 2.0
5 1 2 5.0
6 2 0 6.0
7 2 1 2.0
8 2 2 7.0
9 3 0 6.0
10 3 1 2.0
11 3 2 8.0
12 4 0 6.0
13 4 1 2.0
14 4 2 8.0
15 5 0 6.0
16 5 1 9.0
17 5 2 10.0
You can have a dictionary of last readings for each sensors. You'll have to pick some initial value; the most logical choice is probably to back-fill the earliest reading to earlier times. Once you've populated your last_reading dictionary, you can just sort all the readings by time, update the dictionary for each reading, and then fill in rows according to the dictionay. So after you have your last_reading dictionary initialized:
last_time = readings[1][time]
for reading in readings:
if reading[time] > last_time:
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
last_reading[reading[ID]] = reading[data]
#the above for loop doesn't update for the last time
#so you'll have to handle that separately
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
This assumes that you have only one reading for each time/sensor pair, and that 'readings' a list of dictionaries sorted by time. It also assumes that df has the different sensors as columns and different times as index. Adjust the code as necessary if otherwise. You can also probably optimize it a bit more by updating a whole row at once instead of using a for loop, but I didn't want to deal with making sure I had the Pandas syntax right.
Looking at the application, though, you might want to have each cell in the dataframe be not a number but a tuple of last value and time it was read, so replace last_reading[reading[ID]] = reading[data] with
last_reading[reading[ID]] = [reading[data],reading[time]]. Your neural net can then decide how to weight data based on how old it is.
I got this to work with the following, which I think is pretty general for any case like this where the time index for which you want to fill values is the second in a multi-index with two indices:
# Remove duplicate time indices (happens some in the dataset, pandas freaks out).
df = df[~df.index.duplicated(keep='first')]
# Unstack the dataframe and fill values per serial number forward, backward.
df = df.unstack(level=0)
df.update(df.ffill()) # first ZOH forward
df.update(df.bfill()) # now back fill values that are not seen at the beginning
# Restack the dataframe and re-order the indices.
df = df.stack(level=1)
df = df.swaplevel()
This gets me what I want, although I would love to be able to keep the duplicate time entries if anybody knows of a good way to do this.
You could also use df.update(df.fillna(0)) instead of backfilling if starting unseen values at zero is preferable for a particular application.
I put the above code block in a function called clean_df that takes the dataframe as argument and returns the cleaned dataframe.

Plot Count of Pandas Dataframe with Start_Date and End_Date

I am trying to plot a daily follower count for various twitter handles. The result being something like what you see below, but filterable by more than 1 twitter handle:
Usually, I would do this by simply appending a new dataset pulled from Twitter to the original table, with the date of the log being pulled. However, this would make me end up with a million lines in just a few days. And it wouldn't allow me to clearly see when a user has dropped off.
As an alternative, after pulling my data from Twitter, I structured my pandas dataframe like this:
Follower_ID Handles Start_Date End_Date
100 x 30/05/2017 NaN
101 x 21/04/2017 29/05/2017
201 y 14/06/2017 NaN
100 y 16/06/2017 28/06/2017
Where:
Handles:are the accounts I am pulling the Followers for
Follower_ID:is the user following an handle
So, for example, if I wereFollower_ID 100, I could follow both handle x and handle y
I am wondering what would be the best way to prepare the data (pivot, clean through a function, groupby) so that then it can be plotted accordingly. Any ideas?
I ended up using iterrows in a naïve approach, so there could be a more efficient way that takes advantage of pandas reshaping, etc. But my idea was to make a function that takes in your dataframe and the handle you want to plot, and then returns another dataframe with that handle's daily follower counts. To do this, the function
filters the df to the desired handle only,
takes each date range (for example, 21/04/2017 to 29/05/2017),
turns that into a pandas date_range, and
puts all the dates in a single list.
At that point, collections.Counter on the single list is a simple way to tally up the results by day.
One note is that the null End_Dates should be coalesced to whatever end date you want on your graph. I call that the max_date when I wrangle the data. So altogether:
from io import StringIO
from collections import Counter
import pandas as pd
def get_counts(df, handle):
"""Inputs: your dataframe and the handle
you want to plot.
Returns a dataframe of daily follower counts.
"""
# filters the df to the desired handle only
df_handle = df[df['Handles'] == handle]
all_dates = []
for _, row in df_handle.iterrows():
# Take each date range (for example, 21/04/2017 to 29/05/2017),
# turn that into a pandas `date_range`, and
# put all the dates in a single list
all_dates.extend(pd.date_range(row['Start_Date'],
row['End_Date']) \
.tolist())
counts = pd.DataFrame.from_dict(Counter(all_dates), orient='index') \
.rename(columns={0: handle}) \
.sort_index()
return counts
That's the function. Now reading and wrangling your data ...
data = StringIO("""Follower_ID Handles Start_Date End_Date
100 x 30/05/2017 NaN
101 x 21/04/2017 29/05/2017
201 y 14/06/2017 NaN
100 y 16/06/2017 28/06/2017""")
df = pd.read_csv(data, delim_whitespace=True)
# fill in missing end dates
max_date = pd.Timestamp('2017-06-30')
df['End_Date'].fillna(max_date, inplace=True)
# pandas timestamps (so that we can use pd.date_range)
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])
print(get_counts(df, 'y'))
The last line prints this for handle y:
y
2017-06-14 1
2017-06-15 1
2017-06-16 2
2017-06-17 2
2017-06-18 2
2017-06-19 2
2017-06-20 2
2017-06-21 2
2017-06-22 2
2017-06-23 2
2017-06-24 2
2017-06-25 2
2017-06-26 2
2017-06-27 2
2017-06-28 2
2017-06-29 1
2017-06-30 1
You can plot this dataframe with your preferred package.

Aligning and adding columns in multiple Pandas dataframes based on Date column

I have a number of dataframes all which contain columns labeled 'Date' and 'Cost' along with additional columns. I'd like to add the numerical data in the 'Cost' columns across the different frames based on lining up the dates in the 'Date' columns to provide a timeseries of total costs for each of the dates.
There are different numbers of rows in each of the dataframes.
This seems like something that Pandas should be well suited to doing, but I can't find a clean solution.
Any help appreciated!
Here are two of the dataframes:
df1:
Date Total Cost Funded Costs
0 2015-09-30 724824 940451
1 2015-10-31 757605 940451
2 2015-11-15 788051 940451
3 2015-11-30 809368 940451
df2:
Date Total Cost Funded Costs
0 2015-11-30 3022 60000
1 2016-01-15 3051 60000
I want to have the resulting dataframe have five rows (there are five different dates) and a single column with the total of the 'Total Cost' column from each of the dataframes. Initially I used the following:
totalFunding = df1['Total Cost'].values + df2['Total Cost'].values
This worked fine until there were different dates in each of the dataframes.
Thanks!
The solution posted below works great, except that I need to do this recursively as I have a number of data frames. I created the following function:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
return dfTotal
Which works fine when adding the first two dataframes. However, the addition method appears to convert my Date column into an index in the resulting sum and therefore subsequent passes through the function fail. Here is what dfTotal looks like after the first two data frames are added together:
Total Cost Funded Costs Remaining Cost Total Employee Hours
Date
2015-09-30 1449648 1880902 431254 7410.6
2015-10-31 1515210 1880902 365692 7874.4
2015-11-15 1576102 1880902 304800 8367.2
2015-11-30 1618736 1880902 262166 8578.0
2015-12-15 1671462 1880902 209440 8945.2
2015-12-31 1721840 1880902 159062 9161.2
2016-01-15 1764894 1880902 116008 9495.0
Note that what was originally a column in the dataframe called 'Date' is now listed as the index causing df.set_index('Date') to generate an error on subsequent passes through my function.
DataFrame.add does exactly what you're looking for; it matches the DataFrames based on index, so:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)
should do the trick. If you just want the Total Cost column and you want it as a DataFrame:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)[['Total Cost']]
See also the documentation for DataFrame.add at:
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.add.html
Solution found. As mentioned, the add method converted the 'Date' column into the dataframe index. This was resolved using:
dfTotal['Date'] = dfTotal.index
The complete function is then:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
dfTotal['Date'] = dfTotal.index
return dfTotal

Search in pandas dataframe

Potentially a slightly misleading title but the problem is this:
I have a large dataframe with multiple columns. This looks a bit like
df =
id date value
A 01-01-2015 1.0
A 03-01-2015 1.2
...
B 01-01-2015 0.8
B 02-01-2015 0.8
...
What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].
I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea
[
df[
df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
][
df['id'] == df['id'].iloc[i]
]['value']
for i in range(len(df.index))
]
but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.
I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?
Basic strategy is, for each id, to:
Use date index
Use reindex to expand the data to include all dates
Use shift to shift 7 spots
Use ffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.
Drop unneeded data
This algorithm gives NaN when the lag is too far in the past.
There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.
import pandas as pd
import numpy as np
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
'id':['A']*len(dates),
'value':np.random.randn(len(dates))})
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
'id':['B']*len(dates),
'value':np.random.randn(len(dates))})
df = pd.concat([A,B])
with_lags = []
for id, group in df.groupby('id'):
group = group.set_index(group.date)
index = group.index
group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
group = group.ffill()
group['lag_value'] = group.value.shift(7)
group = group.loc[index]
with_lags.append(group)
with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])

A Multi-Index Construction for Intraday TimeSeries (10 min price data)

I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.

Categories

Resources