I am looking for a way to check the frequency of dates in a column. I have a date with a frequency of every week, but sometimes there is a hurdle of 2 or 3 weeks, and the pd.infer_freq method returns NaN.
My data:
2022-01-01
2022-01-08
2022-01-23
2022-01-30
Your sample data is too small for pd.infer_freq to be able to infer the frequencies. You could find the most common time difference between consecutive days and use that to infer the frequency -
s = pd.Series(dates)
print((s - s.shift(1)).mode())
Output
0 7 days
dtype: timedelta64[ns]
Related
I have a df that looks like this, it contains frequencies recorded at some specific time and place.
Time Latitude Longitude frequency
0 2022-07-07 00:47:49 31.404463 73.117654 -88.599998
1 2022-07-09 00:13:13 31.442087 73.051086 -88.400002
2 2022-07-13 14:25:45 31.433669 73.118194 -87.500000
3 2022-07-13 17:50:53 31.411087 73.094298 -90.199997
4 2022-07-13 17:50:55 31.411278 73.094554 -89.000000
5 2022-07-14 10:49:13 31.395443 73.108911 -88.000000
6 2022-07-14 10:49:15 31.395436 73.108902 -87.699997
7 2022-07-14 10:49:19 31.395379 73.108847 -87.300003
8 2022-07-14 10:50:29 31.393905 73.107315 -88.000000
9 2022-07-14 10:50:31 31.393879 73.107283 -89.000000
10 2022-07-14 10:50:33 31.393858 73.107265 -89.800003
I want to group all the rows which are just 2 seconds apart (like there are 3 rows index 5-7 which have a time difference of just 2 seconds). Similarly, index 8-10 also have the same difference and I want to place them in a separate group and keep only these unique groups.
so far I have tried this,
df.groupby([pd.Grouper(key='Time', freq='25S')]).frequency.count()
It helps a little as I have to manually insert a time duration in which I am looking for close timestamps records. Still, in my case, I don't have specific time intervals as there can be 50 or more consecutive rows with a gap of 2 seconds in-between for the next two minutes. I just want to keep all these rows in a unique group.
My solution is to greate a column Group which groups the rows for which the difference is small.
First sort the column Time (if necessary): df = df.sort_values('Time').
Now create the groups:
n = 2 # number of seconds
df['Group'] = df.Time.diff().dt.seconds.gt(n).cumsum()
Now you can do
df.groupby('Group').frequency.count()
I have a dataset:
ride_completion_time ride_id
0 2022-08-27 11:42:02 1
1 2022-08-24 05:59:26 2
2 2022-08-23 17:40:05 3
3 2022-08-28 23:06:01 4
4 2022-08-27 03:21:29 5
I would like to find out in a 4 hour time span, on average, how many rides are actually completed?
I run df3.dtypes to get my data types.
output:
dropoff_datetime datetime64[ns]
ride_id object
dtype: object
Then I've tried the following:
Option 1)
df3 = df3.groupby(df3.ride_completion_time.dt.floor('2H')).mean()
Result: Dataframe object has no attribute dropoff_date_time
Option 2)
df3.groupby(df3.index.floor('4H').time).sum()
Result: It gives me the right grouping (I see that it's changing my times to every 4 hours) but then it's not summing it really? I tried using average but average isn't supported I think.
Can someone point me in the right direction?
ride_id is object type (probably string) so sum and mean would exclude this column. You want to know number of rides, you can do size:
df3.groupby(df3.index.floor('4H').time).size()
As to why 2) works but not 1), probably somewhere you set ride_completion_time as index in your code.
I need to evaluate the change of one Pandas column that occurs while another column fulfills a certain condition.
Assuming a DataFrame df with a DateTimeIndex and two columns:
timestamp operating_time sensor_values
2022-03-23 23:57:59.802000+00:00 8.172000e+06 398.15
2022-03-23 23:57:59.818000+00:00 8.172000e+06 397.85
2022-03-23 23:58:59.805000+00:00 8.172000e+06 397.5
2022-03-23 23:58:59.821000+00:00 8.172000e+06 NaN
2022-03-23 23:59:59.793000+00:00 8.172000e+06 397.15
...
Now I would like to know how much operating_time passed while sensor_values < 398 and how much operating_time passed while sensor_values >= 398.
I tried to divide the data into to DataFrames like this:
df_low = df[df['sensor_values'] < 398]
df_high = df[df['sensor_values'] >= 398]
However if I then calculate by how much the operating_time changes for each DataFrame with
df_low['operating_time'].diff().sum()
df_high['operating_time'].diff().sum()
I get basically the same value as it seems Pandas is filling the NaN gaps for diff().
How can I find out how much operating time sensor_values was above and below a certain value?
Plot of sensor_values in blue and operating_time in red:
The expected output would be two numbers representing the operating time that was spend over the threshold and below the threshold. So in the example image the operating time increases from roughly 1e6 minutes to roughly 8e6 minutes. The two numbers should therefore add up to 7e6 minutes.
IIUC, you could use:
out = (df['operating_time']
.diff()
.groupby(np.where(df['sensor_values'].gt(398), '>398', '≤398'))
.sum()
)
output (here with limited example):
>398 0.0
≤398 0.0
Name: operating_time, dtype: float64
or, directly from the timestamps:
out = (pd.to_datetime(df['timestamp'])
.diff()
.groupby(np.where(df['sensor_values'].gt(398), '>398', '≤398'))
.sum()
)
output:
>398 0 days 00:00:00
≤398 0 days 00:01:59.991000
Name: timestamp, dtype: timedelta64[ns]
I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5
I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.