So I have some sea surface temperature anomaly data. These data have been filtered down so that these are the values that are below a certain threshold. However, I am trying to identify cold spells - that is, to isolate events that last longer than 5 consecutive days. A sample of my data is below (I've been working between xarray datasets/dataarrays and pandas dataframes). Note, the 'day' is the day number of the month I am looking at (eventually will be expanded to the whole year). I have been scouring SO/the internet for ways to extract these 5-day-or-longer events based on the 'day' column, but I haven't gotten anything to work. I'm still relatively new to coding so my first thought was looping over the rows of the 'day' column but I'm not sure. Any insight is appreciated.
Here's what some of my data look like as a pandas df:
lat lon time day ssta
5940 24.125 262.375 1984-06-03 3 -1.233751
21072 24.125 262.375 1984-06-04 4 -1.394495
19752 24.125 262.375 1984-06-05 5 -1.379742
10223 24.125 262.375 1984-06-27 27 -1.276407
47355 24.125 262.375 1984-06-28 28 -1.840763
... ... ... ... ... ...
16738 30.875 278.875 2015-06-30 30 -1.345640
3739 30.875 278.875 2020-06-16 16 -1.212824
25335 30.875 278.875 2020-06-17 17 -1.446407
41891 30.875 278.875 2021-06-01 1 -1.714249
27740 30.875 278.875 2021-06-03 3 -1.477497
64228 rows × 5 columns
As a filtered xarray:
xarray.Dataset
Dimensions: lat: 28, lon: 68, time: 1174
Coordinates:
time (time) datetime64[ns] 1982-06-01 ... 2021-06-04
lon (lon) float32 262.1 262.4 262.6 ... 278.6 278.9
lat (lat) float32 24.12 24.38 24.62 ... 30.62 30.88
day (time) int64 1 2 3 4 5 6 7 ... 28 29 30 1 2 3 4
Data variables:
ssta (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes: (0)
TLDR; I want to identify (and retain the information of) events that are 5+ consecutive days, ie if there were a day 3 through day 8, or day 21 through day 30, etc.
I think rather than filtering your original data you should try to do it the pandas way which in this case means obtain a series with true false values depending on your condition.
Your data seems not to include temperatures so here is my example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'temp':np.random.randint(10,high=40,size=64228,dtype='int64')})
Will generate a DataFrame with a single column containing random temperatures between 10 and 40 degrees. Notice that I can just work with the auto generated index but you might have to switch it to a column like time or date or something like that using .set_index. Say we are interested in the consecutive days with more than 30 degrees.
is_over_30 = df['temp'] > 30
will give us a True/False array with that information. Notice that this format is very useful since we can index with it. E.g. df[is_over_30] will give us the rows of the dataframe for days where the temperature is over 30 deg. Now we wanna shift the True/False values in is_over_30 one spot forward and generate a new series that is true if both are true like so
is_over_30 & np.roll(is_over_30, -1)
Basically we are done here and could write 3 more of those & rolls. But there is a way to write it more concise.
from functools import reduce
is_consecutively_over_30 = reduce(lambda a,b: a&b, [np.roll(is_over_30, -i) for i in range(5)])
Keep in mind that that even though the last 4 days can't be consecutively over 30 deg this might still happen here since roll shifts the first values into the position relevant for that. But you can just set the last 4 values to False to resolve this.
is_consecutively_over_30[-4:] = False
You can pull the day ranges of the spells using this approach:
min_spell_days = 6
days = {'day': [1,2,5,6,7,8,9,10,17,19,21,22,23,24,25,26,27,31]}
df = pd.DataFrame(days)
Find number of days between consecutive entries:
diff = df['day'].diff()
Mark the last day of a spell:
df['last'] = (diff == 1) & (diff.shift(-1) > 1)
Accumulate the number of days in each spell:
df['diff0'] = np.where(diff > 1, 0, diff)
df['cs'] = df['diff0'].eq(0).cumsum()
df['spell_days'] = df.groupby('cs')['diff0'].transform('cumsum')
Mark the last entry as the last day of a spell if applicable:
if diff.iat[-1] == 1:
df['last'].iat[-1] = True
Select the last day of all qualifying spells:
df_spells = (df[df['last'] & (df['spell_days'] >= (min_spell_days-1))]).copy()
Identify the start, end and duration of each spell:
df_spells['end_day'] = df_spells['day']
df_spells['start_day'] = (df_spells['day'] - df['spell_days'])
df_spells['spell_days'] = df['spell_days'] + 1
Resulting df:
df_spells[['start_day','end_day','spell_days']].astype('int')
start_day end_day spell_days
7 5 10 6
16 21 27 7
Also, using date arithmetic 'day' you could represent a serial day number relative to some base date - like 1/1/1900. That way spells that span month and year boundaries could be handled. It would then be trivial to convert back to a date using date arithmetic and that serial number.
Related
I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5
I have a large dataset spanning many years and I want to subset this data frame by selecting data based on a specific day of the month using python.
This is simple enough and I have achieved with the following line of code:
df[df.index.day == 12]
This selects data from the 12th of each month for all years in the data set. Great.
The problem I have however is the original data set is based on working day data. Therefore the 12th might actually be a weekend or national holiday and thus doesnt appear in the data set. Nothing is returned for that month as such.
What I would like to happen is to select the 12th where available, else select the next working day in the data set.
All help appreciated!
Here's a solution that looks at three days from every month (12, 13, and 14), and then picks the minimum. If the 12th is a weekend it won't exist in the original dataframe, and you'll get the 13th. The same goes for the 14th.
Here's the code:
# Create dummy data - initial range
df = pd.DataFrame(pd.date_range("2018-01-01", "2020-06-01"), columns = ["date"])
# Create dummy data - Drop weekends
df = df[df.date.dt.weekday.isin(range(5))]
# get only the 12, 13, and 14 of every month
# group by year and month.
# get the minimum
df[df.date.dt.day.isin([12, 13, 14])].groupby(by=[df.date.dt.year, df.date.dt.month], as_index=False).min()
Result:
date
0 2018-01-12
1 2018-02-12
2 2018-03-12
3 2018-04-12
4 2018-05-14
5 2018-06-12
6 2018-07-12
7 2018-08-13
8 2018-09-12
9 2018-10-12
...
Edit
Per a question in the comments about national holidays: the same solution applies. Instead of picking 3 days (12, 13, 14), pick a larger number (e.g. 12-18). Then, get the minimum of these that actually exists in the dataframe - and that's the first working day starting with the 12th.
You can backfill the dataframe first to fill the missing values then select the date you want
df = df.asfreq('d', method='bfill')
Then you can do df[df.index.day == 12]
This is my approach, I will explain each line below the code. Please feel free to add a comment if there's something unclear:
!pip install workalendar #Install the module
import pandas as pd #Import pandas
from workalendar.usa import NewYork #Import the required country and city
df = pd.DataFrame(pd.date_range(start='1/1/2018', end='12/31/2018')).rename(columns={0:'Dates'}) #Create a dataframe with dates for the year 2018
cal = NewYork() #Instance the calendar
df['Is_Working_Day'] = df['Dates'].map(lambda x: cal.is_working_day(x)) #Create an extra column, True for working days, False otherwise
df[(df['Dates'].dt.day >= 12) & (df['Is_Working_Day'] == True)].groupby(df['Dates'].dt.month)['Dates'].first()
Essentially this last line returns all days with values equal or higher than 12 that are actual working days, we then group them by month and return the first day for each where this condition is met (day >= 12 and Working_day = True).
Output:
Dates
1 2018-01-12
2 2018-02-13
3 2018-03-12
4 2018-04-12
5 2018-05-14
6 2018-06-12
7 2018-07-12
8 2018-08-13
9 2018-09-12
10 2018-10-12
11 2018-11-13
12 2018-12-12
I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.
Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64
I'm trying to make a program that will equally distribute employees' day off. There are 4 groups and each group has it's own weekmask for each week of the month. So far I've made a code that will change weekmask when it locates 0 in Dataframe(Sunday). I'm stuck on structuring this command np.busday_count(start, end, weekmask=) to automatically change the start and the end date.
My Dataframe looks like this:
And here's my code:
a: int = 0
week_mask: str = '1100111'
def _change_week_mask():
global a, week_mask
a += 1
if a == 1:
week_mask = '1111000'
elif a == 2:
week_mask = '1111111'
elif a == 3:
week_mask = '0011111'
else:
a = 0
for line in rows['Workday']:
if line is '0':
_change_week_mask()
Edit: changed the value of start week from 6 to 0.
Ok, so to answer your problem I have created the sample data frame with below code.
Then I have added below columns to the data frame.
dayofweek - to reach to similar data which you created by setting every Sunday as zero. In this case Monday is set as zero and Sunday is six.
weeknum - week of year
week - instead of counting and than changing the week mask, I have assigned the value to week from 0 to 3 and based on it, we can calculate the mask.
weekmask - using value of the week, I have calculate the mask, you might need to align this as per your logic.
weekenddate- end date I have calculate by adding 7 to start date, if month is changing mid week then this will have month end date.
b
after this we can create a new data frame to have only end of week entry, in this case Monday is 0 so I have taken 0.
then you can apply function and store the result to data frame.
import datetime
import pandas as pd
import numpy as np
df_ = pd.DataFrame({'startdate':pd.date_range(pd.to_datetime('2018-10-01'), pd.to_datetime('2018-11-30'))})
df_['dayofweek'] = df_.startdate.dt.dayofweek
df_['remaining_days_in_month'] = df_.startdate.dt.days_in_month - df_.startdate.dt.day
df_['week'] = df_.startdate.dt.week%4
df_['day'] = df_.startdate.dt.day
df_['weekmask'] = df_.week.map({0 : '1100111', 1 : '1111000' , 2 : '1111111', 3: '0011111'})
df_['weekenddate'] = [x[0] + datetime.timedelta(days=(7-x[1])) if x[2] > 7-x[1] else x[0] + datetime.timedelta(days=(x[2])) for x in df_[['startdate','dayofweek','remaining_days_in_month']].values]
final_df = df_[(df_['dayofweek']==0) | ( df_['day']==1)][['startdate','weekenddate','weekmask']]
final_df['numberofdays'] = [ np.busday_count((x[0]).astype('<M8[D]'), x[1].astype('<M8[D]'), weekmask=x[2]) for x in final_df.values.astype(str)]
Output:
startdate weekenddate weekmask numberofdays
0 2018-10-01 2018-10-08 1100111 5
7 2018-10-08 2018-10-15 1111000 4
14 2018-10-15 2018-10-22 1111111 7
21 2018-10-22 2018-10-29 0011111 5
28 2018-10-29 2018-10-31 1100111 2
31 2018-11-01 2018-11-05 1100111 3
35 2018-11-05 2018-11-12 1111000 4
42 2018-11-12 2018-11-19 1111111 7
49 2018-11-19 2018-11-26 0011111 5
56 2018-11-26 2018-11-30 1100111 2
let me know if this needs some changes as per your requirement.
I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014