Create a Pandas Dataframe Date Column to Day of Year - python

I know this should be easy but for some reason, I cannot get the result that I need. I have data that looks like this where 'raw_time' is read into a df in the date format yyyy-mm-dd hh:mm:ss.
It looks like this:
dfdates =
1429029 1992-01-03 02:00:00
1429030 1992-01-03 01:00:00
1429031 1992-01-03 00:00:00
1429032 1992-01-02 23:00:00
1429033 1992-01-02 22:00:00
1429034 1992-01-02 21:00:00
1429035 1992-01-02 20:00:00
1429036 1992-01-02 19:00:00
1429037 1992-01-02 18:00:00
1429038 1992-01-02 17:00:00
1429039 1992-01-02 16:00:00
1429040 1992-01-02 15:00:00
1429041 1992-01-02 14:00:00
1429042 1992-01-02 13:00:00
1429043 1992-01-02 12:00:00
1429044 1992-01-02 11:00:00
I just need to convert each row to day of year. So the result in a new df would look like:
df_doy:
index day_of_year
1429029 3
1429030 3
1429031 3
1429032 2
1429033 2
1429034 2
1429035 2
1429036 2
1429037 2
1429038 2
1429039 2
1429040 2
1429041 2
1429042 2
1429043 2
1429044 2
thank you,

We have
df['day_of_year'] = pd.to_datetime(df[col]).dt.dayofyear
Or just output the day
df['day_of_year'] = pd.to_datetime(df[1]).dt.day

Assuming dfdates columns are ["index", "date"], you can use dt.dayofyear this way :
df_doy = dfdates.assign(day_of_year = pd.to_datetime(dfdates.pop("date")).dt.dayofyear)
Output :
print(df_doy)
index day_of_year
0 1429029 3
1 1429030 3
2 1429031 3
3 1429032 2
4 1429033 2
.. ... ...
11 1429040 2
12 1429041 2
13 1429042 2
14 1429043 2
15 1429044 2
[16 rows x 2 columns]

Looks like there is a day_of_year variable in Period.
https://pandas.pydata.org/docs/reference/api/pandas.Period.dayofyear.html

Related

Drop all rows for the month if a column has more than one value that crossed the threshold

I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
Suppose the value -9999 was repeated 200 times in the month of January and the threshold is 150. Practically the entire month of January must be deleted or all its rows must be deleted.
date values repeated
1 2013-02 0
2 2013-03 2
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
With this I think I can drop the rows that repeat but I want drop the whole month.
import numpy as np
df['month'] = df['date'].dt.to_period('M')
df['new_value'] = np.where((df['values'] == -9999) & (df['n_missing'] > 150),np.nan,df['values'])
df.dropna()
How can I do that ?
One way using pandas.to_datetime with pandas.DataFrame.groupby.filter.
Here's a sample with months that have -9999 repeated 2, 1, 0, 2 times each:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999.0
3 2013-01-01 03:00:00 -9999.0
4 2013-01-01 04:00:00 0.0
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999.0
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999.0
8759 2016-12-31 23:00:00 0.0
Then we do filtering:
date = pd.to_datetime(df["date"]).dt.strftime("%Y-%m")
new_df = df.groupby(date).filter(lambda x: x["values"].eq(-9999).sum() < 2)
print(new_df)
Output:
date values
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
You can see the months with 2 or more repeats are deleted.

Filtering out another dataframe based on selected hours

I'm trying to filter out my dataframe based only on 3 hourly frequency, meaning starting from 0000hr, 0300hr, 0900hr, 1200hr, 1500hr, 1800hr, 2100hr, so on and so forth.
A sample of my dataframe would look like this
Time A
2019-05-25 03:54:00 1
2019-05-25 03:57:00 2
2019-05-25 04:00:00 3
...
2020-05-25 03:54:00 4
2020-05-25 03:57:00 5
2020-05-25 04:00:00 6
Desired output:
Time A
2019-05-25 06:00:00 1
2019-05-25 09:00:00 2
2019-05-25 12:00:00 3
...
2020-05-25 00:00:00 4
2020-05-25 03:00:00 5
2020-05-25 06:00:00 6
2020-05-25 09:00:00 6
2020-05-25 12:00:00 6
2020-05-25 15:00:00 6
2020-05-25 18:00:00 6
2020-05-25 21:00:00 6
2020-05-26 00:00:00 6
...
You can define a date range with 3 hours interval with pd.date_range() and then filter your dataframe with .loc and isin(), as follows:
date_rng_3H = pd.date_range(start=df['Time'].dt.date.min(), end=df['Time'].dt.date.max() + pd.DateOffset(days=1), freq='3H')
df_out = df.loc[df['Time'].isin(date_rng_3H)]
Input data:
date_rng = pd.date_range(start='2019-05-25 03:54:00', end='2020-05-25 04:00:00', freq='3T')
np.random.seed(123)
df = pd.DataFrame({'Time': date_rng, 'A': np.random.randint(1, 6, len(date_rng))})
Time A
0 2019-05-25 03:54:00 3
1 2019-05-25 03:57:00 5
2 2019-05-25 04:00:00 3
3 2019-05-25 04:03:00 2
4 2019-05-25 04:06:00 4
... ... ...
175678 2020-05-25 03:48:00 2
175679 2020-05-25 03:51:00 1
175680 2020-05-25 03:54:00 2
175681 2020-05-25 03:57:00 2
175682 2020-05-25 04:00:00 1
175683 rows × 2 columns
Output:
print(df_out)
Time A
42 2019-05-25 06:00:00 4
102 2019-05-25 09:00:00 2
162 2019-05-25 12:00:00 1
222 2019-05-25 15:00:00 3
282 2019-05-25 18:00:00 5
... ... ...
175422 2020-05-24 15:00:00 1
175482 2020-05-24 18:00:00 5
175542 2020-05-24 21:00:00 2
175602 2020-05-25 00:00:00 3
175662 2020-05-25 03:00:00 3

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Filtering pandas dataframe by difference of adjacent rows

I have a dataframe indexed by datetime. I want to filter out rows based on the difference between their index and the index of the previous row.
So, if my criteria is "remove all rows that are over one hour late than the previous row", the second row in the example below should be removed:
2005-07-15 17:00:00
2005-07-17 18:00:00
While in the following case, both rows stay:
2005-07-17 23:00:00
2005-07-18 00:00:00
It seems you need boolean indexing with diff for difference and compare with 1 hour Timedelta:
dates=['2005-07-15 17:00:00','2005-07-17 18:00:00', '2005-07-17 19:00:00',
'2005-07-17 23:00:00', '2005-07-18 00:00:00']
df = pd.DataFrame({'a':range(5)}, index=pd.to_datetime(dates))
print (df)
a
2005-07-15 17:00:00 0
2005-07-17 18:00:00 1
2005-07-17 19:00:00 2
2005-07-17 23:00:00 3
2005-07-18 00:00:00 4
diff = df.index.to_series().diff().fillna(0)
print (diff)
2005-07-15 17:00:00 0 days 00:00:00
2005-07-17 18:00:00 2 days 01:00:00
2005-07-17 19:00:00 0 days 01:00:00
2005-07-17 23:00:00 0 days 04:00:00
2005-07-18 00:00:00 0 days 01:00:00
dtype: timedelta64[ns]
mask = diff <= pd.Timedelta(1, unit='h')
print (mask)
2005-07-15 17:00:00 True
2005-07-17 18:00:00 False
2005-07-17 19:00:00 True
2005-07-17 23:00:00 False
2005-07-18 00:00:00 True
dtype: bool
df = df[mask]
print (df)
a
2005-07-15 17:00:00 0
2005-07-17 19:00:00 2
2005-07-18 00:00:00 4

Conditional selection before certain time of day - Pandas dataframe

I have the above dataframe (snippet) and want create a new dataframe which is a conditional selection where I keep only the rows that are timestamped with a time before 15:00:00.
I'm still somewhat new to Pandas / python and have been stuck on this for a while :(
You can use DataFrame.between_time:
start = pd.to_datetime('2015-02-24 11:00')
rng = pd.date_range(start, periods=10, freq='14h')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
print (df)
Date a
0 2015-02-24 11:00:00 0
1 2015-02-25 01:00:00 1
2 2015-02-25 15:00:00 2
3 2015-02-26 05:00:00 3
4 2015-02-26 19:00:00 4
5 2015-02-27 09:00:00 5
6 2015-02-27 23:00:00 6
7 2015-02-28 13:00:00 7
8 2015-03-01 03:00:00 8
9 2015-03-01 17:00:00 9
df = df.set_index('Date').between_time('00:00:00', '15:00:00')
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-25 15:00:00 2
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
If need exclude 15:00:00 add parameter include_end=False:
df = df.set_index('Date').between_time('00:00:00', '15:00:00', include_end=False)
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
You can check the hours of the date column and use it for subsetting:
df['date'] = pd.to_datetime(df['date']) # optional if the date column is of datetime type
df[df.date.dt.hour < 15]

Categories

Resources