filter pandas dataframe by time - python

I have a pandas dataframe which I want to subset on time greater or less than 12pm. First i convert my string datetime to datetime[64]ns object in pandas.
segments_data['time'] = pd.to_datetime((segments_data['time']))
Then I separate time,date,month,year & dayofweek like below.
import datetime as dt
segments_data['date'] = segments_data.time.dt.date
segments_data['year'] = segments_data.time.dt.year
segments_data['month'] = segments_data.time.dt.month
segments_data['dayofweek'] = segments_data.time.dt.dayofweek
segments_data['time'] = segments_data.time.dt.time
My time column looks like following.
segments_data['time']
Out[1906]:
07:43:00
07:52:00
08:00:00
08:42:00
09:18:00
09:18:00
09:18:00
09:23:00
12:32:00
12:43:00
12:55:00
Name: time, dtype: object
Now I want to subset dataframe with time greater than 12pm and time less than 12pm.
segments_data.time[segments_data['time'] < 12:00:00]
It doesn't work because time is a string object.

Update
From pandas docs at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html. Thanks to Frederick in the comments.
Create dataframe with datetimes in it:
i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
ts
A
2018-04-09 00:00:00 1
2018-04-10 00:20:00 2
2018-04-11 00:40:00 3
2018-04-12 01:00:00 4
Use between_time:
ts.between_time('0:15', '0:45')
A
2018-04-10 00:20:00 2
2018-04-11 00:40:00 3
You get the times that are not between two times by setting start_time later than end_time:
ts.between_time('0:45', '0:15')
A
2018-04-09 00:00:00 1
2018-04-12 01:00:00 4
Old Answer
Leave a column as the raw datetime, call it ts:
segments_data['ts'] = pd.to_datetime((segments_data['time']))
Next you can cast the datetime to an H:M:S string and use between(start,end) seems to work:
In [227]:
segments_data=pd.DataFrame(x,columns=['ts'])
segments_data.ts = pd.to_datetime(segments_data.ts)
segments_data
Out[227]:
ts
0 2016-01-28 07:43:00
1 2016-01-28 07:52:00
2 2016-01-28 08:00:00
3 2016-01-28 08:42:00
4 2016-01-28 09:18:00
5 2016-01-28 09:18:00
6 2016-01-28 09:18:00
7 2016-01-28 09:23:00
8 2016-01-28 12:32:00
9 2016-01-28 12:43:00
10 2016-01-28 12:55:00
In [228]:
segments_data[segments_data.ts.dt.strftime('%H:%M:%S').between('00:00:00','12:00:00')]
Out[228]:
ts
0 2016-01-28 07:43:00
1 2016-01-28 07:52:00
2 2016-01-28 08:00:00
3 2016-01-28 08:42:00
4 2016-01-28 09:18:00
5 2016-01-28 09:18:00
6 2016-01-28 09:18:00
7 2016-01-28 09:23:00

Even though this post is 5 years old I just ran into this same problem and decided to post what I was able to get to work. I tried the between_time function but that did not work for me because the index on the dataframe had to be a datetime and I wanted to use one of the dataframe time columns to filter.
# Import datetime libraries
from datetime import datetime, date, time
avail_df['Start'].dt.time
1 08:36:44
2 08:49:14
3 09:26:00
5 08:34:22
7 08:34:19
8 09:09:05
9 12:27:43
10 12:29:14
12 09:05:55
13 09:14:11
14 09:21:41
15 11:28:26
16 12:25:10
17 16:02:52
18 08:53:51
# Use "time()" function to create start/end parameter I used 9:00am for this example
avail_df.loc[avail_df['Start'].dt.time > time(9,00)]
3 09:26:00
8 09:09:05
9 12:27:43
10 12:29:14
12 09:05:55
13 09:14:11
14 09:21:41
15 11:28:26
16 12:25:10
17 16:02:52
20 09:04:50
21 09:21:35
22 09:22:05
23 09:47:05
24 09:55:05

Related

From hours to String

I have this df:
Index Dates
0 2017-01-01 23:30:00
1 2017-01-12 22:30:00
2 2017-01-20 13:35:00
3 2017-01-21 14:25:00
4 2017-01-28 22:30:00
5 2017-08-01 13:00:00
6 2017-09-26 09:39:00
7 2017-10-08 06:40:00
8 2017-10-04 07:30:00
9 2017-12-13 07:40:00
10 2017-12-31 14:55:00
The purpose was that between the time ranges 5:00 to 11:59 a new df would be created with data that would say: morning. To achieve this I converted those hours to booleans:
hour_morning=(pd.to_datetime(df['Dates']).dt.strftime('%H:%M:%S').between('05:00:00','11:59:00'))
and then passed them to a list with "morning" str
text_morning=[str('morning') for x in hour_morning if x==True]
I have the error in the last line because it only returns ´morning´ string values, it is as if the 'X' ignored the 'if' condition. Why is this happening and how do i fix it?
Do
text_morning=[str('morning') if x==True else 'not_morning' for x in hour_morning ]
You can also use np.where:
text_morning = np.where(hour_morning, 'morning', 'not morning')
Given:
Dates values
0 2017-01-01 23:30:00 0
1 2017-01-12 22:30:00 1
2 2017-01-20 13:35:00 2
3 2017-01-21 14:25:00 3
4 2017-01-28 22:30:00 4
5 2017-08-01 13:00:00 5
6 2017-09-26 09:39:00 6
7 2017-10-08 06:40:00 7
8 2017-10-04 07:30:00 8
9 2017-12-13 07:40:00 9
10 2017-12-31 14:55:00 10
Doing:
# df.Dates = pd.to_datetime(df.Dates)
df = df.set_index("Dates")
Now we can use pd.DataFrame.between_time:
new_df = df.between_time('05:00:00','11:59:00')
print(new_df)
Output:
values
Dates
2017-09-26 09:39:00 6
2017-10-08 06:40:00 7
2017-10-04 07:30:00 8
2017-12-13 07:40:00 9
Or use it to update the original dataframe:
df.loc[df.between_time('05:00:00','11:59:00').index, 'morning'] = 'morning'
# Output:
values morning
Dates
2017-01-01 23:30:00 0 NaN
2017-01-12 22:30:00 1 NaN
2017-01-20 13:35:00 2 NaN
2017-01-21 14:25:00 3 NaN
2017-01-28 22:30:00 4 NaN
2017-08-01 13:00:00 5 NaN
2017-09-26 09:39:00 6 morning
2017-10-08 06:40:00 7 morning
2017-10-04 07:30:00 8 morning
2017-12-13 07:40:00 9 morning
2017-12-31 14:55:00 10 NaN

Stacking multiple dataframes together for different timestamp format into one timestamp

I have multiple data frames each having data varying from 1 to 1440 minute (one day).Each dataframes are alike and same columns and same length. The time column values are in hhmm format.
Lets say df_A has the data of 1st day, that is 2021-05-06 It looks like this.
>df_A
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
And the next day's data is in df_B which is also the same. The date is 2021-05-07
>df_B
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
How could I stack these together one under another and create one dataframe while identifying each rows with a column having values in format like YYYYMMDD HH:mm. Which somewhat will look like this:
>df
timestamp col1 col2..... col80
20210506 0000
20210506 0001
.
.
20210506 2359
20210507 0000
.
.
20210507 2359
How could I achieve this while dealing with multiple data frames at ones?
df_A = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_B = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_A['date'] = pd.to_datetime('2021-05-06 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_B['date'] = pd.to_datetime('2021-05-07 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_final = pd.concat([df_A, df_B])
df_final
timestamp date
0 0 2021-05-06 00:00:00
1 1 2021-05-06 00:01:00
2 2 2021-05-06 00:02:00
3 3 2021-05-06 00:03:00
4 4 2021-05-06 00:04:00
5 5 2021-05-06 00:05:00
6 6 2021-05-06 00:06:00
7 7 2021-05-06 00:07:00
8 8 2021-05-06 00:08:00
9 9 2021-05-06 00:09:00
0 0 2021-05-07 00:00:00
1 1 2021-05-07 00:01:00
2 2 2021-05-07 00:02:00
3 3 2021-05-07 00:03:00
4 4 2021-05-07 00:04:00
5 5 2021-05-07 00:05:00
6 6 2021-05-07 00:06:00
7 7 2021-05-07 00:07:00
8 8 2021-05-07 00:08:00
9 9 2021-05-07 00:09:00

How to plot daily plots from yearly time series

I have hourly ozone data over a multi year period in a pandas dataframe. I need to create plots of the ozone data for every day of the year (i.e. 365 plots for the year). The time series is in the following format:
time_lt
3 1980-04-24 17:00:00
4 1980-04-24 18:00:00
5 1980-04-24 19:00:00
6 1980-04-24 20:00:00
7 1980-04-24 21:00:00
8 1980-04-24 22:00:00
9 1980-04-24 23:00:00
10 1980-04-25 00:00:00
11 1980-04-25 01:00:00
12 1980-04-25 02:00:00
13 1980-04-25 03:00:00
14 1980-04-25 04:00:00
How would I group the data by every day in order to plot each? what is the most efficient way of coding this?
Thanks!
Find comments inline
df['time_lt'] = pd.to_datetime(df['time_lt'])
# you can extract day, month, year
df['day'] = df['time_lt'].dt.day
df['month'] = df['time_lt'].dt.month
df['year'] = df['time_lt'].dt.year
#then use groupby
grouped = df.groupby(['day', 'month', 'year'])
# now you can plot individual groups
You can group on the fly:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(
"""id time_lt
3 1980-04-24 17:00:00
4 1980-04-24 18:00:00
5 1980-04-24 19:00:00
6 1980-04-24 20:00:00
7 1980-04-24 21:00:00
8 1980-04-24 22:00:00
9 1980-04-24 23:00:00
10 1980-04-25 00:00:00
11 1980-04-25 01:00:00
12 1980-04-25 02:00:00
13 1980-04-25 03:00:00
14 1980-04-25 04:00:00"""), sep=" \s+")
df['time_lt'] = pd.to_datetime(df['time_lt'])
>>> df.groupby(df.time_lt.dt.floor('1D')).count()
id time_lt
time_lt
1980-04-24 7 7
1980-04-25 5 5
In theory, you can write a plotting function and apply it directly to the groupby result. But then it will be harder to control it. Since plotting itself will still be slowest operation in this chain, you can safely do simple iteration over dates.

Filtering Pandas DataFrame on last n dates

I have a Pandas DF that looks like this:
df
I want to filter the DF using a locally defined int parameter, 'days'. Such as when days = 10, my filtered DF only has the data for the last available 10 dates.
Until now, I have tried the following:
days=10
cutoff_date = df["SeriesDate"][-1:] - datetime.timedelta(days=days)
However, then trying to output the filtered DF using:
df[df['SeriesDate'] > cutoff_date]
I get the follwing error:
ValueError: Can only compare identically-labeled Series objects
I am still learning Python so will appreciate any help that I can get with this.
I think you need select last value of column SeriesDate by iloc:
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=15, freq='20H')
df = pd.DataFrame({'SeriesDate': rng, 'Value_1': np.random.random(15)})
print (df)
SeriesDate Value_1
0 2015-02-24 00:00:00 0.849160
1 2015-02-24 20:00:00 0.332487
2 2015-02-25 16:00:00 0.687638
3 2015-02-26 12:00:00 0.310326
4 2015-02-27 08:00:00 0.660795
5 2015-02-28 04:00:00 0.354475
6 2015-03-01 00:00:00 0.061312
7 2015-03-01 20:00:00 0.443908
8 2015-03-02 16:00:00 0.708326
9 2015-03-03 12:00:00 0.257419
10 2015-03-04 08:00:00 0.618363
11 2015-03-05 04:00:00 0.121625
12 2015-03-06 00:00:00 0.637324
13 2015-03-06 20:00:00 0.058292
14 2015-03-07 16:00:00 0.047624
days=10
cutoff_date = df["SeriesDate"].iloc[-1] - pd.Timedelta(days=days)
print (cutoff_date)
2015-02-25 16:00:00
df1 = df[df['SeriesDate'] > cutoff_date]
print (df1)
SeriesDate Value_1
3 2015-02-26 12:00:00 0.310326
4 2015-02-27 08:00:00 0.660795
5 2015-02-28 04:00:00 0.354475
6 2015-03-01 00:00:00 0.061312
7 2015-03-01 20:00:00 0.443908
8 2015-03-02 16:00:00 0.708326
9 2015-03-03 12:00:00 0.257419
10 2015-03-04 08:00:00 0.618363
11 2015-03-05 04:00:00 0.121625
12 2015-03-06 00:00:00 0.637324
13 2015-03-06 20:00:00 0.058292
14 2015-03-07 16:00:00 0.047624
Another alternative is use max, thanks Pocin:
cutoff_date = df["SeriesDate"].max() - pd.Timedelta(days=days)
print (cutoff_date)
2015-02-25 16:00:00
And if you want filter by dates only:
days=10
cutoff_date = df["SeriesDate"].dt.date.iloc[-1] - pd.Timedelta(days=days)
print (cutoff_date)
2015-02-25
EDIT:
You can filter out dates where is weekend with dayofweek and then use isin
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=15)
df = pd.DataFrame({'SeriesDate': rng, 'Value_1': np.random.random(15)})
print (df)
SeriesDate Value_1
0 2015-02-24 0.498387
1 2015-02-25 0.435767
2 2015-02-26 0.299233
3 2015-02-27 0.489286
4 2015-02-28 0.892167
5 2015-03-01 0.507436
6 2015-03-02 0.360427
7 2015-03-03 0.903886
8 2015-03-04 0.718148
9 2015-03-05 0.645489
10 2015-03-06 0.251285
11 2015-03-07 0.139275
12 2015-03-08 0.756845
13 2015-03-09 0.565863
14 2015-03-10 0.148077
days=10
last_day = df["SeriesDate"].dt.date.iloc[-1]
cutoff_date = last_day - pd.Timedelta(days=days)
rng = pd.date_range(cutoff_date, last_day)
rng = rng[(rng.dayofweek != 0) & (rng.dayofweek != 6)]
print (rng)
DatetimeIndex(['2015-02-28', '2015-03-03', '2015-03-04', '2015-03-05',
'2015-03-06', '2015-03-07', '2015-03-10'],
dtype='datetime64[ns]', freq=None)
df1 = df[df['SeriesDate'].isin(rng)]
print (df1)
SeriesDate Value_1
4 2015-02-28 0.892167
7 2015-03-03 0.903886
8 2015-03-04 0.718148
9 2015-03-05 0.645489
10 2015-03-06 0.251285
11 2015-03-07 0.139275
14 2015-03-10 0.148077

Pandas select columns and data dependant on header

I have a large .csv file. I want to select only the column with he time/date and 20 other columns which I know by header.
As a test I try to take only the column with the header 'TIMESTAMP' I know this is
4207823 rows long in the .csv and it only contains dates and times. The code below selects the TIMESTAMP column but also carries on to take values from other columns as shown below:
import csv
import numpy as np
import pandas
low_memory=False
f = pandas.read_csv('C:\Users\mmso2\Google Drive\MABL Wind\_Semester 2 2016\Wind Farm Info\DataB\DataB - NaN2.csv', dtype = object)#convert file to variable so it can be edited
time = f[['TIMESTAMP']]
time = time[0:4207823]#test to see if this stops time taking other data
print time
output
TIMESTAMP
0 2007-08-15 21:10:00
1 2007-08-15 21:20:00
2 2007-08-15 21:30:00
3 2007-08-15 21:40:00
4 2007-08-15 21:50:00
5 2007-08-15 22:00:00
6 2007-08-15 22:10:00
7 2007-08-15 22:20:00
8 2007-08-15 22:30:00
9 2007-08-15 22:40:00
10 2007-08-15 22:50:00
11 2007-08-15 23:00:00
12 2007-08-15 23:10:00
13 2007-08-15 23:20:00
14 2007-08-15 23:30:00
15 2007-08-15 23:40:00
16 2007-08-15 23:50:00
17 2007-08-16 00:00:00
18 2007-08-16 00:10:00
19 2007-08-16 00:20:00
20 2007-08-16 00:30:00
21 2007-08-16 00:40:00
22 2007-08-16 00:50:00
23 2007-08-16 01:00:00
24 2007-08-16 01:10:00
25 2007-08-16 01:20:00
26 2007-08-16 01:30:00
27 2007-08-16 01:40:00
28 2007-08-16 01:50:00
29 2007-08-16 02:00:00 #these are from the TIMESTAMP column
... ...
679302 221.484 #This is from another column
679303 NaN
679304 2015-09-23 06:40:00
679305 NaN
679306 NaN
679307 2015-09-23 06:50:00
679308 NaN
679309 NaN
679310 2015-09-23 07:00:00
The problem was due to an error in the input file so simple use of usecols in pandas.read_csv worked.
code below demonstrates the selection of a few columns of data
import csv
import pandas
low_memory=False
#read only the selected columns
df = pandas.read_csv('DataB - Copy - Copy.csv',delimiter=',', dtype = object,
usecols=['TIMESTAMP', 'igmmx_U_77m', 'igmmx_U_58m', ])
print df # see what the data looks like
outfile = open('DataB_GreaterGabbardOnly.csv','wb')#somewhere to write the data to
df.to_csv(outfile)#save selection to the blank .csv created above

Categories

Resources