How to plot daily plots from yearly time series - python

I have hourly ozone data over a multi year period in a pandas dataframe. I need to create plots of the ozone data for every day of the year (i.e. 365 plots for the year). The time series is in the following format:
time_lt
3 1980-04-24 17:00:00
4 1980-04-24 18:00:00
5 1980-04-24 19:00:00
6 1980-04-24 20:00:00
7 1980-04-24 21:00:00
8 1980-04-24 22:00:00
9 1980-04-24 23:00:00
10 1980-04-25 00:00:00
11 1980-04-25 01:00:00
12 1980-04-25 02:00:00
13 1980-04-25 03:00:00
14 1980-04-25 04:00:00
How would I group the data by every day in order to plot each? what is the most efficient way of coding this?
Thanks!

Find comments inline
df['time_lt'] = pd.to_datetime(df['time_lt'])
# you can extract day, month, year
df['day'] = df['time_lt'].dt.day
df['month'] = df['time_lt'].dt.month
df['year'] = df['time_lt'].dt.year
#then use groupby
grouped = df.groupby(['day', 'month', 'year'])
# now you can plot individual groups

You can group on the fly:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(
"""id time_lt
3 1980-04-24 17:00:00
4 1980-04-24 18:00:00
5 1980-04-24 19:00:00
6 1980-04-24 20:00:00
7 1980-04-24 21:00:00
8 1980-04-24 22:00:00
9 1980-04-24 23:00:00
10 1980-04-25 00:00:00
11 1980-04-25 01:00:00
12 1980-04-25 02:00:00
13 1980-04-25 03:00:00
14 1980-04-25 04:00:00"""), sep=" \s+")
df['time_lt'] = pd.to_datetime(df['time_lt'])
>>> df.groupby(df.time_lt.dt.floor('1D')).count()
id time_lt
time_lt
1980-04-24 7 7
1980-04-25 5 5
In theory, you can write a plotting function and apply it directly to the groupby result. But then it will be harder to control it. Since plotting itself will still be slowest operation in this chain, you can safely do simple iteration over dates.

Related

Select groups using slicing based on the group index in pandas DataFrame

I have a Dataframe with a users indicated by the column: 'user_id'. Each of these users have several entries in the dataframe based on the date on which they did something, which is also a column. The dataframe looks somthing like
df:
user_id date
0 2019-04-13 02:00:00
0 2019-04-13 03:00:00
3 2019-02-18 22:00:00
3 2019-02-18 23:00:00
3 2019-02-19 00:00:00
3 2019-02-19 02:00:00
3 2019-02-19 03:00:00
3 2019-02-19 04:00:00
8 2019-04-05 04:00:00
8 2019-04-05 05:00:00
8 2019-04-05 06:00:00
8 2019-04-05 15:00:00
15 2019-04-28 19:00:00
15 2019-04-28 20:00:00
15 2019-04-29 01:00:00
23 2019-06-24 02:00:00
23 2019-06-24 05:00:00
23 2019-06-24 06:00:00
24 2019-03-27 12:00:00
24 2019-03-27 13:00:00
What I want to do is, for example, select the first 3 users. I wanted to do this with a code like this:
df.groupby('user_id').iloc[:3]
I know that groupby doesn't have an iloc so how could I achieve the same thing like an iloc in the groups, so I am able to slice them?
I found a way based on crayxt's answer:
df[df['user_id'].isin(df['user_id'].unique()[:3])]

Conditional selection before certain time of day - Pandas dataframe

I have the above dataframe (snippet) and want create a new dataframe which is a conditional selection where I keep only the rows that are timestamped with a time before 15:00:00.
I'm still somewhat new to Pandas / python and have been stuck on this for a while :(
You can use DataFrame.between_time:
start = pd.to_datetime('2015-02-24 11:00')
rng = pd.date_range(start, periods=10, freq='14h')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
print (df)
Date a
0 2015-02-24 11:00:00 0
1 2015-02-25 01:00:00 1
2 2015-02-25 15:00:00 2
3 2015-02-26 05:00:00 3
4 2015-02-26 19:00:00 4
5 2015-02-27 09:00:00 5
6 2015-02-27 23:00:00 6
7 2015-02-28 13:00:00 7
8 2015-03-01 03:00:00 8
9 2015-03-01 17:00:00 9
df = df.set_index('Date').between_time('00:00:00', '15:00:00')
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-25 15:00:00 2
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
If need exclude 15:00:00 add parameter include_end=False:
df = df.set_index('Date').between_time('00:00:00', '15:00:00', include_end=False)
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
You can check the hours of the date column and use it for subsetting:
df['date'] = pd.to_datetime(df['date']) # optional if the date column is of datetime type
df[df.date.dt.hour < 15]

Pandas select columns and data dependant on header

I have a large .csv file. I want to select only the column with he time/date and 20 other columns which I know by header.
As a test I try to take only the column with the header 'TIMESTAMP' I know this is
4207823 rows long in the .csv and it only contains dates and times. The code below selects the TIMESTAMP column but also carries on to take values from other columns as shown below:
import csv
import numpy as np
import pandas
low_memory=False
f = pandas.read_csv('C:\Users\mmso2\Google Drive\MABL Wind\_Semester 2 2016\Wind Farm Info\DataB\DataB - NaN2.csv', dtype = object)#convert file to variable so it can be edited
time = f[['TIMESTAMP']]
time = time[0:4207823]#test to see if this stops time taking other data
print time
output
TIMESTAMP
0 2007-08-15 21:10:00
1 2007-08-15 21:20:00
2 2007-08-15 21:30:00
3 2007-08-15 21:40:00
4 2007-08-15 21:50:00
5 2007-08-15 22:00:00
6 2007-08-15 22:10:00
7 2007-08-15 22:20:00
8 2007-08-15 22:30:00
9 2007-08-15 22:40:00
10 2007-08-15 22:50:00
11 2007-08-15 23:00:00
12 2007-08-15 23:10:00
13 2007-08-15 23:20:00
14 2007-08-15 23:30:00
15 2007-08-15 23:40:00
16 2007-08-15 23:50:00
17 2007-08-16 00:00:00
18 2007-08-16 00:10:00
19 2007-08-16 00:20:00
20 2007-08-16 00:30:00
21 2007-08-16 00:40:00
22 2007-08-16 00:50:00
23 2007-08-16 01:00:00
24 2007-08-16 01:10:00
25 2007-08-16 01:20:00
26 2007-08-16 01:30:00
27 2007-08-16 01:40:00
28 2007-08-16 01:50:00
29 2007-08-16 02:00:00 #these are from the TIMESTAMP column
... ...
679302 221.484 #This is from another column
679303 NaN
679304 2015-09-23 06:40:00
679305 NaN
679306 NaN
679307 2015-09-23 06:50:00
679308 NaN
679309 NaN
679310 2015-09-23 07:00:00
The problem was due to an error in the input file so simple use of usecols in pandas.read_csv worked.
code below demonstrates the selection of a few columns of data
import csv
import pandas
low_memory=False
#read only the selected columns
df = pandas.read_csv('DataB - Copy - Copy.csv',delimiter=',', dtype = object,
usecols=['TIMESTAMP', 'igmmx_U_77m', 'igmmx_U_58m', ])
print df # see what the data looks like
outfile = open('DataB_GreaterGabbardOnly.csv','wb')#somewhere to write the data to
df.to_csv(outfile)#save selection to the blank .csv created above

filter pandas dataframe by time

I have a pandas dataframe which I want to subset on time greater or less than 12pm. First i convert my string datetime to datetime[64]ns object in pandas.
segments_data['time'] = pd.to_datetime((segments_data['time']))
Then I separate time,date,month,year & dayofweek like below.
import datetime as dt
segments_data['date'] = segments_data.time.dt.date
segments_data['year'] = segments_data.time.dt.year
segments_data['month'] = segments_data.time.dt.month
segments_data['dayofweek'] = segments_data.time.dt.dayofweek
segments_data['time'] = segments_data.time.dt.time
My time column looks like following.
segments_data['time']
Out[1906]:
07:43:00
07:52:00
08:00:00
08:42:00
09:18:00
09:18:00
09:18:00
09:23:00
12:32:00
12:43:00
12:55:00
Name: time, dtype: object
Now I want to subset dataframe with time greater than 12pm and time less than 12pm.
segments_data.time[segments_data['time'] < 12:00:00]
It doesn't work because time is a string object.
Update
From pandas docs at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html. Thanks to Frederick in the comments.
Create dataframe with datetimes in it:
i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
ts
A
2018-04-09 00:00:00 1
2018-04-10 00:20:00 2
2018-04-11 00:40:00 3
2018-04-12 01:00:00 4
Use between_time:
ts.between_time('0:15', '0:45')
A
2018-04-10 00:20:00 2
2018-04-11 00:40:00 3
You get the times that are not between two times by setting start_time later than end_time:
ts.between_time('0:45', '0:15')
A
2018-04-09 00:00:00 1
2018-04-12 01:00:00 4
Old Answer
Leave a column as the raw datetime, call it ts:
segments_data['ts'] = pd.to_datetime((segments_data['time']))
Next you can cast the datetime to an H:M:S string and use between(start,end) seems to work:
In [227]:
segments_data=pd.DataFrame(x,columns=['ts'])
segments_data.ts = pd.to_datetime(segments_data.ts)
segments_data
Out[227]:
ts
0 2016-01-28 07:43:00
1 2016-01-28 07:52:00
2 2016-01-28 08:00:00
3 2016-01-28 08:42:00
4 2016-01-28 09:18:00
5 2016-01-28 09:18:00
6 2016-01-28 09:18:00
7 2016-01-28 09:23:00
8 2016-01-28 12:32:00
9 2016-01-28 12:43:00
10 2016-01-28 12:55:00
In [228]:
segments_data[segments_data.ts.dt.strftime('%H:%M:%S').between('00:00:00','12:00:00')]
Out[228]:
ts
0 2016-01-28 07:43:00
1 2016-01-28 07:52:00
2 2016-01-28 08:00:00
3 2016-01-28 08:42:00
4 2016-01-28 09:18:00
5 2016-01-28 09:18:00
6 2016-01-28 09:18:00
7 2016-01-28 09:23:00
Even though this post is 5 years old I just ran into this same problem and decided to post what I was able to get to work. I tried the between_time function but that did not work for me because the index on the dataframe had to be a datetime and I wanted to use one of the dataframe time columns to filter.
# Import datetime libraries
from datetime import datetime, date, time
avail_df['Start'].dt.time
1 08:36:44
2 08:49:14
3 09:26:00
5 08:34:22
7 08:34:19
8 09:09:05
9 12:27:43
10 12:29:14
12 09:05:55
13 09:14:11
14 09:21:41
15 11:28:26
16 12:25:10
17 16:02:52
18 08:53:51
# Use "time()" function to create start/end parameter I used 9:00am for this example
avail_df.loc[avail_df['Start'].dt.time > time(9,00)]
3 09:26:00
8 09:09:05
9 12:27:43
10 12:29:14
12 09:05:55
13 09:14:11
14 09:21:41
15 11:28:26
16 12:25:10
17 16:02:52
20 09:04:50
21 09:21:35
22 09:22:05
23 09:47:05
24 09:55:05

matplotlib plots strange horizontal lines on graph

I have used openpyxl to read data from an Excel spreadsheet into a pandas data frame, called 'tides'. The dataset contains over 32,000 rows of data (of tide times in the UK measured every 15 minutes). One of the columns contains date and time information (variable called 'datetime') and another contains the height of the tide (called 'tide'):
I want to plot datetime along the x-axis and tide on the y axis using:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import openpyxl
import datetime as dt
from matplotlib.dates import date2num
<-- Data imported from Excel spreadsheet into DataFrame using openpyxl. -->
<-- Code omitted for ease of reading. -->
# Convert datatime variable to datetime64 format:
tides['datetime'] = pd.to_datetime(tides['datetime'])
# Plot figure of 'datetime' vs 'tide':
fig = plt.figure()
ax_tides = fig.add_subplot(1,1,1)
ax_tides.plot_date(date2num(phj_tides['datetime']),phj_tides['tide'],'-',xdate=True,label='Tides 2011',linewidth=0.5)
min_datetime = dt.datetime.strptime('01/01/2011 00:00:00',"%d/%m/%Y %H:%M:%S")
max_datetime = dt.datetime.strptime('03/01/2011 23:59:45',"%d/%m/%Y %H:%M:%S")
ax_tides.set_xlim( [min_datetime, max_datetime] )
plt.show()
The plot shows just the first few days of data. However, at the change from one day to the next, something strange happens; after the last point of day 1, the line disappears off to the right and then returns to plot the first point of the second day - but the data is plotted incorrectly on the y axis. This happens throughout the dataset. A printout shows that the data seems to be OK.
number datetime tide
0 1 2011-01-01 00:00:00 4.296
1 2 2011-01-01 00:15:00 4.024
2 3 2011-01-01 00:30:00 3.768
3 4 2011-01-01 00:45:00 3.521
4 5 2011-01-01 01:00:00 3.292
5 6 2011-01-01 01:15:00 3.081
6 7 2011-01-01 01:30:00 2.887
7 8 2011-01-01 01:45:00 2.718
8 9 2011-01-01 02:00:00 2.577
9 10 2011-01-01 02:15:00 2.470
10 11 2011-01-01 02:30:00 2.403
11 12 2011-01-01 02:45:00 2.389
12 13 2011-01-01 03:00:00 2.417
13 14 2011-01-01 03:15:00 2.492
14 15 2011-01-01 03:30:00 2.611
15 16 2011-01-01 03:45:00 2.785
16 17 2011-01-01 04:00:00 3.020
17 18 2011-01-01 04:15:00 3.314
18 19 2011-01-01 04:30:00 3.665
19 20 2011-01-01 04:45:00 4.059
20 21 2011-01-01 05:00:00 4.483
[21 rows x 3 columns]
number datetime tide
90 91 2011-01-01 22:30:00 7.329
91 92 2011-01-01 22:45:00 7.014
92 93 2011-01-01 23:00:00 6.690
93 94 2011-01-01 23:15:00 6.352
94 95 2011-01-01 23:30:00 6.016
95 96 2011-01-01 23:45:00 5.690
96 97 2011-02-01 00:00:00 5.366
97 98 2011-02-01 00:15:00 5.043
98 99 2011-02-01 00:30:00 4.729
99 100 2011-02-01 00:45:00 4.426
100 101 2011-02-01 01:00:00 4.123
101 102 2011-02-01 01:15:00 3.832
102 103 2011-02-01 01:30:00 3.562
103 104 2011-02-01 01:45:00 3.303
104 105 2011-02-01 02:00:00 3.055
105 106 2011-02-01 02:15:00 2.827
106 107 2011-02-01 02:30:00 2.620
107 108 2011-02-01 02:45:00 2.434
108 109 2011-02-01 03:00:00 2.268
109 110 2011-02-01 03:15:00 2.141
110 111 2011-02-01 03:30:00 2.060
[21 rows x 3 columns]
number datetime tide
35020 35021 2011-12-31 19:00:00 5.123
35021 35022 2011-12-31 19:15:00 4.838
35022 35023 2011-12-31 19:30:00 4.551
35023 35024 2011-12-31 19:45:00 4.279
35024 35025 2011-12-31 20:00:00 4.033
35025 35026 2011-12-31 20:15:00 3.803
35026 35027 2011-12-31 20:30:00 3.617
35027 35028 2011-12-31 20:45:00 3.438
35028 35029 2011-12-31 21:00:00 3.278
35029 35030 2011-12-31 21:15:00 3.141
35030 35031 2011-12-31 21:30:00 3.019
35031 35032 2011-12-31 21:45:00 2.942
35032 35033 2011-12-31 22:00:00 2.909
35033 35034 2011-12-31 22:15:00 2.918
35034 35035 2011-12-31 22:30:00 2.923
35035 35036 2011-12-31 22:45:00 2.985
35036 35037 2011-12-31 23:00:00 3.075
35037 35038 2011-12-31 23:15:00 3.242
35038 35039 2011-12-31 23:30:00 3.442
35039 35040 2011-12-31 23:45:00 3.671
I am at a loss to explain this. Can anyone explain what is happening, why it is happening and how can I correct it?
Thanks in advance.
Phil
Doh! Finally found the answer. The original workflow was quite complicated. I stored the data in an Excel spreadsheet and used openpyxl to read data from a named cell range. This was then converted to a pandas DataFrame. The date-and-time variable was converted to datetime format using pandas' .to_datetime() function. And finally the data were plotted using matplotlib. As I was preparing the data to post to this forum (as suggested by rauparaha) and paring down the script to it bare essentials, I noticed that Day1 data were plotted on 01 Jan 2011 but Day2 data were plotted on 01 Feb 2011. If you look at the output in the original post, the dates are mixed formats: The last date given is '2011-12-31' (i.e. year-month-day') but the 2nd date representing 2nd Jan 2011 is '2011-02-01' (i.e. year-day-month).
So, looks like I misunderstood how the pandas .to_datetime() function interprets datetime information. I had purposely had not set the infer_datetime_format attribute (default=False) and had assumed any problems would have been flagged up. But it seems pandas assumes dates are in a month-first format. Unless they're not, in which case, it changes to a day-first format. I should have picked that up!
I have corrected the problem by providing a string that explicitly defines the datetime format. All is fine again.
Thanks again for your suggestions. And apologies for any confusion.
Cheers.
I have been unable to replicate your error but perhaps my working dummy code can help diagnose the problem. I generated dummy data and plotted it with this code:
import pandas as pd
import numpy as np
ydata = np.sin(np.linspace(0, 10, num=200))
time_index = pd.date_range(start=pd.datetime(2000, 1, 1, 0, 0), periods=200, freq=15*pd.datetools.Minute())
df = pd.DataFrame({'tides': ydata, 'datetime': time_index})
df.plot(x='datetime', y='tides')
My data looks like this
datetime tides
0 2000-01-01 00:00:00 0.000000
1 2000-01-01 00:15:00 0.050230
2 2000-01-01 00:30:00 0.100333
3 2000-01-01 00:45:00 0.150183
4 2000-01-01 01:00:00 0.199654
[200 rows]
and generates the following plot

Categories

Resources