box ploting timeseries with pandas and matplotlib

box ploting timeseries with pandas and matplotlib - python

I have a timeseries dataframe and I want to plot it using box plot type. Furthermore I want to group the data by days.
This is what I did,
groups = df.groupby(pd.Grouper(freq='D'))
data = pd.DataFrame(pd.concat([pd.DataFrame(x[1].values) for x in groups], axis=1))
data.boxplot(figsize=(20,5))
plt.show()
This is the result,
Why I have lost the dates in X axis? Can achieve my goal in a easier way? I feel my code is not pythonic.
In addition I did the code below too, but this stacks the days of different months.
sns.boxplot(x=d1.index.day, y=d1['Temperature'])
An example of my dataframe:
Temp
2019/01/01 00:00:00 25.3
2019/01/01 00:30:00 22.0
2019/01/01 01:00:00 22.1
2019/01/01 01:30:00 28.1
2019/01/01 02:00:00 26.8
2019/01/01 02:30:00 25.3
...
2019/01/02 00:00:00 20.2
2019/01/02 00:30:00 27.0
2019/01/02 01:00:00 27.5
2019/01/02 01:30:00 28.1
2019/01/02 02:00:00 28.8
2019/01/02 02:30:00 26.3
...
2019/02/10 23:30:00 21.6
Thank you!

Related

How to loop through dates column and assign values according to a certain condition?

I have a df as follows
dates winter summer rest Final
2020-01-01 00:15:00 65.5 71.5 73.0 NaN
2020-01-01 00:30:00 62.6 69.0 70.1 NaN
2020-01-01 00:45:00 59.6 66.3 67.1 NaN
2020-01-01 01:00:00 57.0 63.5 64.5 NaN
2020-01-01 01:15:00 54.8 60.9 62.3 NaN
2020-01-01 01:30:00 53.1 58.6 60.6 NaN
2020-01-01 01:45:00 51.7 56.6 59.2 NaN
2020-01-01 02:00:00 50.5 55.1 57.9 NaN
2020-01-01 02:15:00 49.4 54.2 56.7 NaN
2020-01-01 02:30:00 48.5 53.7 55.6 NaN
2020-01-01 02:45:00 47.9 53.4 54.7 NaN
2020-01-01 03:00:00 47.7 53.3 54.2 NaN
2020-01-01 03:15:00 47.9 53.1 54.1 NaN
2020-01-01 03:30:00 48.7 53.2 54.6 NaN
2020-01-01 03:45:00 50.2 54.1 55.8 NaN
2020-01-01 04:00:00 52.3 56.1 57.9 NaN
2020-04-28 12:30:00 225.1 200.0 209.8 NaN
2020-04-28 12:45:00 215.7 193.8 201.9 NaN
2020-04-28 13:00:00 205.6 186.9 193.4 NaN
2020-04-28 13:15:00 195.7 179.9 185.0 NaN
2020-04-28 13:30:00 186.7 173.4 177.4 NaN
2020-04-28 13:45:00 179.2 168.1 170.9 NaN
2020-04-28 14:00:00 173.8 164.4 166.3 NaN
2020-04-28 14:15:00 171.0 163.0 163.9 NaN
2020-04-28 14:30:00 170.7 163.5 163.6 NaN
2020-12-31 21:15:00 88.5 90.2 89.2 NaN
2020-12-31 21:30:00 85.2 88.5 87.2 NaN
2020-12-31 21:45:00 82.1 86.3 85.0 NaN
2020-12-31 22:00:00 79.4 84.1 83.2 NaN
2020-12-31 22:15:00 77.6 82.4 82.1 NaN
2020-12-31 22:30:00 76.4 81.2 81.7 NaN
2020-12-31 22:45:00 75.6 80.3 81.6 NaN
2020-12-31 23:00:00 74.7 79.4 81.3 NaN
2020-12-31 23:15:00 73.7 78.4 80.6 NaN
2020-12-31 23:30:00 72.3 77.2 79.5 NaN
2020-12-31 23:45:00 70.5 75.7 77.9 NaN
2021-01-01 00:00:00 68.2 73.8 75.7 NaN
The dates column has dates starting from 2020-01-01 00:15:00 till 2021-01-01 00:00:00 split at every 15 mins.
I also have the following date range conditions:
Winter: 01.11 - 20.03
Summer: 15.05 - 14.09
Rest: 21.03 - 14.05 & 15.09 - 31.10
What I want to do is to create a new column named season that checks every date in the dates column and assigns winter if the date is in Winter range, summer if it is in Summer range and rest if it is the Rest range.
Then, based on the value in the season column, the Final column must be filled. If the value in season column is 'winter', then the values from winter column must be placed, if the value in season column is 'summer', then the values from summer column must be placed and so on.
How can this be done?

Idea is normalize datetimes for same year, then filter by Series.between and set new column by numpy.select:
d = pd.to_datetime(df['dates'].dt.strftime('%m-%d-2020'))
m1 = d.between('2020-11-01','2020-12-31') | d.between('2020-01-01','2020-03-20')
m2 = d.between('2020-05-15','2020-09-14')
df['Final'] = np.select([m1, m2], ['Winter','Summer'], default='Rest')
print (df)
dates winter summer rest Final
0 2020-01-01 00:15:00 65.5 71.5 73.0 Winter
1 2020-06-15 00:30:00 62.6 69.0 70.1 Summer
2 2020-12-25 00:45:00 59.6 66.3 67.1 Winter
3 2020-10-10 01:00:00 57.0 63.5 64.5 Rest

How to extract hourly data from a df in python?

I have the following df
dates Final
2020-01-01 00:15:00 94.7
2020-01-01 00:30:00 94.1
2020-01-01 00:45:00 94.1
2020-01-01 01:00:00 95.0
2020-01-01 01:15:00 96.6
2020-01-01 01:30:00 98.4
2020-01-01 01:45:00 99.8
2020-01-01 02:00:00 99.8
2020-01-01 02:15:00 98.0
2020-01-01 02:30:00 95.1
2020-01-01 02:45:00 91.9
2020-01-01 03:00:00 89.5
The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins.
Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values
Expected output
dates Final
2020-01-01 01:00:00 95.0
2020-01-01 02:00:00 99.8
2020-01-01 03:00:00 89.5
With the last row being 2021-01-01 00:00:00 95.6
How can this be done?
Thanks

Use Series.dt.minute to performance a boolean indexing:
df_filtered = df.loc[df['dates'].dt.minute.eq(0)]
#if necessary
#df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)]
print(df_filtered)
dates Final
3 2020-01-01 01:00:00 95.0
7 2020-01-01 02:00:00 99.8
11 2020-01-01 03:00:00 89.5

If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).

Data grouping into weekyly, monthly and yearly for large datasets using python?

I have datasets that record 'X' value for the 20 years in Dataframe format. X records the data with 3 hrs average and sample of data are given below.
Time_stamp X
1992-01-01 03:00:00 10.2
1992-01-01 06:00:00 10.4
1992-01-01 09:00:00 11.8
1992-01-01 12:00:00 12.0
1992-01-01 15:00:00 10.4
1992-01-01 18:00:00 9.4
1992-01-01 21:00:00 10.4
1992-01-02 00:00:00 13.6
1992-01-02 03:00:00 13.2
1992-01-02 06:00:00 11.8
1992-01-02 09:00:00 12.0
1992-01-02 12:00:00 12.8
1992-01-02 15:00:00 12.6
1992-01-02 18:00:00 11.0
1992-01-02 21:00:00 12.2
1992-01-03 00:00:00 13.8
1992-01-03 03:00:00 14.0
1992-01-03 06:00:00 13.4
1992-01-03 09:00:00 14.2
1992-01-03 12:00:00 16.2
1992-01-03 15:00:00 13.2
1992-01-03 18:00:00 13.4
1992-01-03 21:00:00 13.8
1992-01-04 00:00:00 14.8
1992-01-04 03:00:00 13.8
1992-01-04 06:00:00 7.6
1992-01-04 09:00:00 5.8
1992-01-04 12:00:00 4.4
1992-01-04 15:00:00 5.6
1992-01-04 18:00:00 6.0
1992-01-04 21:00:00 7.0
1992-01-05 00:00:00 6.8
1992-01-05 03:00:00 3.4
1992-01-05 06:00:00 5.8
1992-01-05 09:00:00 10.6
1992-01-05 12:00:00 9.2
1992-01-05 15:00:00 10.6
1992-01-05 18:00:00 9.8
1992-01-05 21:00:00 11.2
1992-01-06 00:00:00 12.0
1992-01-06 03:00:00 10.2
1992-01-06 06:00:00 9.0
1992-01-06 09:00:00 9.0
1992-01-06 12:00:00 8.6
1992-01-06 15:00:00 8.4
1992-01-06 18:00:00 8.2
1992-01-06 21:00:00 8.8
1992-01-07 00:00:00 10.0
1992-01-07 03:00:00 9.6
1992-01-07 06:00:00 8.0
1992-01-07 09:00:00 9.6
1992-01-07 12:00:00 10.8
1992-01-07 15:00:00 10.2
1992-01-07 18:00:00 9.8
1992-01-07 21:00:00 10.2
1992-01-08 00:00:00 9.4
1992-01-08 03:00:00 11.4
1992-01-08 06:00:00 12.6
1992-01-08 09:00:00 12.8
1992-01-08 12:00:00 10.4
1992-01-08 15:00:00 11.2
1992-01-08 18:00:00 9.0
1992-01-08 21:00:00 10.2
1992-01-09 00:00:00 8.2
I would like to create seperate dataframe that calcute and records yearly mean, weekly mean and daily mean of the given datasets. I am new to python and just started working with time series data. I found some question related to this here at stackoverflow but did not find appropriate answer related to this aand did not find any idea how to start with. Any help on this ?
I wrote this code so far,
import pandas as pd
import numpy as np
datasets['date_minus_time'] = df["Time_stamp"].apply( lambda df :
datetime.datetime(year=datasets.year, month=datasets.month,
day=datasets.day))
datasets.set_index(df["date_minus_time"],inplace=True)
df['count'].resample('D', how='sum')
df['count'].resample('W', how='sum')
df['count'].resample('M', how='sum')
But not getting a clue how to include that data records every 3 hrs. and what should to next for my desired result.

Use to_datetime for datetimes in column for improve performance and then DataFrame.resample with parameter on for specify datetime column:
df['Time_stamp'] = pd.to_datetime(df['Time_stamp'])
df_daily = df.resample('D', on='Time_stamp').mean()
df_monthly = df.resample('M', on='Time_stamp').mean()
df_weekly = df.resample('W', on='Time_stamp').mean()

You may use:
df['Time_stamp'] = pd.to_datetime(df['Time_stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index('Time_stamp',inplace=True)
df_monthly = df.resample('M').mean()
df_monthly outputs:
X
Time_stamp
1992-01-31 10.403125
For daily mean use: df_daily = df.resample('D').mean() which outputs:
X
Time_stamp
1992-01-01 10.657143
1992-01-02 12.400000
1992-01-03 14.000000
1992-01-04 8.125000
1992-01-05 8.425000
1992-01-06 9.275000
1992-01-07 9.775000
1992-01-08 10.875000
1992-01-09 8.200000

Python pandas DataFrame loc selection for a range of rows and columns

Here is a head() of my DataFrame df:
Temperature DewPoint Pressure
Date
2010-01-01 00:00:00 46.2 37.5 1.0
2010-01-01 01:00:00 44.6 37.1 1.0
2010-01-01 02:00:00 44.1 36.9 1.0
2010-01-01 03:00:00 43.8 36.9 1.0
2010-01-01 04:00:00 43.5 36.8 1.0
I want to select from August 1 to August 15 2010 and display only the Temperature column.
What I am trying to do is:
df.loc[['2010-08-01','2010-08-15'],'Temperature']
But this is throwing me an error.
Generally speaking what I want to learn is how, using loc method I can easily take a range of row i to row k and column j to p and show it in dataframe using loc method:
df.loc[[i:k],[j:p]]
Thank you very much in advance!!!
Steve

I think if you want to be able to pass a slice for the index and columns then you can use ix to achieve this:
In [19]:
df.ix['2010-01-01':, 'DewPoint':]
Out[19]:
DewPoint Pressure
Date
2010-01-01 00:00:00 37.5 1.0
2010-01-01 01:00:00 37.1 1.0
2010-01-01 02:00:00 36.9 1.0
2010-01-01 03:00:00 36.9 1.0
2010-01-01 04:00:00 36.8 1.0
The docs detail numerous ways of selecting data

read excel and convert index to datatimeindex pandas

I read an excel in pandas like this
df = pd.read_excel("Test.xlsx", index_col=[0])
The dataframe look like this with the index containing a date and time and one column:
01.01.2015 00:15:00 47.2
01.01.2015 00:30:00 46.6
01.01.2015 00:45:00 19.4
01.01.2015 01:00:00 14.8
01.01.2015 01:15:00 14.8
01.01.2015 01:30:00 16.4
01.01.2015 01:45:00 16.2
...
I want to convert the index to a datatimeindex, I tried
df.index = pd.to_datetime(df.index)
and got: "ValueError: Unknown string format"
What is here the best way to convert the index to a datatime format containing date and tiem to use datetime based functions

I think you need add parameter format - see http://strftime.org/:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y %H:%M:%S')
print (df)
a
2015-01-01 00:15:00 47.2
2015-01-01 00:30:00 46.6
2015-01-01 00:45:00 19.4
2015-01-01 01:00:00 14.8
2015-01-01 01:15:00 14.8
2015-01-01 01:30:00 16.4
2015-01-01 01:45:00 16.2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

box ploting timeseries with pandas and matplotlib - python

Related

How to loop through dates column and assign values according to a certain condition?

How to extract hourly data from a df in python?

Data grouping into weekyly, monthly and yearly for large datasets using python?

Python pandas DataFrame loc selection for a range of rows and columns

read excel and convert index to datatimeindex pandas

Categories

Resources