python plot several curves from dataframe - python

I am trying to plot several appliances' temperatures on a plot.
The data comes from the dataframe df below, and I first create the date column as the index.
df=df.set_index('Date')
Date Appliance Value (degrees)
2016-07-05 03:00:00 Thermometer 22
2016-08-06 16:00:00 Thermometer . 19
2016-12-07 21:00:00 . Thermometer . 25
2016-19-08 23:00:00 . Thermostat . 21
2016-25-09 06:00:00 . Thermostat . 20
2016-12-10 21:00:00 . Thermometer . 18
2016-10-11 21:00:00 . Thermostat . 21
2016-10-12 04:00:00 . Thermometer . 20
2017-01-01 07:00:00 . Thermostat . 19
2017-01-02 07:00:00 . Thermometer . 23
We want to be able to show 2 curves: one for the thermometer's temperatures and one for the thermostat's temperatures with 2 different colours, over time.
plt.plot(df.index, [df.value for i in range(len(appliance)]
ax = df.plot()
ax.set_xlim(pd.Timestamp('2016-07-05'), pd.Timestamp('2015-11-30'))
Is ggplot better for this?
I cannot manage to make this works

There are of course several ways to plot the data.
So assume we have a dataframe like this
import pandas as pd
dates = ["2016-07-05 03:00:00", "2016-08-06 16:00:00", "2016-12-07 21:00:00",
"2016-19-08 23:00:00", "2016-25-09 06:00:00", "2016-12-10 21:00:00",
"2016-10-11 21:00:00", "2016-10-12 04:00:00", "2017-01-01 07:00:00",
"2017-01-02 07:00:00"]
app = ["Thermometer","Thermometer","Thermometer","Thermostat","Thermostat","Thermometer",
"Thermostat","Thermometer","Thermostat","Thermometer"]
values = [22,19,25,21,20,18,21,20,19,23]
df = pd.DataFrame({"Date" : dates, "Appliance" : app, "Value":values})
df.Date = pd.to_datetime(df['Date'], format='%Y-%d-%m %H:%M:%S')
df=df.set_index('Date')
Using matplotlib pyplot.plot()
import matplotlib.pyplot as plt
df1 = df[df["Appliance"] == "Thermostat"]
df2 = df[df["Appliance"] == "Thermometer"]
plt.plot(df1.index, df1["Value"].values, marker="o", label="Thermostat")
plt.plot(df2.index, df2["Value"].values, marker="o", label="Thermmeter")
plt.gcf().autofmt_xdate()
plt.legend()
Using pandas DataFrame.plot()
df1 = df[df["Appliance"] == "Thermostat"]
df2 = df[df["Appliance"] == "Thermometer"]
ax = df1.plot(y="Value", label="Thermostat")
df2.plot(y="Value", ax=ax, label="Thermometer")
ax.legend()

Related

How to plot time only of pandas datetime64[ns] attribute

I have a dataframe of a long time range in format datetime64[ns] and a int value
Data looks like this:
MIN_DEP DELAY
0 2018-01-01 05:09:00 0
1 2018-01-01 05:13:00 0
2 2018-01-01 05:39:00 0
3 2018-01-01 05:43:00 0
4 2018-01-01 06:12:00 34
... ... ...
77005 2020-09-30 23:42:00 0
77006 2020-09-30 23:43:00 0
77007 2020-09-30 23:43:00 43
77008 2020-10-01 00:18:00 0
77009 2020-10-01 00:59:00 0
[77010 rows x 2 columns]
MIN_DEP datetime64[ns]
DELAY int64
dtype: object
Target is to plot all the data in just a 00:00 - 24:00 range on the x-axis, no dates anymore.
As i try to plot it, the timeline is 00:00 at any point. How to fix this?
import matplotlib.dates as mdates
fig, ax = plt.subplots()
ax.plot(pd_to_stat['MIN_DEP'],pd_to_stat['DELAY'])
xfmt = mdates.DateFormatter('%H:%M')
ax.xaxis.set_major_formatter(xfmt)
plt.show()
tried to convert the timestamps before to dt.time and plot it then
pd_to_stat['time'] = pd.to_datetime(pd_to_stat['MIN_DEP'], format='%H:%M').dt.time
fig, ax = plt.subplots()
ax.plot(pd_to_stat['time'],pd_to_stat['DELAY'])
plt.show()
Plot does not allow to do that:
TypeError: float() argument must be a string or a number, not 'datetime.time'
According to your requirement, I guess you don't need the dates and as well as the seconds field in your timestamp. So you need a little bit of preprocessing at first.
Remove the seconds field using the code below
dataset['MIN_DEP'] = dataset['MIN_DEP'].strftime("%H:%M")
Then you can remove the date from your timestamp in the following manner
dataset['MIN_DEP'] = pd.Series([val.time() for val in dataset['MIN_DEP']])
Then you can plot your data in the usual manner.
This seems to work now. I did not recognise, the plot was still splitting up in dates. To work around I hat to replace all the dates with the same date and plottet it hiding the date using DateFormatter
import matplotlib.dates as mdates
pd_to_stat['MIN_DEP'] = pd_to_stat['MIN_DEP'].map(lambda t: t.replace(year=2020, month=1, day=1))
fig, ax = plt.subplots()
ax.plot(pd_to_stat['MIN_DEP'],pd_to_stat['DELAY'])
xfmt = mdates.DateFormatter('%H:%M')
ax.xaxis.set_major_formatter(xfmt)
plt.show()

how to plot only with the dates inside my df and not all the dates

I have this following df :
date values
2020-08-06 08:00:00 5
2020-08-06 09:00:00 10
2020-08-06 10:00:00 0
2020-08-17 08:00:00 8
2020-08-17 09:00:00 15
I want to plot this df so I do : df.set_index('date')['values'].plot(kind='line') but it shows all the dates between the 6th and the 17th.
How can I plot the graph only with the dates inside my df ?
I assume that date column is of datetime type.
To draw for selected dates only, the index must be built on
the principle "number of day from a unique list + hour".
But to suppress the default x label ticks, you have to define
your own, e.g. each 8 h in each date to be drawn.
Start from converting your DataFrame as follows:
idx = df['date'].dt.normalize().unique()
dateMap = pd.Series(np.arange(idx.size) * 24, index=idx)
df.set_index(df.date.dt.date.map(dateMap) + df.date.dt.hour, inplace=True)
df.index.rename('HourNo', inplace=True); df
Now, for your data sample, it has the following content:
date values
HourNo
8 2020-08-06 08:00:00 5
9 2020-08-06 09:00:00 10
10 2020-08-06 10:00:00 0
32 2020-08-17 08:00:00 8
33 2020-08-17 09:00:00 15
Then generate your plot and x ticks positions and labels:
fig, ax = plt.subplots(tight_layout=True)
df.loc[:, 'values'].plot(style='o-', rot=30, ax=ax)
xLoc = np.arange(0, dateMap.index.size * 24, 8)
xLbl = pd.concat([ pd.Series(d + pd.timedelta_range(start=0, freq='8H',
periods=3)) for d in dateMap.index ]).dt.strftime('%Y-%m-%d\n%H:%M')
plt.xticks(ticks=xLoc, labels=xLbl, ha='right')
ax.set_xlabel('Date')
ax.set_ylabel('Value')
ax.set_title('Set the proper heading')
ax.grid()
plt.show()
I added also the grid.
The result is:
And the final remark: Avoid column names which are the same as existing
Pandas methods or arrtibutes (e.g. values).
Sometimes it is the cause of "stupid" errors (you intend to refer to
a column, but you actually refer to a metod or attribute).

Smart way of creating multiple graphs using matplotlib

I have an excel worksheet, let us say its name is 'ws_actual'. The data looks as below.
Project Name Date Paid Actuals Item Amount Cumulative Sum
A 2016-04-10 00:00:00 124.2 124.2
A 2016-04-27 00:00:00 2727.5 2851.7
A 2016-05-11 00:00:00 2123.58 4975.28
A 2016-05-24 00:00:00 2500 7475.28
A 2016-07-07 00:00:00 38374.6 45849.88
A 2016-08-12 00:00:00 2988.14 48838.02
A 2016-09-02 00:00:00 23068 71906.02
A 2016-10-31 00:00:00 570.78 72476.8
A 2016-11-09 00:00:00 10885.75 83362.55
A 2016-12-08 00:00:00 28302.95 111665.5
A 2017-01-19 00:00:00 4354.3 116019.8
A 2017-02-28 00:00:00 3469.77 119489.57
A 2017-03-29 00:00:00 267.75 119757.32
B 2015-04-27 00:00:00 2969.93 2969.93
B 2015-06-02 00:00:00 118.8 3088.73
B 2015-06-18 00:00:00 2640 5728.73
B 2015-06-26 00:00:00 105.6 5834.33
B 2015-09-03 00:00:00 11879.7 17714.03
B 2015-10-22 00:00:00 5303.44 23017.47
B 2015-11-08 00:00:00 52000 75017.47
B 2015-11-25 00:00:00 2704.13 77721.6
B 2016-03-09 00:00:00 59752.85 137474.45
B 2016-03-13 00:00:00 512.73 137987.18
.
.
.
Let us say there are many many more projects including A and B with Date Paid and Amount information. I would like to create a plot by project where x axis is 'Date Paid' and y axis is 'Cumulative Sum', but when I just implement the following code, it just combines every project and plot every 'Cumulative Sum' at one graph. I wonder if I need to divide the table by project, save each, and then bring one by one to plot the graph. It is a lot of work, so I am wondering if there is a smarter way to do so. Please help me, genius.
import pandas as pd
import matplotlib.pyplot as plt
ws_actual = pd.read_excel(actual_file[0], sheet_name=0)
ax = ws_actual.plot(x='Date Paid', y='Cumulative Sum', color='g')
Right now you are connecting all of the points, regardless of group. A simple loop will work here allowing you to group the DataFrame and then plot each group as a separate curve. If you want you can define your own colorcycle if you have a lot of groups, so that colors do not repeat.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8,8))
for id, gp in ws_actual.groupby('Project Name'):
gp.plot(x='Date Paid', y='Cumulative Sum', ax=ax, label=id)
plt.show()
You could just iterate the projects:
for proj in ws_actual['Project'].unique():
ws_actual[ws_actual['Project'] == proj].plot(x='Date Paid', y='Cumulative Sum', color='g')
plt.show()
Or check out seaborn for an easy way to make a facet grid for which you can set a rows variable. Something along the lines of:
import seaborn as sns
g = sns.FacetGrid(ws_actual, row="Project")
g = g.map(plt.scatter, "Date Paid", "Cumulative Sum", edgecolor="w")

Plotting time stamp column versus another column in Python

I am .xls file and i have column of timestamp with following format of timestamp
2018-04-01 00:01:45
2018-04-01 00:16:45
2018-04-01 00:31:46
2018-04-01 00:46:45
2018-04-01 01:01:46
2018-04-01 01:16:45
2018-04-01 01:31:50
2018-04-01 01:46:45
2018-04-01 02:01:46
I Have another column with in same .xls file by name of temperature with following format
34
34
34
33
33
33
33
33
33
33
33
33
33
33
33
33
33
33
33
I want to plot values versus time . I tried to plot it but i am having issues in plotting it as it is not correctly reading the timestamp
My code :
#changing timestamp data from object to datatype
w = df['Timestamp']
// column name "Timestamp" was creating issue so i have to remove it"
w=w.drop(w.index[0])
//converting timestamp type object to datetime
w = pd.to_datetime(w)
area = (12 * np.random.rand(N))**2 # 0 to 15 point radii
plt.xlabel('Temperature')
plt.ylabel('DateTime')
plt.title('Temperature and DateTime Relation')
plt.scatter(t, w, s=area, c='purple', alpha=0.5)
plt.show()
Its Giving me error "TypeError: invalid type promotion"
I believe need to_datetime with parameter format first and then for 15Min data add resample with some function like mean:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M')
s = df.resample('15Min', on='date')['temp'].mean()
s.plot()

Data changes while interpolating data frame using Pandas and numpy

I am trying to calculate degree hours based on hourly temperature values.
The data that I am using has some missing days and I am trying to interpolate that data. Below is some part of the data;
2012-06-27 19:00:00 24
2012-06-27 20:00:00 23
2012-06-27 21:00:00 23
2012-06-27 22:00:00 16
2012-06-27 23:00:00 15
2012-06-29 00:00:00 15
2012-06-29 01:00:00 16
2012-06-29 02:00:00 16
2012-06-29 03:00:00 16
2012-06-29 04:00:00 17
2012-06-29 05:00:00 17
2012-06-29 06:00:00 18
....
2014-12-14 20:00:00 1
2014-12-14 21:00:00 0
2014-12-14 22:00:00 -1
2014-12-14 23:00:00 8
The full code is;
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
filename = 'Temperature12.xls'
df_temp = pd.read_excel(filename)
df_temp = df_temp.set_index('datetime')
ts_temp = df_temp['temp']
def inter_lin_nan(ts_temp, rule):
ts_temp = ts_temp.resample(rule)
mask = np.isnan(ts_temp)
# interpolling missing values
ts_temp[mask] = np.interp(np.flatnonzero(mask), np.flatnonzero(~mask),ts_temp[~mask])
return(ts_temp)
ts_temp = inter_lin_nan(ts_temp,'1H')
print ts_temp['2014-06-28':'2014-06-29']
def HDH (Tcurr,Tref=15.0):
if Tref >= Tcurr:
return ((Tref-Tcurr)/24)
else:
return (0)
df_temp['H-Degreehours'] = df_temp.apply(lambda row: HDH(row['temp']),axis=1)
df_temp['CDD-CUMSUM'] = df_temp['C-Degreehours'].cumsum()
df_temp['HDD-CUMSUM'] = df_temp['H-Degreehours'].cumsum()
df_temp1=df_temp['H-Degreehours'].resample('H', how=sum)
print df_temp1
Now I have two questions; while using inter_lin_nan function, it does interpolate data but it also changes the next day data and the next data is totally different from the one available in the excel file. Is this common or I have missed something?
Second question: At the end of the code I am trying to add hourly degree days values and that is why I have created another Data frame, but when I print that data frame, it still has NaN number as in the original data file. Could you please tell why this is happening?
I may be missing something very obvious as I am new to Python.
Don't use numpy when pandas has its own version.
df = pd.read_csv(filepath)
df =df.asfreq('1d') #get a timeseries with index timestamps each day.
df['somelabel'] = df['somelabel'].interpolate(method='linear') # interpolate nan values
Use as frequency to add the required frequency of timestamps to your time series, and uses interpolate() to interpolate nan values only.
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.interpolate.html
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.asfreq.html

Categories

Resources