Plotting Pandas GroupBy data - python

I read a dataset into Pandas and filtered the data using df_new=df.query("parent=='pr1'") to create a new DataFrame which looks like this:
child parent date pres
101 ch05 pr1 2004-06-01 2760.35
102 ch05 pr1 2004-07-08 2758.83
103 ch09 pr1 2004-08-04 2759.13
.. ... ... ...
317 ch12 pr1 2021-03-15 1737.09
318 ch12 pr1 2021-03-17 1730.98
183 ch05 pr1 2021-04-30 1777.09
I am trying to calculate the daily average so tried this: pobs = df.groupby('date')['pres'].mean(). This seems to work because print(pobs) gives something like this:
date
2004-06-01 2760.35
2004-07-08 2758.83
2004-08-04 2759.13
However I want to plot date against pres using matplotlib to make sure but have not been able to extract the two arrays separately. I tried tweaking the solution here Plotting pandas groupby but have got myself tied up in knots. I suspect the answer is one or two lines of code but I just can't find them - all suggestions appreciated. Thanks!

You just need to reset the index
pobs = df.groupby('date')['pres'].mean().reset_index()
output:
date pres
0 2004-06-01 2760.35
1 2004-07-08 2758.83
2 2004-08-04 2759.13
In this way, prob is now a dataframe and can be plotted as such, for example
import matplotlib.pyplot as plt
plt.plot(pobs.date,pobs.pres)

Related

Issue combining two data sets based on date and daily energy using pandas

hopefully a basic one for most.
I have created two datasets using random data one for days of the year and other for energy per day:
import numpy as np
import pandas as pd
np.random.seed(2)
start2018 = pd.datetime(2018, 1, 1)
end2018 = pd.datetime(2018, 12, 31)
dates2018 = pd.date_range(start2018, end2018, freq='d')
synEne2018 = np.random.normal(loc=66.883795, scale=5.448145, size=365)
syn2018data = pd.DataFrame({'Date': [dates2018], 'Total Daily Energy': [synEne2018]})
syn2018data
When I run this code I was hoping to get the daily energy for each date on separate rows. However, what I get is one row similar to below:
Date Total Daily Energy
0 DatetimeIndex(['2018-01-01', '2018-01-02', '20... [64.61323781744713, 66.57724516658102, 55.2454...
Can someone suggest the edit to get this to display as described above..
Remove the square brackets around dates2018 and synEne2018. You are making them nested list by putting square brackets around them. Just leave them alone as it is and you should be good to go.
syn2018data = pd.DataFrame({'Date': dates2018, 'Total Daily Energy': synEne2018})
Prints:
Date Total Daily Energy
0 2018-01-01 64.613238
1 2018-01-02 66.577245
2 2018-01-03 55.245489
3 2018-01-04 75.820228
4 2018-01-05 57.112898
.. ... ...
360 2018-12-27 73.685533
361 2018-12-28 60.096896
362 2018-12-29 65.973035
363 2018-12-30 63.742335
364 2018-12-31 69.150342
[365 rows x 2 columns]

Plotting time series information with missing date values

I have the following dataset:
dataset.head(7)
Transaction_date Product Product Code Description
2019-01-01 A 123 A123
2019-01-02 B 267 B267
2019-01-09 B 267 B267
2019-02-11 C 139 C139
2019-02-11 A 125 C125
2019-02-12 C 139 C139
2019-02-12 A 123 A123
The dataset stores transaction information, for which a transaction date is available. In other words, not for all days, data is available.
Ultimately, I want to create a time series plot, showing me the number of transactions per day.
So far, I have done a simple countplot:
ax = sns.countplot(x=dataset["Transaction_date"],data=dataset)
This plot shows me the dates, where a transaction happened. But I would prefer to see also the dates, where no transaction has happened in a plot, preferably shown as 0.
I have tried the following, but retrieve an error message:
groupbydate = dataset.groupby("Transaction_date")
ax = sns.tsplot(x="Transaction_date",y="Product",data=groubydate.fillna(0))
But I get the error
cannot label index with a null key
Due to restrictions, I can only use seaborn 0.8.1
I believe reindex should work for you:
# First convert the index to datetime
dataset.index = pd.DatetimeIndex(dataset.index)
# Then reindex! You can also select the min and max of the index for the limits
dataset= dataset.reindex(pd.date_range("2019-01-01", "2019-02-12"), fill_value="NaN")
You can drop the rows containing NaN values using pandas.DataFrame.dropna, and then plot the chart. For example:
dataset.dropna(thresh=2)
will drop all rows where there are at least two NaN values.
You may also want to fill the NaN values using pandas.DataFrame.fillna

How can I call individual columns in pandas?

I understand this must be a very basic question, but oddly enough, the resources I've read online don't seem very clear on how to do the following:
How can I index specific columns in pandas?
For example, after importing data from a csv, I have a pandas Series object with individual dates, along with a corresponding dollar amount for each date.
Now, I'd like to group the dates by month (and add their respective dollar amounts for that given month). I plan to create an array where the indexing column is the month, and the next column is the sum of dollar amounts for that month. I would then take this array and create another pandas Series object out of it.
My problem is that I can't seem to call the specific columns from the current pandas series object I have.
Any help?
Edited to add:
from pandas import Series
from matplotlib import pyplot
import numpy as np
series = Series.from_csv('FCdata.csv', header=0, parse_dates = [0], index_col =0)
print(series)
pyplot.plot(series)
pyplot.show() # this successfully plots the x-axis (date) with the y-axis (dollar amount)
dates = series[0] # this is where I try to call the column, but with no luck
This is what my data looks like in a csv:
Dates Amount
1/1/2015 112
1/2/2015 65
1/3/2015 63
1/4/2015 125
1/5/2015 135
1/6/2015 56
1/7/2015 55
1/12/2015 84
1/27/2015 69
1/28/2015 133
1/29/2015 52
1/30/2015 91
2/2/2015 144
2/3/2015 114
2/4/2015 59
2/5/2015 95
2/6/2015 72
2/9/2015 73
2/10/2015 119
2/11/2015 133
2/12/2015 128
2/13/2015 141
2/17/2015 105
2/18/2015 107
2/19/2015 81
2/20/2015 52
2/23/2015 135
2/24/2015 65
2/25/2015 58
2/26/2015 144
2/27/2015 102
3/2/2015 95
3/3/2015 98
You are reading the CSV file into a Series. A Series is a one-dimensional object - there are no columns associated with it. You see the index of that Series (dates) and probably think that's another column but it's not.
You have two alternatives: you can convert it to a DataFrame (either by calling reset_index() or to_frame or use it as a Series.
series.resample('M').sum()
Out:
Dates
2015-01-31 1040
2015-02-28 1927
2015-03-31 193
Freq: M, Name: Amount, dtype: int64
Since you already have an index formatted as date, grouping by month with resample is very straightforward so I'd suggest keeping it as a Series.
However, you can always convert it to a DataFrame with:
df = series.to_frame('Value')
Now, you can use df['Value'] to select that single column. resampling can be done both on the DataFrame and the Series:
df.resample('M').sum()
Out:
Value
Dates
2015-01-31 1040
2015-02-28 1927
2015-03-31 193
And you can access the index if you want to use that in plotting:
series.index # df.index would return the same
Out:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-12',
'2015-01-27', '2015-01-28', '2015-01-29', '2015-01-30',
'2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',
'2015-02-06', '2015-02-09', '2015-02-10', '2015-02-11',
'2015-02-12', '2015-02-13', '2015-02-17', '2015-02-18',
'2015-02-19', '2015-02-20', '2015-02-23', '2015-02-24',
'2015-02-25', '2015-02-26', '2015-02-27', '2015-03-02',
'2015-03-03'],
dtype='datetime64[ns]', name='Dates', freq=None)
Note: For basic time-series charts, you can use pandas' plotting tools.
df.plot() produces:
And df.resample('M').sum().plot() produces:

How to plot data from csv for specific date and time using matplotlib?

I have written a python program to get data from csv using pandas and plot the data using matplotlib. My code is below with result:
import pandas as pd
import datetime
import csv
import matplotlib.pyplot as plt
headers = ['Sensor Value','Date','Time']
df = pd.read_csv('C:/Users\Lala Rushan\Downloads\DataLog.CSV',parse_dates= {"Datetime" : [1,2]},names=headers)
#pd.to_datetime(df['Date'] + ' ' + df['Time'])
#df.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),)
print (df)
#f = plt.figure(figsize=(10, 10))
df.plot(x='Datetime',y='Sensor Value',) # figure.gca means "get current axis"
plt.title('Title here!', color='black')
plt.tight_layout()
plt._show()
Now as you can see the x-axis looks horrible. How can I plot the x-axis for a single date and time interval so that it does not looks like overlapping each other? I have stored both date and time as one column in my dataframe.
My Dataframe looks like this:
Datetime Sensor Value
0 2017/02/17 19:06:17.188 2
1 2017/02/17 19:06:22.360 72
2 2017/02/17 19:06:27.348 72
3 2017/02/17 19:06:32.482 72
4 2017/02/17 19:06:37.515 74
5 2017/02/17 19:06:42.580 70
Hacky way
Try this:
import pylab as pl
pl.xticks(rotation = 90)
It will rotate the labels by 90 degrees, thus eliminating overlap.
Cleaner way
Check out this link which describes how to use fig.autofmt_xdate() and let matplotlib pick the best way to format your dates.
Pandas way
Use to_datetime() and set_index with DataFrame.plot():
df.Datetime=pd.to_datetime(df.Datetime)
df.set_index('Datetime')
df['Sensor Value'].plot()
pandas will then take care to plot it nicely for you:
my Dataframe looks like this:
Datetime Sensor Value
0 2017/02/17 19:06:17.188 2
1 2017/02/17 19:06:22.360 72
2 2017/02/17 19:06:27.348 72
3 2017/02/17 19:06:32.482 72
4 2017/02/17 19:06:37.515 74
5 2017/02/17 19:06:42.580 70

Interpolate (upsample) non-equispaced timeseries into equispaced with Pandas version 18.0rc1

I want to interpolate (upscale) nonequispaced time-series to obtain equispaced time-series.
Currently I am doing it in following way:
take original timeseries.
create new timeseries with NaN values at each 30 seconds intervals ( using resample('30S').asfreq() )
concat original timeseries and new timeseries
sort the timeseries to restore order of times (This I do not like - sorting has complexity of O = n log(n) )
interpolate
remove original points from the timeseries
is there a more simple way with Pandas version 18.0rc1? like in matlab you have original timeseries and you pass new times as a parameter to the interpolate() function to receive values at desired times.
I remark that times of original timeseries might not be be a subset of the times of desired timeseries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
values = [271238, 329285, 50, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:4',
'2015-01-04 08:37:05',
'2015-01-04 08:41:07',
'2015-01-04 08:43:05',
'2015-01-04 08:49:05'])
ts = pd.Series(values, index=timestamps)
ts
ts[ts==-1] = np.nan
newFreq=ts.resample('60S').asfreq()
new=pd.concat([ts,newFreq]).sort_index()
new=new.interpolate(method='time')
ts.plot(marker='o')
new.plot(marker='+',markersize=15)
new[newFreq.index].plot(marker='.')
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['original values (nonequispaced)', 'original + interpolated at new frequency (nonequispaced)', 'interpolated values without original values (equispaced!)']
plt.legend(lines, labels, loc='best')
plt.show()
There have been several requests for a simpler way to interpolate at desired values (I'll edit in links later, but search the issue tracker for interpolate issues). So in the future there will be an easier way.
For now you can write the option a bit more cleanly as
In [9]: (ts.reindex(ts.index | newFreq.index)
.interpolate(method='time')
.loc[newFreq.index])
Out[9]:
2015-01-04 08:29:00 NaN
2015-01-04 08:30:00 277996.070686
2015-01-04 08:31:00 285236.860707
2015-01-04 08:32:00 292477.650728
2015-01-04 08:33:00 299718.440748
...
2015-01-04 08:45:00 261362.402778
2015-01-04 08:46:00 261937.569444
2015-01-04 08:47:00 262512.736111
2015-01-04 08:48:00 263087.902778
2015-01-04 08:49:00 263663.069444
Freq: 60S, dtype: float64
This still involves all the steps you listed above, but the unioning of the indexes is cleaner than concating and dropping.

Categories

Resources