Managing hourly data into monthly data for the whole year Pandas

Managing hourly data into monthly data for the whole year Pandas - python

Background
The data links which I uploaded is the time series data of the con for the whole year at a monitoring station. The format of the data is shown like this:
My target
To investigate the temporal pattern of the samples, I want to plot the variation of the monthly sample.
Like the figure below which I downloaded from plot.ly. Each box represent the daily average sample of the raw data. And the monthly average values are outlined by the lines.
With groupby function or pd.pivot function, I can get the subset of certain month or daily data easily.
But I found out that it's hard to generate a bunch of dataframes. Each one should contains the daily average data for certain month.
By pre-defining 12 empty dataframes, I can generate 12 dataframes which feed my need. But is there any neat way to divide the original dataframe and then generate multliple dataframes by user-defined conditions.
EDIT
Inspired by the answer of #alexis. I tried to achieve my target with these code. And it works for me.
## PM is the original dataset with date, hour, and values.
position = np.arange(1,13,1)
monthDict = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
pm['label'] = np.nan
for i in range(0,len(pm),1):
pm['label'].iloc[i] = monthDict.get(int(pm['date'].str[4:6].iloc[i]))
## Create an empty dataframe for containing the daily mean value.
df = pd.DataFrame(np.nan, index=np.arange(0,31,1), columns=['A'])
for i,t in enumerate(pm.label.unique()):
df[str(t)] = np.nan
df = df.drop(['A'],1)
mean_ = []
for i in range(0,len(pm.label.unique()),1):
month_data = pm.groupby(['label']).get_group(pm.label.unique()[i]).groupby(pm['date'].str[6:8])['pm25'].mean()
mean_.append(month_data.mean())
for j in range(0,len(month_data),1):
df[pm.label.unique()[i]].iloc[j] = month_data[j]
#### PLOT
fig = plt.figure(figsize=(12,5))
ax = plt.subplot()
bp = ax.boxplot( df.dropna().values, patch_artist=True, showfliers=False)
mo_me = plt.plot(position,mean_, marker = 'o', color ='k',markersize =6, label = 'Monthly Mean', lw = 1.5,zorder =3)
cs = ['#9BC4E1','k']
for box in bp['boxes']:
box.set(color = 'b', alpha = 1)
box.set(facecolor = cs[0], alpha = 1)
for whisker in bp['whiskers']:
whisker.set(color=cs[1], linewidth=1,linestyle = '-')
for cap in bp['caps']:
cap.set(color=cs[1], linewidth=1)
for median in bp['medians']:
median.set(color=cs[1], linewidth=1.5)
ax.set_xticklabels(pm.label.unique(), fontsize = 14)
# ax.set_yticklabels(ax.get_yticks(), fontsize = 12)
for label in ax.yaxis.get_ticklabels()[::2]:
label.set_visible(False)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(14)
plt.ylabel('Concentration', fontsize = 16, labelpad =14)
plt.xlabel('Month', fontsize = 16, labelpad =14)
plt.legend(fontsize = 14, frameon = False)
ax.set_ylim(0.0, 178)
plt.grid()
plt.show()
And this is my output figure.
Any suggestion about my code on data management or visualization would be appreciate!

Don't generate 12 dataframes. Instead of splitting your data into multiple similar variables, add a column that indicates which group each row should belong to. This is standard practice (with good reason) for database tables, dataframes, etc.
Use groupby on your dataset to group the data by month, then use apply() on the resulting DataFrameGroupBy object to restrict whatever analysis you want (e.g., the average to each group. This will also make it easy to plot the monthly results together.
You don't provide any code, so it's hard to be more specific than that. Show how you group your data by month and what you want to do to the monthly dataframes, and I'll show you how to restrict it to each month via the groupby object.

Related

How to add a condition to legend labelling to show certain entries only

is it possible to create plot labels based on conditions? As one can see below, I am creating plots within a loop. The issue is that not all locations of the dataset contain data to plot as time series. Hence I want to label only those timesteps which contain data. I have 8 locations of which only 4 contain data. Currently, a legend for all locations is plotted:
I would like to do something like:
lineplot = plt.plot(timesteps, skin_celsius, label='%s. storm location' % i with condition if time series contains no NaN data)
is something like this possible?
This is the whole code for the plot
fig2 = plt.figure(figsize=(20, 20), dpi=300)
# change this loop to extract time series data for day -10 + 10days around stormtrack data.
for i, dummy in enumerate(lati):
dsloc = SSTskin_subfile_masked.sel(lon=loni[i], lat=lati[i], method='nearest')
dstime = SSTskin_subfile_masked.sel(time=timei[i], lon=loni[i], lat=lati[i], method='nearest') #find point data within timeseries data
skin_celsius = (dsloc['analysed_sst']) - 273.15
timesteps = dsloc.time.values
timestep = dstime.time.values
timevalue = ((dstime['analysed_sst']).values) - 273.15
#Here I want to add a condition to the labelling.
lineplot = plt.plot(timesteps, skin_celsius, label='%s. storm location' % I)
dotplot = plt.plot(timestep, timevalue, "or")
plt.title('Skin SST over time at storm track locations', fontsize = 20 )
plt.xlabel('Date', fontsize = 16)
plt.ylabel('Skin SST in $^{\circ}C$', fontsize = 16)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend()

Stacked Area Chart in Python

I'm working on an assignment from school, and have run into a snag when it comes to my stacked area chart.
The data is fairly simple: 4 columns that look similar to this:
Series id
Year
Period
Value
LNS140000
1948
M01
3.4
I'm trying to create a stacked area chart using Year as my x and Value as my y and breaking it up over Period.
#Stacked area chart still using unemployment data
x = d.Year
y = d.Value
plt.stackplot(x, y, labels = d['Period'])
plt.legend(d['Period'], loc = 'upper left')
plt.show()enter code here`
However, when I do it like this it only picks up M01 and there are M01-M12. Any thoughts on how I can make this work?

You need to preprocess your data a little before passing them to the stackplot function. I took a look at this link to work on an example that could be suitable for your case.
Since I've seen one row of your data, I add some random values to the dataset.
import pandas as pd
import matplotlib.pyplot as plt
dd=[[1948,'M01',3.4],[1948,'M02',2.5],[1948,'M03',1.6],
[1949,'M01',4.3],[1949,'M02',6.7],[1949,'M03',7.8]]
d=pd.DataFrame(dd,columns=['Year','Period','Value'])
years=d.Year.unique()
periods=d.Period.unique()
#Now group them per period, but in year sequence
d.sort_values(by='Year',inplace=True) # to ensure entire dataset is ordered
pds=[]
for p in periods:
pds.append(d[d.Period==p]['Value'].values)
plt.stackplot(years,pds,labels=periods)
plt.legend(loc='upper left')
plt.show()
Is that what you want?

So I was able to use Seaborn to help out. First I did a pivot table
df = d.pivot(index = 'Year',
columns = 'Period',
values = 'Value')
df
Then I set up seaborn
plt.style.use('seaborn')
sns.set_style("white")
sns.set_theme(style = "ticks")
df.plot.area(figsize = (20,9))
plt.title("Unemployment by Year and Month\n", fontsize = 22, loc = 'left')
plt.ylabel("Values", fontsize = 22)
plt.xlabel("Year", fontsize = 22)

It seems to me that the problem you are having relates to the formatting of the data. Look how the values are formatted in this matplotlib example. I would try to groupby the data by period, or pivot it in the correct format, and then graphing again.

How to make a line plot from a dataframe with multiple categorical columns in matplotlib

I want to make line chart for the different categories where one is a different country, and one is a different country for weekly based line charts. Initially, I was able to draft line plots using seaborn but it is not quite handy like setting its label, legend, color palette and so on. I am wondering is there any way to easily reshape this data with multiple categorical variables and render line charts. In initial attempt, I tried seaborn.relplot but it is not easy to tune its parameter and hard to customize the resulted plot. Can anyone point me to any efficient way to reshape dataframe with multiple categorical columns and render a clear line chart? Any thoughts?
reproducible data & my attempt:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/cb0553e009933574ac7ec3109ffb5140/raw/a277bc00dc08e526a7d5b7ead5425905f7206bfa/export.csv'
dff = pd.read_csv(url, parse_dates=['weekly'])
dff.drop('Unnamed: 0', axis=1, inplace=True)
df2_bf = dff.groupby(['destination', 'weekly'])['FCF_Beef'].sum().unstack()
df2_bf = df2_bf.fillna(0)
mm = df2_bf.T
mm.columns.name = None
mm = mm[~(mm.isna().sum(1)/mm.shape[1]).gt(0.9)].fillna(0)
#Total sum per column:
mm.loc['Total',:]= mm.sum(axis=0)
mm1 = mm.T
mm1 = mm1.nlargest(6, columns=['Total'])
mm1.drop('Total', axis=1, inplace=True)
mm2 = mm1.T
mm2.reset_index(inplace=True)
mm2['weekly'] = pd.to_datetime(mm2['weekly'])
mm2['year'] = mm2['weekly'].dt.year
mm2['week'] = mm2['weekly'].dt.isocalendar().week
df = mm2.melt(id_vars=['weekly','week','year'], var_name='country')
df_ = df.groupby(['country', 'year', 'week'], as_index=False)['value'].sum()
sns.relplot(data=df_, x='week', y='value', hue='year', row='country', kind='line', height=6, aspect=2, facet_kws={'sharey': False, 'sharex': False}, sizes=(20, 10))
current plot
this is one of current plot that I made with seaborn.relplot
structure of plot is okay for me, but in seaborn.replot, it is hard to tune parameter and it is as flexible as using matplotlib. Also, I realized that the way of aggregating my data is not very efficient. I think there might be a shortcut to make the above code snippet more efficient like:
plt_data = []
for i in dff.loc[:, ['FCF_Beef','FCF_Beef']]:
...
but doing this way I faced a couple of issues to make the right plot. Can anyone point me out how to make this simple and efficient in order to make the expected line chart with matplotlib? Does anyone know any better way of doing this? Any idea? Thanks
desired output
In my desired plot, first I need to iterate list of countries, where each country has one subplot, in each subplot, x-axis shows 52 weeks and y-axis shows weeklyExport amount of different years for each country. Here is draft plot that I made with seaborn.relplot.
note that, I don't like the output from seaborn.relplot, so I am wondering how can I make above attempt more efficient with matplotlib attempt. Any idea?

As requested by the OP, following is an iterative way to plot the data.
The following example plots each year, for a given 'destination' in a single figure
This is similar to the answer for this question.
import pandas as pd
import matplotlib.pyplot as plt
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/cb0553e009933574ac7ec3109ffb5140/raw/a277bc00dc08e526a7d5b7ead5425905f7206bfa/export.csv'
df = pd.read_csv(url, parse_dates=['weekly'], usecols=range(1, 6))
# groupby destination and iterate through for plotting
for g, d in df.groupby(['destination']):
# create the figure
fig, ax = plt.subplots(figsize=(7, 4))
# add lines for specific years
for year in d.weekly.dt.year.unique():
data = d[d.weekly.dt.year == year].copy() # select the data from d, by year
data['week'] = data.weekly.dt.isocalendar().week # create a week column
data.sort_values('weekly', inplace=True)
display(data.head()) # display is for jupyter, if it causes an error, use pring
data.plot(x='week', y='FCF_Beef', ax=ax, label=year)
plt.show()
Single sample plot
If we look at the tail of one of the dataframes, data.weekly.dt.isocalendar().week as putting the last day of the year as week 1, so a line is drawn back to the last data point being placed at week 1.
This function rests on datetime.datetime(2018, 12, 31).isocalendar() and is the expected behavior from the datetime module, as per this closed pandas bug.
Removing the last row with .iloc[:-1, :], is a work around
Alternatively, replace data['week'] = data.weekly.dt.isocalendar().week with data['week'] = data.weekly.dt.strftime('%W').astype('int')
data.iloc[:-1, :].plot(x='week', y='FCF_Beef', ax=ax, label=year)
Updated with all code from OP
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/cb0553e009933574ac7ec3109ffb5140/raw/a277bc00dc08e526a7d5b7ead5425905f7206bfa/export.csv'
dff = pd.read_csv(url, parse_dates=['weekly'], usecols=range(1, 6))
df2_bf = dff.groupby(['destination', 'weekly'])['FCF_Beef'].sum().unstack()
df2_bf = df2_bf.fillna(0)
mm = df2_bf.T
mm.columns.name = None
mm = mm[~(mm.isna().sum(1)/mm.shape[1]).gt(0.9)].fillna(0)
#Total sum per column:
mm.loc['Total',:]= mm.sum(axis=0)
mm1 = mm.T
mm1 = mm1.nlargest(6, columns=['Total'])
mm1.drop('Total', axis=1, inplace=True)
mm2 = mm1.T
mm2.reset_index(inplace=True)
mm2['weekly'] = pd.to_datetime(mm2['weekly'])
mm2['year'] = mm2['weekly'].dt.year
mm2['week'] = mm2['weekly'].dt.strftime('%W').astype('int')
df = mm2.melt(id_vars=['weekly','week','year'], var_name='country')
# groupby destination and iterate through for plotting
for g, d in df.groupby(['country']):
# create the figure
fig, ax = plt.subplots(figsize=(7, 4))
# add lines for specific years
for year in d.weekly.dt.year.unique():
data = d[d.weekly.dt.year == year].copy() # select the data from d, by year
data.sort_values('weekly', inplace=True)
display(data.head()) # display is for jupyter, if it causes an error, use pring
data.plot(x='week', y='value', ax=ax, label=year, title=g)
plt.show()

Monthly shaded error/std plot in matplotlib from daily timeseries data

I have a pandas dataframe with 5 years daily time series data. I want to make a monthly plot from whole datasets so that the plot should shows variation (std or something else) within monthly data. Simillar figure I tried to create but did not found a way to do that:
for example, I have a sudo daily precipitation data:
date = pd.to_datetime("1st of Dec, 1999")
dates = date+pd.to_timedelta(np.arange(1900), 'D')
ppt = np.random.normal(loc=0.0, scale=1.0, size=1900).cumsum()
df = pd.DataFrame({'pre':ppt},index=dates)
Manually I can do it like:
one = df['pre']['1999-12-01':'2000-11-29'].values
two = df['pre']['2000-12-01':'2001-11-30'].values
three = df['pre']['2001-12-01':'2002-11-30'].values
four = df['pre']['2002-12-01':'2003-11-30'].values
five = df['pre']['2003-12-01':'2004-11-29'].values
df = pd.DataFrame({'2000':one,'2001':two,'2002':three,'2003':four,'2004':five})
std = df.std(axis=1)
lw = df.mean(axis=1)-std
up = df.mean(axis=1)+std
plt.fill_between(np.arange(365), up, lw, alpha=.4)
I am looking for the more pythonic way to do that instead of doing it manually!
Any helps will be highly appreciated

If I'm understanding you correctly you'd like to plot your daily observations against a monthly periodic mean +/- 1 standard deviation. And that's what you get in my screenshot below. Nevermind the lackluster design and color choice. We'll get to that if this is something you can use. And please notice that I've replaced your ppt = np.random.rand(1900) with ppt = np.random.normal(loc=0.0, scale=1.0, size=1900).cumsum() just to make the data look a bit more like your screenshot.
Here I've aggregated the daily data by month, and retrieved mean and standard deviation for each month. Then I've merged that data with the original dataframe so that you're able to plot both the source and the grouped data like this:
# imports
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
import numpy as np
# Data that matches your setup, but with a random
# seed to make it reproducible
np.random.seed(42)
date = pd.to_datetime("1st of Dec, 1999")
dates = date+pd.to_timedelta(np.arange(1900), 'D')
#ppt = np.random.rand(1900)
ppt = np.random.normal(loc=0.0, scale=1.0, size=1900).cumsum()
df = pd.DataFrame({'ppt':ppt},index=dates)
# A subset
df = df.tail(200)
# Add a yearmonth column
df['YearMonth'] = df.index.map(lambda x: 100*x.year + x.month)
# Create aggregated dataframe
df2 = df.groupby('YearMonth').agg(['mean', 'std']).reset_index()
df2.columns = ['YearMonth', 'mean', 'std']
# Merge original data and aggregated data
df3 = pd.merge(df,df2,how='left',on=['YearMonth'])
df3 = df3.set_index(df.index)
df3 = df3[['ppt', 'mean', 'std']]
# Function to make your plot
def monthplot():
fig, ax = plt.subplots(1)
ax.set_facecolor('white')
# Define upper and lower bounds for shaded variation
lower_bound = df3['mean'] + df3['std']*-1
upper_bound = df3['mean'] + df3['std']
fig, ax = plt.subplots(1)
ax.set_facecolor('white')
# Source data and mean
ax.plot(df3.index,df3['mean'], lw=0.5, color = 'red')
ax.plot(df3.index, df3['ppt'], lw=0.1, color = 'blue')
# Variation and shaded area
ax.fill_between(df3.index, lower_bound, upper_bound, facecolor='grey', alpha=0.5)
fig = ax.get_figure()
# Assign months to X axis
locator = mdates.MonthLocator() # every month
# Specify the format - %b gives us Jan, Feb...
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
fig.show()
monthplot()
Check out this post for more on axis formatting and this post on how to add a YearMonth column.

In your example, you have a few mistakes, but I think it isn't important.
Do you want all years to be on the same graphic (like in your example)? If you do, this may help you:
df['month'] = df.index.strftime("%m-%d")
df['year'] = df.index.year
df.set_index(['month']).drop(['year'],1).plot()

Box plot for continuous data in Python

I have a csv file with 2 columns:
col1- Timestamp data(yyyy-mm-dd hh:mm:ss.ms (8 months data))
col2 : Heat data (continuous variable) .
Since there are almost 50k record, I would like to partition the col1(timestamp col) into months or weeks and then apply box plot on the heat data w.r.t timestamp.
I tried in R,it takes a long time. Need help to do in Python. I think I need to use seaborn.boxplot.
Please guide.

Group by Frequency then plot groups
First Read your csv data into a Pandas DataFrame
import numpy as np
import Pandas as pd
from matplotlib import pyplot as plt
# assumes NO header line in csv
df = pd.read_csv('\file\path', names=['time','temp'], parse_dates=[0])
I will use some fake data, 30 days of hourly samples.
heat = np.random.random(24*30) * 100
dates = pd.date_range('1/1/2011', periods=24*30, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
Set the timestamps as the DataFrame's index
df = df.set_index('time')
Now group by by the period you want, seven days for this example
gb = df.groupby(pd.Grouper(freq='7D'))
Now you can plot each group separately
for g, week in gb2:
#week.plot()
week.boxplot()
plt.title(f'Week Of {g.date()}')
plt.show()
plt.close()
And... I didn't realize you could do this but it is pretty cool
ax = gb.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=30)
plt.show()
plt.close()
heat = np.random.random(24*300) * 100
dates = pd.date_range('1/1/2011', periods=24*300, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
df = df.set_index('time')
To partition the data in five time periods then get weekly boxplots of each:
Determine the total timespan; divide by five; create a frequency alias; then groupby
dt = df.index[-1] - df.index[0]
dt = dt/5
alias = f'{dt.total_seconds()}S'
gb = df.groupby(pd.Grouper(freq=alias))
Each group is a DataFrame so iterate over the groups; create weekly groups from each and boxplot them.
for g,d_frame in gb:
gb_tmp = d_frame.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
There might be a better way to do this, if so I'll post it or maybe someone will fill free to edit this. Looks like this could lead to the last group not having a full set of data. ...
If you know that your data is periodic you can just use slices to split it up.
n = len(df) // 5
for tmp_df in (df[i:i+n] for i in range(0, len(df), n)):
gb_tmp = tmp_df.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
Frequency aliases
pandas.read_csv()
pandas.Grouper()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Managing hourly data into monthly data for the whole year Pandas - python

Related

How to add a condition to legend labelling to show certain entries only

Stacked Area Chart in Python

How to make a line plot from a dataframe with multiple categorical columns in matplotlib

Monthly shaded error/std plot in matplotlib from daily timeseries data

Box plot for continuous data in Python

Categories

Resources