I have a csv file with 2 columns:
col1- Timestamp data(yyyy-mm-dd hh:mm:ss.ms (8 months data))
col2 : Heat data (continuous variable) .
Since there are almost 50k record, I would like to partition the col1(timestamp col) into months or weeks and then apply box plot on the heat data w.r.t timestamp.
I tried in R,it takes a long time. Need help to do in Python. I think I need to use seaborn.boxplot.
Please guide.
Group by Frequency then plot groups
First Read your csv data into a Pandas DataFrame
import numpy as np
import Pandas as pd
from matplotlib import pyplot as plt
# assumes NO header line in csv
df = pd.read_csv('\file\path', names=['time','temp'], parse_dates=[0])
I will use some fake data, 30 days of hourly samples.
heat = np.random.random(24*30) * 100
dates = pd.date_range('1/1/2011', periods=24*30, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
Set the timestamps as the DataFrame's index
df = df.set_index('time')
Now group by by the period you want, seven days for this example
gb = df.groupby(pd.Grouper(freq='7D'))
Now you can plot each group separately
for g, week in gb2:
#week.plot()
week.boxplot()
plt.title(f'Week Of {g.date()}')
plt.show()
plt.close()
And... I didn't realize you could do this but it is pretty cool
ax = gb.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=30)
plt.show()
plt.close()
heat = np.random.random(24*300) * 100
dates = pd.date_range('1/1/2011', periods=24*300, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
df = df.set_index('time')
To partition the data in five time periods then get weekly boxplots of each:
Determine the total timespan; divide by five; create a frequency alias; then groupby
dt = df.index[-1] - df.index[0]
dt = dt/5
alias = f'{dt.total_seconds()}S'
gb = df.groupby(pd.Grouper(freq=alias))
Each group is a DataFrame so iterate over the groups; create weekly groups from each and boxplot them.
for g,d_frame in gb:
gb_tmp = d_frame.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
There might be a better way to do this, if so I'll post it or maybe someone will fill free to edit this. Looks like this could lead to the last group not having a full set of data. ...
If you know that your data is periodic you can just use slices to split it up.
n = len(df) // 5
for tmp_df in (df[i:i+n] for i in range(0, len(df), n)):
gb_tmp = tmp_df.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
Frequency aliases
pandas.read_csv()
pandas.Grouper()
Related
I want to overlay some graphs out of CSV data (two datasets).
The graph I got from my dataset is shown down below.
Is there any way to plot those datasets over specific points? I would like to overlay these plots by using the anchor of the "big drop" to compare them in a better way.
The code used:
import pandas as pd
import matplotlib.pyplot as plt
# Read the data
data1 = pd.read_csv('data1.csv', delimiter=";", decimal=",")
data2 = pd.read_csv('data2.csv', delimiter=";", decimal=",")
data3 = pd.read_csv('data3.csv', delimiter=";", decimal=",")
data4 = pd.read_csv('data4.csv', delimiter=";", decimal=",")
# Plot the data
plt.plot(data1['Zeit'], data1['Kanal A'])
plt.plot(data2['Zeit'], data2['Kanal A'])
plt.plot(data3['Zeit'], data3['Kanal A'])
plt.plot(data4['Zeit'], data4['Kanal A'])
plt.show()
plt.close()
I would like to share you some data here:
Link to data
Part 1: Anchor times
A simple way is to find the times of interest (lowest point) in each frame, then plot each series with x=t - t_peak instead of x=t. Two ways come to mind to find the desired anchor points:
Simply using the global minimum (in your plots, that would work fine), or
Using the most prominent local minimum, either from first principles, or using scipy's find_peaks().
But first of all, let us attempt to build a reproducible example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def make_sample(t_peak, tmax_approx=17.5, n=100):
# uneven times
t = np.random.uniform(0, 2*tmax_approx/n, n).cumsum()
y = -1 / (0.1 + 2 * np.abs(t - t_peak))
trend = 4 * np.random.uniform(-1, 1) / n
level = np.random.uniform(10, 12)
y += np.random.normal(trend, 1/n, n).cumsum() + level
return pd.DataFrame({'t': t, 'y': y})
poi = [2, 2.48, 2.6, 2.1]
np.random.seed(0)
frames = [make_sample(t_peak) for t_peak in poi]
plt.rcParams['figure.figsize'] = (6,2)
fig, ax = plt.subplots()
for df in frames:
ax.plot(*df.values.T)
In this case, we made the problem maximally inconvenient by giving each time series its own, independent, unevenly distributed time sampling.
Now, retrieving the "maximum drop" by global minimum:
peaks = [df.loc[df['y'].idxmin(), 't'] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
fig, ax = plt.subplots()
for t_peak, df in zip(peaks, frames):
ax.plot(df['t'] - t_peak, df['y'])
But imagine a case where the global minimum is not suitable. For example, add a large sine wave to each series:
frames = [df.assign(y=df['y'] + 5 * np.sin(df['t'])) for df in frames]
# just plotting the first series
df = frames[0]
plt.plot(*df.values.T)
Clearly, there are several local minima, and the one we want ("sharpest drop") is not the global one.
A simple way to find the desired sharpest drop time is by looking at the difference from each point to its two neighbors:
def arg_steepest_min(v):
# simply find the minimum that is most separated from the surrounding points
diff = np.diff(v)
i = np.argmin(diff[:-1] - diff[1:]) + 1
return i
peaks = [df['t'].iloc[arg_steepest_min(df['y'])] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
# just plotting the first curve and the peak found
df = frames[0]
plt.plot(*df.values.T)
plt.plot(*df.iloc[arg_steepest_min(df['y'])].T, 'x')
There are more complex cases where you want to bring the full power of find_peaks(). Here is an example that uses the most prominent minimum, using a certain number of samples for neighborhood:
from scipy.signal import find_peaks, peak_prominences
def arg_most_prominent_min(v, prominence=1, wlen=10):
peaks, details = find_peaks(-v, prominence=prominence, wlen=wlen)
i = peaks[np.argmax(details['prominences'])]
return i
peaks = [df['t'].iloc[arg_most_prominent_min(df['y'])] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
In this case, the peaks found by both methods are the same. Aligning the curves gives:
fig, ax = plt.subplots()
for t_peak, df in zip(peaks, frames):
ax.plot(df['t'] - t_peak, df['y'])
Part 2: aligning the time series for numeric operations
Having found the anchor times and plotted the time series by shifting the x-axis accordingly, suppose now that we want to align all the time series, for example to somehow compare them to one another (e.g.: differences, correlation, etc.). In this example we made up, the time samples are not equidistant and all series have their own sampling.
We can use resample() to achieve our goal. Let us convert the frames into actual time series, transforming the column t (supposed in seconds) into a DateTimeIndex, after shifting the time using the previously found t_peak and using an arbitrary "0" date:
frames = [
pd.Series(
df['y'].values,
index=pd.Timestamp(0) + (df['t'] - t_peak) * pd.Timedelta(1, 's')
) for t_peak, df in zip(peaks, frames)]
>>> frames[0]
t
1969-12-31 23:59:58.171107267 11.244308
1969-12-31 23:59:58.421423545 12.387291
1969-12-31 23:59:58.632390727 13.268186
1969-12-31 23:59:58.823099841 13.942224
1969-12-31 23:59:58.971379021 14.359900
...
1970-01-01 00:00:14.022717327 10.422229
1970-01-01 00:00:14.227996854 9.504693
1970-01-01 00:00:14.235034496 9.489011
1970-01-01 00:00:14.525163506 8.388377
1970-01-01 00:00:14.526806922 8.383366
Length: 100, dtype: float64
At this point, the sampling is still uneven, so we use resample to get a fixed frequency. One strategy is to oversample and interpolate:
frames = [df.resample('100ms').mean().interpolate() for df in frames]
for df in frames:
df.plot()
At this point, we can compare the Series. Here are the pairwise differences and correlations:
fig, axes = plt.subplots(nrows=len(frames), ncols=len(frames), figsize=(10, 5))
for axrow, a in zip(axes, frames):
for ax, b in zip(axrow, frames):
(b-a).plot(ax=ax)
ax.set_title(fr'$\rho = {b.corr(a):.3f}$')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.tight_layout()
I'm working on an assignment from school, and have run into a snag when it comes to my stacked area chart.
The data is fairly simple: 4 columns that look similar to this:
Series id
Year
Period
Value
LNS140000
1948
M01
3.4
I'm trying to create a stacked area chart using Year as my x and Value as my y and breaking it up over Period.
#Stacked area chart still using unemployment data
x = d.Year
y = d.Value
plt.stackplot(x, y, labels = d['Period'])
plt.legend(d['Period'], loc = 'upper left')
plt.show()enter code here`
However, when I do it like this it only picks up M01 and there are M01-M12. Any thoughts on how I can make this work?
You need to preprocess your data a little before passing them to the stackplot function. I took a look at this link to work on an example that could be suitable for your case.
Since I've seen one row of your data, I add some random values to the dataset.
import pandas as pd
import matplotlib.pyplot as plt
dd=[[1948,'M01',3.4],[1948,'M02',2.5],[1948,'M03',1.6],
[1949,'M01',4.3],[1949,'M02',6.7],[1949,'M03',7.8]]
d=pd.DataFrame(dd,columns=['Year','Period','Value'])
years=d.Year.unique()
periods=d.Period.unique()
#Now group them per period, but in year sequence
d.sort_values(by='Year',inplace=True) # to ensure entire dataset is ordered
pds=[]
for p in periods:
pds.append(d[d.Period==p]['Value'].values)
plt.stackplot(years,pds,labels=periods)
plt.legend(loc='upper left')
plt.show()
Is that what you want?
So I was able to use Seaborn to help out. First I did a pivot table
df = d.pivot(index = 'Year',
columns = 'Period',
values = 'Value')
df
Then I set up seaborn
plt.style.use('seaborn')
sns.set_style("white")
sns.set_theme(style = "ticks")
df.plot.area(figsize = (20,9))
plt.title("Unemployment by Year and Month\n", fontsize = 22, loc = 'left')
plt.ylabel("Values", fontsize = 22)
plt.xlabel("Year", fontsize = 22)
It seems to me that the problem you are having relates to the formatting of the data. Look how the values are formatted in this matplotlib example. I would try to groupby the data by period, or pivot it in the correct format, and then graphing again.
I have a pandas dataframe with 5 years daily time series data. I want to make a monthly plot from whole datasets so that the plot should shows variation (std or something else) within monthly data. Simillar figure I tried to create but did not found a way to do that:
for example, I have a sudo daily precipitation data:
date = pd.to_datetime("1st of Dec, 1999")
dates = date+pd.to_timedelta(np.arange(1900), 'D')
ppt = np.random.normal(loc=0.0, scale=1.0, size=1900).cumsum()
df = pd.DataFrame({'pre':ppt},index=dates)
Manually I can do it like:
one = df['pre']['1999-12-01':'2000-11-29'].values
two = df['pre']['2000-12-01':'2001-11-30'].values
three = df['pre']['2001-12-01':'2002-11-30'].values
four = df['pre']['2002-12-01':'2003-11-30'].values
five = df['pre']['2003-12-01':'2004-11-29'].values
df = pd.DataFrame({'2000':one,'2001':two,'2002':three,'2003':four,'2004':five})
std = df.std(axis=1)
lw = df.mean(axis=1)-std
up = df.mean(axis=1)+std
plt.fill_between(np.arange(365), up, lw, alpha=.4)
I am looking for the more pythonic way to do that instead of doing it manually!
Any helps will be highly appreciated
If I'm understanding you correctly you'd like to plot your daily observations against a monthly periodic mean +/- 1 standard deviation. And that's what you get in my screenshot below. Nevermind the lackluster design and color choice. We'll get to that if this is something you can use. And please notice that I've replaced your ppt = np.random.rand(1900) with ppt = np.random.normal(loc=0.0, scale=1.0, size=1900).cumsum() just to make the data look a bit more like your screenshot.
Here I've aggregated the daily data by month, and retrieved mean and standard deviation for each month. Then I've merged that data with the original dataframe so that you're able to plot both the source and the grouped data like this:
# imports
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
import numpy as np
# Data that matches your setup, but with a random
# seed to make it reproducible
np.random.seed(42)
date = pd.to_datetime("1st of Dec, 1999")
dates = date+pd.to_timedelta(np.arange(1900), 'D')
#ppt = np.random.rand(1900)
ppt = np.random.normal(loc=0.0, scale=1.0, size=1900).cumsum()
df = pd.DataFrame({'ppt':ppt},index=dates)
# A subset
df = df.tail(200)
# Add a yearmonth column
df['YearMonth'] = df.index.map(lambda x: 100*x.year + x.month)
# Create aggregated dataframe
df2 = df.groupby('YearMonth').agg(['mean', 'std']).reset_index()
df2.columns = ['YearMonth', 'mean', 'std']
# Merge original data and aggregated data
df3 = pd.merge(df,df2,how='left',on=['YearMonth'])
df3 = df3.set_index(df.index)
df3 = df3[['ppt', 'mean', 'std']]
# Function to make your plot
def monthplot():
fig, ax = plt.subplots(1)
ax.set_facecolor('white')
# Define upper and lower bounds for shaded variation
lower_bound = df3['mean'] + df3['std']*-1
upper_bound = df3['mean'] + df3['std']
fig, ax = plt.subplots(1)
ax.set_facecolor('white')
# Source data and mean
ax.plot(df3.index,df3['mean'], lw=0.5, color = 'red')
ax.plot(df3.index, df3['ppt'], lw=0.1, color = 'blue')
# Variation and shaded area
ax.fill_between(df3.index, lower_bound, upper_bound, facecolor='grey', alpha=0.5)
fig = ax.get_figure()
# Assign months to X axis
locator = mdates.MonthLocator() # every month
# Specify the format - %b gives us Jan, Feb...
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
fig.show()
monthplot()
Check out this post for more on axis formatting and this post on how to add a YearMonth column.
In your example, you have a few mistakes, but I think it isn't important.
Do you want all years to be on the same graphic (like in your example)? If you do, this may help you:
df['month'] = df.index.strftime("%m-%d")
df['year'] = df.index.year
df.set_index(['month']).drop(['year'],1).plot()
I have a huge csv file of data, it looks like this:
STAID, SOUID, DATE, TX, Q_TX
162,100522,19010101, -31, 0
162,100522,19010102, -13, 0
TX is temperature, data goes on for a few thousand more lines to give you an idea.
For every year, I want to plot the amount of days with a temperature above 25 degrees.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("klimaat.csv")
zomers = data.index[data["TX"] > 250].tolist()
x_values = []
y_values = []
plt.xlabel("Years")
plt.ylabel("Amount of days with TX > 250")
plt.title("Zomerse Dagen Per Jaar")
plt.plot(x_values, y_values)
# save plot
plt.savefig("zomerse_dagen.png")
X-axis should be the years say 1900-2010 or something, and the y-axis should be the amount of days with a temperature higher than 250 in that year.
How do I go about this? >_< I can't quite get a grasp on how to extract the amount of days from the data.... and use it in a plot.
You can create the data points separately to make it a little easier to comprehend. Then use pandas.pivot_table to aggregate. Here is a working example that should get you going.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("klimaat.csv", parse_dates=["DATE"])
data.sort_values("DATE", inplace=True)
data["above_250"] = data.TX > 250
data["year"] = data.apply(lambda x: x["DATE"].year, axis=1).astype("category")
plot_df = pd.pivot_table(data, index="year", values="above_250", aggfunc="sum")
years = plot_df.index
y_pos = np.arange(len(years))
values = plot_df.above_250
plt.bar(y_pos, values, align='center', alpha=0.5)
plt.xticks(y_pos, years)
plt.ylabel("Amount of days with TX > 250")
plt.xlabel("Year")
plt.title("Zomerse Dagen Per Jaar")
plt.show()
You can use the datetime module from the python standard library to parse the dates, in particular, have a look at the strptime function. You can then use the datetime.year attribute to aggregate your data.
You can also use an OrderedDict to keep track of your aggregation before you assign OrderedDict.keys() and OrdredDict.values() to x_values and y_values respectively.
Background
The data links which I uploaded is the time series data of the con for the whole year at a monitoring station. The format of the data is shown like this:
My target
To investigate the temporal pattern of the samples, I want to plot the variation of the monthly sample.
Like the figure below which I downloaded from plot.ly. Each box represent the daily average sample of the raw data. And the monthly average values are outlined by the lines.
With groupby function or pd.pivot function, I can get the subset of certain month or daily data easily.
But I found out that it's hard to generate a bunch of dataframes. Each one should contains the daily average data for certain month.
By pre-defining 12 empty dataframes, I can generate 12 dataframes which feed my need. But is there any neat way to divide the original dataframe and then generate multliple dataframes by user-defined conditions.
EDIT
Inspired by the answer of #alexis. I tried to achieve my target with these code. And it works for me.
## PM is the original dataset with date, hour, and values.
position = np.arange(1,13,1)
monthDict = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
pm['label'] = np.nan
for i in range(0,len(pm),1):
pm['label'].iloc[i] = monthDict.get(int(pm['date'].str[4:6].iloc[i]))
## Create an empty dataframe for containing the daily mean value.
df = pd.DataFrame(np.nan, index=np.arange(0,31,1), columns=['A'])
for i,t in enumerate(pm.label.unique()):
df[str(t)] = np.nan
df = df.drop(['A'],1)
mean_ = []
for i in range(0,len(pm.label.unique()),1):
month_data = pm.groupby(['label']).get_group(pm.label.unique()[i]).groupby(pm['date'].str[6:8])['pm25'].mean()
mean_.append(month_data.mean())
for j in range(0,len(month_data),1):
df[pm.label.unique()[i]].iloc[j] = month_data[j]
#### PLOT
fig = plt.figure(figsize=(12,5))
ax = plt.subplot()
bp = ax.boxplot( df.dropna().values, patch_artist=True, showfliers=False)
mo_me = plt.plot(position,mean_, marker = 'o', color ='k',markersize =6, label = 'Monthly Mean', lw = 1.5,zorder =3)
cs = ['#9BC4E1','k']
for box in bp['boxes']:
box.set(color = 'b', alpha = 1)
box.set(facecolor = cs[0], alpha = 1)
for whisker in bp['whiskers']:
whisker.set(color=cs[1], linewidth=1,linestyle = '-')
for cap in bp['caps']:
cap.set(color=cs[1], linewidth=1)
for median in bp['medians']:
median.set(color=cs[1], linewidth=1.5)
ax.set_xticklabels(pm.label.unique(), fontsize = 14)
# ax.set_yticklabels(ax.get_yticks(), fontsize = 12)
for label in ax.yaxis.get_ticklabels()[::2]:
label.set_visible(False)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(14)
plt.ylabel('Concentration', fontsize = 16, labelpad =14)
plt.xlabel('Month', fontsize = 16, labelpad =14)
plt.legend(fontsize = 14, frameon = False)
ax.set_ylim(0.0, 178)
plt.grid()
plt.show()
And this is my output figure.
Any suggestion about my code on data management or visualization would be appreciate!
Don't generate 12 dataframes. Instead of splitting your data into multiple similar variables, add a column that indicates which group each row should belong to. This is standard practice (with good reason) for database tables, dataframes, etc.
Use groupby on your dataset to group the data by month, then use apply() on the resulting DataFrameGroupBy object to restrict whatever analysis you want (e.g., the average to each group. This will also make it easy to plot the monthly results together.
You don't provide any code, so it's hard to be more specific than that. Show how you group your data by month and what you want to do to the monthly dataframes, and I'll show you how to restrict it to each month via the groupby object.