Plot multiple values as ranges - matplotlib - python

I'm trying to determine the most efficient way to produce a group of line plots displayed as a range. I'm hoping to produce something like:
I'll try explain as much as possible. Sorry if I miss any information. I'm envisaging the x-axis to be a range timestamps of hours (8am-9am-10am etc). The total range would be between 8:00:00 and 27:00:00. The y-axis is a count of values occurring at any point in time. The range in the plot would represent the max, min, and average values occurring.
An example df is listed below:
import pandas as pd
import matplotlib.pyplot as plt
d = ({
'Time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],
'Occurring1' : ['1','2','3','4','5','5','6','6','7'],
'Time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],
'Occurring2' : ['1','2','2','3','4','5','5','6','7'],
'Time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],
'Occurring3' : ['1','2','3','4','4','5','6','7','8'],
})
df = pd.DataFrame(data = d)
So this df represents 3 different sets of data. The times, values occurring and even number of entries can vary.
Below is an initial example. Although I'm unsure if I need to rethink my approach. Would a rolling equation work here? Something that assesses the max, min, avg number of values occurring for each hour in a df (8:00:00-9:00:00).
Below is a full initial attempt:
import pandas as pd
import matplotlib.pyplot as plt
d = ({
'Time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],
'Occurring1' : ['1','2','3','4','5','5','6','6','7'],
'Time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],
'Occurring2' : ['1','2','2','3','4','5','5','6','7'],
'Time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],
'Occurring3' : ['1','2','3','4','4','5','6','7','8'],
})
df = pd.DataFrame(data = d)
fig, ax = plt.subplots(figsize = (10,6))
ax.plot(df['Time1'], df['Occurring1'])
ax.plot(df['Time2'], df['Occurring2'])
ax.plot(df['Time3'], df['Occurring3'])
plt.show()

To get the desired result, you'd need to jump through a few hoops. First you need to create a regular time grid, onto which you interpolate the y-data (the occurrences). Then, you can get the min, max, and mean of the interpolated data. The code below demonstrates how to do this:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import griddata
# Example data
d = ({
'Time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],
'Occurring1' : ['1','2','3','4','5','5','6','6','7'],
'Time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],
'Occurring2' : ['1','2','2','3','4','5','5','6','7'],
'Time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],
'Occurring3' : ['1','2','3','4','4','5','6','7','8'],
})
# Create dataframe, explicitly define dtypes
df = pd.DataFrame(data=d)
df = df.astype({
"Time1": np.datetime64,
"Occurring1": np.int,
"Time2": np.datetime64,
"Occurring2": np.int,
"Time3": np.datetime64,
"Occurring3": np.int,
})
# Create 1D vectors of time data
all_times = df[["Time1", "Time2", "Time3"]].values
# Representation of 1 minute in time
t_min = np.timedelta64(int(60*1e9), "ns")
# Create a regular time grid with 10 minute spacing
time_grid = np.arange(all_times.min(), all_times.max(), 10*t_min, dtype="datetime64")
# Storage buffer for interpolated occurring data
occurrences_grid = np.zeros((3, len(time_grid)))
# Loop over all occurrence data and interpolate to regular grid
for i in range(3):
occurrences_grid[i] = griddata(
points=df["Time%i" % (i+1)].values.astype("float"),
values=df["Occurring%i" % (i+1)],
xi=time_grid.astype("float"),
method="linear"
)
# Get min, max, and mean values of interpolated data
occ_min = np.min(occurrences_grid, axis=0)
occ_max = np.max(occurrences_grid, axis=0)
occ_mean = np.mean(occurrences_grid, axis=0)
# Plot interpolated data
plt.fill_between(time_grid, occ_min, occ_max, color="slategray")
plt.plot(time_grid, occ_mean, c="white")
plt.xticks(rotation=60)
plt.tight_layout()
plt.show()
Result (x-labels not formatted properly):

Related

Fourier Result on Time Series explained python

I have passed my time series data,which is essentially measurements from a sensor about pressure, through a Fourier transformation, similar to what is described in https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101.
The file used can be found here:
https://docs.google.com/spreadsheets/d/1MLETSU5Trl5gLGO6pv32rxBsR8xZNkbK/edit?usp=sharing&ouid=110574180158524908052&rtpof=true&sd=true
The code related is this :
import pandas as pd
import numpy as np
file='test.xlsx'
df=pd.read_excel(file,header=0)
#df=pd.read_csv(file,header=0)
df.head()
df.tail()
# drop ID
df=df[['JSON_TIMESTAMP','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_ADH_COATWEIGHT_SP']]
# extract year month
df["year"] = df["JSON_TIMESTAMP"].str[:4]
df["month"] = df["JSON_TIMESTAMP"].str[5:7]
df["day"] = df["JSON_TIMESTAMP"].str[8:10]
df= df.sort_values( ['year', 'month','day'],
ascending = [True, True,True])
df['JSON_TIMESTAMP'] = df['JSON_TIMESTAMP'].astype('datetime64[ns]')
df.sort_values(by='JSON_TIMESTAMP', ascending=True)
df1=df.copy()
df1 = df1.set_index('JSON_TIMESTAMP')
df1 = df1[["ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB"]]
import matplotlib.pyplot as plt
#plt.figure(figsize=(15,7))
plt.rcParams["figure.figsize"] = (25,8)
df1.plot()
#df.plot(style='k. ')
plt.show()
df1.hist(bins=20)
from scipy.fft import rfft,rfftfreq
## https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101
# convert into x and y
x = list(range(len(df1.index)))
y = df1['ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB']
# apply fast fourier transform and take absolute values
f=abs(np.fft.fft(df1))
# get the list of frequencies
num=np.size(x)
freq = [i / num for i in list(range(num))]
# get the list of spectrums
spectrum=f.real*f.real+f.imag*f.imag
nspectrum=spectrum/spectrum[0]
# plot nspectrum per frequency, with a semilog scale on nspectrum
plt.semilogy(freq,nspectrum)
nspectrum
type(freq)
freq= np.array(freq)
freq
type(nspectrum)
nspectrum = nspectrum.flatten()
# improve the plot by adding periods in number of days rather than frequency
import pandas as pd
results = pd.DataFrame({'freq': freq, 'nspectrum': nspectrum})
results['period'] = results['freq'] / (1/365)
plt.semilogy(results['period'], results['nspectrum'])
# improve the plot by convertint the data into grouped per day to avoid peaks
results['period_round'] = results['period'].round()
grouped_day = results.groupby('period_round')['nspectrum'].sum()
plt.semilogy(grouped_day.index, grouped_day)
#plt.xticks([1, 13, 26, 39, 52])
My end result is this :
Result of Fourier Trasformation for Data
My question is, what does this eventually show for our data, and intuitively what does the spike at the last section mean?What can I do with such result?
Thanks in advance all!

How to control the order of facet gird rows and/or columns in xarray?

I am trying to change the order of variables I use to make a facet grid in xarray. For example, I have [a,b,c,d] as column names. I want to reorder it to [c,d,a,b]. Unfortunately, unlike seaborn, I could not find parameters such as col_order or row_order in xarray plot function (
https://xarray.pydata.org/en/stable/generated/xarray.plot.FacetGrid.html
Update:
To help myself better explain what I need, I took the example below from the user guide of xarray:
In the following example, I need to change the place of months. I mean, for example, I want to put the month 7 as the first column and 2nd month as the 5th and so on and so forth.
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature.nc").rename({"air": "Tair"})
# we will add a gradient field with appropriate attributes
ds["dTdx"] = ds.Tair.differentiate("lon") / 110e3 / np.cos(ds.lat * np.pi / 180)
ds["dTdy"] = ds.Tair.differentiate("lat") / 105e3
ds.dTdx.attrs = {"long_name": "$∂T/∂x$", "units": "°C/m"}
ds.dTdy.attrs = {"long_name": "$∂T/∂y$", "units": "°C/m"}
monthly_means = ds.groupby("time.month").mean()
# xarray's groupby reductions drop attributes. Let's assign them back so we get nice labels.
monthly_means.Tair.attrs = ds.Tair.attrs
fg = monthly_means.Tair.plot(
col="month",
col_wrap=4, # each row has a maximum of 4 columns
)
plt.show()
Any help is highly appreciated.
xarray will respect the shape of your data, so you can rearrange the data prior to plotting:
In [2]: ds = xr.tutorial.open_dataset("air_temperature.nc")
In [3]: ds_mon = ds.groupby("time.month").mean()
In [4]: # order the data by month, descending
...: ds_mon.air.sel(month=list(range(12, 0, -1))).plot(
...: col="month", col_wrap=4,
...: )
Out[4]: <xarray.plot.facetgrid.FacetGrid at 0x16b9a7700>

How to retrieve all data from seaborn distribution plot with mutliple distributions?

The post Get data points from Seaborn distplot describes how you can get data elements using sns.distplot(x).get_lines()[0].get_data(), sns.distplot(x).patches and [h.get_height() for h in sns.distplot(x).patches]
But how can you do this if you've used multiple layers by plotting the data in a loop, such as:
Snippet 1
for var in list(df):
print(var)
distplot = sns.distplot(df[var])
Plot
Is there a way to retrieve the X and Y values for both linecharts and the bars?
Here's the whole setup for an easy copy&paste:
#%%
# imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pylab
pylab.rcParams['figure.figsize'] = (8, 4)
import seaborn as sns
from collections import OrderedDict
# Function to build synthetic data
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
# Dataframe with synthetic data
df = sample(rSeed = 123, colNames = ['X1', 'X2'], periodLength = 50)
# sns.distplot with multiple layers
for var in list(df):
myPlot = sns.distplot(df[var])
Here's what I've tried:
Y-values for histogram:
If I run:
barX = [h.get_height() for h in myPlot.patches]
Then I get the following list of lenght 11:
[0.046234272703757885,
0.1387028181112736,
0.346757045278184,
0.25428849987066837,
0.2542884998706682,
0.11558568175939472,
0.11875881712519201,
0.3087729245254993,
0.3087729245254993,
0.28502116110046083,
0.1662623439752689]
And this seems reasonable since there seems to be 6 values for the blue bars and 5 values for the red bars. But how do I tell which values belong to which variable?
Y-values for line:
This seems a bit easier than the histogram part since you can use myPlot.get_lines()[0].get_data() AND myPlot.get_lines()[1].get_data() to get:
Out[678]:
(array([-4.54448949, -4.47612134, -4.40775319, -4.33938504, -4.27101689,
...
3.65968859, 3.72805675, 3.7964249 , 3.86479305, 3.9331612 ,
4.00152935, 4.0698975 , 4.13826565]),
array([0.00042479, 0.00042363, 0.000473 , 0.00057404, 0.00073097,
0.00095075, 0.00124272, 0.00161819, 0.00208994, 0.00267162,
...
0.0033384 , 0.00252219, 0.00188591, 0.00139919, 0.00103544,
0.00077219, 0.00059125, 0.00047871]))
myPlot.get_lines()[1].get_data()
Out[679]:
(array([-3.68337423, -3.6256517 , -3.56792917, -3.51020664, -3.4524841 ,
-3.39476157, -3.33703904, -3.27931651, -3.22159398, -3.16387145,
...
3.24332952, 3.30105205, 3.35877458, 3.41649711, 3.47421965,
3.53194218, 3.58966471, 3.64738724]),
array([0.00035842, 0.00038018, 0.00044152, 0.00054508, 0.00069579,
0.00090076, 0.00116922, 0.00151242, 0.0019436 , 0.00247792,
...
0.00215912, 0.00163627, 0.00123281, 0.00092711, 0.00070127,
0.00054097, 0.00043517, 0.00037599]))
But the whole thing still seems a bit cumbersome. So does anyone know of a more direct approach to perhaps retrieve all data to a dictionary or dataframe?
I was just getting the same need of retrieving data from a seaborn distribution plot, what worked for me was to call the method .findobj() on each iteration's graph. Then, one can notice that the matplotlib.lines.Line2D object has a get_data() method, this is similar as what you've mentioned before for myPlot.get_lines()[1].get_data().
Following your example code
data = []
for idx, var in enumerate(list(df)):
myPlot = sns.distplot(df[var])
# Fine Line2D objects
lines2D = [obj for obj in myPlot.findobj() if str(type(obj)) == "<class 'matplotlib.lines.Line2D'>"]
# Retrieving x, y data
x, y = lines2D[idx].get_data()[0], lines2D[idx].get_data()[1]
# Store as dataframe
data.append(pd.DataFrame({'x':x, 'y':y}))
Notice here that the data for the first sns.distplot plot is stored on the first index of lines2D and the data for the second sns.distplot is stored on the second index. I'm not really sure about why this happens this way, but if you were to consider more than two plots, then you will access each sns.distplot data by calling Lines2D on it's respective index.
Finally, to verify one can plot each distplot
plt.plot(data[0].x, data[0].y)

Using python to take a 32x32 matrices append many of these matrices to a single array then adding a timestamp index to each matrix

I am very new to coding python and I am working with a .CSV file that gives me a 32x32 matrix in a 1024 column row with a time stamp. I reshaped the data to give me 32x32 arrays and looped through each row appending the matrices to a numpy array.
`i = 0
while i < len(df_array):
if i == 0:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = spec
else:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = np.concatenate((spectrum_matrix, spec), axis = 0)
i = i + 1
print("job done")`
What I would like to do is to add the time stamp from the original data file and add them to each of the matrices thus allowing me to re sample the data over a 5 minute average. I also would like to plot the bins a to get a plot similar to this Drop size distribution
As a reference I am reading in the data .CSV with pandas and here is an example of a portion of the raw data: 01.06.2017;18:22:20;0.122;0.00;51;7.401;10375;18745;57;27;0.00;23.6;0.110;0;
<SPECTRUM>;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
The ;'s after the SPECTRUM is the 32x32 matrix.
Thanks in advance for any help!
Python and associated packages can do many things without loops
From my understanding of your data you have a (8640 x 32 x 32) Data Structure (time x size x velocity).
Pandas works very well with 2D data structures, however for higher dimensional data I would recommend you get familiar with xarray. With this package along with pandas you can create and manipulate your data without having to resort to loops.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import seaborn as sns
%matplotlib inline
#create random data
data = (np.random.binomial(n =5, p =0.2, size =(8640,32,32))*1000).astype(int)
#create labels for data
sizes= np.linspace(1,5,32)
velocities = np.linspace(1,1000, num = 32)
#make time range of 24 hours with 10sec intervals
ind = pd.date_range(start='2014-01-01', periods=8640, freq='10s')
#convert data to xarray 3D data structure
df = xr.DataArray(data, coords = [ind, sizes, velocities],
dims = ['time', 'size', 'speed'])
#make a 5 min average of the data
min_average= df.resample('300s', dim = 'time', how = 'mean')
#plot sample of data and 5 min average
my1d = min_average.isel(size = 5, speed= 10)
my1d.plot(label = '5 min avg')
plt.gca()
df.isel(size = 5, speed =10).plot(alpha = 0.3, c = 'r', label = 'raw_data')
plt.legend()
As for making a distribution plot like you linked things become a bit trickier but is possible:
#transform your data to only have mean speed for each time and size
#and convert to pandas dataframe
mean_speed =min_average.mean(dim = ['speed'])
#for some reason xarray make you name the new column when you convert
#to a pandas dataframe. I then get rid of the extra empty variable with
#a list comprehension
df= mean_speed.to_dataframe('').unstack().T
df.index = np.array([np.array(i)[1].astype(float) for i in df.index])
#make a contourplot of your new data
plt.contourf(df.columns, df.index, df.values, cmap ='PuBu_r')
plt.title('mean speed')
plt.ylabel('size')
plt.xlabel('time')
plt.colorbar()

plotting multiple columns value in x-axis in python

I have a dataframe of size (3,100) that is filled with some random float values.
Here is a sample of how the data frame looks like
A B C
4.394966 0.580573 2.293824
3.136197 2.227557 1.306508
4.010782 0.062342 3.629226
2.687100 1.050942 3.143727
1.280550 3.328417 2.247764
4.417837 3.236766 2.970697
1.036879 1.477697 4.029579
2.759076 4.753388 3.222587
1.989020 4.161404 1.073335
1.054660 1.427896 2.066219
0.301078 2.763342 4.166691
2.323838 0.791260 0.050898
3.544557 3.715050 4.196454
0.128322 3.803740 2.117179
0.549832 1.597547 4.288621
This is how I created it
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
Note: pd is pandas
I want to plot a bar chart that would have three segments in x-axis where each segment would have 2 bars. One would show number of values less than 2 and other greater than equal to 2.
So on x-axis there would be two bars attached for column A, one with total number of values less than 2 and one with greater than equal to 2, and same for B and C
Can anyone suggest anything?
I was thinking of using seaborn and setting hue value for differentiating two classes (less than 2 and greater than equal to 2) but then again hue attribute only works for categorical value and I can only set one column in x-axis attribute.
Any tips would be appreciated.
You must use a filter and then count them, then you must use plot(kind='bar')
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
dfout = pd.DataFrame({'minor' : df[df<= 2].count(),
'major' : df[df > 2].count() })
dfout.plot(kind='bar')
plt.show()

Categories

Resources