Fourier Result on Time Series explained python - python

I have passed my time series data,which is essentially measurements from a sensor about pressure, through a Fourier transformation, similar to what is described in https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101.
The file used can be found here:
https://docs.google.com/spreadsheets/d/1MLETSU5Trl5gLGO6pv32rxBsR8xZNkbK/edit?usp=sharing&ouid=110574180158524908052&rtpof=true&sd=true
The code related is this :
import pandas as pd
import numpy as np
file='test.xlsx'
df=pd.read_excel(file,header=0)
#df=pd.read_csv(file,header=0)
df.head()
df.tail()
# drop ID
df=df[['JSON_TIMESTAMP','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_ADH_COATWEIGHT_SP']]
# extract year month
df["year"] = df["JSON_TIMESTAMP"].str[:4]
df["month"] = df["JSON_TIMESTAMP"].str[5:7]
df["day"] = df["JSON_TIMESTAMP"].str[8:10]
df= df.sort_values( ['year', 'month','day'],
ascending = [True, True,True])
df['JSON_TIMESTAMP'] = df['JSON_TIMESTAMP'].astype('datetime64[ns]')
df.sort_values(by='JSON_TIMESTAMP', ascending=True)
df1=df.copy()
df1 = df1.set_index('JSON_TIMESTAMP')
df1 = df1[["ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB"]]
import matplotlib.pyplot as plt
#plt.figure(figsize=(15,7))
plt.rcParams["figure.figsize"] = (25,8)
df1.plot()
#df.plot(style='k. ')
plt.show()
df1.hist(bins=20)
from scipy.fft import rfft,rfftfreq
## https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101
# convert into x and y
x = list(range(len(df1.index)))
y = df1['ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB']
# apply fast fourier transform and take absolute values
f=abs(np.fft.fft(df1))
# get the list of frequencies
num=np.size(x)
freq = [i / num for i in list(range(num))]
# get the list of spectrums
spectrum=f.real*f.real+f.imag*f.imag
nspectrum=spectrum/spectrum[0]
# plot nspectrum per frequency, with a semilog scale on nspectrum
plt.semilogy(freq,nspectrum)
nspectrum
type(freq)
freq= np.array(freq)
freq
type(nspectrum)
nspectrum = nspectrum.flatten()
# improve the plot by adding periods in number of days rather than frequency
import pandas as pd
results = pd.DataFrame({'freq': freq, 'nspectrum': nspectrum})
results['period'] = results['freq'] / (1/365)
plt.semilogy(results['period'], results['nspectrum'])
# improve the plot by convertint the data into grouped per day to avoid peaks
results['period_round'] = results['period'].round()
grouped_day = results.groupby('period_round')['nspectrum'].sum()
plt.semilogy(grouped_day.index, grouped_day)
#plt.xticks([1, 13, 26, 39, 52])
My end result is this :
Result of Fourier Trasformation for Data
My question is, what does this eventually show for our data, and intuitively what does the spike at the last section mean?What can I do with such result?
Thanks in advance all!

Related

plt.scatter plot turns out blank

the code below returns a blank plot in Python:
# import libraries
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
os.chdir('file path')
# import data files
activity = pd.read_csv('file path\dailyActivity_merged.csv')
intensity = pd.read_csv('file path\hourlyIntensities_merged.csv')
steps = pd.read_csv('file path\hourlySteps_merged.csv')
sleep = pd.read_csv('file path\sleepDay_merged.csv')
# ActivityDate in activity df only includes dates (no time). Rename it Dates
activity = activity.rename(columns={'ActivityDate': 'Dates'})
# ActivityHour in intensity df and steps df includes date-time. Split date-time column into dates and times in intensity. Drop the date-time column
intensity['Dates'] = pd.to_datetime(intensity['ActivityHour']).dt.date
intensity['Times'] = pd.to_datetime(intensity['ActivityHour']).dt.time
intensity = intensity.drop(columns=['ActivityHour'])
# split date-time column into dates and times in steps. Drop the date-time column
steps['Dates'] = pd.to_datetime(steps['ActivityHour']).dt.date
steps['Times'] = pd.to_datetime(steps['ActivityHour']).dt.time
steps = steps.drop(columns=['ActivityHour'])
# split date-time column into dates and times in sleep. Drop the date-time column
sleep['Dates'] = pd.to_datetime(sleep['SleepDate']).dt.date
sleep['Times'] = pd.to_datetime(sleep['SleepDate']).dt.time
sleep = sleep.drop(columns=['SleepDate', 'TotalSleepRecords'])
# add a column & calculate time_awake_in_bed before falling asleep
sleep['time_awake_in_bed'] = sleep['TotalTimeInBed'] - sleep['TotalMinutesAsleep']
# merge activity and sleep
list = ['Id', 'Dates']
activity_sleep = sleep.merge(activity,
on = list,
how = 'outer')
# plot relation between calories used daily vs how long it takes users to fall asleep
plt.scatter(activity_sleep['time_awake_in_bed'], activity_sleep['Calories'], s=20, c='b', marker='o')
plt.axis([0, 200, 0, 5000])
plt.show()
NOTE: max(Calories) = 4900 and min(Calories) =0. max(time_awake_in_bed) = 0 and min(time_awake_in_bed) = 150
Please let me know how I can get a scatter plot out of this. Thank you in advance for any help.
The same variables from the same data-frame work perfectly with geom_point() in R.
I found where the problem was. As #Redox and #cheersmate mentioned in comments, the data-frame that I created by merging included NaN values. I fixed this by merging them only on 'Id'. Then I could create a scatter plot:
list = ['Id']
activity_sleep = sleep.merge(activity,
on = list,
how = 'outer')
The column "Dates" is not a good one to merge on, as in each data frame the same dates are repeated in multiple rows. Also I noticed that I get the same plot whether I outer or inner merge. Thank you.

How to control the order of facet gird rows and/or columns in xarray?

I am trying to change the order of variables I use to make a facet grid in xarray. For example, I have [a,b,c,d] as column names. I want to reorder it to [c,d,a,b]. Unfortunately, unlike seaborn, I could not find parameters such as col_order or row_order in xarray plot function (
https://xarray.pydata.org/en/stable/generated/xarray.plot.FacetGrid.html
Update:
To help myself better explain what I need, I took the example below from the user guide of xarray:
In the following example, I need to change the place of months. I mean, for example, I want to put the month 7 as the first column and 2nd month as the 5th and so on and so forth.
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature.nc").rename({"air": "Tair"})
# we will add a gradient field with appropriate attributes
ds["dTdx"] = ds.Tair.differentiate("lon") / 110e3 / np.cos(ds.lat * np.pi / 180)
ds["dTdy"] = ds.Tair.differentiate("lat") / 105e3
ds.dTdx.attrs = {"long_name": "$∂T/∂x$", "units": "°C/m"}
ds.dTdy.attrs = {"long_name": "$∂T/∂y$", "units": "°C/m"}
monthly_means = ds.groupby("time.month").mean()
# xarray's groupby reductions drop attributes. Let's assign them back so we get nice labels.
monthly_means.Tair.attrs = ds.Tair.attrs
fg = monthly_means.Tair.plot(
col="month",
col_wrap=4, # each row has a maximum of 4 columns
)
plt.show()
Any help is highly appreciated.
xarray will respect the shape of your data, so you can rearrange the data prior to plotting:
In [2]: ds = xr.tutorial.open_dataset("air_temperature.nc")
In [3]: ds_mon = ds.groupby("time.month").mean()
In [4]: # order the data by month, descending
...: ds_mon.air.sel(month=list(range(12, 0, -1))).plot(
...: col="month", col_wrap=4,
...: )
Out[4]: <xarray.plot.facetgrid.FacetGrid at 0x16b9a7700>

Summing time series after k-means clustering

I am trying out different variations of K in K-means clustering on a set with time series data.
For each experiment I want to sum up the time series for each cluster label and perform predictions on them.
So for example:
If I cluster the time series into 3 clusters I want to sum all the time series (column-wise) belonging to cluster 1 and all the times series belonging to cluster 2, and the same for cluster 3. After that I will make predictions on each aggregated time-series cluster, but I do not need help on the prediction part.
I was thinking to add the cluster labels to the original dataframe and then use .loc and a loop to extract time series corresponding to the same clusters. But I am wondering if there is a more efficient way?
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.cluster import KMeans
#create dataframe with time series
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
for i in range(20):1
df['ts' + str(i)] = np.random.randint(0,100,size=(len(date_rng)))
df_pivot = df.pivot_table(columns = 'date', values = df.columns)
#cluster
K = range(1,10,2)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_pivot)
print(km.labels_)
#sum/aggregate all ts in each cluster column-wise
#forecast next step for each cluster(dont need help with this part)
`
You can access data points for every cluster and then sum their values.
Something like this:
labels = km.labels_
centroids = km.cluster_centers_
cluster_sums_dict = {} # cluster number: sum of elements
for i in range(k):
# select
temp_cluster = df_pivot[np.where(labels==i)]
cluster_sums_dict[i] = temp_cluster['ts'].sum()
Also on a side note, instead of aggregating a cluster_values, can you use centroids of each cluster for prediction?

Plot multiple values as ranges - matplotlib

I'm trying to determine the most efficient way to produce a group of line plots displayed as a range. I'm hoping to produce something like:
I'll try explain as much as possible. Sorry if I miss any information. I'm envisaging the x-axis to be a range timestamps of hours (8am-9am-10am etc). The total range would be between 8:00:00 and 27:00:00. The y-axis is a count of values occurring at any point in time. The range in the plot would represent the max, min, and average values occurring.
An example df is listed below:
import pandas as pd
import matplotlib.pyplot as plt
d = ({
'Time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],
'Occurring1' : ['1','2','3','4','5','5','6','6','7'],
'Time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],
'Occurring2' : ['1','2','2','3','4','5','5','6','7'],
'Time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],
'Occurring3' : ['1','2','3','4','4','5','6','7','8'],
})
df = pd.DataFrame(data = d)
So this df represents 3 different sets of data. The times, values occurring and even number of entries can vary.
Below is an initial example. Although I'm unsure if I need to rethink my approach. Would a rolling equation work here? Something that assesses the max, min, avg number of values occurring for each hour in a df (8:00:00-9:00:00).
Below is a full initial attempt:
import pandas as pd
import matplotlib.pyplot as plt
d = ({
'Time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],
'Occurring1' : ['1','2','3','4','5','5','6','6','7'],
'Time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],
'Occurring2' : ['1','2','2','3','4','5','5','6','7'],
'Time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],
'Occurring3' : ['1','2','3','4','4','5','6','7','8'],
})
df = pd.DataFrame(data = d)
fig, ax = plt.subplots(figsize = (10,6))
ax.plot(df['Time1'], df['Occurring1'])
ax.plot(df['Time2'], df['Occurring2'])
ax.plot(df['Time3'], df['Occurring3'])
plt.show()
To get the desired result, you'd need to jump through a few hoops. First you need to create a regular time grid, onto which you interpolate the y-data (the occurrences). Then, you can get the min, max, and mean of the interpolated data. The code below demonstrates how to do this:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import griddata
# Example data
d = ({
'Time1' : ['8:00:00','9:30:00','9:40:00','10:25:00','12:30:00','1:31:00','1:35:00','2:45:00','4:50:00'],
'Occurring1' : ['1','2','3','4','5','5','6','6','7'],
'Time2' : ['8:10:00','9:34:00','9:48:00','10:40:00','1:30:00','2:31:00','3:35:00','3:45:00','4:55:00'],
'Occurring2' : ['1','2','2','3','4','5','5','6','7'],
'Time3' : ['9:00:00','9:34:00','9:58:00','10:45:00','10:50:00','12:31:00','1:35:00','2:15:00','3:55:00'],
'Occurring3' : ['1','2','3','4','4','5','6','7','8'],
})
# Create dataframe, explicitly define dtypes
df = pd.DataFrame(data=d)
df = df.astype({
"Time1": np.datetime64,
"Occurring1": np.int,
"Time2": np.datetime64,
"Occurring2": np.int,
"Time3": np.datetime64,
"Occurring3": np.int,
})
# Create 1D vectors of time data
all_times = df[["Time1", "Time2", "Time3"]].values
# Representation of 1 minute in time
t_min = np.timedelta64(int(60*1e9), "ns")
# Create a regular time grid with 10 minute spacing
time_grid = np.arange(all_times.min(), all_times.max(), 10*t_min, dtype="datetime64")
# Storage buffer for interpolated occurring data
occurrences_grid = np.zeros((3, len(time_grid)))
# Loop over all occurrence data and interpolate to regular grid
for i in range(3):
occurrences_grid[i] = griddata(
points=df["Time%i" % (i+1)].values.astype("float"),
values=df["Occurring%i" % (i+1)],
xi=time_grid.astype("float"),
method="linear"
)
# Get min, max, and mean values of interpolated data
occ_min = np.min(occurrences_grid, axis=0)
occ_max = np.max(occurrences_grid, axis=0)
occ_mean = np.mean(occurrences_grid, axis=0)
# Plot interpolated data
plt.fill_between(time_grid, occ_min, occ_max, color="slategray")
plt.plot(time_grid, occ_mean, c="white")
plt.xticks(rotation=60)
plt.tight_layout()
plt.show()
Result (x-labels not formatted properly):

Using python to take a 32x32 matrices append many of these matrices to a single array then adding a timestamp index to each matrix

I am very new to coding python and I am working with a .CSV file that gives me a 32x32 matrix in a 1024 column row with a time stamp. I reshaped the data to give me 32x32 arrays and looped through each row appending the matrices to a numpy array.
`i = 0
while i < len(df_array):
if i == 0:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = spec
else:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = np.concatenate((spectrum_matrix, spec), axis = 0)
i = i + 1
print("job done")`
What I would like to do is to add the time stamp from the original data file and add them to each of the matrices thus allowing me to re sample the data over a 5 minute average. I also would like to plot the bins a to get a plot similar to this Drop size distribution
As a reference I am reading in the data .CSV with pandas and here is an example of a portion of the raw data: 01.06.2017;18:22:20;0.122;0.00;51;7.401;10375;18745;57;27;0.00;23.6;0.110;0;
<SPECTRUM>;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
The ;'s after the SPECTRUM is the 32x32 matrix.
Thanks in advance for any help!
Python and associated packages can do many things without loops
From my understanding of your data you have a (8640 x 32 x 32) Data Structure (time x size x velocity).
Pandas works very well with 2D data structures, however for higher dimensional data I would recommend you get familiar with xarray. With this package along with pandas you can create and manipulate your data without having to resort to loops.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import seaborn as sns
%matplotlib inline
#create random data
data = (np.random.binomial(n =5, p =0.2, size =(8640,32,32))*1000).astype(int)
#create labels for data
sizes= np.linspace(1,5,32)
velocities = np.linspace(1,1000, num = 32)
#make time range of 24 hours with 10sec intervals
ind = pd.date_range(start='2014-01-01', periods=8640, freq='10s')
#convert data to xarray 3D data structure
df = xr.DataArray(data, coords = [ind, sizes, velocities],
dims = ['time', 'size', 'speed'])
#make a 5 min average of the data
min_average= df.resample('300s', dim = 'time', how = 'mean')
#plot sample of data and 5 min average
my1d = min_average.isel(size = 5, speed= 10)
my1d.plot(label = '5 min avg')
plt.gca()
df.isel(size = 5, speed =10).plot(alpha = 0.3, c = 'r', label = 'raw_data')
plt.legend()
As for making a distribution plot like you linked things become a bit trickier but is possible:
#transform your data to only have mean speed for each time and size
#and convert to pandas dataframe
mean_speed =min_average.mean(dim = ['speed'])
#for some reason xarray make you name the new column when you convert
#to a pandas dataframe. I then get rid of the extra empty variable with
#a list comprehension
df= mean_speed.to_dataframe('').unstack().T
df.index = np.array([np.array(i)[1].astype(float) for i in df.index])
#make a contourplot of your new data
plt.contourf(df.columns, df.index, df.values, cmap ='PuBu_r')
plt.title('mean speed')
plt.ylabel('size')
plt.xlabel('time')
plt.colorbar()

Categories

Resources