Standard deviation of time series data on two columns - python

I have a data frame with two-columns of data for a day with a time series index. The sample data is in 1-minute and I want to create a 5-minute data frame where a 5-minute interval will be flagged false when the standard deviation of the 5 samples in the respective 5-minute is not deviating by 5% of the mean of the 5-samples and this need to be performed for each of the 5-minutes in the day and for each column. As seen below for DF1 column X we calculate the mean and standard deviation of the 5 samples from 16:01 to 16:05 and we see the %(Std/Mean) and same thing will be done for the next 5 samples and for column y. Then DF2 will be populated if %(std/Mean)>5% then the particular 5 minute interval will be false.

You can use the resample method of the pandas data frames, for that the dataframe most be index with a time stamp. Here an example:
import pandas as pd
import numpy as np
dates = pd.date_range('1/1/2020', periods=30)
df = pd.DataFrame(np.random.randn(30,2), index=dates, columns=['X','Y'])
df.head()
lbl = 'right' # set the label of the window index to the value of the right
w = '3d'
threshold = 1 # here goes your threshold for flagging the ration of standard deviation and mean
x=df.resample(w, label=lbl).std()['X'] / df.resample(w, label=lbl).mean()['X'] > threshold
y=df.resample(w, label=lbl).std()['Y'] / df.resample(w, label=lbl).mean()['Y'] > threshold
DF2 = pd.concat([x,y], axis=1)

Related

Calculate mean from only one variable in pandas dataframe and netcdf

I am aiming to calculate daily climatology from a dataset, i.e. obtain the sea surface temperature (SST) for each day of the year by averaging all the years (for example, for January 1st, the average SST of all January 1st from 1982 to 2018). To do so, I made the following steps:
DATA PREPARATION STEPS
Here is a Drive link to both datasets to make the code reproducible:
link to datasets
First, I load two datasets:
ds1 = xr.open_dataset('./anomaly_dss/archive_to2018.nc') #from 1982 to 2018
ds2 = xr.open_dataset('./anomaly_dss/realtime_from2018.nc') #from 2018 to present
Then I convert to pandas dataframe and merge both in one:
ds1 = ds1.where(ds1.time > np.datetime64('1982-01-01'), drop=True) # Grab all data since 1/1/1982
ds2 = ds2.where(ds2.time > ds1.time.max(), drop=True) # Grab all data since the end of the archive
# Convert to Pandas Dataframe
df1 = ds1.to_dataframe().reset_index().set_index('time')
df2 = ds2.to_dataframe().reset_index().set_index('time')
# Merge these datasets
df = df1.combine_first(df2)
So far, this is how my dataframe looks like:
NOTE THAT THE LAT,LON GOES FROM LAT(35,37.7), LON(-10,-5), THIS MUST REMAIN LIKE THAT
ANOMALY CALCULATION STEPS
# Anomaly claculation
def standardize(x):
return (x - x.mean())/x.std()
# Calculate a daily average
df_daily = df.resample('1D').mean()
# Calculate the anomaly for each yearday
df_daily['anomaly'] = df_daily['analysed_sst'].groupby(df_daily.index.dayofyear).transform(standardize)
I obtain the following dataframe:
As you can see, I obtain the mean values of all three variables.
QUESTION
As I want to plot the climatology data on a map, I DO NOT want lat/lon variables to be averaged to one point. I need the anomaly from all the points lat/lon points, and I don't really know how to achieve that.
Any help would be very appreciated!!
I think you can do all that in a simpler and more straightforward way without converting your dataarray to a dataframe:
import os
#Will open and combine automatically the 2 datasets
DS = xr.open_mfdataset(os.path.join('./anomaly_dss', '*.nc'))
da = DS.analysed_sst
#Resampling
da = da.resample(time = '1D').mean()
# Anomaly calculation
def standardize(x):
return (x - x.mean())/x.std()
da_anomaly = da.groupby(da.time.dt.dayofyear).apply(standardize)
Then you can plot the anomaly for any day with:
da_anomaly[da_anomaly.dayofyear == 1].plot()

Generating defined amount of the rows based on max/min of other Dataframe in pandas

I have a Dataframe where I have calculated metrics below:
Metrics
I need to generate fixed amount of the rows for new data frame (for example 1000 or 2500), where each row will have a random number no less than minimum and no more than maximum, ideally change for +/- 1%.
I was trying solution as below, but without success so far:
Intervals = pd.DataFrame(np.array([[df['Close'].min(), df['Close'].max()],[0.4, 0.6],[0.4, 0.6],[0.20, 1.], [0.3, 0.4], [0.2, 0.3]]))
df = pd.DataFrame(list(Intervals.apply(lambda x: np.random.uniform(low=x[0],high=x[1], size = 2500).T, axis=1)))
print(df.T)
Any ideas how it can be approached?
You can loop over the columns in your metrics dataframe and create an array of random numbers using numpy.random:
pd.DataFrame({
column: np.random.uniform(
low=metrics[column].min(), high=metrics[column].max(), size=1000
)
for column in metrics.columns
})

Summing time series after k-means clustering

I am trying out different variations of K in K-means clustering on a set with time series data.
For each experiment I want to sum up the time series for each cluster label and perform predictions on them.
So for example:
If I cluster the time series into 3 clusters I want to sum all the time series (column-wise) belonging to cluster 1 and all the times series belonging to cluster 2, and the same for cluster 3. After that I will make predictions on each aggregated time-series cluster, but I do not need help on the prediction part.
I was thinking to add the cluster labels to the original dataframe and then use .loc and a loop to extract time series corresponding to the same clusters. But I am wondering if there is a more efficient way?
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.cluster import KMeans
#create dataframe with time series
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
for i in range(20):1
df['ts' + str(i)] = np.random.randint(0,100,size=(len(date_rng)))
df_pivot = df.pivot_table(columns = 'date', values = df.columns)
#cluster
K = range(1,10,2)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_pivot)
print(km.labels_)
#sum/aggregate all ts in each cluster column-wise
#forecast next step for each cluster(dont need help with this part)
`
You can access data points for every cluster and then sum their values.
Something like this:
labels = km.labels_
centroids = km.cluster_centers_
cluster_sums_dict = {} # cluster number: sum of elements
for i in range(k):
# select
temp_cluster = df_pivot[np.where(labels==i)]
cluster_sums_dict[i] = temp_cluster['ts'].sum()
Also on a side note, instead of aggregating a cluster_values, can you use centroids of each cluster for prediction?

Using inverse_transform MinMaxScaler from scikit_learn to force a dataframe be in a range of another

I was following this answer to apply an inverse transformation over a scaled dataframe. My question is how can I do to transform a new one dataframe to a range of values of the original dataframe?.
So far, I did this:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
cols = ['A', 'B']
data = pd.DataFrame(np.array([[2,3],[1.02,1.2],[0.5,0.3]]),columns=cols)
scaler = MinMaxScaler() # default min and max values are 0 and 1, respectively
scaled_data = scaler.fit_transform(data)
orig_data = scaler.inverse_transform(scaled_data) # obtain same as `data`
new_data = pd.DataFrame(np.array([[8,20],[11,2],[5,3]]),columns=cols)
inver_new_data = scaler.inverse_transform(new_data)
I want inver_new_data will be a dataframe with its columns in the same range of values of data columns, for instance, column A between 0.5 and 2, and so on. However I get for column A values between 8 and 17.
Any ideas?
MinMaxScaler applies to each column the following transformation:
Subtract column minimum;
Divide by column range (i.e. column max - column min).
The inverse transform applies the "inverse" operation in "inverse" order:
Multiply by column range before the transformation;
Add the column min.
Therefore for column A is doing
(df['A'] - df['A'].min())/(df['A'].max() - df['A'].min())
in particular the scaler stores the min 0.5 and the range 1.5
When you apply the inverse_transform to [8, 11, 5] this becomes:
[8*1.5 + 0.5, 11*1.5 + 0.5, 5*1.5 + 0.5]=[12.5, 18, 8]
Now, this is not suggested in general to do any machine learning, however to transform the ranges of the new column to the previous one, you can do something like the following:
data = pd.DataFrame(np.array([[2,3],[1.02,1.2],[0.5,0.3]]),columns=cols)
# Create a Scaler for the initial data
scaler_data = MinMaxScaler()
# Fit the scaler with these data, but there is no need to transform them.
scaler_data.fit(data)
#Create new data
new_data = pd.DataFrame(np.array([[8,20],[11,2],[5,3]]),columns=cols)
# Create a Scaler for the new data
scaler_new_data = MinMaxScaler()
# Trasform new data in the [0-1] range
scaled_new_data = scaler_new_data.fit_transform(new_data)
# Inverse transform new data from [0-1] to [min, max] of data
inver_new_data = scaler_data.inverse_transform(scaled_new_data)
For example this will always map the min and max of new dataframe columns to the min and max of initial dataframe columns respectively.
To explain you what is MinMaxScaler doing:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
So basically every feature of your datawill be between 0 and 1.
The moment you run: fit_transform(data), is trained.
For transformation you have:
X_scaled = scale * X + min - X.min(axis=0) * scale
where scale = (max - min) / (X.max(axis=0) - X.min(axis=0))
the scale was trained from the fitting method.
So if you run inverse_transofmr(new_data) this does not help you at all.
Also inver_new_data= scaler.transform(new_data) will not help you.
You need to precise what the same range means for you? The approach with MinMaxScalerwill not help you right now. You could only limit the columns to the min and max of the original dataframe. So for example:
dataA = new_data[['A']]
scalerA = MinMaxScaler(data['A'].min(), data['A'].max())
inver_new_data_A = scaler.fit_transform(dataA)
but this is also not th exact range, minmaxalso respects the distances between the points.

Python - plot numpy array with gaps in the data

I need to plot some spectral data as a 2D image, where each data point corresponds to a spectrum with a specific date/time. I require to plot all spectra as follows:
- xx-axis - corresponds to the wavelenght
- yy-axis - corresponds to the date/time
- intensity - corresponds to the flux
If my datapoints were continuous/sequential in time I would just use matplotlib's imshow. However, not only the points are not all continuous/sequential in time but I have large time gaps between points.
here is some simulated data that mimics what I have:
import numpy as np
sampleSize = 100
data={}
for time in np.arange(0,5):
data[time] = np.random.sample(sampleSize)
for time in np.arange(14,20):
data[time] = np.random.sample(sampleSize)
for time in np.arange(30,40):
data[time] = np.random.sample(sampleSize)
for time in np.arange(25.5,35.5):
data[time] = np.random.sample(sampleSize)
for time in np.arange(80,120):
data[time] = np.random.sample(sampleSize)
if I needed to print only one of the subsets of data above; i would do:
mplt.imshow([data[time] for time in np.arange(0,5)], cmap ='Greys',aspect='auto',origin='lower',interpolation="none",extent=[-50,50,0,5])
mplt.show()
however, I have no idea how I can print all data in the same plot, while showing the gaps and keeping the yy-axis as the time. Any ideas?
thanks,
Jorge
Or you can use pandas to help you with sorting the keys, then reindex:
df = pd.DataFrame(data).T
plt.imshow(df.reindex(np.arange(df.index.max())),
cmap ='Greys',
aspect='auto',
origin='lower',
interpolation="none",
extent=[-50,50,0,5])
Output:
In the end I ended up using a different approach:
1) re-index the time in my data so that no two arrays has the same time and I avoid non-integer indexes
nTimes = 1
timeIndexes=[int(float(index)) for index in data.keys()]
while len(timeIndexes) != len(set(timeIndexes)):
nTimes += 1
timeIndexes=[int(nTimes*float(index)) for index in data.keys()]
timeIndexesDict = {str(int(nTimes*float(index))):data[index] for index in data.keys()}
lenData2Plot = max([int(key) for key in timeIndexesDict.keys()])
2) create an array of zeros with the number of columns like my data and a number of rows corresponding to my maximum re-indexed time
data2Plot = np.zeros((int(lenData2Plot)+1,sampleSize))
3) replace the rows in my array of zeros corresponding to my re-indeed times
for index in timeIndexesDict.keys():
data2Plot[int(index)][:] = timeIndexesDict[str(index)]
4) plot as I normally would plot an array with no gaps
mplt.imshow(data2Plot,
cmap='Greys',aspect='auto',origin='lower',interpolation="none",
extent=[-50,50,0,120])
mplt.show()

Categories

Resources