Merging two tensors based on time series - python

I have two tensors:
A -> (128,19,3,99,99) #(batch, date, data, data, data)
B -> (128,9,223) #(batch, date, data)
After encoding I will have something like that:
A2 -> (128,19,10) # (batch, date, encoded data with CNN_1)
B2 -> (128,9,10) # (batch, date, encoded data with CNN_2)
C -> (128,28,10) # merge of A2 and B2
I have the records of time-series of each tensor (A,B),
I want to be able to merge them on date axis to a single tensor.
Here is an example of the time series of tensor A:
['2021-12-13', '2021-12-28', '2022-01-02', '2022-01-10', '2022-01-20', '2022-01-26', '2022-02-06', '2022-02-14', '2022-02-22', '2022-03-02', '2022-03-17', '2022-03-21', '2022-03-27', '2022-03-30', ...]
Tensor B has different dates that do not necessarily fall on the exact dates of tensor A.
So in the end, the axis of the date in tensor C will be something like this:
A2,A2,B2,A2,B2,A2,B2,B2 depends on the original order of dates.
Any help will be great!

I got an answer : )
import torch
# Merged Tensors
A0_RGB = torch.rand((128,19,3))
A1_Hyper = torch.rand((128,9,3))
# Merge and find the order
A2_concat = torch.concat((A0_RGB,A1_Hyper), dim=1)
A3_sorted_v, A3_sorted_i = A2_concat[0,:,0].sort()
# Restore original order
A5_test_selection = A2_concat[:,A3_sorted_i,:]
Just the concept...

Related

Is it possible to selecting dataset by time range when range is different for every pixel in pythons xarray module

I try to select only this part of the data within a specific time range that is different for every pixel.
For indexing, I have two np.datetime64[ns] xr.DataArrays with shape(lat:152, lon:131) named time_range_min, time_range_max
One is holding the start dates and the other one the end dates.
I try this for selecting the data
dataset = data.sel(time=slice(time_range_min, time_range_max))
but it yields
cannot use non-scalar arrays in a slice for xarray indexing:
<xarray.DataArray 'NDVI' (lat: 152, lon: 131)>
If I cannot use non-scalar arrays it means that it is in general not possible to do this, or can I transform my arrays?
If "time" is a list of dates in string that is ordered from past to present (e.g. ["10-20-2021", "10-21-2021", ...]:
import numpy as np
listOfMinMaxTimeRanges = [time_range_min, time_range_max]
specifiedRangeOfTimeIndexedList = []
for indexingListOfMinMaxTimeRanges in range(np.shape(listOfMinMaxTimeRanges)[1])):
specifiedRangeOfTimeIndexed = [specifiedRangeOfTime for specifiedRangeOfTime in np.arange(0, len(time), 1) if time.index(listOfMinMaxTimeRanges[0][indexingListOfMinMaxTimeRanges]) <= specifiedRangeOfTime <= time.index(listOfMinMaxTimeRanges[1][indexingListOfMinMaxTimeRanges])]
for indexes in range(len(specifiedRangeOfTimeIndexed)):
specifiedRangeOfTimeIndexedList.append(specifiedRangeOfTimeIndexed[indexes])
Depending on how your dataset is structured:
dataset = data.sel(time = specifiedRangeOfTimeIndexedList)
or
dataset = data.sel(time = time[specifiedRangeOfTimeIndexedList])
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[:, time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList], :, :]
or
dataset = dataset[specifiedRangeOfTimeIndexedList]
...
I found a way to group every cell with stacking in xarray:
time_range_min and time_range_max marks now a single date
stack = dataset.value.stack(gridcell=['lat', 'lon'])
for unique_value, grouped_array in stack.groupby('gridcell'):
grouped_array.sel(time=slice(time_range_min,time_range_max))

labels column shape is inconsistent after splitting data into features and labels

I have a dataset as pandas dataframe that needs to be divided into feature set and labels. as of now, I am dividing the columns as below,
features = df2.drop('case_of_injury_group', 1)
labels = df2['case_of_injury_group']
but the shape of labels is not as what I expected,
features.shape
give (39778, 12) and
labels.shape
gives (39778, ) but I want it as (39778, 1). Please let me know what i am doing wrong here
If want one column DataFrame select by one element nested list:
labels = df2[['case_of_injury_group']]

applying a generalized additive model to an xarray

I have a netCDF file which I have read with xarray. The array contains times, latidude, longitude and only one data variable (i.e. index values)
# read the netCDF files
with xr.open_mfdataset('wet_tropics.nc') as wet:
print(wet)
Out[]:
<xarray.Dataset>
Dimensions: (time: 1437, x: 24, y: 20)
Coordinates:
* y (y) float64 -1.878e+06 -1.878e+06 -1.878e+06 -1.878e+06 ...
* x (x) float64 1.468e+06 1.468e+06 1.468e+06 1.468e+06 ...
* time (time) object '2013-03-29T00:22:28.500000000' ...
Data variables:
index_values (time, y, x) float64 dask.array<shape=(1437, 20, 24), chunksize=(1437, 20, 24)>
So far, so good.
Now I need to apply a generalized additive model to each grid cell in the array. The model I want to use comes from Facebook Prophet (https://facebook.github.io/prophet/) and I have successfully applied it to a pandas array of data before. For example:
cns_ap['y'] = cns_ap['av_index'] # Prophet requires specific names 'y' and 'ds' for column names
cns_ap['ds'] = cns_ap['Date']
cns_ap['cap'] = 1
m1 = Prophet(weekly_seasonality=False, # disables weekly_seasonality
daily_seasonality=False, # disables daily_seasonality
growth='logistic', # logistic because indices have a maximum
yearly_seasonality=4, # fourier transform. int between 1-10
changepoint_prior_scale=0.5).fit(cns_ap)
future1 = m1.make_future_dataframe(periods=60, # 5 year prediction
freq='M', # monthly predictions
include_history=True) # fits model to all historical data
future1['cap'] = 1 # sets cap at maximum index value
forecast1 = m1.predict(future1)
# m1.plot_components(forecast1, plot_cap=False);
# m1.plot(forecast1, plot_cap=False, ylabel='CNS index', xlabel='Year');
The problem is that now I have to
1)iterate through every cell of the netCDF file,
2)get all the values for that cell through time,
3)apply the GAM (using fbprophet), and then export and plot the results.
The question: do you have any ideas on how to loop through the raster, get the index_values of each pixel for all times so that i can run the GAM?
I think that a nested for loop would be feasible, although i dont know how to make one that goes through every cell.
Any help is appreciated

What is the meaning of the error ValueError: cannot copy sequence with size 205 to array axis with dimension 26 and how to solve it

This is the code i wrote , i am trying to convert the non-numerical data to numeric. However it return an error ValueError: cannot copy sequence with size 205 to array axis with dimension 26
The data is get from http://archive.ics.uci.edu/ml/datasets/Automobile
automobile = pd.read_csv('imports-85.csv', names = ["symboling",
"normalized-losses", "make", "fuel", "aspiration", "num-of-doors", "body-
style", "drive-wheels", "engine-location", "wheel-base", "length", "width",
"height", " curb-weight", "engine-type", "num-of-cylinders","engine-
size","fuel-system","bore","stroke"," compression-ratio","horsepower","peak-
rpm","city-mpg","highway-mpg","price"])
X = automobile.drop('symboling',axis=1)
y = automobile['symboling']
le = preprocessing.LabelEncoder()
le.fit([automobile])
print (le)
The fit method takes an array of [n_samples] see the docs. You're passing the entire data frame within a list. I'm pretty sure if you print the shape of your dataframe (automobile.shape) it will show a shape of (205, 26)
If you want to encode your data you need to do it one column at a time e.g.
le.fit(automobile['make']).
Note, that this is not the correct way to encode categorical data, as the name suggests LabelEncoder is designed for labels and not input features. In scikit-learns current state you should use OneHotEncoder. There are plans in the next release for a categorical encoder

converting a sparse dataframe to dense Dataframe in python efficiently

I have a Problem at hand, I have a dataframe which Looks like the one below:
Input Dataframe:
VEHICLE_HASH LS_ID UPPER_BOUND LS_RATIO
00061E31E25B36 PROMISELS103 2500.0 0.000684
00061E31E25B36 PROMISELS103a 3000.0 0.002001
00061E31E25B36 PROMISELS104 3500.0 0.004128
0006254DB52066 PROMISELS104 4000.0 0.003216
0006254DB52066 PROMISELS103 4500.0 0.001114
0006254DB52066 PROMISELS105 5000.0 0.020767
This is a sample dataframe, the actual dataframe is of size (53526122 x 4). Now i wanted to convert this dataframe to a OneHotEncoded Matrix with Features drawn from the string combined by LS_ID and UPPER_BOUND column. I was able to do one hot Encoding and convert the Matrix to a sparse Matrix and then i multiplied the sparse Matrix with the LS_ratio to get the resultant Input sparse Matrix for my xgboost classifier.
Now I want to convert the dataframe into this dense Format with an unique HASH per row with multiple column Features so i could do PCA with this data. But i get out of memmory error. Can this be done efficiently?
Expected Output:
HASH PROMISELS103a_3000.0 PROMISELS103_2500.0 PROMISELS103_4500.0 PROMISELS104_3500.0 PROMISELS104_4000.0 PROMISELS105_5000.0
00061E31E25B36 0.002001 0.000684 0 0 0.004128 0
0006254DB52066 0 0 0.001114 0.003216 0 0.020767
You can try to concatenate LS_ID and UPPER_BOUND columns with separator '_', construct cross-tabulation (suppose all elements in constructed column and 'VEHICLE_HASH' column is unique), and fill NaN values with zeros:
import pandas as pd
import numpy as np
df = pd.DataFrame() # here should be your initial dataframe
df['ID_AND_BOUND'] = df['LS_ID'] + '_' + df['UPPER_BOUND'].astype(str)
df_processed = pd.crosstab(index=df['VEHICLE_HASH'],
columns=df['ID_AND_BOUND'],
values=df['LS_RATIO'],
aggfunc=np.mean)
df_processed = df_processed.reset_index().fillna(0)

Categories

Resources