normalizing my timeseries dataset then setting the timestamp as Index - python

here is my code trying to normalize my dataset, the code works but the problem is when I create the new data frame (the last line of my code) it is not including the timestamp column because it is just including the scaled values.
data_consumption2 = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
data_consumption2['Timestamp'] = pd.to_datetime(data_consumption2['Timestamp'], unit='s')
data_consumption2.fillna(0,inplace=True)
data_consumption2 = data_consumption2.set_index('Timestamp')
#returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(data_consumption2.values)
data_consumption2 = pd.DataFrame(x_scaled)
I hope any one can help me with having my original dframe with timestamps and scaled values in it

You have to set the index of the new dataframe you created.
What the min_max_scaler.fit_transform returns is a numpy array of the scaled values (thus losing the index).
So you could do :
data_consumption2 = pd.DataFrame(data=x_scaled, index=data_consumption2.index)
If you want to also retrieve the columns, you can also pass them along :
data_consumption2 = pd.DataFrame(data=x_scaled,
index=data_consumption2.index,
columns=data_consumption2.columns)
More details in the DataFrame's documentation : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
Those are basic pandas' manipulations, you should find all the answers about it in their documentation.

Related

Is it possible to selecting dataset by time range when range is different for every pixel in pythons xarray module

I try to select only this part of the data within a specific time range that is different for every pixel.
For indexing, I have two np.datetime64[ns] xr.DataArrays with shape(lat:152, lon:131) named time_range_min, time_range_max
One is holding the start dates and the other one the end dates.
I try this for selecting the data
dataset = data.sel(time=slice(time_range_min, time_range_max))
but it yields
cannot use non-scalar arrays in a slice for xarray indexing:
<xarray.DataArray 'NDVI' (lat: 152, lon: 131)>
If I cannot use non-scalar arrays it means that it is in general not possible to do this, or can I transform my arrays?
If "time" is a list of dates in string that is ordered from past to present (e.g. ["10-20-2021", "10-21-2021", ...]:
import numpy as np
listOfMinMaxTimeRanges = [time_range_min, time_range_max]
specifiedRangeOfTimeIndexedList = []
for indexingListOfMinMaxTimeRanges in range(np.shape(listOfMinMaxTimeRanges)[1])):
specifiedRangeOfTimeIndexed = [specifiedRangeOfTime for specifiedRangeOfTime in np.arange(0, len(time), 1) if time.index(listOfMinMaxTimeRanges[0][indexingListOfMinMaxTimeRanges]) <= specifiedRangeOfTime <= time.index(listOfMinMaxTimeRanges[1][indexingListOfMinMaxTimeRanges])]
for indexes in range(len(specifiedRangeOfTimeIndexed)):
specifiedRangeOfTimeIndexedList.append(specifiedRangeOfTimeIndexed[indexes])
Depending on how your dataset is structured:
dataset = data.sel(time = specifiedRangeOfTimeIndexedList)
or
dataset = data.sel(time = time[specifiedRangeOfTimeIndexedList])
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[:, time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList], :, :]
or
dataset = dataset[specifiedRangeOfTimeIndexedList]
...
I found a way to group every cell with stacking in xarray:
time_range_min and time_range_max marks now a single date
stack = dataset.value.stack(gridcell=['lat', 'lon'])
for unique_value, grouped_array in stack.groupby('gridcell'):
grouped_array.sel(time=slice(time_range_min,time_range_max))

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

How is pandas converting my data into categories in this line of code?

I need help understanding this line of code:
y_train2 = train_target2.astype('category').cat.codes
Am I right in saying that y_train2 is being changed to a categorical variable using astype(category) and then cat.codes is used to change it into integers?
Below is the full block of code.
# Train data pre-processing
train_target2 = df_train_01['class_2']
train_target5 = df_train_01['class_5']
df_train_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
# convert text labels to integers
y_train2 = train_target2.astype('category').cat.codes
y_train5 = train_target5.astype('category').cat.codes
# Test data pre-processing
test_target2 = df_test_01['class_2']
test_target5 = df_test_01['class_5']
# drop 'class_2' and 'class_5' columns
df_test_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
y_test2 = test_target2.astype('category').cat.codes
y_test5 = test_target5.astype('category').cat.codes
I think your understanding on the dataframe function and attribute is correct; pdf.astype('category') is turning values into categorical data and pdf.Categorical.codes() (or pdf.Series.codes() ) is an attribute that converts the values into a set of integers that start with 0.
Try to type some simple snippet below to see how they work.
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = load_iris()
pdf = pd.DataFrame(iris.data, columns=['s-length', 's-width', 'p-length', 'p-width'])
print(
iris['s-length'].astype('category'),
len(np.unique(iris['s-length'])), # -> 35
len( set(iris['s-length'].astype('category').cat.codes ), # -> 35
np.unique(iris['s-length'].astype('category').cat.codes)), # -> array([ 0, 1,...34]), dtype=int8)
)
In essence, a pandas categorical data type is a mapping between values that do not have a numeric interpretation and a unique number for each value.
Let's break down your code:
# Take the series `train_target2` and convert it to categorical type
train_target2.astype('category')
# Access the attributes or methods of a categorical series
train_target2.astype('category').cat
# Take the `codes` attribute
train_target2.astype('category').cat.codes
In reality, .codes is not converting the data into numbers. Rather, you are only taking the numeric equivalent of each category. Strictly speaking, .astype('category') is the part that converted your data to categorical.
You can find the attributes and methods of this data type here.

Value Error and problem with shape during creation of Data Frame in Python?

I would like to combine coefficient from Liear Regression model with values from test dataset, nevertheless I have error like below, my code is below, do you know where is the problem and what can I do ?
I need something like below, where indexes are from X.columns and numbers are from LR.coef_.
In the following example, values is a dataframe which has the same shape of your LR.coef_. To use its first row as column values in another dataframe, you can create a dict and pass that dict to pandas.DataFrame().
import pandas as pd
import numpy as np
values = pd.DataFrame(np.zeros((1, 689)))
X = pd.DataFrame(np.zeros((2096, 689)))
frame = { 'coefficient': values.iloc[0] }
coefficient = pd.DataFrame(frame, index=X.columns)

How to scale all columns except last column?

I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)

Categories

Resources