Pandas Series: Log Normalize - python

I have a Pandas Series, that needs to be log-transformed to be normal distributed. But I can´t log transform yet, because there are values =0 and values below 1 (0-4000). Therefore I want to normalize the Series first. I heard of StandardScaler(scikit-learn), Z-score standardization and Min-Max scaling(normalization).
I want to cluster the data later, which would be the best method?
StandardScaler and Z-score standardization use mean, variance etc. Can I use them on "not yet normal distibuted" data?

To transform to logarithms, you need positive values, so translate your range of values (-1,1] to normalized (0,1] as follows
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-1,1,(10,1)))
df['norm'] = (1+df[0])/2 # (-1,1] -> (0,1]
df['lognorm'] = np.log(df['norm'])
results in a dataframe like
0 norm lognorm
0 0.360660 0.680330 -0.385177
1 0.973724 0.986862 -0.013225
2 0.329130 0.664565 -0.408622
3 0.604727 0.802364 -0.220193
4 0.416732 0.708366 -0.344795
5 0.085439 0.542719 -0.611163
6 -0.964246 0.017877 -4.024232
7 0.738281 0.869141 -0.140250
8 0.558220 0.779110 -0.249603
9 0.485144 0.742572 -0.297636

If your data is in the range (-1;+1) (assuming you lost the minus in your question) then log transform is probably not what you need. At least from a theoretical point of view, it's obviously the wrong thing to do.
Maybe your data has already been preprocessed (inadequately)? Can you get the raw data? Why do you think log transform will help?
If you don't care about what is the meaningful thing to do, you can call log1p, which is the same as log(1+x) and which will thus work on (-1;∞).

Related

How to extract the x and y values from yf.download

i have a simnple issue. I want to download Data from yfinance and store it in a Dataframe. That works.
Now how can i additionally extract the X and Y Values, that are stored in that Dataframe?
I mean, just from the fact, that the Data is plottable i conclude, that there are x and y values for every datapoint on the plot.
Here ist the simple code
import yfinance as yf
import matplotlib.pyplot as plt
stock = 'TSLA'
start = '2020-11-01'
df = yf.download(stock , start=start)
What i finally want to achieve ´would be to use the X and Y values to feed them into a polyfit function.
In that way i am trying to do a regression on the pricechart data from a stock, to finally be able to take derivatives and apply some analysis on that function.
Anybody has a good idea?
I appreciate, thanks a lot,
Benjamin
You can save the date and close price like this:
X=df.index
Y=df.Close
And if you want to plot the Closeprice accordind to the date:
df.reset_index().plot(x='Date', y='Close')
If you want to use all the data except the close column to predict the Closeprice, you can keep them with:
X=df.drop(columns='Close')

Apply Mann Whitney U test on multidimensional array and replace single values of variable of xarray data array in Python?

I'm new to Python and need some help with xarray.
I have two 3 dimensional data arrays (rlon, rlat, time) for future and past climate. I want to compute the Mann-Whitney-U-test for each grid point to analyse significance of temperature change in future compared to past. I already got the Mann-Whitney-U-test work with selecting a time serie from one grid point of historical and future data each. Example:
import numpy as np
import xarray as xr
import scipy.stats as sts
#selecting time period and grid point of past and future data
tp = fileHis['tas']
tf = fileFut['tas']
gridpoint_past=tp.sel(rlon=-6.375, rlat=1.375, time=slice('1999-01-01', '1999-01-31'))
gridpoint_future=tf.sel(rlon=-6.375, rlat=1.375, time=slice('2099-01-01', '2099-01-31'))
#mannwhintey-u-test
result=sts.mannwhitneyu(gridpoint_past, gridpoint_future, alternative='two-sided')
print('pvalue =',result[1])
Output:
pvalue = 0.05922372345359562
My problem now is that I need to do this for each grid point and each month and in the end I would like to have a data array with pvalues for each grid point and each month of a year.
I was thinking about looping through all rlat, rlon and months and run the Mann-Whitney-U-test for each, unless there is a better way to do.?
And how can I write the pvalues one by one into a new data array with the same rlat, rlon dimension?
I was trying this, but it does not work:
I created a data array pvalue_mon, which has the same rlat, rlon as tp and tf and has 12 months as time steps.
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=th.time.dt.month.isin([1])) = result[1]
SyntaxError: can't assign to function call
or this:
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=pvalue_mon.time.dt.month.isin([1])).update(result[1])
TypeError: 'numpy.float64' object is not iterable
How can I replace a single value of an existing variable?
Instead of using the .sel() function, try using .loc[ ] as described here:
http://xarray.pydata.org/en/stable/indexing.html#assigning-values-with-indexing

scaling data between -1 and 1 centred on zero

Apologies in advance for any incorrect wording. The reason I am not finding answers to this might be because I am not using the right terminology.
I have a dataframe that looks something like
0 -0.004973 0.008638 0.000264 -0.021122 -0.017193
1 -0.003744 0.008664 0.000423 -0.021031 -0.015688
2 -0.002526 0.008688 0.000581 -0.020937 -0.014195
3 -0.001322 0.008708 0.000740 -0.020840 -0.012715
4 -0.000131 0.008725 0.000898 -0.020741 -0.011249
5 0.001044 0.008738 0.001057 -0.020639 -0.009800
6 0.002203 0.008748 0.001215 -0.020535 -0.008368
7 0.003347 0.008755 0.001373 -0.020428 -0.006952
8 0.004476 0.008758 0.001531 -0.020319 -0.005554
9 0.005589 0.008758 0.001688 -0.020208 -0.004173
10 0.006687 0.008754 0.001845 -0.020094 -0.002809
...
For each column I would like to scale the data to a float between -1.0 and 1.0 for this column's min and max.
I have tried scikit learn's minmax scaler with scaler = MinMaxScaler(feature_range = (-1, 1)) but some values change sign as a result, which I need to preserve.
Is there a way to 'centre' the scaling on zero?
Have you tried using StandardScaler from sklearn ?
It has with_mean and with_std option, which you can use to get data you want.
The problem with scaling the negative values to the column's minimum value and the positive values to the column's maximum value is that the scale of the positive numbers may be different than the scale of the positive numbers. If you want to use the same scale for both negative and positive values, try the following:
def zero_centered_min_max_scaling(dataframe):
"""
Scale the numerical values in the dataframe to be between -1 and 1, preserving the
signal of all values.
"""
df_copy = dataframe.copy(deep=True)
for column in df_copy.columns:
max_absolute_value = df_copy[column].abs().max()
df_copy[column] = df_copy[column] / max_absolute_value
return df_copy

Put a fixed quantity of missing values in a dataset - Azure ML

I'm dealing with Azure ML and my goal is to see what happens if I have a fixed quantity(in percentage) of missing values in my dataset.
My idea could be:
Starting from the dataset(take in example Adult dataset) ,duplicate the original dataset and call it for convention X. Dataset X will contain randomly missing value in the percentage of the 20%. Once we have the original dataset and the duplicated dataset X we can use a Neural Net algo , create training and test set and then train this neural net with the dataset X in input . What it could be interesting to see is the global error produced. After we can imagine to expand the range of missing values in the dataset X. Starting from 20%,after 40% and so on... I think the hardest part is to duplicate the original dataset and so create the dataset X with this missing values.
In which way I can do it? Using modules in Azure ML or maybe R/Python scripts?
Just Sharing my idea, please see the sample code & comments as below.
import numpy as np
import pandas as pd
# Origin DataFrame
df = pd.DataFrame(np.random.randn(6,4))
# Copy data via flatten data matrix as an array
array = df.values.flatten()
# insert missing data by percent
# Define the percent of missing data
percent = 0.2
size = len(array)
# generate a random list for indexing data which will be assigned NaN
chosen = np.random.choice(size, int(size*percent))
array[chosen] = np.nan
# Create a new DataFrame with missing data
df2 = pd.DataFrame(np.reshape(array, (6,4)))
Hope it helps.

Pandas/Statsmodel OLS predicting future values

I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
matrices are not aligned
edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085
I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1 to remove). See below, it should give the same answer.
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
Edit:
To predict future values, just pass new data to .predict() For example, using the first model:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
On the intercept - there's nothing encoded in the number 1 it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
X = sm.add_constant(X)

Categories

Resources