Converting target dataset into a classification dataset

Converting target dataset into a classification dataset – Pandas - python

I'm trying to convert the dataset into a classification dataset by:
Step 1: Split the range of target values into three equal parts - low, mid, and high.
Step 2: Reassign the target values into into three categorical values 0, 1, and 2, representing low, mid and high range of values, respectively.
I tried different approach by using the method that were suggesting in this post: How to automatically categorise data in panda dataframe? and didn't get the result I wanted. Any suggestion?
Dataset in question:
from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target

let's find the lowest and give him the highest value (100) than the max(y) (50 in your example), we repeat this until we have done this for at 33% of your y, and we repeat this 2 times with another different value higher than max(y).
Then we use a function to modify your 100,200 and 300 to 0,1,2
from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target
y = list(y)
print(y)
for i in range(len(y)):
index = y.index(min(y))
if i < len(y)/3:
y[index]=100
elif i > len(y)/3 and i < 2*(len(y)/3):
y[index]=200
else:
y[index]=300
def split_in_3(y):
if y == 100:
return 0
elif y == 200:
return 1
else:
return 2
y2 = map(split_in_3,y)
print(list(y2))

Related

How can I reshape my array to fit (4,100)

This is my code but when I run it, I am not getting the correct shape. I need it to return a numpy array of the shape (4,100).
To get an idea of what I'm doing, I am fitting a polynomial LinearRegression model on the training data for the specified degrees then generating predictions for the polynomial's values by transposing the 100 row, single column output into a single row, 100 column array.
np.random.seed(0)
C = 15
n = 60
x = np.linspace(0, 20, n) # x is drawn from a fixed range
y = x ** 3 / 20 - x ** 2 - x + C * np.random.randn(n)
x = x.reshape(-1, 1) # convert x and y from simple array to a 1-column matrix for input to sklearn regression
y = y.reshape(-1, 1)
# Create the training and testing sets and their targets
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def model():
degs = (1, 3, 7, 11)
#Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it
#contains a single sample.
def poly_y(i):
poly = PolynomialFeatures(degree = i)
x_poly = poly.fit_transform(X_train.reshape(-1,1))
linreg = LinearRegression().fit(x_poly, y_train)
#x_orig = np.linspace(0, 20, 100)
y_pred = linreg.predict(poly.fit_transform(np.linspace(0, 20, 100).reshape(-1,1)))
y_pred = y_pred.T
return(y_pred.reshape(-1,1))
ans = poly_y(1)
for i in degs:
temp = poly_y(i)
ans = np.vstack([ans, temp])
return ans
model()
Image of output:

Combining the comments to your question, with a brief explanation:
You're currently doing
ans = poly_y(1)
for i in degs:
temp = poly_y(i)
ans = np.vstack([ans, temp])
You set ans to the result for a degree of one, then loop through all degrees and stack those to ans. But, all degrees include 1, so you get degree 1 twice, and end up with a 500 by 1 array. Thus, you can remove the first line. Then, you have this loop where you repeatedly stack to ans, which can be done in one go, using a list comprehension (e.g., with [poly_y(deg) for deg in degs]). Stacking that results in a 400 by 1 array, which is not what you want. You could reshape that, or you could use hstack. The latter returns a 100 by 4 array; to get a 4 by 100 array, just transpose that.
So the final solution would be to replace the above four lines with
ans = np.hstack([poly_y(deg) for deg in degs]).T
(and if you want to get more fancy, replace those lines and the return ans line with
return np.hstack([poly_y(deg) for deg in degs]).T
)

Low volatility portfolio construction

I want to test the low volatility factor for some market other than equities. Contradiccting finance 101, it has been Shown that low volatility stocks outperform high volatility stocks (see, for example, Baker, Malcolm, Brendan Bradley, and Jeffrey Wurgler (2011), “Benchmarks as Limits to Arbitrage: Understanding the Low-Volatility Anomaly”, Financial Analyst Journal, Vol. 67, No. 1, pp. 40–54.)
So what I want to do is construct the low vola factor by following the methodology of Jegadeesh and Titman (1993), namely raning stocks according to their previous j historical volatility and short top 30% (the most volatile) and Long the bottom 30% (the least volatile), and hold that Long-short Portfolio for k periods. Therefore, a 3-3 j-k Portfolio would mean, looking at the past 3 months of historical volatility (j), and hold that Portfolio for the following 3 months (k).
I have written some Code, and the j part Can be easily managed by simply increasing or decreasing the window of the rolling window vola calculation. The part I am struggling with is the k part, how this could be done. Unfortunately, I couldnt find many examples online.
In addition, I was wondering if my Code is correct or if I did any mistake, since it surprisingly did not work, regardless of the dataset I used. I am not sure whether this is the right place to ask, but if someone could take a look at it that would be great and might be helpful to others planning to implement a strategy like this as well.
Below is a simple working example with just 10 stocks. As I said, I want to implement it for some other assets, but this Code should work. You just have to use your own API key in line 16. Thanks a lot!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import quandl
import pickle
import scipy.optimize as sco
from scipy.ndimage.interpolation import shift
import matplotlib.pyplot as plt
##################
# Low volatility #
##################
quandl.ApiConfig.api_key = 'Your key here'
stocks = ['MSFT','AAPL','AMZN','FB','BRK.B','JPM','GOOG','JNJ','V','PG','XOM']
data = quandl.get_table('WIKI/PRICES', ticker = stocks,
qopts = { 'columns': ['date', 'ticker', 'adj_close'] },
date = { 'gte': '2016-1-1', 'lte': '2019-11-3' }, paginate=True)
# with open("data.pkl", "wb") as pickle_file:
# pickle.dump(data, pickle_file)
# with open("data.pkl", "rb") as pickle_file:
# data = pickle.load(pickle_file)
data = data.pivot_table(index='date', columns='ticker', values='adj_close')
data = data.groupby(pd.Grouper(freq="M")).mean() # convert from daily to monthly prices
returns = (np.log(data) - np.log(data.shift(1))).dropna()
stds = returns.rolling(12).std()
stds = stds.values # convert to numpy array
list = []
for x in range(0, stds.shape[0]): # for each row in std matrix, create decile buckets (dec -> breakpoint to the next bucket)
for y in range(0,100,10):
dec = np.percentile(stds[x], y)
list.append(dec)
list = np.array(list) # convert list to numpy array
list = np.reshape(list, (stds.shape[0], -1)) # reshape the array such that it has the same format as returns (here: (26,10))
inds = []
for x in range(0, stds.shape[0]): # if the return is in the lower 30%, allocate a -1 to the asset. If it is in the upper 30%, allocate a 1. 0 otherwise.
ind = np.digitize(stds[x], list[x])
for x in range(0, ind.shape[0]):
if ind[x] <= 3:
ind[x] = 1
elif ind[x] >= 8:
ind[x] = -1
else:
ind[x] = 0
inds.append(ind)
inds = np.array(inds)
inds = inds.astype(np.float32)
for x in inds: # divide -1, 1 and 0 by the respective total number of counts of -1, 1 and 0, such that they sum up to -1 and 1 (beta neutral long-short)
ones = np.count_nonzero(x == 1) # count the number of 1
minus_ones = np.count_nonzero(x == -1) # count the number of -1
zeros = np.count_nonzero(x == 0) # count the number of 0
for y in range(0, inds.shape[1]):
if x[y] == 1:
x[y] = x[y] / ones
elif x[y] == -1:
x[y] = x[y] / minus_ones
else:
x[y] = x[y] / zeros
returns = returns.shift(periods=-1).values # shift returns one period back, and create numpy array
pf_returns = np.sum((inds*returns), axis=1) # multiply returns with weights, and sum up
pf_returns = pd.DataFrame(pf_returns)
print("---")
print(pf_returns.describe())
# Plot
pf_returns_indexed = 100 * (1 + pf_returns).cumprod()
pf_returns_indexed = pf_returns_indexed.plot(linewidth=1.2) # change line width
plt.show()

average point on each bin pandas

I have 2 dataframes temperature(y) and ratio(x). In each dataframe I have 60 columns corresponding to 60 different machines that measure both parameters.
for now I have a plot for each machine of y vs x, as follow:
for column in ratio.columns:
x = ratio[column]
y = temperature[column]
if len(x) != len(y):
x_ind = x.index
y_ind = y.index
common_ind = x_ind.intersection(y_ind)
x = x[common_ind]
y = y[common_ind]
plt.scatter(x,y)
plt.savefig("plot" +column+".png")
plt.clf()
because I have a lot of data points, I want to do binning for each machine and to do an average on each bin, so that I will have an average point of y for each bin.
x is between 0 and 1 and I want to bin every 0.05, which gives 20 bins.
I got an histogram for each machine by doing:
for x in ratio.columns:
ratio.hist(column = x, bins = 20)
but this is only giving number of events vs ratio.
how can I link the temperature dataframe
I am new to pandas and I can't figure out how to do this

mask bin every 20
mask = my_df.index//20
then use groupby and agg
my_df.groupby(mask).agg(['mean'])

python: increase performance of finding the best timeshift for a correlation between each X column and y

I have a dataframe X with several columns and a dataframe y with only one column (series). The rows in X represent timesteps and I want to find the interval I need to shift each column of X to obtain the highest correlation with y. I wrote a function that loops over all columns and then loops over all timesteps and correlates the X column with y. If the R² is better than before I store the timestep. However, with over 300 columns this routine is really taking some time and I need to increase the performance. Is there a nice way to simplify this code?
(In the example I used the iris data set which is of course not a timeseries...)
from sklearn import datasets
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
from copy import deepcopy
def get_best_shift(dfX, dfy, ti=60, maxt=1440):
"""
determines the best correlation for the last maxt minutes based on a
timestep of ti minutes. Creates a dataframe with the shifted variables based on the
best match (strongest correlation).
"""
df_out = deepcopy(dfX)
for xcol in dfX:
bestshift = 0
Rmax = 0
for ishift in range(0, int(maxt / ti)):
xvals = dfX[xcol].iloc[0:(dfX.shape[0] - ishift)].values
yvals = np.array([val[0] for val in dfy.iloc[ishift:dfy.shape[0]].values])
selector = np.array([str(val)!="nan" for val in (xvals*yvals)],dtype=bool)
xvals = xvals[selector]
yvals = yvals[selector]
R = np.corrcoef(xvals,yvals)[0][1]
# plt.figure()
# plt.plot(xvals,yvals,'k.')
# plt.show()
if R ** 2 > Rmax:
Rmax = R ** 2
# print(Rmax)
bestshift = ishift
df_out[xcol] = list(np.zeros(bestshift)) + list(dfX[xcol].iloc[0:dfX.shape[0] - bestshift].values)
df_out = df_out.rename(columns={xcol: ''.join([str(xcol), '_t-', str(bestshift)])})
return df_out
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)
df = get_best_shift(X,y)

sample X examples from each class label

l have a dataset (numpy vector) with 50 classes and 9000 training examples.
x_train=(9000,2048)
y_train=(9000,) # Classes are strings
classes=list(set(y_train))
l would like to build a sub-dataset such that each class will have 5 examples
which means l get 5*50=250 training examples. Hence my sub-dataset will take this form :
sub_train_data=(250,2048)
sub_train_labels=(250,)
Remark : we take randomly 5 examples from each class (total number of classes = 50)
Thank you

Here is a solution for that problem :
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
def balanced_sample_maker(X, y, sample_size, random_seed=42):
uniq_levels = np.unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
np.random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of each label
balanced_copy_idx = []
for gb_level, gb_idx in groupby_levels.items():
over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
balanced_copy_idx+=over_sample_idx
np.random.shuffle(balanced_copy_idx)
data_train=X[balanced_copy_idx]
labels_train=y[balanced_copy_idx]
if ((len(data_train)) == (sample_size*len(uniq_levels))):
print('number of sampled example ', sample_size*len(uniq_levels), 'number of sample per class ', sample_size, ' #classes: ', len(list(set(uniq_levels))))
else:
print('number of samples is wrong ')
labels, values = zip(*Counter(labels_train).items())
print('number of classes ', len(list(set(labels_train))))
check = all(x == values[0] for x in values)
print(check)
if check == True:
print('Good all classes have the same number of examples')
else:
print('Repeat again your sampling your classes are not balanced')
indexes = np.arange(len(labels))
width = 0.5
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()
return data_train,labels_train
X_train,y_train=balanced_sample_maker(X,y,10)
inspired by Scikit-learn balanced subsampling

Pure numpy solution:
def sample(X, y, samples):
unique_ys = np.unique(y, axis=0)
result = []
for unique_y in unique_ys:
val_indices = np.argwhere(y==unique_y).flatten()
random_samples = np.random.choice(val_indices, samples, replace=False)
ret.append(X[random_samples])
return np.concatenate(result)

I usually use a trick from scikit-learn for this. I use the StratifiedShuffleSplit function. So if I have to select 1/n fraction of my train set, I divide the data into n folds and set the proportion of test data (test_size) as 1-1/n. Here is an example where I use only 1/10 of my data.
sp = StratifiedShuffleSplit(n_splits=1, test_size=0.9, random_state=seed)
for train_index, _ in sp.split(x_train, y_train):
x_train, y_train = x_train[train_index], y_train[train_index]

You can use dataframe as input (as in my case), and use simple code below:
col = target
nsamples = min(t4m[col].value_counts().values)
res = pd.DataFrame()
for val in t4m[col].unique():
t = t4m.loc[t4m[col] == val].sample(nsamples)
res = pd.concat([res, t], ignore_index=True).sample(frac=1)
col is the name of your column with classes. Code finds minority class, shuffles dataframe, then takes sample of size of minority class from each class.
Then you can convert result back to np.array

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting target dataset into a classification dataset – Pandas - python

Related

How can I reshape my array to fit (4,100)

Low volatility portfolio construction

average point on each bin pandas

python: increase performance of finding the best timeshift for a correlation between each X column and y

sample X examples from each class label

Categories

Resources