Finding all the variables that give the highest Adjusted R squared value - python

I have a dataframe which stores different variables. I'm using OLS linear regression and using all of the variables to predict the 'price' column.
import pandas as pd
import statsmodels.api as sm
data = {'accommodates':[2, 2, 3, 2, 2, 6, 8, 4, 3, 2],
'bedrooms':[1, 2, 1, 1, 3, 4, 2, 2, 2, 3],
'instant_bookable':[1, 0, 1, 1, 1, 1, 0, 0, 0, 1],
'availability_365':[123, 3, 33, 14, 15, 16, 3, 41, 61, 74],
'minimum_nights':[3, 12, 1, 4, 6, 7, 2, 3, 6, 10],
'beds':[2, 2, 3, 4, 1, 5, 6, 2, 3, 2],
'price':[59, 234, 15, 162, 56, 42, 28, 52, 22, 31]}
df = pd.DataFrame(data, columns = ['accommodates', 'bedrooms', 'instant_bookable', 'availability_365',
'minimum_nights', 'beds', 'price'])
I have a for loop which calculates the Adjusted R squared value for each variable:
fit_d = {}
for columns in [x for x in df.columns if x != 'price']:
Y = df['price']
X = df[columns]
X = sm.add_constant(X)
model = sm.OLS(Y,X, missing = 'drop').fit()
fit_d[columns] = model.rsquared
fit_d
How can I modify my code in order to find the combination of variables that give the largest Adjusted R squared value? Ideally the function would find the variable with the largest adj. R squared value first, then using the 1st variable iterate with the remaining variables to get 2 variables that give the highest value, then 3 variables etc. until the value cannot be increased further. I'd like the output to be something like
Best variables: {'accommodates, 'availability', 'bedrooms'}

Here is a "brute force way" to do all possible combinations (from itertools) of different length to find the variables with higher R value. The idea is to do 2 loops, one for the number of variables to try, and one for all the combinations with the number of variables.
from itertools import combinations
# all possible columns for X
cols = [x for x in df.columns if x != 'price']
# define Y as same accross the loops
Y = df['price']
# define result dictionary
fit_d = {}
# loop for any length of combinations
for i in range(1, len(cols)+1):
# loop for any combinations with length i
for comb in combinations(cols, i):
# Define X from the combination
X = df[list(comb)]
X = sm.add_constant(X)
# perform the OLS opertion
model = sm.OLS(Y,X, missing = 'drop').fit()
# save the rsquared in a dictionnary
fit_d[comb] = model.rsquared
# extract the key for the max R value
key_max = max(fit_d, key=fit_d.get)
print(f'Best variables {key_max} for a R-value of {round(fit_d[key_max], 5)}')
# Best variables ('accommodates', 'bedrooms', 'instant_bookable', 'availability_365', 'minimum_nights', 'beds') for a R-value of 0.78506

Related

How to edit all data value given in a dataframe except for the values of a particular index?

I have a dataframe consisting of float64 values in it. I have to divide each value by hundred except for the the values of the row of index no. 388. For that I wrote the following code.
Dataset
Preprocessing:
df = pd.read_csv('state_cpi.csv')
d = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
df['Month']=df['Name'].map(d)
r = {'Rural':1, 'Urban':2, 'Rural+Urban':3}
df['Region_code']=df['Sector'].map(r)
df['Himachal Pradesh'] = df['Himachal Pradesh'].str.replace('--','NaN')
df['Himachal Pradesh'] = df['Himachal Pradesh'].astype('float64')
Extracting the data of use:
data = df.iloc[:,3:-2]
Applying the division on the data dataframe
data[:,:388] = (data[:,:388] / 100).round(2)
data[:,389:] = (data[:,389:] / 100).round(2)
It returned me a dataframe where the data of row no. 388 was also divided by 100.
Dataset
As an example, I give the created dataframe. Indices except for 10 are copied into the aaa list. These index numbers are then supplied when querying and 1 is added to each element. The row with index 10 remains unchanged.
df = pd.DataFrame({'a': [1, 23, 4, 5, 7, 7, 8, 10, 9],
'b': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
index=[1, 2, 5, 7, 8, 9, 10, 11, 12])
aaa = df[df.index != 10].index
df.loc[aaa, :] = df.loc[aaa, :] + 1
In your case, the code will be as follows:
aaa = data[data.index != 388].index
data.loc[aaa, :] = (data.loc[aaa, :] / 100).round(2)

how to delete rows and columns in numpy python?

I am having trouble creating a function which takes a matrix M as an input and deletes BOTH rows and columns containing the number 0 and giving an output containing the remaining numbers. Any help is much appreciated as I have my programming exam coming up soon.
By "deleting both rows and columns" this is what I mean:
import numpy as np
x = np.array([[1,2,3,4,5],
[6,0,8,9,10],
[11,12,13,14,15],
[16,0,0,19,20]])
idxs_array = list(np.where(x==0))
idxs_array = [list(dict.fromkeys(x)) for x in idxs_array]
for axis, idxs in enumerate(idxs_array):
sub_factor = 0
for idx in idxs:
x = np.delete(x,idx-sub_factor,axis)
sub_factor += 1
print(x)
# x = [[ 1, 4, 5],
# [11, 14, 15]]
1. Locate zero elements
First of all, we need to identify the location of the zero elements in the matrix, which can be done easily with np.where().
np.where will return the row/column indices of the elements matched specific condition (doc).
row_idx, col_idx = np.where(arr == 0)
2. Remove corresponding rows/columns
To remove corresponding rows and columns, there is an easy way to do this, which is indexing (doc).
That is, you can specify the row (or column) you want to keep with True, else it shall be False.
print(np.arange(4)[[True, False, True, False]])
# array([0, 2])
3. Put two things together
Here is a minimal example.
arr = np.array([[ 1, 2, 3, 4, 5],
[ 6, 0, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 0, 0, 19, 20]])
row_idx, col_idx = np.where(arr == 0)
rm_row_idx = set(row_idx.tolist())
rm_col_idx = set(col_idx.tolist())
row_mask = [i not in rm_row_idx for i in range(arr.shape[0])]
col_mask = [i not in rm_col_idx for i in range(arr.shape[1])]
arr = arr[row_mask, :]
arr = arr[:, col_mask]
print(arr)
# Shall be:
# array([[ 1, 4, 5],
# [11, 14, 15]])

Storing Categorical from codes from Dataframe

I have a dataframe enumerated for each 50 rows they interpolate between A and B. I don't really understand the function Categorical.from_codes. I have a dataframe that holds my features that are simply 20 pixels from 50 images, therefore a matrix of 50x20. The Y values are simply the index values for instance: pixel 0, 1, 2, 3 and forth. This is my dataframe and its enumeration, how for given dataframe, can i extract X and Y, where X is my data, Y should be my categories.
import numpy as np
import pandas as pd
my_array = np.zeros((700, 20))
indices = sorted(list(range(0,int(my_array.shape[0]/50)))*50)
pixel_index = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
df = pd.DataFrame(my_array, columns=pixel_index)
class_names = list('AB')
target_names = ["Class_" + c for c in class_names]
n_sets = df.shape[0]//50
class_col = []
for name in target_names:
class_col += [name]*50
n_sets = df.shape[0]//(50*len(target_names))
class_col = class_col*n_sets
df['class'] = class_col
X = pd.DataFrame(my_array, columns= pixel_index)
y = pd.Categorical.from_codes(indices,target_names)
It's a little difficult to understand what you're trying to achieve. If you're trying to create a Y series which is 0/1, corresponding to the class you create for every row, replace this line:
y = pd.Categorical.from_codes(indices,target_names)
With
y = pd.Categorical(df["class"]).codes
The value of y would then be 50 zeros, 50 1's, 50 zeros, 50 1's, etc.

Generating random numbers to obtain a fixed sum(python) [duplicate]

This question already has answers here:
Generate random numbers summing to a predefined value
(7 answers)
Closed 4 years ago.
I have the following list:
Sum=[54,1536,36,14,9,360]
I need to generate 4 other lists, where each list will consist of 6 random numbers starting from 0, and the numbers will add upto the values in sum. For eg;
l1=[a,b,c,d,e,f] where a+b+c+d+e+f=54
l2=[g,h,i,j,k,l] where g+h+i+j+k+l=1536
and so on upto l6. And I need to do this in python. Can it be done?
Generating a list of random numbers that sum to a certain integer is a very difficult task. Keeping track of the remaining quantity and generating items sequentially with the remaining available quantity results in a non-uniform distribution, where the first numbers in the series are generally much larger than the others. On top of that, the last one will always be different from zero because the previous items in the list will never sum up to the desired total (random generators usually use open intervals in the maximum). Shuffling the list after generation might help a bit but won't generally give good results either.
A solution could be to generate random numbers and then normalize the result, eventually rounding it if you need them to be integers.
import numpy as np
totals = np.array([54,1536,36,14]) # don't use Sum because sum is a reserved keyword and it's confusing
a = np.random.random((6, 4)) # create random numbers
a = a/np.sum(a, axis=0) * totals # force them to sum to totals
# Ignore the following if you don't need integers
a = np.round(a) # transform them into integers
remainings = totals - np.sum(a, axis=0) # check if there are corrections to be done
for j, r in enumerate(remainings): # implement the correction
step = 1 if r > 0 else -1
while r != 0:
i = np.random.randint(6)
if a[i,j] + step >= 0:
a[i, j] += step
r -= step
Each column of a represents one of the lists you want.
Hope this helps.
This might not be the most efficient way but it will work
totals = [54, 1536, 36, 14]
nums = []
x = np.random.randint(0, i, size=(6,))
for i in totals:
while sum(x) != i: x = np.random.randint(0, i, size=(6,))
nums.append(x)
print(nums)
[array([ 3, 19, 21, 11, 0, 0]), array([111, 155, 224, 511, 457,
78]), array([ 8, 5, 4, 12, 2, 5]), array([3, 1, 3, 2, 1, 4])]
This is a way more efficient way to do this
totals = [54,1536,36,14,9,360, 0]
nums = []
for i in totals:
if i == 0:
nums.append([0 for i in range(6)])
continue
total = i
temp = []
for i in range(5):
val = np.random.randint(0, total)
temp.append(val)
total -= val
temp.append(total)
nums.append(temp)
print(nums)
[[22, 4, 16, 0, 2, 10], [775, 49, 255, 112, 185, 160], [2, 10, 18, 2,
0, 4], [10, 2, 1, 0, 0, 1], [8, 0, 0, 0, 0, 1], [330, 26, 1, 0, 2, 1],
[0, 0, 0, 0, 0, 0]]

Getting the indexes of a Dataframe after a numpy array function

I have a function which implements the k-mean algorithm and I want to use it with DataFrames in order to take into account indexes. For the moment I use DataFrame.values and it works. Yet I don't get the indexes of the output.
def cluster_points(X, mu):
clusters = {}
for x in X:
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
for i in enumerate(mu)], key=lambda t:t[1])[0]
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
def reevaluate_centers(mu, clusters):
newmu = []
keys = sorted(clusters.keys())
for k in keys:
newmu.append(np.mean(clusters[k], axis = 0))
return newmu
def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))
def find_centers(X, K):
# Initialize to K random centers
oldmu = random.sample(X, K)
mu = random.sample(X, K)
while not has_converged(mu, oldmu):
oldmu = mu
# Assign all points in X to clusters
clusters = cluster_points(X, mu)
# Reevaluate centers
mu = reevaluate_centers(oldmu, clusters)
return(mu, clusters)
For instance with thus example minimal and sufficient :
import itertools
df = pd.DataFrame(np.random.randint(0,10,size=(10, 5)), index = list(range(10)), columns=list(range(5)))
df.index.name = 'subscriber_id'
df.columns.name = 'ad_id'
I get :
find_centers(df.values, 2)
([array([ 3.8, 3. , 3.6, 2. , 3.6]),
array([ 6.8, 3.6, 5.6, 6.8, 6.8])],
{0: [array([2, 0, 5, 6, 4]),
array([1, 1, 2, 3, 3]),
array([6, 0, 4, 0, 3]),
array([7, 9, 4, 1, 7]),
array([3, 5, 3, 0, 1])],
1: [array([6, 2, 5, 9, 6]),
array([8, 9, 7, 2, 8]),
array([7, 5, 3, 7, 8]),
array([7, 1, 5, 7, 6]),
array([6, 1, 8, 9, 6])]})
I have the values but don't have the indexes.
If you want to get the array of values including the index, you can simply add the index to the columns with reset_index():
values_with_index = df.reset_index().values
Update
If what you want is to have the index on the output, but not use it during the actual clustering, you can do the following. First, pass the actual data frame object to find_centers:
find_centers(df, 2)
Then change cluster_points as follows:
def cluster_points(X, mu):
clusters = {}
for _, x in X.iterrows():
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]]))
for i in enumerate(mu)], key=lambda t:t[1])[0]
# You can replace this try/except block with
# clusters.setdefault(bestmukey, []).append(x)
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
The centers in the output will still be arrays, but the clusters will contain series objects with each row. The name property of each of these series is the index value in the data frame.

Categories

Resources