Storing Categorical from codes from Dataframe

Storing Categorical from codes from Dataframe - python

I have a dataframe enumerated for each 50 rows they interpolate between A and B. I don't really understand the function Categorical.from_codes. I have a dataframe that holds my features that are simply 20 pixels from 50 images, therefore a matrix of 50x20. The Y values are simply the index values for instance: pixel 0, 1, 2, 3 and forth. This is my dataframe and its enumeration, how for given dataframe, can i extract X and Y, where X is my data, Y should be my categories.
import numpy as np
import pandas as pd
my_array = np.zeros((700, 20))
indices = sorted(list(range(0,int(my_array.shape[0]/50)))*50)
pixel_index = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
df = pd.DataFrame(my_array, columns=pixel_index)
class_names = list('AB')
target_names = ["Class_" + c for c in class_names]
n_sets = df.shape[0]//50
class_col = []
for name in target_names:
class_col += [name]*50
n_sets = df.shape[0]//(50*len(target_names))
class_col = class_col*n_sets
df['class'] = class_col
X = pd.DataFrame(my_array, columns= pixel_index)
y = pd.Categorical.from_codes(indices,target_names)

It's a little difficult to understand what you're trying to achieve. If you're trying to create a Y series which is 0/1, corresponding to the class you create for every row, replace this line:
y = pd.Categorical.from_codes(indices,target_names)
With
y = pd.Categorical(df["class"]).codes
The value of y would then be 50 zeros, 50 1's, 50 zeros, 50 1's, etc.

Related

How to edit all data value given in a dataframe except for the values of a particular index?

I have a dataframe consisting of float64 values in it. I have to divide each value by hundred except for the the values of the row of index no. 388. For that I wrote the following code.
Dataset
Preprocessing:
df = pd.read_csv('state_cpi.csv')
d = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
df['Month']=df['Name'].map(d)
r = {'Rural':1, 'Urban':2, 'Rural+Urban':3}
df['Region_code']=df['Sector'].map(r)
df['Himachal Pradesh'] = df['Himachal Pradesh'].str.replace('--','NaN')
df['Himachal Pradesh'] = df['Himachal Pradesh'].astype('float64')
Extracting the data of use:
data = df.iloc[:,3:-2]
Applying the division on the data dataframe
data[:,:388] = (data[:,:388] / 100).round(2)
data[:,389:] = (data[:,389:] / 100).round(2)
It returned me a dataframe where the data of row no. 388 was also divided by 100.
Dataset

As an example, I give the created dataframe. Indices except for 10 are copied into the aaa list. These index numbers are then supplied when querying and 1 is added to each element. The row with index 10 remains unchanged.
df = pd.DataFrame({'a': [1, 23, 4, 5, 7, 7, 8, 10, 9],
'b': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
index=[1, 2, 5, 7, 8, 9, 10, 11, 12])
aaa = df[df.index != 10].index
df.loc[aaa, :] = df.loc[aaa, :] + 1
In your case, the code will be as follows:
aaa = data[data.index != 388].index
data.loc[aaa, :] = (data.loc[aaa, :] / 100).round(2)

Add padding based on partial sum

I have four given variables:
group size
total of groups
partial sum
1-D tensor
and I want to add zeros when the sum within a group reached the partial sum. For example:
groupsize = 4
totalgroups = 3
partialsum = 15
d1tensor = torch.tensor([ 3, 12, 5, 5, 5, 4, 11])
The expected result is:
[ 3, 12, 0, 0, 5, 5, 5, 0, 4, 11, 0, 0]
I have no clue how can I achieve that in pure pytorch. In python it would be something like this:
target = [0]*(groupsize*totalgroups)
cursor = 0
current_count = 0
d1tensor = [ 3, 12, 5, 5, 5, 4, 11]
for idx, ele in enumerate(target):
subgroup_start = (idx//groupsize) *groupsize
subgroup_end = subgroup_start + groupsize
if sum(target[subgroup_start:subgroup_end]) < partialsum:
target[idx] = d1tensor[cursor]
cursor +=1
Can anyone help me with that? I have already googled it but couldn't find anything.

Some logic, Numpy and list comprehensions are sufficient here.
I will break it down step by step, you can make it slimmer and prettier afterwards:
import numpy as np
my_val = 15
block_size = 4
total_groups = 3
d1 = [3, 12, 5, 5, 5, 4, 11]
d2 = np.cumsum(d1)
d3 = d2 % my_val == 0 #find where sum of elements is 15 or multiple
split_points= [i+1 for i, x in enumerate(d3) if x] # find index where cumsum == my_val
#### Option 1
split_array = np.split(d1, split_points, axis=0)
padded_arrays = [np.pad(array, (0, block_size - len(array)), mode='constant') for array in split_array] #pad arrays
padded_d1 = np.concatenate(padded_arrays[:total_groups]) #put them together, discard extra group if present
#### Option 2
split_points = [el for el in split_points if el <len(d1)] #make sure we are not splitting on the last element of d1
split_array = np.split(d1, split_points, axis=0)
padded_arrays = [np.pad(array, (0, block_size - len(array)), mode='constant') for array in split_array] #pad arrays
padded_d1 = np.concatenate(padded_arrays)

Finding all the variables that give the highest Adjusted R squared value

I have a dataframe which stores different variables. I'm using OLS linear regression and using all of the variables to predict the 'price' column.
import pandas as pd
import statsmodels.api as sm
data = {'accommodates':[2, 2, 3, 2, 2, 6, 8, 4, 3, 2],
'bedrooms':[1, 2, 1, 1, 3, 4, 2, 2, 2, 3],
'instant_bookable':[1, 0, 1, 1, 1, 1, 0, 0, 0, 1],
'availability_365':[123, 3, 33, 14, 15, 16, 3, 41, 61, 74],
'minimum_nights':[3, 12, 1, 4, 6, 7, 2, 3, 6, 10],
'beds':[2, 2, 3, 4, 1, 5, 6, 2, 3, 2],
'price':[59, 234, 15, 162, 56, 42, 28, 52, 22, 31]}
df = pd.DataFrame(data, columns = ['accommodates', 'bedrooms', 'instant_bookable', 'availability_365',
'minimum_nights', 'beds', 'price'])
I have a for loop which calculates the Adjusted R squared value for each variable:
fit_d = {}
for columns in [x for x in df.columns if x != 'price']:
Y = df['price']
X = df[columns]
X = sm.add_constant(X)
model = sm.OLS(Y,X, missing = 'drop').fit()
fit_d[columns] = model.rsquared
fit_d
How can I modify my code in order to find the combination of variables that give the largest Adjusted R squared value? Ideally the function would find the variable with the largest adj. R squared value first, then using the 1st variable iterate with the remaining variables to get 2 variables that give the highest value, then 3 variables etc. until the value cannot be increased further. I'd like the output to be something like
Best variables: {'accommodates, 'availability', 'bedrooms'}

Here is a "brute force way" to do all possible combinations (from itertools) of different length to find the variables with higher R value. The idea is to do 2 loops, one for the number of variables to try, and one for all the combinations with the number of variables.
from itertools import combinations
# all possible columns for X
cols = [x for x in df.columns if x != 'price']
# define Y as same accross the loops
Y = df['price']
# define result dictionary
fit_d = {}
# loop for any length of combinations
for i in range(1, len(cols)+1):
# loop for any combinations with length i
for comb in combinations(cols, i):
# Define X from the combination
X = df[list(comb)]
X = sm.add_constant(X)
# perform the OLS opertion
model = sm.OLS(Y,X, missing = 'drop').fit()
# save the rsquared in a dictionnary
fit_d[comb] = model.rsquared
# extract the key for the max R value
key_max = max(fit_d, key=fit_d.get)
print(f'Best variables {key_max} for a R-value of {round(fit_d[key_max], 5)}')
# Best variables ('accommodates', 'bedrooms', 'instant_bookable', 'availability_365', 'minimum_nights', 'beds') for a R-value of 0.78506

Summing up arrays without doubles

I would like to know if I have generated the 3 arrays in the manner below, how can I sum all the numbers up from all 3 arrys without summing up the ones that appear in each array.
(I would like to only som upt 10 once but I cant add array X_1 andX_2 because they both have 10 and 20, I only want to som up those numbers once.)
Maybe this can be done by creating a new array out of the X_1, X_2 and X_3 what leave out doubles?
def get_divisible_by_n(arr, n):
return arr[arr%n == 0]
x = np.arange(1,21)
X_1=get_divisible_by_n(x, 2)
#we get array([ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])
X_2=get_divisible_by_n(x, 5)
#we get array([ 5, 10, 15, 20])
X_3=get_divisible_by_n(x, 3)
#we get array([3, 6, 9, 12, 15, 18])

it is me again!
here is my solution using numpy, cuz i had more time this time:
import numpy as np
arr = np.arange(1,21)
divisable_by = lambda x: arr[np.where(arr % x == 0)]
n_2 = divisable_by(2)
n_3 = divisable_by(3)
n_5 = divisable_by(5)
what_u_want = np.unique( np.concatenate((n_2, n_3, n_5)) )
# [ 2, 3, 4, 5, 6, 8, 9, 10, 12, 14, 15, 16, 18, 20]

Not really efficient and not using numpy but here is one solution:
def get_divisible_by_n(arr, n):
return [i for i in arr if i % n == 0]
x = [i for i in range(21)]
X_1 = get_divisible_by_n(x, 2)
X_2 = get_divisible_by_n(x, 5)
X_3 = get_divisible_by_n(x, 3)
X_all = X_1+X_2+X_3
y = set(X_all)
print(sum(y)) # 142

Pandas DataFrame to multidimensional NumPy Array

I have a Dataframe which I want to transform into a multidimensional array using one of the columns as the 3rd dimension.
As an example:
df = pd.DataFrame({
'id': [1, 2, 2, 3, 3, 3],
'date': np.random.randint(1, 6, 6),
'value1': [11, 12, 13, 14, 15, 16],
'value2': [21, 22, 23, 24, 25, 26]
})
I would like to transform it into a 3D array with dimensions (id, date, values) like this:
The problem is that the 'id's do not have the same number of occurrences so I cannot use np.reshape().
For this simplified example, I was able to use:
ra = np.full((3, 3, 3), np.nan)
for i, value in enumerate(df['id'].unique()):
rows = df.loc[df['id'] == value].shape[0]
ra[i, :rows, :] = df.loc[df['id'] == value, 'date':'value2']
To produce the needed result:
but the original DataFrame contains millions of rows.
Is there a vectorized way to accomplice the same result?

Approach #1
Here's one vectorized approach after sorting id col with df.sort_values('id', inplace=True) as suggested by #Yannis in comments -
count_id = df.id.value_counts().sort_index().values
mask = count_id[:,None] > np.arange(count_id.max())
vals = df.loc[:, 'date':'value2'].values
out_shp = mask.shape + (vals.shape[1],)
out = np.full(out_shp, np.nan)
out[mask] = vals
Approach #2
Another with factorize that doesn't require any pre-sorting -
x = df.id.factorize()[0]
y = df.groupby(x).cumcount().values
vals = df.loc[:, 'date':'value2'].values
out_shp = (x.max()+1, y.max()+1, vals.shape[1])
out = np.full(out_shp, np.nan)
out[x,y] = vals

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Storing Categorical from codes from Dataframe - python

Related

How to edit all data value given in a dataframe except for the values of a particular index?

Add padding based on partial sum

Finding all the variables that give the highest Adjusted R squared value

Summing up arrays without doubles

Pandas DataFrame to multidimensional NumPy Array

Categories

Resources