Feature scaling converts different values in columns on a same scale - python

Scaling converts different columns with different values alike example Standard Scaler but when building a model out of it, the values which were different earlier are converted to same values with mean=0 and std = 1, so it should affect the model fit and results.
I have taken a toy pandas dataframe with 1st column starting from 1 to 10 and 2nd column starting from 5 to 14 and scaled both using Standard Scaler.
import pandas as pd
ls1 = np.arange(1,10)
ls2 = np.arange(5,14)
before_scaling= pd.DataFrame()
before_scaling['a'] = ls1
before_scaling['b'] = ls2
'''
a b
0 1 5
1 2 6
2 3 7
3 4 8
4 5 9
5 6 10
6 7 11
7 8 12
8 9 13
'''
from sklearn.preprocessing import StandardScaler,MinMaxScaler
ss = StandardScaler()
after_scaling = pd.DataFrame(ss.fit_transform(before_scaling),columns=
['a','b'])
'''
a b
0 -1.549193 -1.549193
1 -1.161895 -1.161895
2 -0.774597 -0.774597
3 -0.387298 -0.387298
4 0.000000 0.000000
5 0.387298 0.387298
6 0.774597 0.774597
7 1.161895 1.161895
8 1.549193 1.549193
'''
If there is a regression model to be built using the above 2 independent variables then i believe that fitting the model ( Linear regression ) will produce different fit and results using the dataframe on before_scaling and after_scaling dataframes.
If yes, then why we use feature Scaling and if we use feature scaling on individual columns one by one then also it will produce same results

This happening because the fit_transform function work as follow:
For each feature you have ('a', 'b' in your case) apply this equation:
X = (X - MEAN) / STD
where MEAN is the mean of the feature and STD is the standared diviation.
The first feature a has a mean of '5' and std of '2.738613', while feature b has mean of '9' and std of '2.738613'. So if you subtract from each value the mean of its corresponding feature you will have two identical features and as we have the std equal in both features you will end up with identical transformation.
before_scaling['a'] = before_scaling['a'] - before_scaling['a'].mean()
before_scaling['b'] = before_scaling['b'] - before_scaling['b'].mean()
print(before_scaling)
a b
0 -4.0 -4.0
1 -3.0 -3.0
2 -2.0 -2.0
3 -1.0 -1.0
4 0.0 0.0
5 1.0 1.0
6 2.0 2.0
7 3.0 3.0
8 4.0 4.0
Finally be aware that the last value in the arange function is not included.

After waiting for some time and not getting my answer , i tried it myself and now i got the answer.
After Scaling although the different columns may have the same value if the distribution is same for these columns. The reason why the model able to retain the same results with changed features values after scaling is because the model changes the weights of coefficients.
# After scaling with Standard Scaler
b = -1.38777878e-17
t = 0.5 * X_a[0,0] + 0.5 * X_a[0,1] + b
t = np.array(t).reshape(-1,1)
sc2.inverse_transform(t)
# out 31.5
'''
X_a
array([[-1.64750894, -1.64750894],
[-1.47408695, -1.47408695],
[-1.30066495, -1.30066495],
[-1.12724296, -1.12724296],
[-0.95382097, -0.95382097],
[-0.78039897, -0.78039897],
[-0.60697698, -0.60697698],
[-0.43355498, -0.43355498],
[-0.26013299, -0.26013299],
[-0.086711 , -0.086711 ],
[ 0.086711 , 0.086711 ],
[ 0.26013299, 0.26013299],
[ 0.43355498, 0.43355498],
[ 0.60697698, 0.60697698],
[ 0.78039897, 0.78039897],
[ 0.95382097, 0.95382097],
[ 1.12724296, 1.12724296],
[ 1.30066495, 1.30066495],
[ 1.47408695, 1.47408695],
[ 1.64750894, 1.64750894]])
'''
# Before scaling
2.25 * X_b[0,0] + 2.25 * X_b[0,1] + 6.75
# out 31.5
'''
X_b
array([[ 1, 10],
[ 2, 11],
[ 3, 12],
[ 4, 13],
[ 5, 14],
[ 6, 15],
[ 7, 16],
[ 8, 17],
[ 9, 18],
[10, 19],
[11, 20],
[12, 21],
[13, 22],
[14, 23],
[15, 24],
[16, 25],
[17, 26],
[18, 27],
[19, 28],
[20, 29]], dtype=int64)
'''

Related

Finding the summation of values from two pandas dataframe column

I have a pandas dataframe like below
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
I am trying to attain the value of this expression.
I havnt got an idea how to mutiply first value in a column with 2nd value in another column like in the expression.
Try pd.DataFrame.shift() but I think you need to enter -1 into shift judging by the summation notation you posted. i + 1 implies using the next x or y, so shift needs to use a negative integer to shift 1 number ahead. Positive integers in shift go backwards.
Can you confirm 320 is the right answer?
0.5 * ((df.x * df.y.shift(-1)) - (df.x.shift(-1) + df.y)).sum()
>>>320
I think the below code has the correct value in expresion_end
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
df["x+1"]=df["x"].shift(periods=-1)
df["y+1"]=df["y"].shift(periods=-1)
df["exp"]=df["x"]*df["y+1"]-df["x+1"]*df["y"]
expresion_end=0.5*df["exp"].sum()
You can use pandas.DataFrame.shift(). You can one times compute shift(-1) and use it for 'x' and 'y'.
>>> df_tmp = df.shift(-1)
>>> (df['x']*df_tmp['y'] - df_tmp['x']*df['y']).sum() * 0.5
-202.5
# Explanation
>>> df[['x+1', 'y+1']] = df.shift(-1)
>>> df
x y x+1 y+1
0 5 10 4.0 20.0 # x*(y+1) - y*(x+1) = 5*20 - 10*4
1 4 20 15.0 30.0
2 15 30 20.0 15.0
3 20 15 12.0 14.0
4 12 14 5.0 5.0
5 5 5 NaN NaN

Searching in numpy array

I have a 2D numpy array, say A sorted with respect to Column 0. e.g.
Col.0
Col.1
Col.2
10
2.45
3.25
11
2.95
4
12
3.45
4.25
15
3.95
5
18
4.45
5.25
21
4.95
6
23
5.45
6.25
27
5.95
7
29
6.45
7.25
32
6.95
8
35
7.45
8.25
The entries in each row is unique i.e. Col. 0 is the identification number of a co-ordinate in xy plane, Columns 1 and 2 are x and y co-ordinates of these points.
I have another array B (rows can contain duplicate data). Column 0 and Column 1 store x and y co-ordinates.
Col.0
Col.1
2.45
3.25
4.45
5.25
6.45
7.25
2.45
3.25
My aim is to find the row index number in array A corresponding to data in array B without using for loop. So, in this case, my output should be [0,4,8,0].
Now, I know that with numpy searchsorted lookup for multiple data can be done in one shot. But, it can be used to compare with a single column of A and not multiple columns. Is there a way to do this?
Pure numpy solution:
My intuition is that I take the difference c between a[:,1:] and b by broadcasting, such that c is of shape (11, 4, 2). The rows that match will be all zeros. Then I do c == False to obtain a mask. I do c.all(2) which results in a boolean array of shape (11, 4), where all True elements represents matches between a and b. Then I simply use np.nonzero to obtain the indices of said elements.
import numpy as np
a = np.array([
[10, 2.45, 3.25],
[11, 2.95, 4],
[12, 3.45, 4.25],
[15, 3.95, 5],
[18, 4.45, 5.25],
[21, 4.95, 6],
[23, 5.45, 6.25],
[27, 5.95, 7],
[29, 6.45, 7.25],
[32, 6.95, 8],
[35, 7.45, 8.25],
])
b = np.array([
[2.45, 3.25],
[4.45, 5.25],
[6.45, 7.25],
[2.45, 3.25],
])
c = (a[:,np.newaxis,1:]-b) == False
rows, cols = c.all(2).nonzero()
print(rows[cols.argsort()])
# [0 4 8 0]
You can use merge in pandas:
df2.merge(df1.reset_index(),how='left',left_on=['Col.0','Col.1'],right_on=['Col.1','Col.2'])['index']
output:
0 0
1 4
2 8
3 0
Name: index, dtype: int64
and if you like it as array:
df2.merge(df1.reset_index(),how='left',left_on=['Col.0','Col.1'],right_on=['Col.1','Col.2'])['index'].to_numpy()
#array([0, 4, 8, 0])

Checking if the points fall within circleS

I have a long list of H-points with known coordinates. I have also a list of TP-points. I'd like to know if the H-points fall within any(!) TP-point with certain radius (e.g. r=5).
dfPoints = pd.DataFrame({'H-points' : ['a','b','c','d','e'],
'Xh' :[10, 35, 52, 78, 9],
'Yh' : [15,5,11,20,10]})
dfTrafaPostaje = pd.DataFrame({'TP-points' : ['a','b','c','d','e'],
'Xt' :[15,25,35],
'Yt' : [15,25,35],
'M' : [5,2,3]})
def inside_circle(x, y, a, b, r):
return (x - a)*(x - a) + (y - b)*(y - b) < r*r
I've started but.. it would be much easier to check this for only one TP point. But if I have e.g. 1500 of them and 30.000 H-points, then i need more general solution.
Can anyone help?
Another option is to use distance_matrix from scipy.spatial:
dist_mat = distance_matrix(dfPoints [['Xh','Yh']], dfTrafaPostaje [['Xt','Yt']])
dfPoints [np.min(dist_mat,axis=1)<5]
Took about 2s for 1500 dfPoints and 30000 dfTrafaPostje.
Update: to get the index of the reference points with highest score:
dist_mat = distance_matrix(dfPoints [['Xh','Yh']], dfTrafaPostaje [['Xt','Yt']])
# get the M scores of those within range
M_mat = pd.DataFrame(np.where(dist_mat <= 5, dfTrafaPosaje['M'].values[None, :], np.nan),
index=dfPoints['H-points'] ,
columns=dfTrafaPostaje['TP-points'])
# get the points with largest M values
# mask with np.nan for those outside range
dfPoints['M'] = np.where(M_mat.notnull().any(1), M_mat.idxmax(1), np.nan)
For the included sample data:
H-points Xh Yh TP
0 a 10 15 a
1 b 35 5 NaN
2 c 52 11 NaN
3 d 78 20 NaN
4 e 9 10 NaN
You could use cdist from scipy to compute the pairwise distances, then create a mask with True where distance is less than radius, and finally filter:
import pandas as pd
from scipy.spatial.distance import cdist
dfPoints = pd.DataFrame({'H-points': ['a', 'b', 'c', 'd', 'e'],
'Xh': [10, 35, 52, 78, 9],
'Yh': [15, 5, 11, 20, 10]})
dfTrafaPostaje = pd.DataFrame({'TP-points': ['a', 'b', 'c'],
'Xt': [15, 25, 35],
'Yt': [15, 25, 35]})
radius = 5
distances = cdist(dfPoints[['Xh', 'Yh']].values, dfTrafaPostaje[['Xt', 'Yt']].values, 'sqeuclidean')
mask = (distances <= radius*radius).sum(axis=1) > 0 # create mask
print(dfPoints[mask])
Output
H-points Xh Yh
0 a 10 15

new column containing header of another column based on conditionals

I'm relatively new to python and I feel this is a complex task
From dfa:
I'm trying to return the smallest and second smallest values from a range of columns (dist 1 through to dist 5) and return the name of the column where these values have come from (i.e. "dist_3"), placing this information into 4 new columns. A given distX column will have a mix of numbers and NaN either as string or np.nan.
dfa = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'],
'dist1': ['NaN',2,'NaN','NaN', 30],
'dist2': [20, 21, 22, 23, 'NaN'],
'dist3': [120, 'NaN', 122, 123, 11],
'dist4': [40, 'NaN', 42, 43, 'NaN'],
'dist5': ['NaN',1,'NaN','NaN', 70]})
Task 1) I want to add two new columns "fir_closest" and "fir_closest_dist".
fir_closest_dist should contain the smallest value from columns dist1 through to dist5 (i.e. 20 for row 1, 11 for row 5).
fir_closest should contain the name of the column from where the value in fir_closest_dist came from (i.e. "dist2 for the first row)
Task 2) Repeat the above but for the second/next smallest value to create two new columns "sec_closest" and "sec_closest_dist"
Output table needs to look like dfb
dfb = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'],
'dist1': ['NaN',2,'NaN','NaN', 30],
'dist2': [20, 21, 22, 23, 'NaN'],
'dist3': [120, 'Nan', 122, 123, 11],
'dist4': [40, 'NaN', 42, 43, 'NaN'],
'dist5': ['NaN',1,'NaN','NaN', 70],
'fir_closest': ['dist2','dist5','dist2','dist2', 'dist3'],
'fir_closest_dist': [20,1,22,23,11],
'sec_closest': ['dist4','dist1','dist4','dist4', 'dist1'],
'sec_closest_dist': [40,2,42,43,30]})
Please can you show code or explain how best to approach this. What is the name for this method of populating new columns?
Thanks in advance
I think this may do what you need.
import pandas as pd
import numpy as np
#Reproducibility and data generation for example
np.random.seed(0)
X = np.random.randint(low = 0, high = 10, size = (5,5))
#Your data
df = pd.DataFrame(X, columns = [f'dist{j}' for j in range(5)])
# Number of columns
ix = range(df.shape[1])
col_names = df.columns.values
#Find arg of kth smallest
arg_row_min,arg_row_min2,*rest = np.argsort(df.values, axis = 1).T
df['dist_min'] = col_names[arg_row_min]
df['num_min'] = df.values[ix,arg_row_min]
df['dist_min2'] = col_names[arg_row_min2]
df['num_min2'] = df.values[ix,arg_row_min2]
Assuming your DataFrame is named df, and you have run import pandas as pd and import numpy as np:
# Example data
df = pd.DataFrame({'date': pd.date_range('2017-04-15', periods=5),
'name': ['Mullion']*5,
'dist1': [pd.np.nan, pd.np.nan, 30, 20, 15],
'dist2': [40, 30, 20, 15, 16],
'dist3': [101, 100, 98, 72, 11]})
df
date dist1 dist2 dist3 name
0 2017-04-15 NaN 40 101 Mullion
1 2017-04-16 NaN 30 100 Mullion
2 2017-04-17 30.0 20 98 Mullion
3 2017-04-18 20.0 15 72 Mullion
4 2017-04-19 15.0 16 11 Mullion
# Select only those columns with numeric data types. In your case, this is
# the same as:
# df_num = df[['dist1', 'dist2', ...]].copy()
df_num = df.select_dtypes(np.number)
# Get the column index of each row's minimum distance. First, fill NaN with
# numpy's infinity placeholder to ensure that NaN distances are never chosen.
idxs = df_num.fillna(np.inf).values.argsort(axis=1)
# The 1st column of idxs (which is idxs[:, 0]) contains the column index of
# each row's smallest distance.
# The 2nd column of idxs (which is idxs[:, 1]) contains the column index of
# each row's second-smallest distance.
# Convert the index of each row's closest distance to a column name.
# (df.columns is a list-like that holds the column names of df.)
df['closest_name'] = df_num.columns[max_idxs[:, 0]]
# Now get the distances themselves by indexing the underlying numpy array
# of values. There may be a more pandas-specific way of doing this, but
# this should be very fast.
df['closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 0]]
# Same idea for the second-closest distances.
df['second_closest_name'] = df_num.columns[max_idxs[:, 1]]
df['second_closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 1]]
df
date dist1 dist2 dist3 name closest_name closest_dist \
0 2017-04-15 NaN 40 101 Mullion dist2 40.0
1 2017-04-16 NaN 30 100 Mullion dist2 30.0
2 2017-04-17 30.0 20 98 Mullion dist2 20.0
3 2017-04-18 20.0 15 72 Mullion dist1 20.0
4 2017-04-19 15.0 16 11 Mullion dist3 11.0
second_closest_name second_closest_dist
0 dist3 101.0
1 dist3 100.0
2 dist1 30.0
3 dist2 15.0
4 dist1 15.0

Manipulate pandas.DataFrame with multiple criterias

For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?
query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0
Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2

Categories

Resources