I have a 2D numpy array, say A sorted with respect to Column 0. e.g.
Col.0
Col.1
Col.2
10
2.45
3.25
11
2.95
4
12
3.45
4.25
15
3.95
5
18
4.45
5.25
21
4.95
6
23
5.45
6.25
27
5.95
7
29
6.45
7.25
32
6.95
8
35
7.45
8.25
The entries in each row is unique i.e. Col. 0 is the identification number of a co-ordinate in xy plane, Columns 1 and 2 are x and y co-ordinates of these points.
I have another array B (rows can contain duplicate data). Column 0 and Column 1 store x and y co-ordinates.
Col.0
Col.1
2.45
3.25
4.45
5.25
6.45
7.25
2.45
3.25
My aim is to find the row index number in array A corresponding to data in array B without using for loop. So, in this case, my output should be [0,4,8,0].
Now, I know that with numpy searchsorted lookup for multiple data can be done in one shot. But, it can be used to compare with a single column of A and not multiple columns. Is there a way to do this?
Pure numpy solution:
My intuition is that I take the difference c between a[:,1:] and b by broadcasting, such that c is of shape (11, 4, 2). The rows that match will be all zeros. Then I do c == False to obtain a mask. I do c.all(2) which results in a boolean array of shape (11, 4), where all True elements represents matches between a and b. Then I simply use np.nonzero to obtain the indices of said elements.
import numpy as np
a = np.array([
[10, 2.45, 3.25],
[11, 2.95, 4],
[12, 3.45, 4.25],
[15, 3.95, 5],
[18, 4.45, 5.25],
[21, 4.95, 6],
[23, 5.45, 6.25],
[27, 5.95, 7],
[29, 6.45, 7.25],
[32, 6.95, 8],
[35, 7.45, 8.25],
])
b = np.array([
[2.45, 3.25],
[4.45, 5.25],
[6.45, 7.25],
[2.45, 3.25],
])
c = (a[:,np.newaxis,1:]-b) == False
rows, cols = c.all(2).nonzero()
print(rows[cols.argsort()])
# [0 4 8 0]
You can use merge in pandas:
df2.merge(df1.reset_index(),how='left',left_on=['Col.0','Col.1'],right_on=['Col.1','Col.2'])['index']
output:
0 0
1 4
2 8
3 0
Name: index, dtype: int64
and if you like it as array:
df2.merge(df1.reset_index(),how='left',left_on=['Col.0','Col.1'],right_on=['Col.1','Col.2'])['index'].to_numpy()
#array([0, 4, 8, 0])
Related
I have a pandas dataframe like below
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
I am trying to attain the value of this expression.
I havnt got an idea how to mutiply first value in a column with 2nd value in another column like in the expression.
Try pd.DataFrame.shift() but I think you need to enter -1 into shift judging by the summation notation you posted. i + 1 implies using the next x or y, so shift needs to use a negative integer to shift 1 number ahead. Positive integers in shift go backwards.
Can you confirm 320 is the right answer?
0.5 * ((df.x * df.y.shift(-1)) - (df.x.shift(-1) + df.y)).sum()
>>>320
I think the below code has the correct value in expresion_end
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
df["x+1"]=df["x"].shift(periods=-1)
df["y+1"]=df["y"].shift(periods=-1)
df["exp"]=df["x"]*df["y+1"]-df["x+1"]*df["y"]
expresion_end=0.5*df["exp"].sum()
You can use pandas.DataFrame.shift(). You can one times compute shift(-1) and use it for 'x' and 'y'.
>>> df_tmp = df.shift(-1)
>>> (df['x']*df_tmp['y'] - df_tmp['x']*df['y']).sum() * 0.5
-202.5
# Explanation
>>> df[['x+1', 'y+1']] = df.shift(-1)
>>> df
x y x+1 y+1
0 5 10 4.0 20.0 # x*(y+1) - y*(x+1) = 5*20 - 10*4
1 4 20 15.0 30.0
2 15 30 20.0 15.0
3 20 15 12.0 14.0
4 12 14 5.0 5.0
5 5 5 NaN NaN
I'm trying to calculate a rolling average on some incomplete data. I want to average values in column 2 across windows of size 1.0 of the value in column 1 (miles). I've tried .rolling(), but (from my limited understanding) this only creates windows based on the index, and not on column values.
import pandas as pd
import numpy as np
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
])
averages = []
for index in range(len(df)):
nearby = df.loc[np.abs(df[0] - df.loc[index][0]) <= 0.5]
averages.append(nearby[1].mean())
df['rollingAve'] = averages
Gives the desired output:
0 1 rollingAve
0 4.5 10 10.0
1 4.6 11 10.0
2 4.8 9 10.0
3 5.5 6 6.0
4 5.6 6 6.0
5 8.1 10 11.5
6 8.2 13 11.5
But this slows down substantially for big dataframes. Is there a way to implement .rolling() with varying window sizes, or something similar?
Panda's BaseIndexer is quite handy, although it takes a little bit of head-scratching to get it right.
In the following, I use np.searchsorted to quickly find the indices (start, end) of each window:
from pandas.api.indexers import BaseIndexer
class RangeWindow(BaseIndexer):
def __init__(self, val, width):
self.val = val.values
self.width = width
def get_window_bounds(self, num_values, min_periods, center, closed):
if min_periods is None: min_periods = 0
if closed is None: closed = 'left'
w = (-self.width/2, self.width/2) if center else (0, self.width)
side0 = 'left' if closed in ['left', 'both'] else 'right'
side1 = 'right' if closed in ['right', 'both'] else 'left'
ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
ix1 = np.maximum(ix1, ix0 + min_periods)
return ix0, ix1
Some deluxe options: min_periods, center, and closed are implemented according to what the DataFrame.rolling specifies.
Application:
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
], columns='a b'.split())
df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()
# gives:
0 10.0
1 10.0
2 10.0
3 6.0
4 6.0
5 11.5
6 11.5
Name: b, dtype: float64
Timing:
df = pd.DataFrame(
np.random.uniform(0, 1000, size=(1_000_000, 2)),
columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)
%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()
CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms
Update on performance:
Following a comment from #anon01, I was wondering if one could go faster for the case when the rolling involves large windows. Turns out I should have measured Pandas's rolling mean and sum performance first... (Premature optimization, anyone?) See at the end why.
Anyway, the idea was to do a cumsum just once, then take the difference of elements dereferenced by the windows endpoints:
# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return z[ix1] - z[ix0]
def fast_rolling_mean(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return (z[ix1] - z[ix0]) / (ix1 - ix0)
With this (and the 1-million rows df above), I see:
%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ± 335 µs per loop
versus:
%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ± 1.54 ms per loop
However!!! Pandas is likely already doing such an optimization (it's a pretty obvious one). The timings don't increase with larger windows (which is why I was saying I should have checked first).
df.rolling and series.rolling do allow for value-based windows if the index is of type DateTimeIndex or TimedeltaIndex. You can use this to get close to the desired result:
df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)
output:
0 1 rolling_mean
0 4.5 10 10.000000
1 4.6 11 10.500000
2 4.8 9 10.000000
3 5.5 6 8.666667
4 5.6 6 7.000000
5 8.1 10 10.000000
6 8.2 13 11.500000
Advantages
This is a three-line solution that should have great performance, leveraging pandas datetime backend.
Disadvantages
This is definitely a hack, casting your miles column to time-delta seconds, and the average isn't centered (center isn't implemented for datetimelike and offset based windows).
Overall: if you value performance and can live with a non-centered mean, this would be a great way to go with a comment or two.
Scaling converts different columns with different values alike example Standard Scaler but when building a model out of it, the values which were different earlier are converted to same values with mean=0 and std = 1, so it should affect the model fit and results.
I have taken a toy pandas dataframe with 1st column starting from 1 to 10 and 2nd column starting from 5 to 14 and scaled both using Standard Scaler.
import pandas as pd
ls1 = np.arange(1,10)
ls2 = np.arange(5,14)
before_scaling= pd.DataFrame()
before_scaling['a'] = ls1
before_scaling['b'] = ls2
'''
a b
0 1 5
1 2 6
2 3 7
3 4 8
4 5 9
5 6 10
6 7 11
7 8 12
8 9 13
'''
from sklearn.preprocessing import StandardScaler,MinMaxScaler
ss = StandardScaler()
after_scaling = pd.DataFrame(ss.fit_transform(before_scaling),columns=
['a','b'])
'''
a b
0 -1.549193 -1.549193
1 -1.161895 -1.161895
2 -0.774597 -0.774597
3 -0.387298 -0.387298
4 0.000000 0.000000
5 0.387298 0.387298
6 0.774597 0.774597
7 1.161895 1.161895
8 1.549193 1.549193
'''
If there is a regression model to be built using the above 2 independent variables then i believe that fitting the model ( Linear regression ) will produce different fit and results using the dataframe on before_scaling and after_scaling dataframes.
If yes, then why we use feature Scaling and if we use feature scaling on individual columns one by one then also it will produce same results
This happening because the fit_transform function work as follow:
For each feature you have ('a', 'b' in your case) apply this equation:
X = (X - MEAN) / STD
where MEAN is the mean of the feature and STD is the standared diviation.
The first feature a has a mean of '5' and std of '2.738613', while feature b has mean of '9' and std of '2.738613'. So if you subtract from each value the mean of its corresponding feature you will have two identical features and as we have the std equal in both features you will end up with identical transformation.
before_scaling['a'] = before_scaling['a'] - before_scaling['a'].mean()
before_scaling['b'] = before_scaling['b'] - before_scaling['b'].mean()
print(before_scaling)
a b
0 -4.0 -4.0
1 -3.0 -3.0
2 -2.0 -2.0
3 -1.0 -1.0
4 0.0 0.0
5 1.0 1.0
6 2.0 2.0
7 3.0 3.0
8 4.0 4.0
Finally be aware that the last value in the arange function is not included.
After waiting for some time and not getting my answer , i tried it myself and now i got the answer.
After Scaling although the different columns may have the same value if the distribution is same for these columns. The reason why the model able to retain the same results with changed features values after scaling is because the model changes the weights of coefficients.
# After scaling with Standard Scaler
b = -1.38777878e-17
t = 0.5 * X_a[0,0] + 0.5 * X_a[0,1] + b
t = np.array(t).reshape(-1,1)
sc2.inverse_transform(t)
# out 31.5
'''
X_a
array([[-1.64750894, -1.64750894],
[-1.47408695, -1.47408695],
[-1.30066495, -1.30066495],
[-1.12724296, -1.12724296],
[-0.95382097, -0.95382097],
[-0.78039897, -0.78039897],
[-0.60697698, -0.60697698],
[-0.43355498, -0.43355498],
[-0.26013299, -0.26013299],
[-0.086711 , -0.086711 ],
[ 0.086711 , 0.086711 ],
[ 0.26013299, 0.26013299],
[ 0.43355498, 0.43355498],
[ 0.60697698, 0.60697698],
[ 0.78039897, 0.78039897],
[ 0.95382097, 0.95382097],
[ 1.12724296, 1.12724296],
[ 1.30066495, 1.30066495],
[ 1.47408695, 1.47408695],
[ 1.64750894, 1.64750894]])
'''
# Before scaling
2.25 * X_b[0,0] + 2.25 * X_b[0,1] + 6.75
# out 31.5
'''
X_b
array([[ 1, 10],
[ 2, 11],
[ 3, 12],
[ 4, 13],
[ 5, 14],
[ 6, 15],
[ 7, 16],
[ 8, 17],
[ 9, 18],
[10, 19],
[11, 20],
[12, 21],
[13, 22],
[14, 23],
[15, 24],
[16, 25],
[17, 26],
[18, 27],
[19, 28],
[20, 29]], dtype=int64)
'''
I'm new to numpy and am trying to do some slicing and indexing with arrays. My goal is to take an array, and use slicing and indexing to square the last column, and then subtract the first column from that result. I then want to put the new column back into the old array.
I've been able to figure out how to slice and index the column to get the result I want for the last column. My problem however is that when I try to put it back into my original array, I get the wrong output (as seen below).
theNumbers = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
sliceColumnOne = theNumbers[:,0]
sliceColumnThree = theNumbers[:,3]**2
editColumnThree = sliceColumnThree - sliceColumnOne
newArray = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[editColumnThree]])
print("nums:\n{}".format(newArray))
I want the output to be
[[ 1 2 3 15]
[ 5 6 7 59]
[ 9 10 11 135]
[ 13 14 15 243]]
However mine becomes:
[list([1, 2, 3, 4]) list([5, 6, 7, 8]) list([9, 10, 11, 12])
list([array([ 15, 59, 135, 243])])]
Any suggestions on how to fix this?
Just assign the last numpy array row to the new one "theNumbers[3] = editColumnThree"
Code:
import numpy as np
theNumbers = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
sliceColumnOne = theNumbers[:,0]
sliceColumnThree = theNumbers[:,3]**2
editColumnThree = sliceColumnThree - sliceColumnOne
theNumbers[3] = editColumnThree
print("nums:\n{}".format(theNumbers))
Output:
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]
[ 15 59 135 243]]
newArray = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[editColumnThree]])
print("nums:\n{}".format(newArray))
this way, editColumnThree is the last row, not column. You can use
newArray = theNumbers.copy() # if a copy is needed
newArray[:,-1] = editColumnThree # replace last (-1) column
If you just want to stack the vectors on top of eachother, use vstack:
import numpy as np
theNumbers = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
newNumbers = np.vstack(theNumbers)
print(newNumbers)
>>>[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]
[13 14 15 16]]
But the issue here isn't just that you need to stack these numbers, you are mixing up columns and rows. You are changing a row instead of a column. To change the column, update the last element in each row:
import numpy as np
theNumbers = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
LastColumn = theNumbers[:,3]**2
FirstColumn = theNumbers[:,0]
editColumnThree = LastColumn - FirstColumn
for i in range(4):
theNumbers[i,3] = editColumnThree [i]
print(theNumbers)
>>>[[ 1 2 3 15]
[ 5 6 7 59]
[ 9 10 11 135]
[ 13 14 15 243]]
I have a data frame with the following columns: {'day','measurement'}
And there might be several measurements in a day (or no measurements at all)
For example:
day | measurement
1 | 20.1
1 | 20.9
3 | 19.2
4 | 20.0
4 | 20.2
and an array of coefficients:
coef={-1:0.2, 0:0.6, 1:0.2}
My goal is to resample the data and average it using the coefficiets, (missing data should be left out).
This is the code I wrote to calculate that
window=[-1,0,-1]
df['resampled_measurement'][df['day']==d]=[coef[i]*df['measurement'][df['day']==d-i].mean() for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()
df['resampled_measurement'][df['day']==d]/=[coef[i] for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()
For the example above, the output should be:
Day measurement
1 20.500
2 19.850
3 19.425
4 19.875
The problem is that the code runs forever, and I'm pretty sure that there's a better way to resample with coefficients.
Any advice would be highly appreciated !
Here's a possible solution to what you're looking for:
# This is your data
In [2]: data = pd.DataFrame({
...: 'day': [1, 1, 3, 4, 4],
...: 'measurement': [20.1, 20.9, 19.2, 20.0, 20.2]
...: })
# Pre-compute every day's average, filling the gaps
In [3]: measurement = data.groupby('day')['measurement'].mean()
In [4]: measurement = measurement.reindex(pd.np.arange(data.day.min(), data.day.max() + 1))
In [5]: coef = pd.Series({-1: 0.2, 0: 0.6, 1: 0.2})
# Create a matrix with the time-shifted measurements
In [6]: matrix = pd.DataFrame({key: measurement.shift(key) for key, val in coef.iteritems()})
In [7]: matrix
Out[7]:
-1 0 1
day
1 NaN 20.5 NaN
2 19.2 NaN 20.5
3 20.1 19.2 NaN
4 NaN 20.1 19.2
# Take a weighted average of the matrix
In [8]: (matrix * coef).sum(axis=1) / (matrix.notnull() * coef).sum(axis=1)
Out[8]:
day
1 20.500
2 19.850
3 19.425
4 19.875
dtype: float64