I want to use scipy or pandas to interpolate on a table like this one:
df = pd.DataFrame({'x':[1,1,1,2,2,2],'y':[1,2,3,1,2,3],'z':[10,20,30,40,50,60] })
df =
x y z
0 1 1 10
1 1 2 20
2 1 3 30
3 2 1 40
4 2 2 50
5 2 3 60
I want to be able to interpolate for a x value of 1.5 and a y value of 2.5 and obtain a 40.
The process would be:
Starting from the first interpolation parameter (x), find the values that surround the target value. In this case the target is 1.5 and the surrounding values are 1 and 2.
Interpolate in y for a target of 2.5 considering x=1. In this case between rows 1 and 2, obtaining a 25
Interpolate in y for a target of 2.5 considering x=2. In this case between rows 4 and 5, obtaining a 55
Interpolate the values form previous steps to the target x value. In this case I have 25 for x=1 and 55 for x=2. The interpolated value for 1.5 is 40
The order in which interpolation is to be performed is fixed and the data will be correctly sorted.
I've found this question but I'm wondering if there is a standard solution already available in those libraries.
You can use scipy.interpolate.interp2d:
import scipy.interpolate
f = scipy.interpolate.interp2d(df.x, df.y, df.z)
f([1.5], [2.5])
[40.]
The first line creates an interpolation function z = f(x, y) using three arrays for x, y, and z. The second line uses this function to interpolate for z given values for x and y. The default is linear interpolation.
Define your interpolate function:
def interpolate(x, y, df):
cond = df.x.between(int(x), int(x) + 1) & df.y.between(int(y), int(y) + 1)
return df.loc[cond].z.mean()
interpolate(1.5,2.5,df)
40.0
Related
I created two random variables (x and y) with certain properties. Now, I want to create a dataframe from scratch out of these two variables. Unfortunately, what I type seems to be wrong. How can I do this correctly?
# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)
# form a column vector (n, 1)
x = x.reshape(-100, 1)
print(x)
# creating variable y with normal distribution
y = norm.rvs(size=100,loc=0,scale=1)
# form a column vector (n, 1)
y = y.reshape(-100, 1)
print(y)
# creating a dataframe from scratch and assigning x and y to it
df = pd.DataFrame()
df.assign(y = y, x = x)
df
There are a lot of ways to go about this.
According to the documentation pd.DataFrame accepts ndarray (structured or homogeneous), Iterable, dict, or DataFrame. Your issue is that x and y are 2d numpy array
>>> x.shape
(100, 1)
where it expects either one 1d array per column or a single 2d array.
One way would be to stack the array into one before calling the DataFrame constructor
>>> pd.DataFrame(np.hstack([x,y]))
0 1
0 0.0 0.764109
1 1.0 0.204747
2 1.0 -0.706516
3 1.0 -1.359307
4 1.0 0.789217
.. ... ...
95 1.0 0.227911
96 0.0 -0.238646
97 0.0 -1.468681
98 0.0 1.202132
99 0.0 0.348248
The alernatives mostly revolve around calling np.Array.flatten(). e.g. to construct a dict
>>> pd.DataFrame({'x': x.flatten(), 'y': y.flatten()})
x y
0 0 0.764109
1 1 0.204747
2 1 -0.706516
3 1 -1.359307
4 1 0.789217
.. .. ...
95 1 0.227911
96 0 -0.238646
97 0 -1.468681
98 0 1.202132
99 0 0.348248
I'm trying to remove rows from a DataFrame that are within a Euclidean distance threshold of other points listed in the DataFrame. So for example, in the small DataFrame provided below, two rows would be removed if a threshold value was set equal to 0.001 (1 mm: thresh = 0.001), where X and Y are spatial coordinates:
import pandas as pd
data = {'X': [0.075, 0.0791667,0.0749543,0.0791184,0.075,0.0833333, 0.0749543],
'Y': [1e-15, 0,-0.00261746,-0.00276288, -1e-15,0,-0.00261756],
'T': [12.57,12.302,12.56,12.292,12.57,12.052,12.56]}
df = pd.DataFrame(data)
df
# X Y T
# 0 0.075000 1.000000e-15 12.570
# 1 0.079167 0.000000e+00 12.302
# 2 0.074954 -2.617460e-03 12.560
# 3 0.079118 -2.762880e-03 12.292
# 4 0.075000 -1.000000e-15 12.570
# 5 0.083333 0.000000e+00 12.052
# 6 0.074954 -2.617560e-03 12.560
The rows with indices 4 and 6 need to be removed because they are spatial duplicates of rows 0 and 2, respectively, since they are within the specified threshold distance of previously listed points. Also, I always want to remove the 2nd occurrence of a point that is within the threshold distance of a previous point. What's the best way to approach this?
Let's try it with this one. Calculate the Euclidean distance for each pair of (X,Y), which creates a symmetric matrix. Then mask the upper half; then for the lower half, filter out the rows where there is a value less than thresh:
import numpy as np
m = np.tril(np.sqrt(np.power(df[['X']].to_numpy() - df['X'].to_numpy(), 2) +
np.power(df[['Y']].to_numpy() - df['Y'].to_numpy(), 2)))
m[np.triu_indices(m.shape[0])] = np.nan
out = df[~np.any(m < thresh, axis=1)]
We could also write it a bit more concisely and legibly (taking a leaf out of #BENY's elegant solution) by using k parameter in numpy.tril to directly mask the upper half of the symmetric matrix:
distances = np.sqrt(np.sum([(df[[c]].to_numpy() - df[c].to_numpy())**2
for c in ('X','Y')], axis=0))
msk = np.tril(distances < thresh, k=-1).any(axis=1)
out = df[~msk]
Output:
X Y T
0 0.075000 1.000000e-15 12.570
1 0.079167 0.000000e+00 12.302
2 0.074954 -2.617460e-03 12.560
3 0.079118 -2.762880e-03 12.292
5 0.083333 0.000000e+00 12.052
You mentioned the key words distance , so we do cdist from scipy
from scipy.spatial.distance import cdist
v = df[['X','Y']]
ary = cdist(v, v, metric='euclidean')
df[~np.tril(ary<0.001,k = -1).any(1)]
Out[100]:
X Y T
0 0.075000 1.000000e-15 12.570
1 0.079167 0.000000e+00 12.302
2 0.074954 -2.617460e-03 12.560
3 0.079118 -2.762880e-03 12.292
5 0.083333 0.000000e+00 12.052
this code gives the result i want:
it takes the value n-1 and calculates n from it
take the previous value in column 'y' lets call it y-1 and calculate a value which becomes the new y, than in the next row take that new y as y-1 and calculate another new y aso
size = 10
x= range(1,size+1)
df = pd.DataFrame(data={'x': x,'y': size })
for n in range(1,len(x)):
df['y'].iloc[n] = df['y'].iloc[n-1]*2
out:
x y
0 1 10
1 2 20
2 3 40
... ... ...
9 10 5120
I want to put it into a lambda but somehow fail to get it right:
b=2
df['y'].loc[1::] = df['y'].shift(-1).apply(lambda x: x*b)
out:
x y
0 1 10.0
1 2 20.0
2 3 20.0
... ... ...
the lambda function takes the pre-populated value (10) in each row instead of shifting 1 step back and taking the previous value as base for the multiplication
i looked at some threads, but its above my comprahension, if i am dealing here with recursion and this is not possible with lambdas?
recursive lambda-expressions possible?
Python Recursion on Pandas df
Can a lambda function call itself recursively in Python?
Edit:
I want that subsequent entries in 'y' are calculated with previous 'y' entries, starting from idx 1
DataFrame at start:
idx | y |
0 10
DataFrame after 1st calc:
y1 = y0 *2
# *2 is a placeholder could be mx+b, or something else
idx | y |
0 10
1 20
I'm not sure you need to do any recursion, unless I'm misunderstanding this is mathematical exponents.
Note sure what you're actual use case but something like one of these should work.
[v*(2**i) for i,v in enumerate(df.y)]
df.apply(lambda j: j.y*(2**(j.x-1)), axis=1)
Let us assume we are given the below function:
def f(x,y):
y = x + y
return y
The function f(x,y) sums two numbers (but it could be any more or less complicated functions of two arguments). Let us now consider the following
import pandas as pd
import random
import numpy as np
random.seed(1234)
df = pd.DataFrame({'first': random.sample(range(0, 9), 5),
'second': np.NaN}, index = None)
y = 1
df
first second
0 7 NaN
1 1 NaN
2 0 NaN
3 6 NaN
4 4 NaN
for the scope of the question the second column of the data frame is here irrelevant, so we can without loss of generality assume it to be NaN. Let us apply f(x,y) to each row of the data frame, considering that the variable y has been initialised to 1. The first iteration returns 7+1 = 8; now, when applying the function again to second row, we want the y value to be updated to the previously calculated 8 and therefore the final result to be 1+8 =9, and so on and so forth.
What is the pythonic way to handle this? I want to avoid looping and re-assigning the variables inside the loop, thus my guess would be something along the lines of
def apply_to_df(df, y):
result = df['first'].apply(lambda s: f(s,y))
return result
however one may easily see that the above does not consider the updated values, whereas it computes the all calculations with the initial original value for y=1.
print(apply_to_df(df,y))
0 8
1 2
2 1
3 7
4 5
Note, you can probably solve this specific case with an existing cumulative function. However, in the general case, you could just hack it by relying on global state:
In [7]: y = 1
In [8]: def f(x):
...: global y
...: y = x + y
...: return y
...:
In [9]: df['first'].apply(lambda s: f(s))
Out[9]:
0 8
1 9
2 9
3 15
4 19
Name: first, dtype: int64
I want to avoid looping and re-assigning the variables inside the loop
Note, pd.DataFrame.apply is a vanilla Python loop under the hood, and it's actually less efficient because it does a lot of checking/validation of inputs. It is not meant to be efficient, but convenient. So if you care about performance, you've already given up if you are relying on .apply
Honestly, I think I would rather write the explicit loop over the rows inside of a function than rely on global state.
You could use a generator function to remember the prior calculation result:
def my_generator(series, foo, y_seed=0):
y = y_seed # Seed value for `y`.
s = series.__iter__() # Create an iterator on the series.
while True:
# Call the function on the next `x` value together with the most recent `y` value.
y = foo(x=s.next(), y=y)
yield y
df = df.assign(new_col=list(my_generator(series=df['first'], foo=f, y_seed=1)))
>>> df
first second new_col
0 8 NaN 9
1 3 NaN 12
2 0 NaN 12
3 5 NaN 17
4 4 NaN 21
I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)