How to perform interpolation on 2d grid from dataframe in python? - python

I would like to know how to interpolate values on a 2d grid using dataframe data.
My data is below
500 1000 1400 1600
0 78.0 86.4 89.1 90.0
10 78.8 87.0 89.6 90.5
20 79.6 87.5 90.0 90.9
how to get [ X=15 , Y=750 ] ?

You can use either scipy.interpolate.RegularGridInterpolator or scipy.interpolate.interpn:
You are talking about a dataframe, so I am putting your data into a pandas dataframe first:
import numpy as np
import pandas as pd
from scipy.interpolate import interpn, RegularGridInterpolator
values = np.array([
[78.0, 86.4, 89.1, 90.0],
[78.8, 87.0, 89.6, 90.5],
[79.6, 87.5, 90.0, 90.9]
])
df = pd.DataFrame(
values, index=[0, 10, 20], columns=[500, 1000, 1400, 1600]
)
print(df)
# 500 1000 1400 1600
# 0 78.0 86.4 89.1 90.0
# 10 78.8 87.0 89.6 90.5
# 20 79.6 87.5 90.0 90.9
Now, run the interpolation:
point = (15, 750)
# Method 1 (RegularGridInterpolator)
interp = RegularGridInterpolator(
(df.index, df.columns), df.values, method='linear'
)
print(interp(point))
# 83.225
# Method 2 (interpn)
print(interpn([df.index, df.columns], df.values, point, method='linear'))
# [83.225]

Related

Create new columns in pandas df by grouping and performing operations on an existing column

I have a dataframe that looks like this (Minimal Reproducible Example)
thermometers = ['T-10000_0001', 'T-10000_0002','T-10000_0003', 'T-10000_0004',
'T-10001_0001', 'T-10001_0002', 'T-10001_0003', 'T-10001_0004',
'T-10002_0001', 'T-10002_0003', 'T-10002_0003', 'T-10002_0004']
temperatures = [15.1, 14.9, 12.7, 10.8,
19.8, 18.3, 17.7, 18.1,
20.0, 16.4, 17.6, 19.3]
df_set = {'thermometers': thermometers,
'Temperatures': temperatures}
df = pd.DataFrame(df_set)
Index
Thermometer
Temperature
0
T-10000_0001
14.9
1
T-10000_0002
12.7
2
T-10000_0003
12.7
3
T-10000_0004
10.8
4
T-10001_0001
19.8
5
T-10001_0002
18.3
6
T-10001_0003
17.7
7
T-10001_0004
18.1
8
T-10002_0001
20.0
9
T-10002_0002
16.4
10
T-10002_0003
17.6
11
T-10002_0004
19.3
I am trying to group the thermometers (i.e 'T-10000', 'T-10001', 'T-10002'), and create new columns with the min, max and average of each thermometer reading. So my final data frame would look like this
Index
Thermometer
min_temp
average_temp
max_temp
0
T-10000
10.8
12.8
14.9
1
T-10001
17.7
18.5
19.8
2
T-10002
16.4
18.3
20.0
I tried creating a separate function which I think requires regular expression, but I'm unable to figure out how to go about it. Any help will be much appreciated.
Use groupby by splitting with your delimiter _. Then, just aggregate with whatever functions you need.
>>> df.groupby(df['thermometers']\
.str.split('_'). \
.str.get(0)).agg(['min', 'mean', 'max'])
min mean max
thermometers
T-10000 10.8 13.375 15.1
T-10001 17.7 18.475 19.8
T-10002 16.4 18.325 20.0
Another approach with str.extract to avoid the call to str.get:
(df['Temperatures']
.groupby(df['thermometers'].str.extract('(^[^_]+)', expand=False))
.agg(['min', 'mean'])
)
Output:
min mean
thermometers
T-10000 10.8 13.375
T-10001 17.7 18.475
T-10002 16.4 18.325

calculating the minimum, mean, and maximum values of the expanding window in a time series dataset

I found the following code for my task which I would need to compute mean,min,max of a timeseries dataframe up to each time step.
for instance the value for time step 10 should include all the information from time step 0 to time step 10.
The following code seems to be working for a series of data, I was wondering if there exists a pythonic way to do that for a dataframe
from pandas import DataFrame
from pandas import concat
series = read_csv('daily-min-temperatures.csv', header=0, index_col=0)
temps = DataFrame(series.values)
window = temps.expanding()
dataframe = concat([window.min(), window.mean(), window.max(), temps.shift(-1)], axis=1)
dataframe.columns = ['min', 'mean', 'max', 't+1']
print(dataframe.head(5))
IIUC:
import pandas as pd
df = pd.read_csv('daily-min-temperatures.csv', header=0, index_col=0)
out = df['Temp'].expanding().agg(['min', 'mean', 'max']) \
.assign(**{'t+1': df['Temp'].shift(-1)})
Output:
>>> out
min mean max t+1
Date
1981-01-01 20.7 20.700000 20.7 17.9
1981-01-02 17.9 19.300000 20.7 18.8
1981-01-03 17.9 19.133333 20.7 14.6
1981-01-04 14.6 18.000000 20.7 15.8
1981-01-05 14.6 17.560000 20.7 15.8
... ... ... ... ...
1990-12-27 0.0 11.174712 26.3 13.6
1990-12-28 0.0 11.175377 26.3 13.5
1990-12-29 0.0 11.176014 26.3 15.7
1990-12-30 0.0 11.177254 26.3 13.0
1990-12-31 0.0 11.177753 26.3 NaN
[3650 rows x 4 columns]

Why is Pandas interpolate() only finding the middle of my two columns and not being more accurate?

I have a table:
-60 -40 -20 0 20 40 60
100 520.0 440.0 380.0 320.0 280.0 240.0 210.0
110 600.0 500.0 430.0 370.0 320.0 280.0 250.0
I add the column to the dataframe like so:
wind_comp = -35
if int(wind_comp) not in df.columns:
new_col = df.columns.to_list()
new_col.append(int(wind_comp))
new_col.sort()
df = df.reindex(columns=new_col)
Which returns this:
-60 -40 -35 -20 0 20 40 60
100 520 440 NaN 380 320 280 240 210
110 600 500 NaN 430 370 320 280 250
I interpolate using pandas interpolate() method like this:
df.interpolate(axis=1).interpolate('linear')
If I add a new column of say, -35 it just finds the middle of the -40 and the -20 columns and doesn't get any more accurate. So it returns this:
-60 -40 -35 -20 0 20 40 60
100 520.0 440.0 410.0 380.0 320.0 280.0 240.0 210.0
110 600.0 500.0 465.0 430.0 370.0 320.0 280.0 250.0
Obviously this row would be correct if I had added a column of -30, but I didn't. I need it to give back more accuracy. I want to be able to enter -13 for example and it give me back that interpolated exact number.
How can I do this? Am I doing something wrong in my code or and I missing something? Please help.
EDIT:
It seems that pandas.interpolate() will only halve the to numbers it is placed between and doesn't take into account headers.
I can't find anything that really applies to working with a table using scipy but maybe I'm interpreting it wrong. Is it possible to use that or something different?
Here's an example of interp1d with your values. Now, I'm glossing over a huge number of details here, like how to get values from your DataFrame into a list like this. In many cases, it is easier to do manipulation like this with lists before it becomes a DataFrame.
import scipy.interpolate
x = [ -60, -40, -20, 0 , 20, 40, 60]
y1 = [ 520.0, 440.0, 380.0, 320.0, 280.0, 240.0, 210.0]
y2 = [ 600.0, 500.0, 430.0, 370.0, 320.0, 280.0, 250.0]
f1 = scipy.interpolate.interp1d(x,y1)
f2 = scipy.interpolate.interp1d(x,y2)
print(-35, f1(-35))
print(-35, f2(-35))
Output:
-35 425.0
-35 482.5

transpose multiple columns in a pandas dataframe

AD AP AR MD MS iS AS
0 169.88 0.00 50.50 814.0 57.3 32.3 43.230
1 12.54 0.01 84.75 93.0 51.3 36.6 43.850
2 321.38 0.00 65.08 986.0 56.7 28.9 42.070
I would like to change the dataframe above to a transposed version where for each column, the values are put in a single row, so e.g. for columns AD and AP, it will look like this
d1_AD d2_AD d3_AD d1_AP d2_AP d3_AP
169.88 12.54 321.38 0.00 0.01 0.00
I can do a transpose, but how do I get the column names and output structure like above?
NOTE: The output is truncated for legibility but the actual output should include all the other columns like AR MD MS iS AS
We can rename to make the index of the correct form, then stack and sort_index, then Collapse the MultiIndex and to_frame and transpose
new_df = df.rename(lambda x: f'd{x + 1}').stack().sort_index(level=1)
new_df.index = new_df.index.map('_'.join)
new_df = new_df.to_frame().transpose()
Input df:
df = pd.DataFrame({
'AD': [169.88, 12.54, 321.38], 'AP': [0.0, 0.01, 0.0],
'AR': [50.5, 84.75, 65.08], 'MD': [814.0, 93.0, 986.0],
'MS': [57.3, 51.3, 56.7], 'iS': [32.3, 36.6, 28.9],
'AS': [43.23, 43.85, 42.07]
})
new_df:
d1_AD d2_AD d3_AD d1_AP d2_AP ... d2_MS d3_MS d1_iS d2_iS d3_iS
0 169.88 12.54 321.38 0.0 0.01 ... 51.3 56.7 32.3 36.6 28.9
[1 rows x 21 columns]
If lexicographic sorting does not work we can wait to convert the MultiIndex to string until after sort_index:
new_df = df.stack().sort_index(level=1) # Sort level 1 (by number)
new_df.index = new_df.index.map(lambda x: f'd{x[0]+1}_{x[1]}')
new_df = new_df.to_frame().transpose()
Larger frame:
df = pd.concat([df] * 4, ignore_index=True)
Truncated output:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_iS d9_iS d10_iS d11_iS d12_iS
0 169.88 12.54 321.38 169.88 12.54 ... 36.6 28.9 32.3 36.6 28.9
[1 rows x 84 columns]
If needing columns in same order as df, use melt using ignore_index=False to not have to recalculate groups and let melt handle the ordering:
new_df = df.melt(value_name=0, ignore_index=False)
new_df = new_df[[0]].set_axis(
# Create the new index
'd' + (new_df.index + 1).astype(str) + '_' + new_df['variable']
).transpose()
Truncated output on the larger frame:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_AS d9_AS d10_AS d11_AS d12_AS
0 169.88 12.54 321.38 169.88 12.54 ... 43.85 42.07 43.23 43.85 42.07
[1 rows x 84 columns]
You could try melt and set_index with groupby:
x = df.melt().set_index('variable').rename_axis(index=None).T.set_axis([0])
x.set_axis(x.columns + x.columns.to_series().groupby(level=0).transform('cumcount').add(1).astype(str), axis=1)
AD1 AD2 AD3 AP1 AP2 AP3 AR1 AR2 AR3 ... MS1 MS2 MS3 iS1 iS2 iS3 AS1 AS2 AS3
0 169.88 12.54 321.38 0.0 0.01 0.0 50.5 84.75 65.08 ... 57.3 51.3 56.7 32.3 36.6 28.9 43.23 43.85 42.07
[1 rows x 21 columns]

Finding highest values "zone" in a 2d matrix in Python

I have a 2d matrix in Python like this (a 10 rows/20 columns list I use to later do an imshow):
[[-20.17 -12.88 -20.7 -25.69 -21.69 -34.22 -32.65 -31.74 -36.36 -37.65
-41.42 -41.14 -44.01 -43.19 -41.85 -39.25 -40.15 -41.31 -39.73 -28.66]
[ 14.18 53.86 70.03 64.39 72.37 39.95 30.44 28.14 20.77 17.98
25.74 25.66 27.56 37.61 42.39 42.39 35.79 41.65 41.65 41.84]
[ 33.71 68.35 69.39 66.7 59.99 40.08 40.08 40.8 26.19 19.82
19.82 18.07 20.32 19.51 24.77 22.81 21.45 21.45 21.45 23.7 ]
[103.72 55.11 32.3 29.47 16.53 15.54 9.4 8.11 5.06 5.06
13.07 13.07 12.99 13.47 13.47 13.47 12.92 12.92 14.27 20.63]
[ 59.02 18.6 37.53 24.5 13.01 34.35 8.16 13.66 12.57 8.11
8.11 8.11 8.11 8.11 8.11 5.66 5.66 5.66 5.66 7.41]
[ 52.69 14.17 7.25 -5.79 3.19 -1.75 -2.43 -3.98 -4.92 -6.68
-6.68 -6.98 -6.98 -8.89 -8.89 -9.15 -9.15 -9.15 -9.15 -9.15]
[ 29.24 10.78 0.6 -3.15 -12.55 3.04 -1.68 -1.68 -1.41 -6.15
-6.15 -6.15 -10.59 -10.59 -10.59 -10.59 -10.59 -9.62 -10.29 -10.29]
[ 6.6 0.11 2.42 0.21 -5.68 -10.84 -10.84 -13.6 -16.12 -14.41
-15.28 -15.28 -15.28 -18.3 -5.55 -13.16 -13.16 -13.16 -13.16 -14.15]
[ 3.67 -11.69 -6.99 -16.75 -19.31 -20.28 -21.5 -21.5 -34.02 -37.16
-25.51 -25.51 -26.36 -26.36 -26.36 -26.36 -29.38 -29.38 -29.59 -29.38]
[ 31.36 -2.87 0.34 -8.06 -12.14 -22.7 -24.39 -25.51 -26.36 -27.37
-29.38 -31.54 -31.54 -31.54 -32.41 -33.26 -33.26 -15.54 -15.54 -15.54]]
I'm trying to find a way to detect the "zone" of this matrix that contains the highest density of high values in it. It means it might not contain the highest single value of the whole list, obviously.
I suppose to do so I should define how big this zone is, so let's say it should be 2x2 (so I want to find what is the 'square' of 2x2 items containing the highest values).
I always think I have a logical solution to do so, but then I always fail following the logic of how it could work!
Anyone has a suggestion I could start from?
I know there might be some easier ways to do so, but this is the easiest for me. I've created the following function to perform this task which takes two arguments:
arr: a 2D numpy array.
zone_size: the size of the square zone.
And the function goes like so:
def get_heighest_zone(arr, zone_size):
max_sum = float("-inf")
row_idx, col_idx = 0, 0
for row in range(arr.shape[0]-zone_size):
for col in range(arr.shape[1]-zone_size):
curr_sum = np.sum(arr[row:row+zone_size, col:col+zone_size])
if curr_sum > max_sum:
row_idx, col_idx = row, col
max_sum = curr_sum
return arr[row_idx:row_idx+zone_size, col_idx:col_idx+zone_size]
Assuming arr is the numpy array posted in your question, applying this function over different zone_sizes will return these values:
>>> get_heighest_zone(arr, 2)
[[70.03 64.39]
[69.39 66.7 ]]
>>> get_heighest_zone(arr, 3)
[[53.86 70.03 64.39]
[68.35 69.39 66.7 ]
[55.11 32.3 29.47]]
>>> get_heighest_zone(arr, 4)
[[ 14.18 53.86 70.03 64.39]
[ 33.71 68.35 69.39 66.7 ]
[103.72 55.11 32.3 29.47]
[ 59.02 18.6 37.53 24.5 ]]
If the zone_size doesn't have to be square, then you will need to modify a little bit in the code. Also, you should assert that zone_size is less than the array size.
Hopefully, this is what you was looking for!

Categories

Resources