I have a 2-dimensional xarray dataset that I want to interpolate on the lon and lot coordinates such that I have a higher resolution, but the values correspond exactly with the original values at each coordinate.
I thought the excellent xr.interp function would be able to do this, but following the example I see some discrepancy between the original and interpolated values. I am increasing the longitude and latitude resolution by 4, and thus would except all air values that occur once in the original dataset, to occur 16 times in the interpolated dataset, but this is not the case.
Does anyone know what the cause is that the original and interpolated dataset do not align and how I could solve it?
ds = xr.tutorial.open_dataset("air_temperature").isel(time=0)
fig, axes = plt.subplots(ncols=2, figsize=(10, 4))
ds_sel=ds.sel(lon=slice(250,260),lat=slice(40,30))
ds.air.plot(ax=axes[0],xlim=(250,260),ylim=(30,40))
axes[0].set_title("Raw data")
# Interpolated data
new_lon = np.linspace(ds.lon[0], ds.lon[-1], ds.dims["lon"] * 4)
new_lat = np.linspace(ds.lat[0], ds.lat[-1], ds.dims["lat"] * 4)
dsi = ds.interp(lat=new_lat, lon=new_lon,method="nearest")
dsi_sel=dsi.sel(lon=slice(250,260),lat=slice(40,30))
dsi.air.plot(ax=axes[1],xlim=(250,260),ylim=(30,40))
axes[1].set_title("Interpolated data")
Showing the unique values with
unique, counts = np.unique(ds_sel.air.values, return_counts=True)
print("original values",dict(zip(unique, counts)))
unique, counts = np.unique(dsi_sel.air.values, return_counts=True)
print("interpolated values",dict(zip(unique, counts)))
I get
original values {262.1: 1, 263.1: 1, 263.9: 1, 264.4: 1, 265.19998: 1, 266.6: 1, 266.79: 1, 266.9: 2, 268.29: 1, 269.79: 1, 270.4: 1, 273.0: 1, 273.6: 1, 275.19998: 1, 276.29: 1, 278.0: 1, 278.5: 1, 278.6: 1, 281.5: 1, 282.1: 1, 282.29: 1, 284.6: 1, 286.79: 1, 288.0: 1}
interpolated values {262.1: 4, 263.1: 8, 263.9: 8, 264.4: 8, 265.19998: 4, 266.6: 16, 266.79: 16, 266.9: 24, 268.29: 8, 269.79: 20, 270.4: 10, 273.0: 20, 273.6: 16, 275.19998: 8, 276.29: 20, 278.0: 16, 278.5: 10, 278.6: 8, 281.5: 4, 282.1: 16, 282.29: 8, 284.6: 8, 286.79: 8, 288.0: 4}
I think you're conceptually running up against a fencepost error (see the section on this page: https://en.wikipedia.org/wiki/Off-by-one_error)
You should interpret the xarray coordinates as "midpoints", not as the cell boundaries.
Your new_lon isn't nicely divided into 1/2, 1/4, 1/8, etc.:
print(new_lon)
[200. 200.61611374 201.23222749 201.84834123 202.46445498
203.08056872 203.69668246 204.31279621 204.92890995]
And it doesn't include all the original coordinates.
Taking the "off-by-ones" into account:
new_lon = np.linspace(ds.lon[0], ds.lon[-1], (ds.dims["lon"] - 1) * 4 + 1)
new_lat = np.linspace(ds.lat[0], ds.lat[-1], (ds.dims["lat"] - 1) * 4 + 1)
print(new_lon)
[200. 200.625 201.25 201.875 202.5 203.125 203.75 204.375 205. ]
You can then e.g. inspect the part of the first row of the original and the interpolated one:
selection = ds["air"][0, :3]
selection_i = dsi["air"][0, :9]
print(selection["lon"])
print(selection.values)
print(selection_i["lon"])
print(selection_i.values)
This looks good to me:
[200. 202.5 205. ]
[241.2 242.5 243.5]
[200. 200.625 201.25 201.875 202.5 203.125 203.75 204.375 205. ]
[241.2 241.2 241.2 242.5 242.5 242.5 242.5 243.5 243.5]
Of course, when doing nearest interpolation, you might end up with ties:
0.5 is equally far removed from 0.0 as it is from 1.0 -- and so you inadverntely have to bias either "up" or "down" to get a single nearest value.
Also note that the .plot() command, which draws a Matplotlib QuadMesh has to infer boundaries from midpoints somehow. This can sometimes lead to boundaries being drawn slightly differently from what you might have in mind (especially if the coordinate is unevenly spaced).
Related
I have a list of tuples, where each tuple is a datetime and float. I wish to clip the float values so that they are all above a threshold value. For example if I have:
a = [
(datetime.datetime(2021, 11, 1, 0, 0, tzinfo=tzutc()), 100),
(datetime.datetime(2021, 11, 1, 1, 0, tzinfo=tzutc()), 9.0),
(datetime.datetime(2021, 11, 1, 2, 0, tzinfo=tzutc()), 100.0)
]
and if I want to clip at 10.0, this would give me:
b = [
(datetime.datetime(2021, 11, 1, 0, 0, tzinfo=tzutc()), 100),
(datetime.datetime(2021, 11, 1, 0, ?, tzinfo=tzutc()), 10.0),
(datetime.datetime(2021, 11, 1, 1, ?, tzinfo=tzutc()), 10.0),
(datetime.datetime(2021, 11, 1, 2, 0, tzinfo=tzutc()), 100.0)
]
So if I were to plot the a data (before clipping), I would get a V shaped graph. However, if I clip the data at 10.0 to give me the b data, and plot, I will have a \_/ shaped graph instead. There is a bit of math involved in calculating the new times so I'm hoping there is already functionality available to do this kind of thing. The datetimes are sorted in order and are unique. I can fix the data so the difference between consecutive times is equal, should that be necessary.
Apologies for not putting a full answer yesterday, my SO account is still rate-limited.
I have made a bit more complex custom dataset to showcase several values in a row being below threshold.
import pandas as pd
from datetime import datetime
from matplotlib import pyplot as plt
from scipy.interpolate import InterpolatedUnivariateSpline
df = pd.DataFrame([
(datetime(2021, 10, 31, 23, 0), 0),
(datetime(2021, 11, 1, 0, 0), 80),
(datetime(2021, 11, 1, 1, 0), 100),
(datetime(2021, 11, 1, 2, 0), 6),
(datetime(2021, 11, 1, 3, 0), 105),
(datetime(2021, 11, 1, 4, 0), 70),
(datetime(2021, 11, 1, 5, 0), 200),
(datetime(2021, 11, 1, 6, 0), 0),
(datetime(2021, 11, 1, 7, 0), 7),
(datetime(2021, 11, 1, 8, 0), 0),
(datetime(2021, 11, 1, 9, 0), 20),
(datetime(2021, 11, 1, 10, 0), 100),
(datetime(2021, 11, 1, 11, 0), 0)
], columns=['time', 'whatever'])
THRESHOLD = 10
The first thing to do here is to express index in terms of timedelta so that it behaves as any usual number we can then do all kinds of calculations with. For convenience, I am also expressing it as Series - an even better approach would be to create it as such from the get go, save the initial timestamp and reindex.
start_time = df['time'][0]
df.set_index((df['time'] - start_time).dt.total_seconds(), inplace=True)
series = df['whatever']
Then, I've tried InterpolatedUnivariateSpline from scipy:
roots = InterpolatedUnivariateSpline(df.index, series.values - THRESHOLD).roots()
threshold_crossings = pd.Series([THRESHOLD] * len(roots), index=roots)
new_series = pd.concat([series[series > THRESHOLD], threshold_crossings]).sort_index()
Let's test it out:
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(series)
ax.plot(df.index, [THRESHOLD] * len(df.index), 'k-.', label='threshold')
ax.plot(new_series)
ax.set_xlabel('$t-t_0$, s')
axins = ax.inset_axes([0.6, 0.6, 0.35, 0.3])
axins.plot(series)
axins.plot(df.index, [THRESHOLD] * len(df.index), 'k-.')
axins.plot(new_series)
axins.set_ylim(0, 20)
ax.indicate_inset_zoom(axins, edgecolor="black")
ax.set_ylabel('whatever, a.u.')
ax.legend(loc='upper left')
ax.set_title('Roots from InterpolatedUnivariateSpline')
Not so great. Spline roots interpolation is quite a bit off (after all, it uses a cubic B-spline under the hood and can't find roots if setting order to 1). Ah well. For monotonic functions, we could just inverse the interpolation, but this is not the case here. I hope someone finds a better way to do it, but my next step was rolling out a custom function:
def my_interp(series: pd.Series, thr: float) -> pd.Series:
needs_interp = series > thr
# XOR means we are only considering transition points
needs_interp = (needs_interp ^ needs_interp.shift(-1)).fillna(False)
# The last point will never be interpolated
x = series.index.to_series()
k = series.diff(periods=-1) / x.diff(periods=-1)
b = series - k * x
x_fill = ((thr - b) / k)[needs_interp]
fill_series = pd.Series(data=[thr] * x_fill.size, index=x_fill.values)
# NB! needs_interp is a wrong mask to use for series here
return pd.concat([series[series > thr], fill_series]).sort_index()
new= my_interp(series, THRESHOLD)
It achieves what you want to do with good precision:
To get back to timestamp representation, one would simply do
new_series.index = (start_time + pd.to_timedelta(new_series.index, unit='s'))
With that said, there are a couple caveats:
The function above assumes the timestamps are sorted (can be achieved
by sort_index), and no duplicates are present in the series
Edge conditions are nasty as usual. I have tested the function a little bit, the logic seems sound and it does not break if either side of the series is above/below the threshold, and it handles irregular data just fine, but still - watch out for NaNs in your data and consider how you should handle all the edge conditions, sorting etc.
There is no logic dedicated to handling data points exactly at threshold or ensuring there is any regularity in new timestamps. This could lead to bugs, too: e.g. if some portion of your code relies on having at least 2 data points every day, it might not hold after the transformation.
I have a segmentation map (numpy.ndarray) that contain objects labeled with unique numbers. I want to combine objects across multiple slices by labeling them with the same number. Specifically, I want to renumber objects based on a DataFrame containing centroid positions and the desired label value.
First, I created some mock labels and a DataFrame:
df = pd.DataFrame({
"slice": [0, 0, 0, 0, 1, 1, 1, 2, 2, 2],
"number": [1, 2, 3, 4, 1, 2, 3, 1, 2, 3],
"x": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32],
"y": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32]
})
def make_segmap(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice in df["slice"].unique():
masks = []
for row in df[df["slice"] == n_slice].iterrows():
# Create circle
mask_circle = (x - row[1]["x"])**2 + (y - row[1]["y"])**2 < 5**2
# Random index number (here just a multiple)
masks.append(mask_circle * row[1]["number"]*3)
maps.append(np.max(masks, axis=0))
return np.stack(maps, axis=0)
segmap = make_segmap(df)
For renumbering, this is what I came up with so far:
new_maps = []
# Iterate over slices
for n_slice in df["slice"].unique():
new_labels = []
for row in df[df["slice"] == n_slice].iterrows():
# Find current value at position
original_label = segmap[n_slice, row[1]["y"], row[1]["x"]]
# Replace all label occurrences with the desired label from the DataFrame
replaced_label = np.where(segmap[n_slice] == original_label, row[1]["number"], 0)
new_labels.append(replaced_label)
new_maps.append(np.max(new_labels, axis=0))
new_segmap = np.stack(new_maps, axis=0)
This works reasonably well but doesn't scale to larger datasets. The real dataset has thousands of objects across hundreds of slices and this approach takes very long to run (an hour or so). Are there any suggestions on how to replace multiple values at once to improve performance?
Thanks in advance.
You can use groupby to replace the current quadratic search algorithm by a (quasi) linear search. Moreover, you can take advantage of Numpy's vectorization and broadcasting to remove the inner loop and make the computation faster.
Here is a faster implementation:
def make_segmap_fast(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice,subDf in df.groupby("slice"):
subDf_x = subDf["x"].to_numpy()[:, None, None]
subDf_y = subDf["y"].to_numpy()[:, None, None]
subDf_number = subDf["number"].to_numpy()[:, None, None]
# Create circle
mask_circle = (x - subDf_x)**2 + (y - subDf_y)**2 < 5**2
# Random index number (here just a multiple)
masks = mask_circle * subDf_number
maps.append(np.max(masks, axis=0)*3)
return np.stack(maps, axis=0)
On my machine, this is 2 times faster on the very small example (much more on bigger dataframes).
I have two curves, defined by
X1=[9, 10.5, 11, 12, 12, 11, 10, 8, 7, 7]
Y1=[-5, -3.5, -2.5, -0.7, 1, 3, 4, 5, 5, 5]
X2=[5, 7, 9, 9.5, 10, 11, 12]
Y2=[-2, 4, 1, 0, -0.5, -0.7, -3]
They intersect each other
and by a function which is written in the system code I am using, I can have the coordinates of the intersection.
loop1=Loop([9, 10.5, 11, 12, 12, 11, 10, 8, 7, 7],[-5, -3.5, -2.5, -0.7, 1, 3, 4, 5, 5, 5])
loop2=Loop([5, 7, 9, 9.5, 10, 11, 12], [-2, 4, 1, 0, -0.5, -0.7, -3])
x_int, y_int = get_intersect(loop1,loop2)
Intersection = [[],[]]
Intersection.append(x_int)
Intersection.append(y_int)
for both curves, I need to find the points which are upstream and downstream the intersection identified by (x_int, y_int).
What I tried is something like:
for x_val, y_val, x, y in zip(Intersection[0], Intersection[1], loop1[0], loop1[1]):
if abs(x_val - x) < 0.5 and abs(y_val - y) < 0.5:
print(x_val, x, y_val, y)
The problem is that the result is extremely affected by the delta that I decide (0.5 in this case) and this gives me wrong results especially if I work with more decimal numbers (which is actually my case).
How can I make the loop more robust and actually find all and only the points which are upstream and downstream the intersection?
Many thanks for your help
TL;TR: loop over poly line segments and test if the intersection is betwwen the segment end points.
A more robust (than "delta" in OP) approach is to find a segment of the polyline, which contains the intersection (or given point in general). This segment should IMO be part of the get_intersect function, but if you do not have access to it, you have to search the segment yourself.
Because of roundoff errors, the given point does not exactly lie on the segment, so you still have some tol parameter, but the results should be "almost-insensitive" to its (very low) value.
The approach uses simple geometry, namely dot product and cross product and their geometric meaning:
dot product of vector a and b divided by |a| is projection (length) of b onto the direction of a. Once more dividing by |a| normalizes the value to the range [0;1]
cross product of a and b is the area of the parallelogram having a and b as sides. Dividing it by square of length make it some dimensionless factor of distance. If a point lies exactly on the segment, the cross product is zero. But a small tolerance is needed for floating point numbers.
X1=[9, 10.5, 11, 12, 12, 11, 10, 8, 7, 7]
Y1=[-5, -3.5, -2.5, -0.7, 1, 3, 4, 5, 5, 5]
X2=[5, 7, 9, 9.5, 10, 11, 12]
Y2=[-2, 4, 1, 0, -0.5, -0.7, -3]
x_int, y_int = 11.439024390243903, -1.7097560975609765
def splitLine(X,Y,x,y,tol=1e-12):
"""Function
X,Y ... coordinates of line points
x,y ... point on a polyline
tol ... tolerance of the normalized distance from the segment
returns ... (X_upstream,Y_upstream),(X_downstream,Y_downstream)
"""
found = False
for i in range(len(X)-1): # loop over segments
# segment end points
x1,x2 = X[i], X[i+1]
y1,y2 = Y[i], Y[i+1]
# segment "vector"
dx = x2 - x1
dy = y2 - y1
# segment length square
d2 = dx*dx + dy*dy
# (int,1st end point) vector
ix = x - x1
iy = y - y1
# normalized dot product
dot = (dx*ix + dy*iy) / d2
if dot < 0 or dot > 1: # point projection is outside segment
continue
# normalized cross product
cross = (dx*iy - dy*ix) / d2
if abs(cross) > tol: # point is perpendicularly too far away
continue
# here, we have found the segment containing the point!
found = True
break
if not found:
raise RuntimeError("intersection not found on segments") # or return None, according to needs
i += 1 # the "splitting point" has one higher index than the segment
return (X[:i],Y[:i]),(X[i:],Y[i:])
# plot
import matplotlib.pyplot as plt
plt.plot(X1,Y1,'y',linewidth=8)
plt.plot(X2,Y2,'y',linewidth=8)
plt.plot([x_int],[y_int],"r*")
(X1u,Y1u),(X1d,Y1d) = splitLine(X1,Y1,x_int,y_int)
(X2u,Y2u),(X2d,Y2d) = splitLine(X2,Y2,x_int,y_int)
plt.plot(X1u,Y1u,'g',linewidth=3)
plt.plot(X1d,Y1d,'b',linewidth=3)
plt.plot(X2u,Y2u,'g',linewidth=3)
plt.plot(X2d,Y2d,'b',linewidth=3)
plt.show()
Result:
I am working my way through: https://medium.com/analytics-vidhya/exploratory-data-analysis-of-the-hotel-booking-demand-with-python-200925230106
In a bunch of the visualization outputs the sort order is off. As I am working my way through each question, I have successfully fixed the sort order of every output -- until now.
For question #6, part two (Let’s see the stay duration trend for each hotel type.) I am getting
the exact same output as is shown in the article. However, the x-axis is incorrectly sorted, and I am trying to fix it as I have all previous outputs.
Here is my code for question #6, including the first part where I fixed the sort order:
# 6. How long do people stay in the hotel?
df_not_canceled2 = df_not_canceled.copy()
total_nights = df_not_canceled2['stays_in_weekend_nights'] + df_not_canceled2['stays_in_week_nights']
x5, y5, z5 = get_count(total_nights, limit=10)
x5 = x5[[0, 2, 1, 3, 5, 6, 4, 8, 7, 9]]
y5 = y5[[0, 2, 1, 3, 5, 6, 4, 8, 7, 9]]
z5 = z5[[0, 2, 1, 3, 5, 6, 4, 8, 7, 9]]
plot(x5, y5, x_label='Number of Nights', y_label='Booking Percentage (%)', title='Night Stay Duration (Top 10)', figsize =(10, 5))
plt.show()
# The stay duration trend for each hotel type.
df_not_canceled2.loc[:, 'total_nights'] = df_not_canceled2['stays_in_weekend_nights'] + df_not_canceled2['stays_in_week_nights']
df_not_canceled2 = df_not_canceled2.sort_values(by=['total_nights']).reset_index(drop=True)
fig1, ax = plt.subplots(figsize=(12, 6))
ax.set_xlabel('No of Nights')
ax.set_ylabel('No of Nights')
ax.set_title('Hotel wise night stay duration (Top 10)')
sns.countplot(x='total_nights', hue='hotel', data=df_not_canceled2,
order=df_not_canceled2['total_nights'].value_counts().iloc[:10].index, ax=ax)
plt.show()
First I tried sorting the df by 'total_nights'. The output did not change. Then I sorted and reset the index (this is the current state of my code). Still no change.
This is what I get (exactly the same as the article):
Notice the sort order of the x-axis (total_nights). I want 1, 2, 3, 4, 5, etc., not 1, 3, 2, 4, 7, etc.
Just figured it out. Had to remove .value_counts from the order parameter.
I have a 2d numpy array (6 x 6) elements. I want to create another 2D array out of it, where each block is the average of all elements within a blocksize window. Currently, I have the foll. code:
import os, numpy
def avg_func(data, blocksize = 2):
# Takes data, and averages all positive (only numerical) numbers in blocks
dimensions = data.shape
height = int(numpy.floor(dimensions[0]/blocksize))
width = int(numpy.floor(dimensions[1]/blocksize))
averaged = numpy.zeros((height, width))
for i in range(0, height):
print i*1.0/height
for j in range(0, width):
block = data[i*blocksize:(i+1)*blocksize,j*blocksize:(j+1)*blocksize]
if block.any():
averaged[i][j] = numpy.average(block[block>0])
return averaged
arr = numpy.random.random((6,6))
avgd = avg_func(arr, 3)
Is there any way I can make it more pythonic? Perhaps numpy has something which does it already?
UPDATE
Based on M. Massias's soln below, here is an update with fixed values replaced by variables. Not sure if it is coded right. it does seem to work though:
dimensions = data.shape
height = int(numpy.floor(dimensions[0]/block_size))
width = int(numpy.floor(dimensions[1]/block_size))
t = data.reshape([height, block_size, width, block_size])
avrgd = numpy.mean(t, axis=(1, 3))
To compute some operation slice by slice in numpy, it is very often useful to reshape your array and use extra axes.
To explain the process we'll use here: you can reshape your array, take the mean, reshape it again and take the mean again.
Here I assume blocksize is 2
t = np.array([[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],[0, 1, 2, 3, 4, 5],])
t = t.reshape([6, 3, 2])
t = np.mean(t, axis=2)
t = t.reshape([3, 2, 3])
np.mean(t, axis=1)
outputs
array([[ 0.5, 2.5, 4.5],
[ 0.5, 2.5, 4.5],
[ 0.5, 2.5, 4.5]])
Now that it's clear how this works, you can do it in one pass only:
t = t.reshape([3, 2, 3, 2])
np.mean(t, axis=(1, 3))
works too (and should be quicker since means are computed only once - I guess). I'll let you substitute height/blocksize, width/blocksize and blocksize accordingly.
See #askewcan nice remark on how to generalize this to any dimension.