I have a pandas.DataFrame 2048 by 2048 with index and columns representing y and x coordinates respectively.
I want to make an axis transformation and get to polar coordinates, making a new pandas.DataFrame with index and columns representing radius and polar angle.
The only way I can think of is access the dataframe values one by one, calculating radius and angle and then setting the according value in the new dataframe, but it is extremely slow, since element-wise operations are not that fast in pandas. It's still slow even if I perform the operation row-by-row.
Is there a better way to do that without writing my own CPython extension in C?
This should not be problematic if your dataframe is built as I understand it (though I think I am wrong). To build an example:
from __future__ import division
import numpy as np, pandas as pd
index = np.arange(1,2049,dtype=float)
cols = np.arange(2050,4098,dtype=float)
df = pd.DataFrame(index=index, columns=cols)
# now calculate angle and radius, then set in new dataframe
phi = np.arctan(df.columns/df.index)
r = np.power( np.power(df.index,2) + np.power(df.columns,2), 0.5 )
df_polar = pd.DataFrame(index=r, columns=phi)
While this agrees with what you stated, I think I have missed something here. If this is not right, can you clarify?
Related
I am a python newbie, trying to understand how to work with numpy masks better.
I have two 2D data arrays plus axis values, so something like
import numpy as np
data1=np.arange(50).reshape(10,5)
data2=np.random.rand(10,5)
x=5*np.arange(5)+15
y=2*np.arange(10)
Where x contains the coordinates of the 1st axis of data1 and data2, and y gives the coordinates of the 2nd axis of data1 and data2.
I want to identify and count all the points in data1 for which
data1>D1min,
the corresponding x values are inside a given range, XRange, and
the corresponding y is are inside a given range, YRange
Then, when I am all done, I also need to do a check to make sure none of the corresponding data2 values are less than another limit, D2Max
so if
XRange = [27,38]
YRange = [2,12]
D1min = 23
D2Max = 0.8
I would want to include cells 3:4 in the x direction and 1:6 in the 2nd dimension (assuming I want to include the limiting values).
That means I would only consider data1[3:4,1:6]
Then the limits of the values in the 2D arrays come into it, so want to identify and count points for which data1[3:4,1:6] > 23.
Once I have done that I want to take those data locations and check to see if any of those locations have values <0.8 in data2.
In reality I don't have formulas for x and y, and the arrays are much larger. Also, x and y might not even be monotonic.
I figure I should use numpy masks for this and I have managed to do it, but the result seems really tortured - I think the code wold be more clear if I just looped though the values in the 2D arrays.
I think the main problem is that I have trouble combining masks with boolean operations. The ideas I get from searching on line often don't seem to work on arrays.
I assume there is a elegant and (hopefully) understandable way to do this in just a few lines with masks. Would anyone care to explain it to me?
Well I eventually came up with something, so I thought I'd post it. I welcome suggested improvements.
#expand x and y into 2D arrays so that they can more
#easily be used for masking using tile
x2D = np.tile(x,(len(y),1))
y2D = np.tile(y,(len(x),1)).T
#mask these based on the ranges in X and Y
Xmask = np.ma.masked_outside(x2D,XRange[0],XRange[1]).mask
Ymask = np.ma.masked_outside(y2D,YRange[0],YRange[1]).mask
#then combine them
#Not sure I need the shrink=False, but it seems safer
XYmask = np.ma.mask_or(Xmask, Ymask,shrink=False)
#now mask the data1 array based on D1mask.
highdat = np.ma.masked_less(data1,D1min)
#combine with XYmask
data1mask = np.ma.mask_or(highdat.mask, XYmask,shrink=False)
#apply to data1
data1masked = np.ma.masked_where(data1mask,data1)
#number of points fulfilling my criteria
print('Number of points: ',np.ma.count(data1masked))
#transfer mask from data1 to data2
data2masked = np.ma.masked_where(data1mask, data2)
#do my check based on data2
if data2masked.min() < D2Max: print('data2 values are low!')
I have one dataset of satellite based solar induced fluorescence (SIF) and one of modeled precipitation. I want to compare precipitation to SIF on a per pixel basis in my study area. My two datasets are of the same area but at slightly different spatial resolutions. I can successfully plot these values across time and compare against each other when I take the mean for the whole area, but I'm struggling to create a scatter plot of this on a per pixel basis.
Honestly I'm not sure if this is the best way to compare these two values when looking for the impact of precip on SIF so I'm open to ideas of different approaches. As for merging the data currently I'm using xr.combine_by_coords but it is giving me an error I have described below. I could also do this by converting the netcdfs into geotiffs and then using rasterio to warp them, but that seems like an inefficient way to do this comparison. Here is what I have thus far:
import netCDF4
import numpy as np
import dask
import xarray as xr
rainy_bbox = np.array([
[-69.29519955115512,-13.861261028444734],
[-69.29519955115512,-12.384786628185896],
[-71.19583431678012,-12.384786628185896],
[-71.19583431678012,-13.861261028444734]])
max_lon_lat = np.max(rainy_bbox, axis=0)
min_lon_lat = np.min(rainy_bbox, axis=0)
# this dataset is available here: ftp://fluo.gps.caltech.edu/data/tropomi/gridded/
sif = xr.open_dataset('../data/TROPO_SIF_03-2018.nc')
# the dataset is global so subset to my study area in the Amazon
rainy_sif_xds = sif.sel(lon=slice(min_lon_lat[0], max_lon_lat[0]), lat=slice(min_lon_lat[1], max_lon_lat[1]))
# this data can all be downloaded from NASA Goddard here either manually or with wget but you'll need an account on https://disc.gsfc.nasa.gov/: https://pastebin.com/viZckVdn
imerg_xds = xr.open_mfdataset('../data/3B-DAY.MS.MRG.3IMERG.201803*.nc4')
# spatial subset
rainy_imerg_xds = imerg_xds.sel(lon=slice(min_lon_lat[0], max_lon_lat[0]), lat=slice(min_lon_lat[1], max_lon_lat[1]))
# I'm not sure the best way to combine these datasets but am trying this
combo_xds = xr.combine_by_coords([rainy_imerg_xds, rainy_xds])
Currently I'm getting a seemingly unhelpful RecursionError: maximum recursion depth exceeded in comparison on that final line. When I add the argument join='left' then the data from the rainy_imerg_xds dataset is in combo_xds and when I do join='right' the rainy_xds data is present, and if I do join='inner' no data is present. I assumed there was some internal interpolation with this function but it appears not.
This documentation from xarray outlines quite simply the solution to this problem. xarray allows you to interpolate in multiple dimensions and specify another Dataset's x and y dimensions as the output dimensions. So in this case it is done with
# interpolation based on http://xarray.pydata.org/en/stable/interpolation.html
# interpolation can't be done across the chunked dimension so we have to load it all into memory
rainy_sif_xds.load()
#interpolate into the higher resolution grid from IMERG
interp_rainy_sif_xds = rainy_sif_xds.interp(lat=rainy_imerg_xds["lat"], lon=rainy_imerg_xds["lon"])
# visualize the output
rainy_sif_xds.dcSIF.mean(dim='time').hvplot.quadmesh('lon', 'lat', cmap='jet', geo=True, rasterize=True, dynamic=False, width=450).relabel('Initial') +\
interp_rainy_sif_xds.dcSIF.mean(dim='time').hvplot.quadmesh('lon', 'lat', cmap='jet', geo=True, rasterize=True, dynamic=False, width=450).relabel('Interpolated')
# now that our coordinates match, in order to actually merge we need to convert the default CFTimeIndex to datetime to merge dataset with SIF data because the IMERG rainfall dataset was CFTime and the SIF was datetime
rainy_imerg_xds['time'] = rainy_imerg_xds.indexes['time'].to_datetimeindex()
# now the merge can easily be done with
merged_xds = xr.combine_by_coords([rainy_imerg_xds, interp_rainy_sif_xds], coords=['lat', 'lon', 'time'], join="inner")
# now visualize the two datasets together // multiply SIF by 30 because values are so ow
merged_xds.HQprecipitation.rolling(time=7, center=True).sum().mean(dim=('lat', 'lon')).hvplot().relabel('Precip') * \
(merged_xds.dcSIF.mean(dim=('lat', 'lon'))*30).hvplot().relabel('SIF')
TL;DR: Question: Is there a fast way to interpolate a scattered 2D-dataset at specific coordinates?
And if so could someone provide an example with the provided sample data and variables used from "Current Solution" (as I'm apparently to stupid to implement it myself).
Problem:
I need to interpolate (and if possible also extrapolate) a DataFrame (size = (34, 18)) of scattered data at specific coordinate points. The DataFrame stays always the same.
The interpolation need to be fast as it is done more than 10.000 times in a loop.
The coordinates at which will be interpolated are not know in advance as they change every loop.
Current Solution:
def Interpolation(a, b):
#import external modules
import pandas as pd
from scipy import interpolate
#reading .xlsx file into DataFrame
file = pd.ExcelFile(file_path)
mr_df = file.parse('Model_References')
matrix = mr_df.set_index(mr_df.columns[0])
#interpolation at specific coordinates
matrix = Matrix.stack().reset_index().values
value = interpolate.griddata(matrix[:,0:2], matrix[:,2], (a, b), method='cubic')
return(value)
This method is not acceptable for long time use as only the two lines of code under #interpolation at specific coordinates is more than 95% of the execution time.
My Ideas:
scipy.interpolate.Rbf seems like the best solution if the data needs to be interpolated and extrapolated but as to my understanding it only creates a finer mesh of the existing data and cannot output a interpolated value at specific coordinates
creating a smaller 4x4 matrix of the area around the specific coordinates (a,b) would maybe decrease the execution time per loop, but I do struggle how to use griddata with the smaller matrix. I created a 5x5 matrix with the first row and column being the indexes and the other 4x4 entries is the data with the specific coordinates in the middle.
But I get a TypeError: list indices must be integers or slices, not tuple which I do not understand as I did not change anything else.
Sample Data:
0.0 0.1 0.2 0.3
0.0 -407 -351 -294 -235
0.0001 -333 -285 -236 -185
0.0002 -293 -251 -206 -161
0.00021 -280 -239 -196 -151
Thanks to #Jdog's comment I was able to figure it out:
The creation of a spline once before the loop with scipy.interpolate.RectBivariateSpline and the read out of specific coordinates with scipy.interpolate.RectBivariateSpline.ev decreased the execution time of the interpolation from 255s to 289ms.
def Interpolation(mesh, a, b):
#interpolation at specific coordinates
value = mesh.ev(stroke, current)
return(value)
#%%
#import external modules
import pandas as pd
from scipy import interp
#reading .xlsx file into DataFrame
file = pd.ExcelFile(file_path)
mr_df = file.parse('Model_References')
matrix = mr_df.set_index(mr_df.columns[0])
mesh = interp.RectBivariateSpline(a_index, b_index, matrix)
for iterations in loop:
value = Interpolation(mesh, a, b)
I have a filter . They are supposed to have the same structure but they are scaled differently and the data from the top filter shown in the plot is truncated before 10000. I just set the value equal to zero at 10000 but I would like to extrapolated the top filter in order to follow the structure of the bottom filter. The data related to each filter is provided in the links. I don't know how I can obtain the tail structure from the data in the bottom filter and apply it to the top one considering they have been scaled differently. Note that I need to use the upper panel filter because my other filters are calibrated accordingly.
I can obtain the interpolation for the lower filter using interp1d, but I don't know how I should rescale it properly that can be used for the top filter.
from scipy.interpolate import interp1d
from scipy import arange
import numpy as np
u=np.loadtxt('WFI_I.res')
f=interp1d(u[:,0], u[:,1])
x=arange(7050, 12000)
y=f(x)
I will be grateful for any suggestion or code to do that.
Assuming that you have two filter arrays with y values of filter1 and filter2 and x (wavelength) values of wave1 and wave2, then something like this should work (untested though):
wave_match = 9500 # wavelength for matching
index1 = np.searchsorted(wave1, wave_match)
index2 = np.searchsorted(wave2, wave_match)
match1 = filter1[index1]
match2 = filter2[index2]
scale = match1 / match2
wave12 = np.concatenate([wave1[:index1], wave2[index2:]])
filter12 = np.concatenate([filter1[:index1], scale * filter2[index2:]])
Since the two rasters (raster1 and raster2) overlap each other, I want to make new raster by calculating mean of each overlapped pixels; i.e., The resulting new raster is calculated as:
new = [[mean(1,3), mean(1,3), mean(1,3), mean(1,3), mean(1,3)],[mean(2,4),mean(2,4),mean(2,4),mean(2,4),mean(2,4)]]
import numpy as np
raster1 = np.array([[1,1,1,1,1],[2,2,2,2,2]])
raster2 = np.array([[3,3,3,3,3],[4,4,4,4,4]])
new = np.mean(raster1,raster2,axis=1)
print (new.tolist())
What is wrong?
Maybe I misunderstood you but do you want?
raster = (raster1 + raster2) / 2
Actually in this case you don't even need np.mean, just use matrix operations.
np.mean is used to deal with calculating mean for a single matrix on specific axis, so it is a different situation.
It should be
new = np.mean([raster1,raster2],axis=1)
with brackets. Actually I am guessing it should be
It should be
new = np.mean([raster1,raster2],axis=0)
The first argument to np.mean should be the whole array, see e.g. http://wiki.scipy.org/Numpy_Example_List_With_Doc#mean