Is this the most efficient and accurate way to extrapolate using scipy?

Is this the most efficient and accurate way to extrapolate using scipy? - python

I have a set of data points over time, but there is some missing data and the data is not at regular intervals. In order to get a full data set over time at regular intervals I did the following:
import pandas as pd
import numpy as np
from scipy import interpolate
x = data['time']
y = data['shares']
f = interpolate.interp1d(x, y, fill_value='extrapolate')
time = np.arange(0, 3780060, 600)
new_data = []
for interval in time:
new_data.append(f(interval))
test = pd.DataFrame({'time': time, 'shares': y})
test_func = test_func.astype(float)
When both the original and the extrapolated data sets are plotted, they seem to line up almost perfectly, but I still wonder if there is a more efficient and/or accurate way to accomplish the above.

You should apply interpolation function only once, like this
new_data = f(time)
If you need values at regular intervals fill_value='extrapolate' is redundant, because it is just interpolation. You may use 'extrapolate' if your new interval is wider than original one. But it is bad practice.

Related

Interpolate: spectra (wavelength, counts) at a given temperature, to create grid of temperature and counts

I have a number of spectra: wavelength/counts at a given temperature. The wavelength range is the same for each spectrum.
I would like to interpolate between the temperature and counts to create a large grid of spectra (temperature and counts (at a given wavelength range).
The code below is my current progress. When I try to get a spectrum for a given temperature I only get one value of counts when I need a range of counts representing the spectrum (I already know the wavelengths).
I think I am confused about arrays and interpolation. What am I doing wrong?
import pandas as pd
import numpy as np
from scipy import interpolate
image_template_one = pd.read_excel("mr_image_one.xlsx")
counts = np.array(image_template_one['counts'])
temp = np.array(image_template_one['temp'])
inter = interpolate.interp1d(temp, counts, kind='linear')
temp_new = np.linspace(30,50,0.5)
counts_new = inter(temp_new)
I am now think that I have two arrays; [wavelength,counts] and [wavelength, temperature]. Is this correct, and, do I need to interpolate between the arrays?
Example data

I think what you want to achieve can be done with interp2d:
from scipy import interpolate
# dummy data
data = pd.DataFrame({
'temp': [30]*6 + [40]*6 + [50]*6,
'wave': 3 * [a for a in range(400,460,10)],
'counts': np.random.uniform(.93,.95,18),
})
# make the interpolator
inter = interpolate.interp2d(data['temp'], data['wave'], data['counts'])
# scipy's interpolators return functions,
# which you need to call with the values you want interpolated.
new_x, new_y = np.linspace(30,50,100), np.linspace(400,450,100)
interpolated_values = inter(new_x, new_y)

Optimize plane of array (POA) irradiance calculation using WRF (netCDF) data

I need to calculate the plane of array (POA) irradiance using python's pvlib package (https://pvlib-python.readthedocs.io/en/stable/). For this I would like to use the output data from the WRF model (GHI, DNI, DHI). The output data is in netCDF format, which I open using the netCDF4 package and then I extract the necessary variables using the wrf-python package.
With that I get a xarray.Dataset with the variables I will use. I then use the xarray.Dataset.to_dataframe() method to transform it into a pandas dataframe, and then I transform the dataframe into a numpy array using the dataframe.values. And then I do a loop where in each iteration I calculate the POA using the function irradiance.get_total_irradiance (https://pvlib-python.readthedocs.io/en/stable/auto_examples/plot_ghi_transposition.html) for a grid point.
That's the way I've been doing it so far, however I have over 160000 grid points in the WRF domain, the data is hourly and spans 365 days. This gives a very large amount of data. I believe if pvlib could work directly with xarray.dataset it could be faster. However, I could only do it this way, transforming the data into a numpy.array and looping through the rows. Could anyone tell me how I can optimize this calculation? Because the code I developed is very time-consuming.
If anyone can help me with this I would appreciate it. Maybe an improvement to the code, or another way to calculate the POA from the WRF data...
I'm providing the code I've built so far:
from pvlib import location
from pvlib import irradiance
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import netCDF4
import wrf
Getting WRF data
variaveis = ['T2',
'U10',
'V10',
'SWDDNI',
'SWDDIF',
'SWDOWN']
netcdf_data = netCDF4.Dataset('wrfout_d02_2003-11-01_00_00_00')
first = True
for v in variaveis:
var = wrf.getvar(netcdf_data, v, timeidx=wrf.ALL_TIMES)
if first:
met_data = var
first = False
else:
met_data = xr.merge([met_data, var])
met_data = xr.Dataset.reset_coords(met_data, ['XTIME'], drop=True)
met_data['T2'] = met_data['T2'] - 273.15
WS10 = (met_data['U10']**2 + met_data['V10']**2)**0.5
met_data['WS10'] = WS10
df = met_data[['SWDDIF',
'SWDDNI',
'SWDOWN',
'T2',
'WS10']].to_dataframe().reset_index().drop(columns=['south_north',
'west_east'])
df.rename(columns={'SWDOWN': 'ghi',
'SWDDNI':'dni',
'SWDDIF':'dhi',
'T2':'temp_air',
'WS10':'wind_speed',
'XLAT': 'lat',
'XLONG': 'lon',
'Time': 'time'}, inplace=True)
df.set_index(['time'], inplace=True)
df = df[df.ghi>0]
df.index = df.index.tz_localize('America/Recife')
Function to get POA irradiance
def get_POA_irradiance(lon, lat, date, dni, dhi, ghi, tilt=10, surface_azimuth=0):
site_location = location.Location(lat, lon, tz='America/Recife')
# Get solar azimuth and zenith to pass to the transposition function
solar_position = site_location.get_solarposition(times=date)
# Use the get_total_irradiance function to transpose the GHI to POA
POA_irradiance = irradiance.get_total_irradiance(
surface_tilt = tilt,
surface_azimuth = surface_azimuth,
dni = dni,
ghi = ghi,
dhi = dhi,
solar_zenith = solar_position['apparent_zenith'],
solar_azimuth = solar_position['azimuth'])
# Return DataFrame with only GHI and POA
return pd.DataFrame({'lon': lon,
'lat': lat,
'GHI': ghi,
'POA': POA_irradiance['poa_global']}, index=[date])
Loop in each row (time) of the array
array = df.reset_index().values
list_poa = []
def loop_POA():
for i in tqdm(range(len(array) - 1)):
POA = get_POA_irradiance(lon=array[i,6],
lat=array[i,7],
dni=array[i,2],
dhi=array[i,1],
ghi=array[i,3],
date=str(array[i,0]))
list_poa.append(POA)
return list_poa
poa_final = pd.concat(lista)

Thanks both for a good question and for using pvlib! You're right that pvlib is intended for modeling single locations and is not designed for use with xarray datasets, although some functions might coincidentally work with them.
I strongly suspect that the majority of the runtime you're seeing is for the solar position calculations. You could switch to a faster method (see the method options here), as the default solar position method is very accurate but also quite slow when calculating bulk positions. Installing numba will help, but it still might be too slow for you, so you might check the other models (ephemeris, pyephem). There are also some fast but low-precision methods, but you will need to change your code a bit to use them. See the list under "Correlations and analytical expressions for low precision solar position calculations" here.
Like Michael Delgado suggests in the comments, parallel processing is an option. But that can be a headache in python. You will probably want multiprocessing, not multithreading.
Another idea is to use atlite, a python package designed for this kind of spatial modeling. But its solar modeling capabilities are not nearly as detailed as pvlib, so it might not be useful for your case.
One other note: I don't know if the WRF data are interval averages or instantaneous values, but if you care about accuracy you should handle them differently for transposition. See this example.
Edit to add: after looking at your code again, there might be another significant speedup to be had. Are you calling get_POA_irradiance for single combinations of position and timestamp? If so, that is unnecessary and very slow. It would be much faster to pass in the full time series for each location, i.e. scalar lat/lon but vector irradiance.

Efficient expanding OLS in pandas

I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.
Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
Please do not suggest other deprecated methods such as expanding_mean.
For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta.
Currently, I am thinking about something like
import numpy as np
import pandas as pd
import statsmodels.api as sm
def my_OLS_func(df, y_name, X_name):
y = df[y_name]
X = df[X_name]
X = sm.add_constant(X)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
return b
df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})
df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))
Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.
However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments.
Any suggestions would be appreciated.

One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:
# Simulate some data
rs = np.random.RandomState(seed=12345)
nobs = 100000
beta = [10., -0.2]
sigma2 = 2.5
exog = sm.add_constant(rs.uniform(size=nobs))
eps = rs.normal(scale=sigma2**0.5, size=nobs)
endog = np.dot(exog, beta) + eps
# Construct and fit the recursive least squares model
mod = sm.RecursiveLS(endog, exog)
res = mod.fit()
# This is a 2 x 100,000 numpy array with the regression coefficients
# that would be estimated when using data from the beginning of the
# sample to each point. You should usually ignore the first k=2
# datapoints since they are controlled by a diffuse prior.
res.recursive_coefficients.filtered

what is the difference between the two datasets for numpy.fft

I am trying to find the period of a sin curve and can find the right periods for sin(t).
However for sin(k*t), the frequency shifts. I do not know how it shifts.
I can adjust the value of interd below to get the right signal only if I know the dataset is sin(0.6*t).
Why can I get the right result for sin(t)?
Anyone can detect the right signal just based on my code ? Or just a small change?
The figure below is the power spectral density of sin(0.6*t).
The dataset is like:
1,sin(1*0.6)
2,sin(2*0.6)
3,sin(3*0.6)
.........
2000,sin(2000*0.6)
And my code:
timepoints = np.loadtxt('dataset', usecols=(0,), unpack=True, delimiter=",")
intensity = np.loadtxt('dataset', usecols=(1,), unpack=True, delimiter=",")
binshu = 300
lastime = 2000
interd = 2000.0/300
sp = np.fft.fft(intensity)
freq = np.fft.fftfreq(len(intensity),d=interd)
freqnum = np.fft.fftfreq(len(intensity),d=interd).argsort()
pl.xlabel("frequency(Hz)")
pl.plot(freq[freqnum]*6.28, np.sqrt(sp.real**2+sp.imag**2)[freqnum])

I think you're making it too complicated. If you consider timepoints to be in seconds then interd is 1 (difference between values in timepoints). This works fine for me:
import numpy as np
import matplotlib.pyplot as pl
# you can do this in one line, that's what 'unpack' is for:
timepoints, intensity = np.loadtxt('dataset', usecols=(0,1), unpack=True, delimiter=",")
interd = timepoints[1] - timepoints[0] # if this is 1, it can be ignored
sp = np.fft.fft(intensity)
freq = np.fft.fftfreq(len(intensity), d=interd)
pl.plot(np.fft.fftshift(freq), np.fft.fftshift(np.abs(sp)))
pl.xlabel("frequency(Hz)")
pl.show()
You'll also note that I didn't sort the frequencies, that's what fftshift is for.
Also, don't do np.sqrt(sp.imag**2 + sp.real**2), that's what np.abs is for :)
If you're not sampling enough (the frequency is higher than your sample rate, i.e., 2*pi/interd < 0.5*k), then there's no way for fft to know how much data you're missing, so it assumes you're not missing any. You can't expect it to know a priori. This is the data you're giving it:

How to create a grid from LiDAR points (X,Y,Z) with GDAL python?

I'm new really to python programming, and I was just wondering if you can create a regular grid of 0.5 by o.5 m of resolution using LiDAR points.
My data are in LAS format (reading with from liblas import file as lasfile) and they have the following format: X,Y,Z. Where X and Y are coordinates.
The points are randomly positioned and some pixel are empty (NAN value) and in some pixel there are more of one points. Where there are more of one point, I wish to obtain a mean value. In the end i need to save the data in a TIF format or Ascii format.
I am studying osgeo module and GDAL but I honest to say that i don't know if osgeo module is the best solution.
I am really glad for help with some code that i can study and implement,
Thanks in Advance for the help, I really need.
I don't know the best way to get a grid with these parameters.

It's a bit late but maybe this answer will be useful for others, if not for you...
I have done this with Numpy and Pandas, and it's pretty fast. I was using TLS data and could do this with several million data points without any trouble on a decent 2009-vintage laptop. The key is 'binning' by rounding the data, and then using Pandas' GroupBy methods to do the aggregating and calculate the means.
If you need to round to a power of 10 you can use np.round, otherwise you can round to an arbitrary value by making a function to do so, which I have done by modifying this SO answer.
import numpy as np
import pandas as pd
# make rounding function:
def round_to_val(a, round_val):
return np.round( np.array(a, dtype=float) / round_val) * round_val
# load data
data = np.load( 'shape of ndata, 3')
n_d = data.shape[0]
# round the data
d_round = np.empty( [n_d, 5] )
d_round[:,0] = data[:,0]
d_round[:,1] = data[:,1]
d_round[:,2] = data[:,2]
del data # free up some RAM
d_round[:,3] = round_to_val( d_round[:,0], 0.5)
d_round[:,4] = round_to_val( d_round[:,1], 0.5)
# sorting data
ind = np.lexsort( (d_round[:,4], d_round[:,3]) )
d_sort = d_round[ind]
# making dataframes and grouping stuff
df_cols = ['x', 'y', 'z', 'x_round', 'y_round']
df = pd.DataFrame( d_sort)
df.columns = df_cols
df_round = df[['x_round', 'y_round', 'z']]
group_xy = df_round.groupby(['x_round', 'y_round'])
# calculating the mean, write to csv, which saves the file with:
# [x_round, y_round, z_mean] columns. You can exit Python and then start up
# later to clear memory if that's an issue.
group_mean = group_xy.mean()
group_mean.to_csv('your_binned_data.csv')
# Restarting...
import numpy as np
from scipy.interpolate import griddata
binned_data = np.loadtxt('your_binned_data.csv', skiprows=1, delimiter=',')
x_bins = binned_data[:,0]
y_bins = binned_data[:,1]
z_vals = binned_data[:,2]
pts = np.array( [x_bins, y_bins])
pts = pts.T
# make grid (with borders rounded to 0.5...)
xmax, xmin = 640000.5, 637000
ymax, ymin = 6070000.5, 6067000
grid_x, grid_y = np.mgrid[640000.5:637000:0.5, 6067000.5:6070000:0.5]
# interpolate onto grid
data_grid = griddata(pts, z_vals, (grid_x, grid_y), method='cubic')
# save to ascii
np.savetxt('data_grid.txt', data_grid)
When I've done this, I have saved the output as a .npy and converted to a tiff with the Image library, and then georeferenced in ArcMap. There is probably a way to do that with osgeo but I haven't used it.
Hope this helps someone at least...

You can use the histogram function in Numpy to do binning, for instance:
import numpy as np
points = np.random.random(1000)
#create 10 bins from 0 to 1
bins = np.linspace(0, 1, 10)
means = (numpy.histogram(points, bins, weights=data)[0] /
numpy.histogram(points, bins)[0])

Try LAStools, particularly lasgrid or las2dem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.