I'm reasonably new here, so sorry if this is a basic question.
I have a dataset comprising several thousand 'hotspots'- small sections of 3D data, at most a few tens or hundreds datapoints each, and each arranged around a centre. I am (in Python) trying to fit each to a particular radial basis function, roughly of the form exp(-r)/r, and obtain the parameters of that fit for each hotspot. All the libraries I've seen that do this are ML-based and treat the hotspots as training data from which to obtain global parameters- but that's not what I need, because each hotspot should have different parameters and those variations contain meaningful information. scipy.optimize.RBF doesn't seem to allow you to retrieve the parameters it uses, unless I'm misreading the documentation. Is there any library that does this easily? Any advice would be very appreciated!
Related
I am currently coding a project visualizing aperiodic tilings in the hyperbolic plane. My code takes several parameters (sidelenght, number of corners) and then, after rendering using drawSVG returns an SVG file. I noticed that varying the parameter of the sidelenght yielded widely different results in a continuous fashion.
The closest I found was this which, however, does not go nearly as far as I would require, only doing elementary fade-ins/fade-outs. Similarly, the classical Svg-Animate does not nearly do enough as that it would be of use to me.
My question is- is there some package in Python which can a sufficiently small discrete interval (i.e. basically stop-motion) and render all of these images in sequence in order to attain the appearance of a continuous varying of the parameter?
The Problem
I'm a physics graduate research assistant, and I'm trying to build a very ambitious Python code that, boiled down to the basics, evaluates a line integral in a background described by very large arrays of data.
I'm generating large sets of data that are essentially t x n x n arrays, with n and t on the order of 100. They represent the time-varying temperatures and velocities of a 2 dimensional space. I need to collect many of these grids, then randomly choose a dataset and calculate a numerical integral dependent on the grid data along some random path through the plane (essentially 3 separate grids: x-velocity, y-velocity, and temperature, as the vector information is important). The end goal is gross amounts of statistics on the integral values for given datasets.
Visual example of time-varying background.
That essentially entails being able to sample the background at particular points in time and space, say like (t, x, y). The data begins as a big ol' table of points, with each row organized as ['time','xpos','ypos','temp','xvel','yvel'], with an entry for each (xpos, ypos) point in each timestep time, and I can massage it how I like.
The issue is that I need to sample thousands of random paths in many different backgrounds, so time is a big concern. Barring the time required to generate the grids, the biggest holdup is being able to access the data points on the fly.
I previously built a proof of concept of this project in Mathematica, which is much friendlier to the analytic mindset that I'm approaching the project from. In that prototype, I read in a collection of 10 backgrounds and used Mathematica's ListInterpolation[] function to generate a continuous function that represented the discrete grid data. I stored these 10 interpolating functions in an array and randomly called them when calculating my numerical integrals.
This method worked well enough, after some massaging, for 10 datasets, but as the project moves forward that may rapidly expand to, say, 10000 datasets. Eventually, the project is likely to be set up to generate the datasets on the fly, saving each for future analysis if necessary, but that is some time off. By then, it will be running on a large cluster machine that should be able to parallelize a lot of this.
In the meantime, I would like to generate some number of datasets and be able to sample from them at will in whatever is likely to be the fastest process. The data must be interpolated to be continuous, but beyond that I can be very flexible with how I want to do this. My plan was to do something similar to the above, but I was hoping to find a way to generate these interpolating functions for each dataset ahead of time, then save them to some file. The code would then randomly select a background, load its interpolating function, and evaluate the line integral.
Initial Research
While hunting to see if someone else had already asked a similar question, I came across this:
Fast interpolation of grid data
The OP seemed interested in just getting back a tighter grid rather than a callable function, which might be useful to me if all else fails, but the solutions also seemed to rely on methods that are limited by the size of my datasets.
I've been Googling about for interpolation packages that could get at what I want. The only things I've found that seem to fit the bill are:
Scipy griddata()
Scipy interpn()
Numpy interp()
Attempts at a Solution
I have one sample dataset (I would make it available, but it's about 200MB or so), and I'm trying to generate and store an interpolating function for the temperature grid. Even just this step is proving pretty troubling for me, since I'm not very fluent in Python. I found that it was slightly faster to load the data through pandas, cut to the sections I'm interested in, and then stick this in a numpy array.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import griddata
# Load grid data from file
gridData = pd.read_fwf('Backgrounds\\viscous_14_moments_evo.dat', header=None, names=['time','xpos','ypos','temp','xvel','yvel'])
# Set grid parameters
# nGridSpaces is total number of grid spaces / bins.
# Will be data-dependent in the future.
nGridSpaces = 27225
# Number of timesteps is gridData's time column divided by number of grid spaces.
NT = int(len(gridData['time'])/nGridSpaces)
From here, I've tried to use Scipy's interpnd() and griddata() functions, to no avail. I believe I'm just not understanding how it wants to take the data. I think that my issue with both is trying to corral the (t, x, y) "points" corresponding to the temperature values into a useable form.
The main thrust of my efforts has been trying to get them into Numpy's meshgrid(), but I believe that maybe I'm hitting the upper limit of the size of data Numpy will take for this sort of thing.
# Lists of points individually
tList=np.ndarray.flatten(pd.DataFrame(gridData[['time']]).to_numpy())
xList=np.ndarray.flatten(pd.DataFrame(gridData[['xpos']]).to_numpy())
yList=np.ndarray.flatten(pd.DataFrame(gridData[['ypos']]).to_numpy())
# 3D grid of points
points = np.meshgrid(tList, xList, yList)
# List of temperature values
tempValues=np.ndarray.flatten(pd.DataFrame(gridData[['temp']]).to_numpy())
# Interpolate and spit out a value for a point somewhere central-ish as a check
point = np.array([1,80,80])
griddata(points, tempValues, point)
This returns a value error on the line calling meshgrid():
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
The Questions
First off... What limitations are there on the sizes of datasets I'm using? I wasn't able to find anything in Numpy's documentation about maximum sizes.
Next... Does this even sound like a smart plan? Can anyone think of a smarter framework to get where I want to go?
What are the biggest impacts on speed when handling these types of large arrays, and how can I mitigate them?
I have a plurality of timeseries of angular data. These values are not vectors (no magnitude), just angles. I need to determine among the various timeseries how correlated they are with each other (e.g., would like to obtain a correlation matrix) over the duration of the data. For example, some are measured very close to each other and I expect will be highly correlated, but I'm interested in also seeing how correlated the further measurements are.
How would I go about adapting this angular data in order to be able to obtain a correlation matrix? I thought about just vectorizing it (i.e., with unit vectors), but then I'm not sure how to do the correlation analysis with this two-dimensional data, as I've only done it with one dimensional previously. Of course, I can't simply analyze the correlation of the angles themselves, due to the nature of angular data (the reset at 0-360).
I'm working in Python, so if anyone has any recommendations on relevant packages I would appreciate it.
I have found a solution in the Astropy python package. The following function is suitable for circular correlation:
https://docs.astropy.org/en/stable/api/astropy.stats.circcorrcoef.html
My goal is to build a temperature gradient map over a floor plan to display minute changes in temp. via uniformly distributed sensors.
As far as I understand most heatmap tools available work with point density to produce heatmaps whereas what I'm looking for is a gradient based on varying values of individual points (sensors) on the map. I.e. something like this...
which I nicked from here.
From what I've gathered interpolation will definitely be involved and it may just be Radial Basis Function Interpolation because it wouldn't require a grid as per this post.
I've used the Anaconda distribution thus far. The data from sensors will be extracted from timescaleDB and the positions of sensors will be lat/long.
I've done very minor experimentation with the code from the link above and got this result. Radial Basis Function Interpolation
So here are my questions. Multiple python libraries have interpolation as a built-in function but which one of the libraries would be the best for the task described above? What parts of documentation should I read up on from libraries which can help me with this specific problem? Any good resource recommendations for this topic? Would anything else be required for this apart from interpolation?
Thanks in advance!
P.S. This is a side project I'd like to work on as a student, not commercial in any way shape or form.
I like scipy.interpolate library. It has a lot of nice functions, the simplest that would work for you would probably be the scipy.interpolate.interp2d(), and if you want to go with an non linear distribution of sensors griddata() is very useful.
I'm trying to build and implement a regression tree algorithm on some raster data in python, and can't seem to find the best way to do so. I will attempt to explain what I'm trying to do:
My desired output is a raster image, whose values represent lake depth, call it depth.tif. I have a series of raster images, each representing the reflectance values in different Landsat bands, say [B1.tif, B2.tif, ..., B7.tif] that I want to use as my independent variables to predict lake depth.
For my training data, I have a shapefile of ~6000 points of known lake depth. To create a tree, I extracted the corresponding reflectance values for each of those points, then exported that to a table. I then used that table in weka, a machine-learning software, to create a 600-branch regression tree that would predict depth values based on the set of reflectance values. But because the tree is so large, I can't write it in python manually. I ran into the python-weka-wrapper module so I can use weka in python, but have gotten stuck with the whole raster part. Since my data has an extra dimension (if converted to array, each independent variable is actually a set of ncolumns x nrows values instead of just a row of values, like in all of the examples), I don't know if it can do what I want. In all the examples for the weka-python-wrapper, I can't find one that deals with spatial data, and I think this is what's throwing me off.
To clarify, I want to use the training data (which is a point shapefile/table right now but can- if necessary- be converted into a raster of the same size as the reflectance rasters, with no data in all cells except for the few points I have known depth data at), to build a regression tree that will use the reflectance rasters to predict depth. Then I want to apply that tree to the same set of reflectance rasters, in order to obtain a raster of predicted depth values everywhere.
I realize this is confusing and I may not be doing the best job at explaining. I am open to other options besides just trying to implement weka in python, such as sklearn, as long as they are open source. My question is, can what I described be done? I'm pretty sure it can, as it's very similar to image classification, with the exception that the target values (depth) are continuous and not discrete classes but so far I have failed. If so, what is the best/most straight-forward method and/or are there any examples that might help?
Thanks
I have had some experience using LandSat Data for the prediction of environmental properties of soil, which seems to be somewhat related to the problem that you have described above. Although I developed my own models at the time, I could describe the general process that I went through in order to map the predicted data.
For the training data, I was able to extract the LandSat values (in addition to other properties) for the spatial points where known soil samples were taken. This way, I could use the LandSat data as inputs for predicting the environmental data. A part of this data would also be reserved for testing to confirm that the trained models were not overfitting to training data and that it predicted the outputs well.
Once this process was completed, it would be possible to map the desired area by getting the spatial information at each point of the desired area (matching the resolution of the desired image). From there, you should be able to input these LandSat factors into the model for prediction and the output used to map the predicted depth. You could likely just use Weka in this case to predict all of the cases, then use another tool to build the map from your estimates.
I believe I whipped up some code long ago to extract each of my required factors in ArcGIS, but it's been a while since I did this. There should be some good tutorials out there that could help you in that direction.
I hope this helps in your particular situation.
It sounds like you are not using any spatial information to build your tree
(such as information on neighboring pixels), just reflectance. So, you can
apply your decision tree to the pixels as if the pixels were all in a
one-dimensional list or array.
A 600-branch tree for a 6000 point training data file seems like it may be
overfit. Consider putting in an option that requires the tree to stop splitting
when there are fewer than N points at a node or something similar. There may
be a pruning factor that can be set as well. You can test different settings
till you find the one that gives you the best statistics from cross-validation or
a held-out test set.