I have a plurality of timeseries of angular data. These values are not vectors (no magnitude), just angles. I need to determine among the various timeseries how correlated they are with each other (e.g., would like to obtain a correlation matrix) over the duration of the data. For example, some are measured very close to each other and I expect will be highly correlated, but I'm interested in also seeing how correlated the further measurements are.
How would I go about adapting this angular data in order to be able to obtain a correlation matrix? I thought about just vectorizing it (i.e., with unit vectors), but then I'm not sure how to do the correlation analysis with this two-dimensional data, as I've only done it with one dimensional previously. Of course, I can't simply analyze the correlation of the angles themselves, due to the nature of angular data (the reset at 0-360).
I'm working in Python, so if anyone has any recommendations on relevant packages I would appreciate it.
I have found a solution in the Astropy python package. The following function is suitable for circular correlation:
https://docs.astropy.org/en/stable/api/astropy.stats.circcorrcoef.html
Related
With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html
I am modeling electrical current through various structures with the help of FiPy. To do so, I solve Laplace's equation for the electrical potential. Then, I use Ohm's law to derive the field and with the help of the conductivity, I obtain the current density.
FiPy stores the potential as a cell-centered variable and its gradient as a face-centered variable which makes sense to me. I have two questions concerning face-centered variables:
If I have a two- or three-dimensional problem, FiPy computes the gradient in all directions (ddx, ddy, ddz). The gradient is a FaceVariable which is always defined on the face between two cell centers. For a structured (quadrilateral) grid, only one of the derivates should be greater than zero since for any face, the position of the two cell-centers involved should only differ in one coordinate. In my simulations however, it occurs frequently that more than one of the derivates (ddx, ddy, ddz) is greater than zero, even for a structured grid.
The manual gives the following explanation for the FaceGrad-Method:
Return gradient(phi) as a rank-1 FaceVariable using differencing for the normal direction(second-order gradient).
I do not see, how this differs from my understanding pointed out above.
What makes it even more problematic: Whenever "too many" derivates are included, current does not seem to be conserved, even in the simplest structures I model...
Is there a clever way to access the data stored in the face-centered variable? Let's assume I would want to compute the electrical current going through my modeled structure.
As of right now, I save the data stored in the FaceVariable as a tsv-file. This yields a table with (x,y,z)-positions and (ddx, ddy, ddz)-values. I read the file and save the data into arrays to use it in Python. This seems counter-intuitive and really inconvenient. It would be a lot better to be able to access the FaceVariable along certain planes or at certain points.
The documentation does not make it clear, but .faceGrad includes tangential components which account for more than just the neighboring cell center values.
Please see this Jupyter notebook for explicit expressions for the different types of gradients that FiPy can calculate (yes, this stuff should go into the documentation: #560).
The value is accessible with myFaceVar.value and the coordinates with myFaceVar.mesh.faceCenters. FiPy is designed around unstructured meshes and so taking arbitrary slices is not trivial. CellVariable objects support interpolation by calling myCellVar((xs, ys, zs)), but FaceVariable objects do not. See this discussion.
Recently I've been trying to figure out how to calculate the entropy of a random variable X using
sp.stats.entropy()
from the stats package of SciPy, with this random variable X being the returns I obtain from the stock of a specific company ("Company 1") from 1997 to 2012 (this is for a financial data/machine learning assignment). However, the arguments involve inputting the probability values
pk
and so far I'm even struggling with computing the actual empirical probabilities, seeing as I only have the observations of the random variable. I've tried different ways of normalising the data in order to obtain an array of probabilities, but my data contains negative values too, which means that when I try and do
asset1/np.sum(asset1)
where asset1 is the row array of the returns of the stock of "Company 1", I manage to obtain a new array which adds up to 1, but obviously with some negative values, and as we all know, negative probabilities do not exist. Therefore, is there any way of computing the empirical probabilities of my observations occurring again (ideally with the option of choosing specific bins, or for a range of values) on Python?
Furthermore, I've been trying to look for a Python package for countless hours which is solely dedicated to the calculation of random variable entropies, joint entropies, mutual information etc. as an alternative to SciPy's entropy option (simply to compare) but most seem to be outdated (I currently have Python 3.5), hence does anyone know of any good package which is compatible with my current version of Python? I know R seems to have a very compact one.
Any kind of help would be highly appreciated. Thank you very much in advance!
EDIT: stock returns are considered to be RANDOM VARIABLES, as opposed to the stock prices which are processes. Therefore, the entropy can definitely be applied in this context.
For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.
The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.
I have implementations for both on my github:
https://github.com/paulbrodersen/entropy_estimators
The code has only been tested using python 2.7, but I would be surprised if it doesn't run on 3.x.
Background
I try to estimate the potential energy supply within a geographical area using spatially explicit data. For this purpose, I build a Bayesian network (HydeNet package) and attached it to a raster stack in R. The Bayesian network model reads the input data (e.g resource supply, conversion efficiency) of each cell location from the raster stack and computes the corresponding energy supply (MCMC simulations). As a result I obtain a new raste layer with a specific probability distribution of the expected energy supply for each raster cell.
However, I am equally interested in the total energy supply within the study area. That means I need to aggregate (sum) the potential supply of all the raster cells in order to get the overall supply potential within the area.
Click here for visual example
Research
The mathematical operation I want to do is called convolution. R provides a corresponding function called convolve that makes use of the Fast Fourrier Transfomration.
The examples I found so far (e.g. example 1, 2) were limited to the addition of two distributions at a time. However, I would like to sum-up multiple distributions (thousands, millions).
Question
How can I sum-up (convolve) multiple probabilty distributions?
I have up to 18,000,000 probability distributions. Thus the computation efficiency will certainly be an big issue.
Further, I am mainly interested in a solution in R, but other solutions (notably Python) are appreciated too.
I don't know if convolving multiple distributions at once would result in a speed increase. Wouldn't somthing like a123 = convolve(a1, a2, a3) behind the scenes simplify to a12 = convolve(a1, a2); a123 = convolve(a12, a30)?. Regardless, in R what you could try is using the foreach package and do all convolutions in parallel. on a quad core that would speed up the calculations (theoretically) by a factor 4. If you really want more speed you could try to use the OpenCL package to see if you can do these calculations parallel on a GPU, but this is programmingwise not easy to get into. If I were you I would focus more on these kind of solutions than trying to speed up functions that do convolutions.
I am working with large datasets of protein-protein similarities generated in NCBI BLAST. I have stored the results in a large pairwise matrices (25,000 x 25,000) and I am using multidimensional scaling (MDS) to visualize the data. These matrices were too large to work with in RAM so I stored them on disk in HDF5 format and accessed them with the h5py module.
The sklearn manifold MDS method generated great visualization for small-scale data in 3D, so that is the one I am currently using. For the calculation, it requires a complete symmetric pairwise dissimilarity matrix. However, with large datasets, a sort of "crust" is formed that obscures the clusters that have formed.
I think the problem is that I am required to input a complete dissimilarity matrix. Some proteins are not related to each other, but in the pairwise dissimilarity matrix, I am forced to input a default max value of dissimilarity. In the documentation of sklearn MDS, it says that a value of 0 is considered a missing value, but inputting 0 where I want missing values does not seem to work.
Is there any way of inputting an incomplete dissimilarity matrix so unrelated proteins don't have to be inputted? Or is there a better/faster way to visualize the data in a pairwise dissimilarity matrix?
MDS requires a full dissimilarity matrix AFAIK. However, I think it is probably not the best tool for what you plan to achieve. Assuming that your dissimilarity matrix is metric (which need not be the case), it surely can be embedded in 25,000 dimensions, but "crushing" that to 3D will "compress" the data points together too much. That results in the "crust" you'd like to peel away.
I would rather run a hierarchical clustering algorithm on the dissimilarity matrix, then sort the leaves (i.e. the proteins) so that the similar ones are kept together, and then visualize the dissimilarity matrix with rows and columns permuted according to the ordering generated by the clustering. Assuming short distances are colored yellow and long distances are blue (think of the color blind! :-) ), this should result in a matrix with big yellow rectangles along the diagonal where the similar proteins cluster together.
You would have to downsample the image or buy a 25,000 x 25,000 screen :-) but I assume you want to have an "overall" low-resolution view anyway.
There are many algorithms under the name nonlineaer dimentionality reduction. You can find a long list of those algorithms on wikipedia, most of them are developed in recent years. If PCA doesn't work well for your data, I would try the method CCA or tSNE. The latter is especially good to show cluster structures.