Suppose you have a record of distribution for each day in some period. For example, some distribution which depends on a parameter which evolves over time. Suppose we have dozens or hundreds of days. How would you visualize the change of this distribution? I want the plot to contain as much info as possible.
One way I could think of is: proxy the density function and plot these curves's evolution in 2D. It will be some form of homotopy: the initial distribution converging to the final with some smooth step. Of course here I assume smoothness.
Thanks for your help.
The question is theoretical in nature but I am aiming for python realization so implementations or suggestions for libraries are also welcome.
Related
I have a dataset generator for machine learning purposes which unfortunately is producing a distribution as shown in the attached image. Note that I can't really change the data generation process very much.
This dataset is heavily skewed towards the centre. I would like to be able to fit this dataset to a more uniform distribution by removing some of those central datapoints. I am a beginner to programming and data science so I am unaware of any methods in which I can do this.
If anyone can point to a library or function I can utilise to achieve this it would be much appreciated. I am using python3 and my data is stored in a csv.
Thanks!
My goal is to build a temperature gradient map over a floor plan to display minute changes in temp. via uniformly distributed sensors.
As far as I understand most heatmap tools available work with point density to produce heatmaps whereas what I'm looking for is a gradient based on varying values of individual points (sensors) on the map. I.e. something like this...
which I nicked from here.
From what I've gathered interpolation will definitely be involved and it may just be Radial Basis Function Interpolation because it wouldn't require a grid as per this post.
I've used the Anaconda distribution thus far. The data from sensors will be extracted from timescaleDB and the positions of sensors will be lat/long.
I've done very minor experimentation with the code from the link above and got this result. Radial Basis Function Interpolation
So here are my questions. Multiple python libraries have interpolation as a built-in function but which one of the libraries would be the best for the task described above? What parts of documentation should I read up on from libraries which can help me with this specific problem? Any good resource recommendations for this topic? Would anything else be required for this apart from interpolation?
Thanks in advance!
P.S. This is a side project I'd like to work on as a student, not commercial in any way shape or form.
I like scipy.interpolate library. It has a lot of nice functions, the simplest that would work for you would probably be the scipy.interpolate.interp2d(), and if you want to go with an non linear distribution of sensors griddata() is very useful.
I have sorted data with pandas so that I have this dataframe (I work with anaconda, jupyter notebook):
I showed a histogram with the abscissa indexing "écart G-D" and ordinate "probabilité".
I found a topic on stack overflow that deals exactly what I want to do except that it is 7 years old and the code is obsolete! I still tried while correcting some things but it does not work (besides I do not even understand the code) ...
Here is the link of the topic:
Fitting empirical distribution to theoretical ones with Scipy (Python)?
I would like to graphically test the probability density function that best follows the shape of my histogram.
If anyone could enlighten me, it would be great because I'm really in a bind ...
Thank you.
You can fit your data manually by calculating the parameters of a distribution(mean, lambda, etc) and use scipy to generate that distribution. Also, if your main objective is just fit the data to a distribution and then use that distribution later, you can use another software (Stat::Fit) to best fit to your data automatically and plot it on the histogram.
You can use the distfit library in Python. It will determine the best theoretical distribution for your data.
With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html
I am studying the correlation between a set of input variables and a response variable, price. These are all in time series.
1) Is it necessary that I smooth out the curve where the input variable is cyclical (autoregressive)? If so, how?
2) Once a correlation is established, I would like to quantify exactly how the input variable affects the response variable.
Eg: "Once X increases >10% then there is an 2% increase in y 6 months later."
Which python libraries should I be looking at to implement this - in particular to figure out the lag time between two correlated occurrences?
Example:
I already looked at: statsmodels.tsa.ARMA but it seems to deal with predicting only one variable over time. In scipy the covariance matrix can tell me about the correlation, but does not help with figuring out the lag time.
While part of the question is more statistics based, the bit about how to do it in Python seems at home here. I see that you've since decided to do this in R from looking at your question on Cross Validated, but in case you decide to move back to Python, or for the benefit of anyone else finding this question:
I think you were in the right area looking at statsmodels.tsa, but there's a lot more to it than just the ARMA package:
http://statsmodels.sourceforge.net/devel/tsa.html
In particular, have a look at statsmodels.tsa.vector_ar for modelling multivariate time series. The documentation for it is available here:
http://statsmodels.sourceforge.net/devel/vector_ar.html
The page above specifies that it's for working with stationary time series - I presume this means removing both trend and any seasonality or periodicity. The following link is ultimately readying a model for forecasting, but it discusses the Box-Jenkins approach for building a model, including making it stationary:
http://www.colorado.edu/geography/class_homepages/geog_4023_s11/Lecture16_TS3.pdf
You'll notice that link discusses looking for autocorrelations (ACF) and partial autocorrelations (PACF), and then using the Augmented Dickey-Fuller test to test whether the series is now stationary. Tools for all three can be found in statsmodels.tsa.stattools. Likewise, statsmodels.tsa.arma_process has ACF and PACF.
The above link also discusses using metrics like AIC to determine the best model; both statsmodels.tsa.var_model and statsmodels.tsa.ar_model include AIC (amongst other measures). The same measures seem to be used for calculating lag order in var_model, using select_order.
In addition, the pandas library is at least partially integrated into statsmodels and has a lot of time series and data analysis functionality itself, so will probably be of interest. The time series documentation is located here:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html