Plot numpy.fft.fft2 output - python

I'm trying to get a sense of the spatial frequencies present in a series of images I want to analyze. I decided to do this with the numpy.fft.fft2 function, but apparetly the output can't be plotted - can you help me figure out what's wrong?
Apparently this is happening because the output contains values like 0.+0.j which` matplotlib can't deal with. But I don't know how to change them to something it can deal with either.
Here's a minimal example with my issue.

Matplotlib can only handle real values. Your options are to take the real or imaginary parts of the results, or magnitude and maybe even phase. These can be done with numpy.real or numpy.imag, or numpy.abs and numpy.angle.
Ultimately, I guess it just depends on what you want to know about your FFT. People are usually most interested in the magnitude of FFT data, which suggests abs. This gives you an idea of the "power" in the various frequencies.

Related

Is having a large outlier value going to be a problem for lightgbm?

I have a classification task at hand. I'm using lightgbm for that.
I have a particular value that has the histogram as below:
Basically, all values are nicely concentrated on the left, with a few values on the right.
Lightgbm uses approximate splits, rather than exact ones. It, therefore, has to build a histogram and find bin edges.
Does anyone happen to know, how exactly the bin ranges are defined? I'm asking because if one does not choose to build enough bins then it will render the whole variable useless since too much stuff will be cluttered in lower bins. Ultimately, the question is: what are the consequences of having a few very high values in a column?
Also, this seems to be a relevant piece of code, but I'm not good enough with C++ to read and understand it relatively quickly.
UPD: Just to clarify, it's a poor visualization here. The largest number is 5.02e03, not 6265.02e03

Approximate maximum of an unknown curve

I have a data set that looks like this:
I used the scipy.signal.find_peaks function to determine the peaks of the data set, and it works out fine enough, but since this function determines the local maxima of the data, it is not neglecting the noise in the data which causes overshoot. Therefore what I'm determining isn't actually the location of the most likely maxima, but rather the location of an 'outlier'.
Is there another, more exact way to approximate the local maxima?
I'm not sure that you can consider those points to be outliers so easily, as they look to be close to the place I would expect them to be. But if you don't think they are a valid approximation let me tell you three other ways you can use.
First option
I would construct a physical model of these peaks (a mathematical formula) and do a fitting analysis around the peaks. You can for instance, suppose that the shape of the plot is the sum of some background model (maybe constant or maybe more complicated) plus some Gaussian peaks (or Lorentzian).
This is what we usually do in physics. Of course it will be more accurate taking knowledge from the underlying processes, which I don't have.
Having a good model, this way is definitely better as taking the maximum values, as even if they are not outliers, they still have errors which you want to reduce.
Second option
But if you want a easier way, just a rough estimation and you already found the location of the three peaks programmatically, you can make the average of a few points around the maximum. If you do it so, the function np.where or np.argwhere tend to be useful for this kind of things.
Third option
The easiest option is taking the value by hand. It could sound unacceptable for academic proposes and it probably is. Even worst, it is not a programmatic way, and this is SO. But at the end, it depends on why and for what you need those values and on the confidence interval you need for your measurement.

P value for Normality test very small despite normal histogram

I've looked over the normality tests in scipy stats for both scipy.stats.mstats.normaltest as well as scipy.stats.shapiro and it looks like they both assume the null hypothesis is that the data they're given is normal.
Ie, a p value less than .05 would indicate that they're not normal.
I'm doing a regression with LassoCV in SKLearn, and in order to give myself better results I log transformed the answers, which gives a histogram that looks like this:
Looks normal to me.
However, when I run the data through either of the two tests mentioned above I get very small p values that would indicate the data is not normal, and in a big way.
This is what I get when I use scipy.stats.shapiro
scipy.stats.shapiro(y)
Out[69]: (0.9919402003288269, 3.8889791653673456e-07)
And I get this when I run scipy.stats.mstats.normaltest:
scipy.stats.mstats.normaltest(y)
NormaltestResult(statistic=25.755128535282189, pvalue=2.5547293546709236e-06)
It seems implausible to me that my data would test out as being so far from normality with the histogram it has.
Is there something causing this discrepancy, or am I not interpreting the results correctly?
If the numbers on the vertical axis are the number of observations for the respective class, then the sample size is about 1500. For such a large sample size goodness-of-fit tests are rarely useful. But is it really necessary that your data is perfectly normally distributed? If you want to analyze the data with a statistical method, is this method maybe robust under ("small") deviations from the normal distribution assumption?
In practice the question is usually "Is the normal distribution assumption acceptable" for my statistical analysis. A perfect normal distribution is very rarly available.
An additional comment on histograms: One has to be careful by interpreting data from histograms because if the data "looks normal" or not may depend on the width of the histogram classes. Histograms are only hints which should be treated with caution.
If you run this n times and take the mean of the p values, you will get what you expect. Run it in a loop in a Monte Carlo way.

Scipy zoom with complex values

I have a numpy array of values and I wanted to scale (zoom) it. With floats I was able to use scipy.ndimage.zoom but now my array contains complex values which are not supported by scipy.ndimage.zoom. My workaround was to separate the array into two parts (real and imaginary) and scale them independently. After that I add them back together. Unfortunately this produces a lot of tiny artifacts in my 'image'. Does somebody know a better way? Maybe there also exists a python library for this? I couldn't find one.
Thank you!
This is not a good answer but it seems to work quite well. Instead of using the default parameters for the zoom method, I'm using order=0. I then proceed to deal with the real and imaginary part separately, as described in my question. This seems to reduce the artifacts although some smaller artifacts remain. It is by no means perfect and if somebody has a better answer, I would be very interested.

K-means algorithm suitable?

I am writing a python script to analyse some data captured from a device. I want to automate the task of finding out if my data matches a certain pattern. In the image given below I want to determine that in the given set of captured data if I can categorize my data into 3 different clusters [as shown] using a script. The range of these clusters are not predefined. All I want to know is if I see a three different clusters in my data that are reasonably apart from each other - if not then my test fails. I am just trying to figure out what is best data analysis algorithm to use here. I was reading about clustering algorithms and was going to start with K-means clustering but anyone has a better idea?
http://imgur.com/I4jMqpk
[Link to the an example set of captured data - Note the color coded clusters][1]
The better idea is to start with a good problem statement. If you are not able to strictly define what you are looking for, then no method is suitable. If you can exactly write down what you need, then you can search for a solution. Clustering methods are quite weird objects, they will always "succeed", they will always cluster your data, often in a way, which is completely unacceptable for a human being. If your data looks like you plotted (it is 2d case, with points being a parts of a "dense" point clouds) then the most appropriate seems something like DBScan/Optics, so very simple method, which will result in more "human like" clusters (as opposed to k-means which won't divide your data into those "clouds", but rather will often split them).

Categories

Resources