How to convert data into normal distribution - python

I have a data set consists of number of page views in 6 months for 30k customers. It also consists of following:
Number of unique OS used
Number of unique Browsers user
Number of unique cookies used
All these numbers are taken over a period of six months.
Now I did try to do a normal test using:
from scipy.stats import normaltest
k2, p = normaltest(df)
print(p)
Which returns 0.0 meaning the data is not following normal distribution.
Now I want to know why is that? I thought that generally as the size increases, we see normal distribution in data, since the data has a size of 30k I was not able to understand why it was not normally distributed.
I did try converting them into Z score, but still no luck. Can I transform my data such that I can have a normal distribution? Is there any method using which I can do that?

In the area I work in we typically Log transform data which is heteroscedastic like yours probably is. In my area (mass spectrometry), small values are far more likely than large, so we end up with an exponential distribution.
I'm guessing your data will look like mine, in which case you will need to do a log transform of your data to make it normally distributed. I would do this so that I can apply t-tests and other stats models.
Something like
df_visits = df_visits.apply(lambda x: np.log(x))
of course you will also need to get rid of any zeros before you can log transform.
Image showing pre Vs post log transform

Related

Check for normal distribution

With the informations bellow can I say that my data follow a normal distribution ?
Else is theire other way to check that?
PS: the data are values of pressure inside oil pipelines every minute from sensors.
Try increasing the number of bins in your histogram plot.
You can also try visualising your data with a qq plot, looking at other statistics such as kurtosis or perform test for normality.
Looking at the sample size of your data, it seems that it should produce somewhat of a normal distribution. However looking at your histogram plot, there seems to be extreme values > 2 s.d. You might want to look into that, to determine if you want to add a dummy variable to account for it in any future analysis.

How to correlate partial signal with dataset

I have a some large datasets of sensor values consisting of a single sensor value sampled at a one-minute interval, like a waveform. The total dataset spans a few years.
I wish to (using python) enter/select a arbitrary set of sensor data (for instance consisting of 600 values, so for 10hrs worth of data) and find all similar time stamps where roughly the same shape occurred in these datasets.
The matches should be made by shape (relative differences), not by actual values, as there are different sensors used with different biases and environments. Also, I wish to retrieve multiple matches within a single dataset, to further analyse.
I’ve been looking into pandas, but I’m stuck at the moment... any guru here?
I don't know much about the functionalities available in Pandas.
I think you need to first decide the typical time span T over which
the correlation is supposed to occurred. What I would do is to
split all your times series into (possibly overlapping) segments
of duration T using Numpy (see here for instance).
This will lead to a long list of segments. I would then compute
the correlation between all pairs of segments using e.g. corrcoef.
You get a large correlation matrix where you can spot the
pairs of similar segments by applying a threshold on the absolute
value of the correlation. You can estimate the correct threshold
by applying this algorithm to a data set where you don't expect
any correlation, or by randomizing your data.

Histogram for csv data

I have a csv file where one of the column stands for how many followers a Twitter user has:
The csv file has about 1 000 000 rows. I'd like to create a graph showing the distribution of followers number across the whole data. Since the range of followers number is quite big (starting from 0 followers up to hundred thousands) maybe the data on the graph should be quite approximate, it can be a graph where each bar represents 1000 followers or even more (so, 1st one would be 0-1000, then 1000-2000 etc). I hope I'm making myself clear.
I've tried a simple code but it gives a weird result.
df = pd.read_csv(".csv", encoding='utf8', delimiter=',')
df["user.followers_count"].hist()
Here's the result:
Does it have anything to do with the size and a large range of my data?
There is a bins argument in the hist function you have called. You just need to update it with a more reasonable value.
To understand: If the range is 1-10000, and you have set bins=10, then 1-1000 is one bin, 1000-2000 is another, and so on.
Increasing the number of bins (and thus reducing this range size) will help you get a smoother distribution curve and get what you are trying to achieve with this code/dataset.
Documentation link:
https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.hist.html
Given your data , i have the following output

Negative power spectrum

I've got a set of 780 monthly temperature anomalies over 65 years and I want to analyse it for frequencies that are driving the anomalies. I've used the spectrum package to do this, I've included pictures of the series before and after the analysis.
from spectrum import *
p = Periodogram(anomalies, sampling=1/12)
p.run()
plt.title('Power Spectrum of Monthly Temperature Anomalies')
p.plot(marker='.')
plt.show()
The resulting spectrum has several clear negative spikes. Now I understand that a negative value in Db isn't actually a negative absolute value, but why is this happening? Does it imply there's some specific frequency missing from my data? Because a positive spike implies one is present.
Also, why are most of the frequencies displayed negative? What is the reference value for Db to be an amplification of?
A part of me thinks I should be taking the absolute value of this spectrum but I'd like to understand why if that's the case. Also I put in the value for sampling as 1/12 because the data points are monthly so hopefully the frequency scale is in per year?
Many thanks, this is my first post here so let me know if I need to be clearer about anything.
Negative Energies
The Series being Analysed
As you can see in the plots, on the y-axis, the units are in dB (decibel, https://en.wikipedia.org/wiki/Decibel). so what is see is not the raw data (in the frequency domain) but something like 10*log10(data). This explains the presence of negative values and is perfectly normal.
Here you have positive and negative values but in general, you would normalise the data (by the maximum) so that all values are negative and the highest value is set to 0. This is possible using :
p.plot(norm=True)
You can plot the raw data (without the log function) but you would need to use the raw data (in the frequency domain). For instance to reproduce the behaviour of p.plot function, you can use:
from pylab import plot
plot(p.frequencies(), 10*log10(p.psd/p.psd.max())
So, if you do not want to use the decibel unit, use:
from pylab import plot
plot(p.frequencies(), p.psd)
disclaimer: I'm the main author of Spectrum (http://pyspectrum.readthedocs.io/).

Python: How to properly deal with NaN's in a pandas DataFrame for feature selection in scikit-learn

This is related to a question I posted here but this one is more specific and simpler.
I have a pandas DataFrame whose index is unique user identifiers, columns correspond to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most.
I created some extra columns to measure the "success" which I define as just % attended relative to invites:
my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']
My goal right now is to do feature selection on the events/columns, starting with the most basic variance-based method: remove those with low variance. Then I would look at a linear regression on the events and keep only those with large coefficients and small p-values.
But my problem is I have so many NaN's and I'm not sure what the correct way to deal with them is as most scikit-learn methods give me errors because of them. One idea is to replace 'didn't attend' with -1 and 'not invited' with 0 but I'm worried this will alter the significance of events.
Can anyone suggest the proper way to deal with all these NaN's without altering the statistical significance of each feature/event?
Edit: I'd like to add that I'm happy to change my metric for "success" from the above if there is a reasonable one which will allow me to move forward with feature selection. I am just trying to determine which events are effective in capturing user interest. It's pretty open-ended and this is mostly an exercise to practice feature selection.
Thank you!
if I understand correctly you would like to clean your data from NaN's without significantly altering the statistical properties within it - so that you can run some analysis afterwords.
I actually came across something similar recently, one simple approach you might be interested is using sklearn's 'Imputer'. As EdChum mentioned earlier, one idea is to replace with the mean on the axis. Other options include replacing with the median for example.
Something like:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
cleaned_data = imp.fit_transform(original_data)
In this case, this will replace the NaN's with the mean across each axis (for example let's impute by event so axis=1). You then could round the cleaned data to make sure you get 0's and 1's.
I would plot a few histograms for the data, by event, to sanity check whether this preprocessing significantly changes your distribution - as we may be introducing too much of a bias by swapping so many values to the mean / mode / median along each axis.
Link for reference: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
Taking things one step further (assuming above is not sufficient), you could alternately do the following:
Take each event column in your data and calculate the probability of attending ('p') vs not attending ('1 - p'), after dropping all nan numbers. [i.e. p = Attending / (Attending + Not Attending) ]
Then replace your NaN numbers across each event column using random numbers generated from Bernoulli distribution which we fit with the 'p' you estimated, roughly something like:
import numpy as np
n = 1 # number of trials
p = 0.25 # estimated probability of each trial (i.e. replace with what you get for attended / total)
s = np.random.binomial(n, p, 1000)
# s now contains random a bunch of 1's and 0's you can replace your NaN values on each column with
Again, this in itself is not perfect are you are still going to end up slightly biasing your data (for e.g. a more accurate approach would be to account for dependencies in your data across events for each user) - but by sampling from a roughly matching distribution this should at least be more robust than replacing arbitrarily with mean values etc.
Hope this helps!
the above example is deprecated please use:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Categories

Resources