Which correlation method do I use? - python

For a project, I have a dataset:
Eruptions Waiting
0 3.600 79
1 1.800 54
2 3.333 74
3 2.283 62
4 4.533 85
and was instructed to turn it into a Seaborn pairplot:
I am then asked: What correlation method should I use based on this graph. I am stuck between Pearson and Spearman. I am unsure which I should use. Based on this, I am also asked: Are the durations correlated with the waiting time between eruptions? Any correlation >= .7 is a correlation. Please help

Related

Numpy Vectorized Window Operations

I'm interested in figuring out how to do vectorized computations in a numpy array / pandas dataframe where each new cell is updated with local information.
For example, lets say I'm a weatherman interested in making predictions about the weather. My prediction algorithm will be the mean of the past 3 days. While this prediction is simple, I'd like to be able to do this with an arbitrary function.
Example data:
day temp
1 70
2 72
3 68
4 67
...
After a transformation should become
day temp prediction
1 70 None (no previous data)
2 72 70 (only one data point)
3 68 71 (two data points)
4 67 70
5 70 69
...
I'm only interested in the prediction column, so no need to make an attempt to join the data back together after achieving the prediction! Thanks!
Use rolling with a window of 3 and the min_periods of 1
df['prediction'] = df['temp'].rolling(window = 3, min_periods = 1).mean().shift()
df
day temp prediction
0 1 70 NaN
1 2 72 70
2 3 68 71
3 4 67 70
4 5 70 69

Applying `pd.qcut` on multiple columns

I have a DataFrame containing 2 columns x and y that represent coordinates in a Cartesian system. I want to obtain groups with an even(or almost even) number of points. I was thinking about using pd.qcut() but as far as I can tell it can be applied only to 1 column.
For example, I would like to divide the whole set of points with 4 intervals in x and 4 intervals in y (numbers might not be equal) so that I would have roughly even number of points. I expect to see 16 intervals in total (4x4).
I tried a very direct approach which obviously didn't produce the right result (look at 51 and 99 for example). Here is the code:
df['x_bin']=pd.qcut(df.x,4)
df['y_bin']=pd.qcut(df.y,4)
grouped=df.groupby([df.x_bin,df.y_bin]).count()
print(grouped)
The output:
x_bin y_bin
(7.976999999999999, 7.984] (-219.17600000000002, -219.17] 51 51
(-219.17, -219.167] 60 60
(-219.167, -219.16] 64 64
(-219.16, -219.154] 99 99
(7.984, 7.986] (-219.17600000000002, -219.17] 76 76
(-219.17, -219.167] 81 81
(-219.167, -219.16] 63 63
(-219.16, -219.154] 53 53
(7.986, 7.989] (-219.17600000000002, -219.17] 78 78
(-219.17, -219.167] 77 77
(-219.167, -219.16] 68 68
(-219.16, -219.154] 51 51
(7.989, 7.993] (-219.17600000000002, -219.17] 70 70
(-219.17, -219.167] 55 55
(-219.167, -219.16] 77 77
(-219.16, -219.154] 71 71
Am I making a mistake in thinking it is possible to do with pandas only or am I missing something else?
The problem is that the distribution of the rows might not be the same according to x than according to y.
You are empirically mimicking a correlation analysis and finding out that there is slight negative relation... the y values are higher in the lower end of x scale and rather flat on the higher end of x.
So, if you want even number of datapoints on each bin I would suggest splitting the df into x bins and then applying qcut on y for each x bin ( so y bins have different cut points but even sample size)
Edit
Something like:
split_df = [(xbin, xdf) for xbin, xdf in df.groupby(pd.qcut(df.x, 4))] # no aggregation so far, just splitting the df evenly on x
split_df = [(xbin, xdf.groupby(pd.qcut(xdf.y)).x.size())
for xbin, xdf in split_df] # now each xdf is evenly cut on y
Now you need to work on each xdf separately. Attempting to concatenate all xdfs will result in an error. Index for xdfs is a CategoricalIndex, and the first xdf needs to have all categories for concat to work (i.e. split_df[0][1].index must include the bins of all other xdfs). Or you could change the Index to the center of the interval as a float64 on both xbins and ybins.

SKLearn Multi Classification without Knowing the Classifications in Advance Python

I have recently got in to using SKLearn, especially Classification models and had a question more on use case examples, than being stuck on any particular bit of code, so apolgies in advance if this isn't the right place to be asking questions such as this.
So far I have been using sample data where one trains the model based on data that has already been classified. The 'Iris' data set for example, all the data is classified in to one of the three species. But what if one wants to group/classify the data without knowing the classifications in the first place.
Let's take this imaginary data:
Name Feat_1 Feat_2 Feat_3 Feat_4
0 A 12 0.10 0 9734
1 B 76 0.03 1 10024
2 C 97 0.07 1 8188
3 D 32 0.21 1 6420
4 E 45 0.15 0 7723
5 F 61 0.02 1 14987
6 G 25 0.22 0 5290
7 H 49 0.30 0 7107
If one wanted to split the names in to 4 separate classifications, using the different features, is this possible, and which SKLearn model(s) is needed? I'm not asking for any code, I'm quite able to research on my own if someone could point me in the right direction? So far I can only find examples where the classifications are already known.
In the example above, if I wanted to break the data down in to 4 classifications I would want my outcome to be something like this (note the new column, denoting the class):
Name Feat_1 Feat_2 Feat_3 Feat_4 Class
0 A 12 0.10 0 9734 4
1 B 76 0.03 1 10024 1
2 C 97 0.07 1 8188 3
3 D 32 0.21 1 6420 3
4 E 45 0.15 0 7723 2
5 F 61 0.02 1 14987 1
6 G 25 0.22 0 5290 4
7 H 49 0.30 0 7107 4
Many thanks for any help
you can you k-mean clustering which will group data into lesser in lesser classes in each iteration until all data are grouped in 1 group. Then you can either stop the iteration early when number of classes are what you wanted or you can also go back on already trained model to get number of class you want. For example to get 4 classes you can go 4 steps back when data are clustered in 4 classes
sklearn.cluster.KMeans doc
Classification is a supervised approach, meaning that the training data comes with features and labels. If you want to group the data according to the features, then you can go for some clustering algorithms (unsupervised), such as sklearn.cluster.KMeans (with k = 4).
Start with an unsupervised method to determine clusters... use those clusters as your labels.
I recommend using sklearn's GMM instead of k-means.
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
K-means assumes circular clusters.
This topic is called: unsupervised learning
Some definition is:
Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows modeling probability densities of given inputs.[1] It is one of the main three categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques.
There are tons of algorithms out there, you need to try what fits best for your algorithms, some examples are:
Hieracrchical clustering (implemented in Scipy: https://en.wikipedia.org/wiki/Single-linkage_clustering)
kmeans (implemented in sklearn: https://en.wikipedia.org/wiki/K-means_clustering)
Dbscan (implemented in sklearn: https://en.wikipedia.org/wiki/DBSCAN)

How do I know whether to remove the column, or rows when dealing with null data?

Here is the head of my Dataframe. I am trying to remove the NaN values in the column "Type 2", but I am not sure how to decide whether to remove the entire column containing the NaN values, or remove the rows containing the NaN values. How should I decide which method to use to remove the NaN values? Is there a certain threshold to determine whether to remove the rows or the entire column, for datasets in general? My end goal is to run a machine learning algorithm on this dataset to predict whether or not a Pokemon is Legendary. Thank you
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
5 5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
9 7 Squirtle Water NaN 314 44 48 65 50 64 43 1 False
10 8 Wartortle Water NaN 405 59 63 80 65 80 58 1 False
15 12 Butterfree Bug Flying 395 60 45 50 90 80 70 1 False
yes we can decide a threshold for this.
if you have NAN values in all columns is best use:
data.dropna(axis=0,inplace=True)
this we drop all hows that contain NANĀ“s, if you use axis=1 will delete all columns that have NAN values.
One thing that you need think is how much percent of the values in a column is NAN, if more that 70% of NAN values is in only one column and i have no other way to complete this I delete this column.
if the NAN values is distributed in the columns is better delete rows.
i hope it helped you.
I would restrain from deleting whole rows in this scenario.
When deleting rows you would probably never have a pokemon in your dataset which has NaN as second type.
5 5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
In a next step it is easy to think of a legendary Pokemon which does not have a second type. You would never be able to predict such a Pokemon correctly.
You could still delete the column, but you would loose information.
Other than deleting I'd rather introduce an undefined_type tag for those NaN values and go from there.
5 5 Charmeleon Fire undefined_type 405 58 64 58 80 65 80 1 False
Above those things you should do some feature analysis to find out which features actually do contribute to the information gain (e.g. random forest with elbow method). If the introduction of the undefined_type tag reduces the information gain of that feature, you'll know after this analysis.
In this case, I think your best bet would be to make the types categorical and have the NaN's in the type column be a category as well. This would make your machine learning model more robust.

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

Categories

Resources