Calculating covariance matrix amongst different features using Pandas dataframe - python

I have a dataset into a pandas dataframe with 9 set of features and 249 rows, I would like to get a covariance matrix amongst the 9 features (resulting in a 9 X 9 matrix), however, when I use the df.cov() function, I only get a 3 X 3 matrix. What am I doing wrong here?
Thanks!
Below is my code snippet
# perform data preprocessing
# only get players with MPG with less than 20 and only select the required colums
MPG_df = df.loc[df['MPG'] >= 20]
processed_df = MPG_df[["FT%", "2P%", "3P%", "PPG", "RPG", "APG", "SPG", "BPG", "TOPG"]]
processed_df
And when I attempt in getting the covariance matrix using the code below, I only get a 3 X 3 matrix
#desired result
cov_processed_df = df = pandas.DataFrame(processed_df, columns=['FT%', '2P%', '3P%', 'PPG', 'RPG', 'APG', 'SPG', 'BPG', 'TOPG']).cov()
cov_processed_df
Thanks!

The excluded columns are probably non-numeric (even though they look like so!). Try
cov_processed_df = processed_df.astype(float).cov()
To see the data types of the original df, you may run:
print(processed_df.dtypes)
If you see "object" appearing in the result, then it means those columns are non-numeric. (Even if they contain at least 1 non-numeric data, they are flagged as non-numeric.)

Related

Deciding between np hard and np complete in my own set of rules

I have the Iris dataset which looks something like:
1,3,1,1,0
1,1,1,1,0
1,3,1,1,0
1,2,1,1,0
1,3,1,1,0
1,2,1,1,0
2,2,2,2,1
2,2,2,2,1
2,2,2,2,1
2,1,2,2,1
1,1,2,2,1
2,1,2,2,1
2,2,3,4,2
2,1,3,4,2
3,1,3,4,2
2,1,3,4,2
2,1,3,4,2
3,1,3,4,2
I am just giving 18 rows but there are total 150 rows. The 1st 4 column gives the 4 attribute values and the fifth column gives the class.
So 3,1,3,4,2 means if the att_1=3, att_2=1, att_3=3, att_4=4 the class=2
Now I have written 2 classifier algorithms, where I tried to extract rules from this dataset.
The 1st algorithm(Implemented using C and python) gives the output as:
*,*,1,*,0
*,*,2,2,1
*,*,*,3,1
*,*,*,4,2
*,*,3,*,2
By these above 5 rows I tried to keep all the characteristics of the main dataset which contains 150 rows. Here * stands for don't care and *,*,2,2,1 this simply means if the value of attribute 3 and 4 =2 then we don't care about the value of attribute 1 and 2, the class will be = 1.
The 2nd algorithm(Implemented using C and python) gives the output as:
*,*,1,*,0
*,*,2,2,1
2,*,*,3,1
*,2,2,*,1
*,*,3,4,2
*,1,*,4,2
*,*,3,2,2
3,*,*,*,2
1,*,*,3,2
Now I took the union of these 2 rule sets. And got the outcome as:
*,*,2,2,1
*,*,1,*,0
*,2,2,*,1
*,*,3,2,2
*,*,3,4,2
3,*,*,*,2
*,*,*,3,1
1,*,*,3,2
*,*,*,4,2
*,*,3,*,2
*,1,*,4,2
2,*,*,3,1
Now the question that is arriving in my mind is, there are 12 rules in that union file. But may be there can be 3 or 4 most effective rules, using which we can get the clear view of the initial iris dataset with 150 rows. So my point is : To find the top 5 most effective rules from this above union rules. Basically I got those rules from the initial iris dataset and now I want to get the initial iris dataset back from the best possible generated rules. Is this problem a np-hard or a np-complete problem? And why so?

How to get p-value for each row of two columns in pandas DataFrame?

I would like to ask for any suggestion how to calculate p-value for each row in my pandas DataFrame. My dataframe looks like this - there are columns with means of Data1 and Data2, and then also columns with standard error of the means. Each row represent one atom. Thus I need calculate p-value for each row (= it means, e.g., compare mean of atom 1 from Data1 with mean of atom 1 from Data2).
SEM-DATA1 MEAN-DATA1 SEM-DATA2 MEAN-DATA2
0 0.001216 0.145842 0.000959 0.143103
1 0.002687 0.255069 0.001368 0.250505
2 0.005267 0.321345 0.003722 0.305767
3 0.027265 0.906731 0.033637 0.731638
4 0.029974 0.773725 0.150025 0.960804
I found here on Stack that many people recommend using scipy. But I dont know how to apply it in the way I need it.
Is it possible?
Thank You.
You are comparing two samples df['MEAN...1'] and df['MEAN...2'], so, you should do this:
from scipy import stats
stats.ttest_ind(df['MEAN-DATA1'],df['MEAN-DATA2'])
which return:
Ttest_indResult(statistic=0.01001479441863673, pvalue=0.9922547232600507)
or if you only want to p-value
a = stats.ttest_ind(df['MEAN-DATA1'],df['MEAN-DATA2'])
a[1]
which gives
0.9922547232600507
EDIT
A clarification is in order here. A t-test (or the aquisition of a "p-value" is aimed at finding out is two distributions are coming from the same population (or sample). Testing for two single values will give NaN.

Seaborn violin plot using arrays - Error: Neither the `x` nor `y` variable appears to be numeric

I am trying to generate several violin plots in one, using seaborn. The dataframe
I use includes several categorical values in one column (to be used on the x-axis), with an array of values for each categorical value (to be used to create the violin plot for each categorical value). A small working example would be this:
foo = pd.DataFrame(columns =['Names','Values'])
for i in range(10):
foo.loc[i] = ['no'+str(i),np.random.normal(i,2,10)]
But when trying
sns.violinplot(x='Names', y='Values', data=foo)
I get the following error
ValueError: Neither the x nor y variable appears to be numeric.
Now I could be hacky and just separate the array across several rows as such:
foo = pd.DataFrame(columns =['Names','Values'])
for i in range(3):
bar = np.random.normal(i,2,10)
for j,b in enumerate(bar):
foo.loc[i*10+j] = ['no'+str(i),b]
which yields the plot I want:
But I'm guessing there is a more simple solution to this, without needing to restructure my dataframe.
pd.DataFrame.explode() helps you turn your column of lists into separate cells. After converting them to actual numbers instead of strings sns.violinplot can plot without effort.
foo = foo.explode('Values')
foo['Values'] = foo['Values'].astype('float')
sns.violinplot(data=foo, x='Names', y='Values')
In pandas 0.25 you could use explode, for a previous version use any of the solutions here:
result = foo.explode('Values').reset_index(drop=True)
result = result.assign(Names=result['Names'].astype('category'),
Values=result['Values'].astype(np.float32))
sns_plot = sns.violinplot(x='Names', y='Values', data=result)
Output
Exploding (or unnesting) will transform your data into:
Names Values
0 no0 3.352148
1 no0 2.195788
2 no0 1.234673
3 no0 0.084360
4 no0 1.778226
.. ... ...
95 no9 12.385434
96 no9 9.849669
97 no9 11.360196
98 no9 8.535900
99 no9 9.369197
[100 rows x 2 columns]
The assign transforms the dtypes into:
Names category
Values float32
dtype: object

Problems with a binary one-hot (one-of-K) coding in python

Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me.
Pandas and get_dummies in the categorical columns of the data-frame. This method seems excellent as far as the original data-frame contains all data available. That is, you do the one-hot coding before splitting your data in training, validation, and test sets. However, if the data is already split in different sets, this method doesn't work very well. Why? Because one of the data sets (say, the test set) can contain fewer values for a given variable. For example, it can happen that whereas the training set contain the values red, blue, yellow, and unknown for the variable color, the test set only contains red and blue. So the test set would end up having fewer columns than the training set. (I don't know either how the new columns are sorted, and if even having the same columns, this could be in a different order in each set).
Sklearn and DictVectorizer This solves the previous issue, as we can make sure that we are applying the very same transformation to the test set. However, the outcome of the transformation is a numpy array instead of a pandas data-frame. If we want to recover the output as a pandas data-frame, we need to (or at least this is the way I do it): 1) pandas.DataFrame(data=outcome of DictVectorizer transformation, index=index of original pandas data frame, columns= DictVectorizer().get_features_names) and 2) join along the index the resulting data-frame with the original one containing the numerical columns. This works, but it is somewhat cumbersome.
Is there a better way to do a binary one-hot encoding within a pandas data-frame if we have our data split in training and test set?
If your columns are in the same order, you can concatenate the dfs, use get_dummies, and then split them back again, e.g.,
encoded = pd.get_dummies(pd.concat([train,test], axis=0))
train_rows = train.shape[0]
train_encoded = encoded.iloc[:train_rows, :]
test_encoded = encoded.iloc[train_rows:, :]
If your columns are not in the same order, then you'll have challenges regardless of what method you try.
You can set your data type to categorical:
In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})
In [6]: df_train
Out[6]:
car color
0 seat red
1 bmw green
In [7]: pd.get_dummies(df_train )
Out[7]:
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
1 0 1 0 1 0
See this issue of Pandas.

Creating Pandas 2d heatmap based on accumulated values (rather than actual frequency)?

Thanks for reading, I've spent 3-4 hours searching for examples to solve this but can't find any that solve.. the ones I did try didn't seem to work with pandas DataFrame object.. any help would be very much appreciated!!:)
Ok this is my problem.
I have a Pandas DataFrame containing 12 columns.
I have 500,000 rows of data.
Most of the columns are useless. The variables/columns I am interested in are called: x,y and profit
Many of the x and y points are the same,
so i'd like to group them into a unique combination then add up all the profit for each unique combination.
Each unique combination is a bin (like a bin used in histograms)
Then I'd like to plot a 2d chart/heatmap etc of x,y for each bin and the colour to be total profit.
e.g.
x,y,profit
7,4,230.0
7,5,162.4
6,8,19.3
7,4,-11.6
7,4,180.2
7,5,15.7
4,3,121.0
7,4,1162.8
Note how values x=7, y=4, there are 3 rows that meet this criteria.. well the total profit should be:
230.0 - 11.6 +1162.8 = 1381.2
So in bin x=7, y = 4, the profit is 1381.2
Note for values x=7, y=5, there are 2 instances.. total profit should be: 162.4 + 15.7 = 178.1
So in bin x=7, y = 5, the profit is 178.1
So finally, I just want to be able to plot: x,y,total_profit_of_bin
e.g. To help illustrate what I'm looking for, I found this on internet, it is similar to what I'd like, (ignore the axis & numbers)
http://2.bp.blogspot.com/-F8q_ZcI-HJg/T4_l7D0C7yI/AAAAAAAAAgE/Bqtx3eIHzRk/s1600/heatmap.jpg
Thank-you so much for taking the time to read:)
If for 'bin' of x where the values are x are equal, and the values of y are equal, then you can use groupby.agg. That would look something like this
import pandas as pd
import numpy as np
df = YourData
AggDF = df.groupby('x').agg({'y' : 'max', 'profit' : 'sum'})
AggDF
That would get you the data I think you want, then you could plot as you see fit. Do you need assistance with that also?
NB this is only going to work in the way you want it to if within each 'bin' i.e. the data grouped according to the values of x, the values of y are equal. I assume this must be the case, as otherwise I don't think it would make much sense to be trying to graph x and y together.

Categories

Resources