I have the following dataframe:
A B C D
0 4 1 1 78
1 82 2 58 41
2 53 3 31 76
3 1 45 4 12
5 5 2 4 87
6 1 74 6 11
7 1 1 6 47
8 1 1 6 8
to which I am trying to apply :
sklearn.decomposition.PCA
in order to reduce the number of columns from 4 to 2
And I can't understand which dimension: rows or columns does PCA takes as the number of vectors.
Because if I do the following:
df=
A B C D
0 4 1 1 78
pca=PCA(n_components=3)
pca.fit(df.T)
it will return the following error:
ValueError: n_components=3 must be between 0 and n_features=1 with
svd_solver='full'
Even If I have only 1 data in each vector I should still be able to reduce the number of vectors from 4 to 3.
This is how you would do it using PCA, note I am also standardizing the values.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
vals = df.ix[:, :4].values
vals_std = StandardScaler().fit_transform(vals)
sklearn_pca = PCA(n_components = 'however many you want')
vals_pca = sklearn_pca.fit_transform(vals_std)
Then based on however many dimensions you settled on you can add it back to your data frame.
Related
I would like to cluster X2 and X3 for group month by using kmeans clustering. I need to cluster combined two variables. Also I would like to assign cluster 0 ,cluster 1 and cluster 2 to "strong","average","weak" according to the mean of each cluster highest means mean strong cluster. Below is my sample data set.
df=pd.DataFrame({'month':['1','1','1','1','1','2','2','2','2','2','2','2'],'X1':
[30,42,25,32,12,10,4,6,5,10,24,21],'X2':[10,76,100,23,65,94,67,24,67,54,87,81],'X3':
[23,78,95,52,60,76,68,92,34,76,34,12]})
df
I need to automate this and since then I have so many cols I would like to do this on 2 cols (df.loc[:,2:3]) in general. Assigning cluster to each def is
cluster 2="best"
cluster 1="average"
cluster 0="weak"
To find the best cluster find the mean of each column and then sum if it is higest then assign it to "best", lower to average, and lowest to "weak"
Please help thank you
groupby and apply a clustering function
We can group the dataframe by month and cluster the columns X2 and X3 using a custom defined clustering function
cols = df.columns[2:4]
mapping = {0: 'weak', 1: 'average', 2: 'best'}
def cluster(X):
k_means = KMeans(n_clusters=3).fit(X)
return X.groupby(k_means.labels_)\
.transform('mean').sum(1)\
.rank(method='dense').sub(1)\
.astype(int).to_frame()
df['Cluster_id'] = df.groupby('month')[cols].apply(cluster)
df['Cluster_cat'] = df['Cluster_id'].map(mapping)
month X1 X2 X3 Cluster_id Cluster_cat
0 1 30 10 23 0 weak
1 1 42 76 78 1 average
2 1 25 100 95 2 best
3 1 32 23 52 0 weak
4 1 12 65 60 1 average
5 2 10 94 76 2 best
6 2 4 67 68 2 best
7 2 6 24 92 1 average
8 2 5 67 34 0 weak
9 2 10 54 76 2 best
10 2 24 87 34 0 weak
11 2 21 81 12 0 weak
Hi I'm trying to create a new column in my dataframe and I want the values to based on a calc. The calc is - score share of Student within the Class. There are 2 different students with the same name in different classes, hence why the first group by below is on Class and Student both.
df['share'] = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
With the code above, I get the error incompatible index of inserted column with frame index.
Can someone please help. Thanks
the problem is the groupby aggregate and the index are the unique values of the column you group. And in your case, the SHARE score is the class's score and not the student's, and this sets up a new dataframe with each student's share score.
I understood your problem this way.
ndf = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
ndf = ndf.reset_index()
ndf
If I understood you correctly, given an example df like the following:
Class Student Score
1 1 1 99
2 1 2 60
3 1 3 90
4 1 4 50
5 2 1 93
6 2 2 93
7 2 3 67
8 2 4 58
9 3 1 54
10 3 2 29
11 3 3 34
12 3 4 46
Do you need the following result?
Class Student Score Score_Share
1 1 1 99 0.331104
2 1 2 60 0.200669
3 1 3 90 0.301003
4 1 4 50 0.167224
5 2 1 93 0.299035
6 2 2 93 0.299035
7 2 3 67 0.215434
8 2 4 58 0.186495
9 3 1 54 0.331288
10 3 2 29 0.177914
11 3 3 34 0.208589
12 3 4 46 0.282209
If so, that can be achieved straight forward with:
df['Score_Share'] = df.groupby('Class')['Score'].apply(lambda x: x / x.sum())
You can apply operations within each group's scope like that.
PS. I don't know why a student with the same name in a different class would be a problem, so maybe I'm not getting something right. I'll edit this according to your response. Can't make a comment because I'm a newbie here :)
I have the following dataframe
df:
group people value value_50
1 5 100 1
2 2 90 1
1 10 80 1
2 20 40 0
1 7 10 0
2 23 30 0
And I am trying to apply sklearn minmax on one of the column, given a condition on dataset, and then want to join that back as per pandas index in my original data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
After copying the above data
data = pd.read_clipboard()
minmax = MinMaxScaler(feature_range=(0,10))
''' Applying a filter on "group" and then apply minmax only on those values '''
val = pd.DataFrame(minmax.fit_transform(data[data['group'] == 1][['value']])
,columns = ['val_minmax'] )
But it looks like we lose the index after the minmax
val
val_minmax
0 10.000000
1 7.777778
2 0.000000
where index in my original dataset on this filter is
data[data['group'] == 1]['value']
output:
0 100
2 80
4 10
Desired dataset:
df_out:
group people value value_50 val_minmax
1 5 100 1 10
2 2 90 1 na
1 10 80 1 7.88
2 20 40 0 na
1 7 10 0 0
2 23 30 0 na
Now, how to join back my data at rows in the original data, so that I can get the above output?
You just need to assign it back
df.loc[df.group==1,'val_minmax']=minmax.fit_transform(df[df['group'] == 1][['value']])
In R, cbind(dataframe, new_column) will return the original dataframe with an extra column called "new_column"
What is best practice for achieving this in Python (preferably using base or pandas)?
To make the question more concrete, suppose
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
and
new_column = [2,4,6,8,10,12]
And that the final output should be
Day Visitors Bounce Rate new_column
0 1 43 65 2
1 2 34 67 4
2 3 65 78 6
3 4 56 65 8
4 5 29 45 10
5 6 76 52 12
You can do this:
web_stats['new_column'] = [2,4,6,8,10,12]
I'd like to keep the columns in the order they were defined with pd.DataFrame. In the example below, df.info shows that GroupId is the first column and print also prints GroupId.
I'm using Python version 3.6.3
import numpy as np
import pandas as pd
df = pd.DataFrame({'Id' : np.random.randint(1,100,10),
'GroupId' : np.random.randint(1,5,10) })
df.info()
print(df.iloc[:,0])
One way is to use collections.OrderedDict, as below. Note that the OrderedDict object takes a list of tuples as an input.
from collections import OrderedDict
df = pd.DataFrame(OrderedDict([('Id', np.random.randint(1,100,10)),
('GroupId', np.random.randint(1,5,10))]))
# Id GroupId
# 0 37 4
# 1 10 2
# 2 42 1
# 3 97 2
# 4 6 4
# 5 59 2
# 6 12 2
# 7 69 1
# 8 79 1
# 9 17 1
Unless you're using python-3.6+ where dictionaries are ordered, this just isn't possible with a (standard) dictionary. You will need to zip your items together and pass a list of tuples:
np.random.seed(0)
a = np.random.randint(1, 100, 10)
b = np.random.randint(1, 5, 10)
df = pd.DataFrame(list(zip(a, b)), columns=['Id', 'GroupId'])
Or,
data = [a, b]
df = pd.DataFrame(list(zip(*data)), columns=['Id', 'GroupId']))
df
Id GroupId
0 45 3
1 48 1
2 65 1
3 68 1
4 68 3
5 10 2
6 84 3
7 22 4
8 37 4
9 88 3