I would like to cluster X2 and X3 for group month by using kmeans clustering. I need to cluster combined two variables. Also I would like to assign cluster 0 ,cluster 1 and cluster 2 to "strong","average","weak" according to the mean of each cluster highest means mean strong cluster. Below is my sample data set.
df=pd.DataFrame({'month':['1','1','1','1','1','2','2','2','2','2','2','2'],'X1':
[30,42,25,32,12,10,4,6,5,10,24,21],'X2':[10,76,100,23,65,94,67,24,67,54,87,81],'X3':
[23,78,95,52,60,76,68,92,34,76,34,12]})
df
I need to automate this and since then I have so many cols I would like to do this on 2 cols (df.loc[:,2:3]) in general. Assigning cluster to each def is
cluster 2="best"
cluster 1="average"
cluster 0="weak"
To find the best cluster find the mean of each column and then sum if it is higest then assign it to "best", lower to average, and lowest to "weak"
Please help thank you
groupby and apply a clustering function
We can group the dataframe by month and cluster the columns X2 and X3 using a custom defined clustering function
cols = df.columns[2:4]
mapping = {0: 'weak', 1: 'average', 2: 'best'}
def cluster(X):
k_means = KMeans(n_clusters=3).fit(X)
return X.groupby(k_means.labels_)\
.transform('mean').sum(1)\
.rank(method='dense').sub(1)\
.astype(int).to_frame()
df['Cluster_id'] = df.groupby('month')[cols].apply(cluster)
df['Cluster_cat'] = df['Cluster_id'].map(mapping)
month X1 X2 X3 Cluster_id Cluster_cat
0 1 30 10 23 0 weak
1 1 42 76 78 1 average
2 1 25 100 95 2 best
3 1 32 23 52 0 weak
4 1 12 65 60 1 average
5 2 10 94 76 2 best
6 2 4 67 68 2 best
7 2 6 24 92 1 average
8 2 5 67 34 0 weak
9 2 10 54 76 2 best
10 2 24 87 34 0 weak
11 2 21 81 12 0 weak
Related
Hi I'm trying to create a new column in my dataframe and I want the values to based on a calc. The calc is - score share of Student within the Class. There are 2 different students with the same name in different classes, hence why the first group by below is on Class and Student both.
df['share'] = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
With the code above, I get the error incompatible index of inserted column with frame index.
Can someone please help. Thanks
the problem is the groupby aggregate and the index are the unique values of the column you group. And in your case, the SHARE score is the class's score and not the student's, and this sets up a new dataframe with each student's share score.
I understood your problem this way.
ndf = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
ndf = ndf.reset_index()
ndf
If I understood you correctly, given an example df like the following:
Class Student Score
1 1 1 99
2 1 2 60
3 1 3 90
4 1 4 50
5 2 1 93
6 2 2 93
7 2 3 67
8 2 4 58
9 3 1 54
10 3 2 29
11 3 3 34
12 3 4 46
Do you need the following result?
Class Student Score Score_Share
1 1 1 99 0.331104
2 1 2 60 0.200669
3 1 3 90 0.301003
4 1 4 50 0.167224
5 2 1 93 0.299035
6 2 2 93 0.299035
7 2 3 67 0.215434
8 2 4 58 0.186495
9 3 1 54 0.331288
10 3 2 29 0.177914
11 3 3 34 0.208589
12 3 4 46 0.282209
If so, that can be achieved straight forward with:
df['Score_Share'] = df.groupby('Class')['Score'].apply(lambda x: x / x.sum())
You can apply operations within each group's scope like that.
PS. I don't know why a student with the same name in a different class would be a problem, so maybe I'm not getting something right. I'll edit this according to your response. Can't make a comment because I'm a newbie here :)
Here's my data -
ID,Pay1,Pay2,Pay3,Low,High,expected_output
1,12,21,23,1,2,21
2,21,34,54,1,3,54
3,74,56,76,1,1,74
The goal is to calculate the max Pay of each row as per the Pay column index specified in Low and High columns.
For example, for row 1, calculate the max of Pay1 and Pay2 columns as Low and High are 1 and 2.
I have tried building a dynamic string and then using eval function which is not performing well.
Idea is filter only Pay columns and then using numpy broadcasting select columns by Low and High columns, pass to DataFrame.where and last get max:
df1 = df.filter(like='Pay')
m1 = np.arange(len(df1.columns)) >= df['Low'].to_numpy()[:, None] - 1
m2 = np.arange(len(df1.columns)) <= df['High'].to_numpy()[:, None] - 1
df['expected_output'] = df1.where(m1 & m2, 0).max(axis=1)
print (df)
ID Pay1 Pay2 Pay3 Low High expected_output
0 1 12 21 23 1 2 21
1 2 21 34 54 1 3 54
2 3 74 56 76 1 1 74
An alternative; I expect #jezrael's solution to be faster as it is within numpy and pd.wide_to_long is not particularly fast:
grouping = (
pd.wide_to_long(df.filter(regex="^Pay|Low|High"),
i=["Low", "High"],
stubnames="Pay",
j="num")
.query("Low==num or High==num")
.groupby(level=["Low", "High"])
.Pay.max()
)
grouping
Low High
1 1 74
2 21
3 54
Name: Pay, dtype: int64
df.join(grouping.rename("expected_output"), on=["Low", "High"])
ID Pay1 Pay2 Pay3 Low High expected_output
0 1 12 21 23 1 2 21
1 2 21 34 54 1 3 54
2 3 74 56 76 1 1 74
I am trying to rank a large dataset using python. I do not want duplicates and rather than using the 'first' method, I would instead like it to look at another column and rank it based on that value.
It should only look at the second column if the rank in the first column has duplicates.
Name CountA CountB
Alpha 15 3
Beta 20 52
Delta 20 31
Gamma 45 43
I would like the ranking to end up
Name CountA CountB Rank
Alpha 15 3 4
Beta 20 52 2
Delta 20 31 3
Gamma 45 43 1
Currently, I am using df.rank(ascending=False, method='first')
Maybe use sort and pull out the index:
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C','D'],'CountA':[15,20,20,45],'CountB':[3,52,31,43]})
df['rank'] = df.sort_values(['CountA','CountB'],ascending=False).index + 1
Name CountA CountB rank
0 A 15 3 4
1 B 20 52 2
2 C 20 31 3
3 D 45 43 1
You can take the counts of the values in CountA and then filter the DataFrame rows based on the count of CountA being greater than 1. Where the count is greater than 1, take CountB, otherwise CountA.
df = pd.DataFrame([[15,3],[20,52],[20,31],[45,43]],columns=['CountA','CountB'])
colAcount = df['CountA'].value_counts()
#then take the indices where colACount > 1 and use them in a where
df['final'] = df['CountA'].where(~df['CountA'].isin(colAcount[colAcount>1].index),df['CountB'])
df = df.sort_values(by='final', ascending=False).reset_index(drop=True)
# the rank is the index
CountA CountB final
0 20 52 52
1 45 43 45
2 20 31 31
3 15 3 15
See this for more details.
In R, cbind(dataframe, new_column) will return the original dataframe with an extra column called "new_column"
What is best practice for achieving this in Python (preferably using base or pandas)?
To make the question more concrete, suppose
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
and
new_column = [2,4,6,8,10,12]
And that the final output should be
Day Visitors Bounce Rate new_column
0 1 43 65 2
1 2 34 67 4
2 3 65 78 6
3 4 56 65 8
4 5 29 45 10
5 6 76 52 12
You can do this:
web_stats['new_column'] = [2,4,6,8,10,12]
I have the following dataframe:
A B C D
0 4 1 1 78
1 82 2 58 41
2 53 3 31 76
3 1 45 4 12
5 5 2 4 87
6 1 74 6 11
7 1 1 6 47
8 1 1 6 8
to which I am trying to apply :
sklearn.decomposition.PCA
in order to reduce the number of columns from 4 to 2
And I can't understand which dimension: rows or columns does PCA takes as the number of vectors.
Because if I do the following:
df=
A B C D
0 4 1 1 78
pca=PCA(n_components=3)
pca.fit(df.T)
it will return the following error:
ValueError: n_components=3 must be between 0 and n_features=1 with
svd_solver='full'
Even If I have only 1 data in each vector I should still be able to reduce the number of vectors from 4 to 3.
This is how you would do it using PCA, note I am also standardizing the values.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
vals = df.ix[:, :4].values
vals_std = StandardScaler().fit_transform(vals)
sklearn_pca = PCA(n_components = 'however many you want')
vals_pca = sklearn_pca.fit_transform(vals_std)
Then based on however many dimensions you settled on you can add it back to your data frame.