I have done KMeans clusters and now I need to analyse each individual cluster. For example look at cluster 1 and see what clients are on it and make conclusions.
dfRFM['idcluster'] = num_cluster
dfRFM.head()
idcliente Recencia Frecuencia Monetario idcluster
1 3 251 44 -90.11 0
2 8 1011 44 87786.44 2
6 88 537 36 8589.57 0
7 98 505 2 -179.00 0
9 156 11 15 35259.50 0
How do I group so I only see results from lets say idcluster 0 and sort by lets say "Monetario". Thanks!
To filter a dataframe, the most common way is to use df[df[colname] == val] Then you can use df.sort_values()
In your case, that would look like this:
dfRFM_id0 = dfRFM[dfRFM['idcluster']==0].sort_values('Monetario')
The way this filtering works is that dfRFM['idcluster']==0 returns a series of True/False based on if it is, well, true or false. So then we have a sort of dfRFM[(True,False,True,True...)], and so the dataframe returns only the rows where we have a True. That is, filtering/selecting the data where the condition is true.
edit: add 'the way this works...'
I think you actually just need to filter your DF!
df_new = dfRFM[dfRFM.idcluster == 0]
and then sort by Montario
df_new = df_new.sort_values(by = 'Monetario')
Group by is really best for when you're wanting to look at the cluster as a whole - for example, if you wanted to see the average values for Recencia, Frecuencia, and Monetario for all of Group 0.
Related
I have a lot of data that I'm trying to do some basic machine learning on, kind of like the Titanic example that predicts whether a passenger survived or died (I learned this in an intro Python class) based on factors like their gender, age, fare class...
What I'm trying to predict is whether a screw fails depending on how it was made (referred to as Lot). The engineers just listed how many times a failure occurred. Here's how it's formatted.
Lot
Failed?
100
3
110
0
120
1
130
4
The values in the cells are the number of occurrences, so for example:
Lot 100 had three screws that failed
Lot 110 had 0 screws that failed
Lot 120 had one screw that failed
Lot 130 had four screws that failed
I plan on doing a logistic regression using scikit-learn, but first I need each row to be listed as a failure or not. What I'd like to see is a row for every observation, and have them listed as either a 0 (did not occur) or 1 (did occur). Here's what it'd look like after
Lot
Failed?
100
1
100
1
100
1
110
0
120
1
140
1
140
1
140
1
140
1
Here's what I've tried and what I've gotten
df = pd.DataFrame({
'Lot' : ['100', '110', '120', '130'],
'Failed?' : [3, 0, 1, 4]
})
df.loc[df.index.repeat(df['Failed?'])].reset_index(drop = True)
When I do this it repeats the rows but keeps the same values in the Failed? column.
Lot
Failed?
100
3
100
3
100
3
110
0
120
1
140
4
140
4
140
4
140
4
Any ideas? Thank you!
You can use pandas.Series.repeat with reindex, but first you need to differentiate between rows that have 0 and those that do not:
s = df[df['Failed?'].eq(0)] # "save" rows with 0 as value as they will be excluded in repeat since they are repeated 0 times.
df = df.reindex(df.index.repeat(df['Failed?'])) #repeat each row depending on value
df['Failed?'] = 1 #set all values equal to 1
df = pd.concat([df,s]).sort_index() #bring in the 0 values that we saved as 's' earlier and sort by the index to put back in order
df
#The above code as a one-liner:
(pd.concat([df.reindex(df.index.repeat(df['Failed?'])).assign(**{'Failed?' : 1}),
df[df['Failed?'].eq(0)]])
.sort_index())
Out[1]:
Lot Failed?
0 100 1
0 100 1
0 100 1
1 110 0
2 120 1
3 130 1
3 130 1
3 130 1
3 130 1
below will give you failure or not but I suppose you are better served by the other answer.
df.loc[df['Failed?']>0,'Failed?'] = 1
Just as a comment: this is a bit of a strange data transformation, you might want to just keep a numerical target variable
I need to create a dataframe containing the manhattan distance between two dataframes with the same columns, and I need the indexes of each dataframe to be the index and column name, so for example lets say I have these two dataframes:
x_train :
index a b c
11 2 5 7
23 4 2 0
312 2 2 2
x_test :
index a b c
22 1 1 1
30 2 0 0
so the columns match but the size and indexes do not, the expected dataframe would look like this:
dist_dataframe:
index 11 23 312
22 11 5 3
30 12 4 4
and what I have right now is this:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def calc_distance(X_test,X_train):
dist_dataframe = pd.DataFrame(index=X_test.index,columns = X_train.index)
for i in X_train.index:
for j in X_test.index:
dist_dataframe.loc[i,j]=manhattan_distance(X_train.loc[[i]],X_test.loc[[j]])
return dist_dataframe
what I get from the code I have is this dataframe:
dist_dataframe:
index
index 11 23 312
22 NaN NaN NaN
30 NaN NaN NaN
I get the right dataframe size except that it has 2 rows called indexes that I get from the creation of the new dataframe, and also I get an error no matter what I do in the manhattan calculation line, can anyone help me out here please?
Problem in your code
There is a very small problem in your code, i.e. accessing values in dist_dataframe. So,instead of dist_dataframe.loc[i,j], you should reverse the order of i and j and make it like dist_dataframe.loc[j,i]
More efficient solution
It will work fine but since you are a new contributor, I would also like to point out the efficiency of your code. Always try to replace loops with pandas in-built functions. Since they are written in C, it makes them much faster. So here is a more efficient solution:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def xtrain_distance(row):
distances = {}
for i,each in x_train.iterrows():
distances[i] = manhattan_distance(each,row)
return distances
result = x_test.apply(xtrain_distance, axis=1)
# converting into dataframe
pd.DataFrame(dict(result)).transpose()
It also produces same output and on your example and you can't see any time difference. But when run on a larger size (same data scaled over 20 times), i.e. 60 x_train samples and 40 x_test samples, here is the time difference:
Your solution took: 929 ms
This solution took: 207 ms
It got 4x faster just by eliminating one for loop. Note that, it can be made more efficient but for the sake of demonstration, I have used this solution.
I have a dataframe df with 2 columns
Sub_marks Total_marks
40 90
60 80
100 90
0 0
I need to find which all rows fails the criteria that sub_marks <= Total_marks.
Currently i am using sympy function as below
def fn_validate(formula,**args):
exp=sy.sympify(formula)
exp=exp.subs(args)
return exp
I am calling above function using apply method as below
df['val_check']=df.apply(lambda row:fn_validate('X<=Y',X=row['Sub_marks],Y=row['Total_marks']),axis=1)
I am expecting a column val_check with True/False expression validation result. But in case of 0 values, I am getting error.
Invalid Nan Comparison
I cant remove this values from the dataframe
Please let me know, is there anyother way to do this expression validation
You can try this:
df['val_check'] = df.Sub_marks <= df.Total_marks
df
Sub_marks Total_marks val_check
0 40 90 True
1 60 80 True
2 100 90 False
3 0 0 True
You can directly compare columns and store them in a list.
condition = df['sub_marks']>=df['total_marks']
print(condition)
Output:
[True,True,False,True]
In my Dataframe I have one column with numeric values, let say - distance. I want to find out which group of distance (range) have the biggest number of records (rows).
Doing simple:
df.distance.count_values() returns:
74 1
90 1
94 1
893 1
889 1
885 1
877 1
833 1
122 1
545 1
What I want to achieve is something like buckets from histogram, so I am expecting output like this:
900 4 #all values < 900 and > 850
100 3
150 1
550 1
850 1
The one approach I've figured out so far, but I don't think is the best and most optimal one is just find max and min values, divide by my step (50 in this case) and then do loop checking all the values and assigning to appropriate group.
Is there any other, better approach for that?
I'd suggest doing the following, assuming your value column is labeled val
import numpy as np
df['bin'] = df['val'].apply(lambda x: 50*np.floor(x/50))
The result is the following:
df.groupby('bin')['val'].count()
Thanks to EdChum suggestion and based on this example I've figured out, the best way (at least for me) is to do something like this:
import numpy as np
step = 50
#...
max_val = df.distance.max()
bins = list(range(0,int(np.ceil(max_val/step))*step+step,step))
clusters = pd.cut(df.distance,bins,labels=bins[1:])
I have three arrays as listed below:
users — Contains the id of 50000 users ( all distinct )
pusers — Contains the id of users who own some posts (contains repeated id's also, that is, one user can own many posts) [ 50000 values]
score — Contains the score corresponding to each value in pusers.[ 50000 values]
Now I want to populate another array PScore based on the following calculation. For each value of users in pusers, I need to fetch the corresponding score and add it to the PScore array in the index corresponding to the user.
Example,
if users[5] = 23224
and pusers[6] = pusers[97] = 23224
then PScore[5] += score[6]+score[97]
Items of note:
score is related to pusers (e.g., pusers[5] has score[5])
PScore is expected to be related to users (e.g., cumulative score of users[5] is Pscore[5])
The ultimate aim is to assign a cumulative score of posts to the user who owns it.
The users who don't own any posts are assigned a score of 0.
Can anyone help me in doing this? I tried a lot but once I run my different trials, the output screen remains blank until I Ctrl+Z and get out.
I went through all of the following posts but I couldn't use them effectively for my scenario.
Compare values of two arrays in python
how to compare two arrays in python?
Checking if any elements in one list are in another
I am new to this forum and I'm a beginner in Python too. Any help is going to be really useful to me.
Additional Information
I'm working on a small project using StackOverflow data.
I'm using Orange tool and I'm in the process of learning the tool and python.
Ok I understand that something is wrong with my approach. So shouldn't I use lists for this scenario? Can anyone please tell me how I should proceed with this?
Sample of the data that i have arrived at is as shown below.
PUsers Score
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
13 0
77 1
77 4
77 3
77 0
77 2
77 2
77 3
102 2
105 0
108 2
108 2
117 2
Users
-1
1
2
3
4
5
8
9
10
11
13
16
17
19
20
22
23
24
25
26
27
29
30
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
48
49
50
All that I want is the total score associated with each user. Once again, the pusers list contains repetition while users list contains unique values. I need the total score associated with each user stored in such a way that, if I say PScore[6], it should refer to the total score associated with User[6].
Hope I answered the queries.
Thanks in advance.
From how you described your arrays and since you're using python, this looks like a perfect candidate for dictionaries.
Instead of having one array for post owner and another array for post score, you should be able to make a dictionary that maps a user id to a score. When you're taking in data, look in the dictionary to see if the user already exists. If so, add the score to the current score. If not, make a new entry. When you've looped through all the data, you should have a dictionary that maps from user id to total score.
http://docs.python.org/2/tutorial/datastructures.html#dictionaries
I think your algorithm is either wrong or broken.
Try to compute it's complexity. If it's N^2 or more you are likely using an inefficient algorithm. O(N^2) with 50.000 elements should take a few seconds. O(N^3) will probably take minutes.
If you're sure of your approach try running it with some small fake data to figure out if it does the right thing or if you accidentally added some infinite loop.
You can easily get it working in linear time with dictionaries.