Ordering a dataframe using value_counts - python

I have a dataframe in which under the column "component_id", I have component_ids repeating several times.
Here is what the df looks like:
In [82]: df.head()
Out[82]:
index molregno chembl_id assay_id tid tid component_id
0 0 942606 CHEMBL1518722 688422 103668 103668 4891
1 0 942606 CHEMBL1518722 688422 103668 103668 4891
2 0 942606 CHEMBL1518722 688721 78 78 286
3 0 942606 CHEMBL1518722 688721 78 78 286
4 0 942606 CHEMBL1518722 688779 103657 103657 5140
component_synonym
0 LMN1
1 LMNA
2 LGR3
3 TSHR
4 MAPT
As can be seen, the same component_id can be linked to various component_synonyms(essentially the same gene, but different names). I wanted to find out the frequency of each gene as I want to find out the top 20 most frequently hit genes and therefore, I performed a value_counts on the column "component_id". I get something like this.
In [84]: df.component_id.value_counts()
Out[84]:
5432 804
3947 402
5147 312
3 304
2693 294
75 282
Name: component_id, dtype: int64
Is there a way for me to order the entire dataframe according to the component_id that is present the most number of times?
And also, is it possible for my dataframe to contain only the first occurrence of each component_id?
Any advice would be greatly appreciated!

I think you can make use of count to sort the rows and then drop the count column i.e
df['count'] = df.groupby('component_id')['component_id'].transform('count')
df_sorted = df.sort_values(by='count',ascending=False).drop('count',1)

Related

Find occurences where a column from one dataframe equals another, based on condition

I have the following two dataframes, which have different size: df1 (966 rows x 2 cols), df2 (36 rows, 2 cols), where
df1:
Video_# Selected Joint.1
484 1 Left_shoulder
778 1 Left_shoulder
418 1 Right_shoulder
964 1 Right_shoulder
193 1 Right_shoulder
... ... ... ... ...
285 36 Right_elbow
267 36 Left_hand
216 36 Shoulder_centre
139 36 Right_shoulder
df2:
Video_# Ann.1
0 1 Shoulder_center
1 2 Head
2 3 Right_hip
... ... ... ... ...
33 34 Left_knee
34 35 Right_knee
35 36 Right_shoulder
Where Video_# goes from 1-36. In df2 the Video_# column is just a single occurrence of each, so 1-36 just once. Whereas in df1 there are multiple occurrences of each 1-36, not the same number for each 1-36 (I hope that makes sense).
What I want to check is the number of occurrences df1['Selected Joint.1'] == df2['Ann.1'] based on Video_#. So the expected output is (eg.):
Video_# Equality Occurrences
1 3
2 5
... ... ...
36 6
Is that possible?
Use DataFrame.merge with GroupBy.size:
df = (df1.merge(df2,
left_on=['Video_#','Selected Joint.1'],
right_on=['Video_#','Ann.1'])
.groupby('Video_')
.size()
.reset_index(name='Equality Occurrences'))

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

Sort letters in ascending order ('a-z') in Python after using value_counts

I imported my data file and isolated the first letter of each word, and provided the count of the word. My next step is to sort the letters in ascending order 'a-z'. This is the code that I have right now:
import pandas as pd
df = pd.read_csv(text.txt", names=['FirstNames'])
df
df['FirstLetter'] = df['FirstNames'].str[:1]
df
df['FirstLetter'] = df['FirstLetter'].str.lower()
df
df['FirstLetter'].value_counts()
df
df2 = df['FirstLetter'].index.value_counts()
df2
Using .index.value_counts() wasn't working for me. It turned this output:
Out[72]:
2047 1
4647 1
541 1
4639 1
2592 1
545 1
4643 1
2596 1
549 1
2600 1
2612 1
553 1
4651 1
2604 1
557 1
4655 1
2608 1
561 1
2588 1
4635 1
..
`````````
How can I fix this?
You can use the sort_index() function. This should work df['FirstLetter'].value_counts().sort_index()

Handling Zeros or NaNs in a Pandas DataFrame operations

I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].

Printing Matrix of dataset into Table using Pandas with single column transposed

I'm using Movie Lens Dataset in Python Pandas. I need to print the matrix of u.data a tab separated file in foll. way
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating Rating
UserID2 Rating Rating Rating
I've already been through following links
One - Dataset is much huge put it in series
Two - Transpose of Row not mentioned
Three - Tried with reindex so as
to get NaN values in one column
Four - df.iloc and df.ix
didn't work either
I need the output so as it shows me rating and NaN (when not rated) for movies w.r.t. users.
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating NaN
UserID2 Rating NaN Rating
P.S. I won't mind having solutions with numpy, crab, recsys, csv or any other python package
EDIT 1 - Sorted the data and exported but got an additional field
df2 = df.sort_values(['UserID','MovieID'])
print type(df2)
df2.to_csv("sorted.csv")
print df2
The file produces foll. sorted.csv file
,UserID,MovieID,Rating,TimeStamp
32236,1,1,5,874965758
23171,1,2,3,876893171
83307,1,3,4,878542960
62631,1,4,3,876893119
47638,1,5,3,889751712
5533,1,6,5,887431973
70539,1,7,4,875071561
31650,1,8,1,875072484
20175,1,9,5,878543541
13542,1,10,3,875693118
EDIT 2 - As asked in Comments
Here's the format of Data in u.data file which acts as input
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
One method:
Use pivot_table and if one value per user and movie id then aggfunc doesn't matter, however if there are multiple values, the choose your aggregation.
df.pivot_table(values='Rating',index='UserID',columns='MovieID', aggfunc='mean')
Second method (no duplicate userid, movieid records):
df.set_index(['UserID','MovieID'])['Rating'].unstack()
Third method (no duplicate userid, movieid records):
df.pivot(index='UserID',columns='MovieID',values='Rating')
Fourth method (like the first you can choose your aggregation method):
df.groupby(['UserID','MovieID'])['Rating'].mean().unstack()
Output:
MovieID 1 2 3 4 5 6 7 8 9 10
UserID
1 5 3 4 3 3 5 4 1 5 3

Categories

Resources