Ranking columns by a single integer - python

Hi i am basically trying to rank a column in a dataframe into ranking position.
it looks something like this i am trying to create something like this. For person with same number of fruits sold to have the same ranking So that when i sort them by rankings it does not have any decimals. Can anyone advice me?
person | number of fruits sold | ranking
A | 5 | 2
B | 6 | 1
C | 2 | 4
D | 5 | 2
E | 3 | 3

You can use pd.factorize. A few tricks here: take care to negate your series, specify sort=True, add 1 for your desired result.
df['ranking'] = pd.factorize(-df['number of fruits sold'], sort=True)[0] + 1
Result:
person number of fruits sold ranking
0 A 5 2
1 B 6 1
2 C 2 4
3 D 5 2
4 E 3 3

Use Series.rank:
df['ranking'] = df['number of fruits sold'].rank(method='dense', ascending=False).astype(int)
print (df)
person number of fruits sold ranking
0 A 5 2
1 B 6 1
2 C 2 4
3 D 5 2
4 E 3 3

Related

Pandas, Python - merging columns with same key, but with different values

From my for-loop, the resulted lists are as follow:
#These lists below are list types and in ordered/structured.
key=[1234,2345,2223,6578,9976]
index0=[1,4,6,3,4,5,6,2,1]
index1=[4,3,2,1,6,8,5,3,1]
index2=[9,4,6,4,3,2,1,4,1]
How do I merge them all into a table by pandas? Below is the expectation.
key | index0 | index1 | index2
1234 | 1 | 4 | 9
2345 | 4 | 3 | 4
... | ... | ... | ...
9967 | 1 | 1 | 1
I had tried using pandas, but only came across into an error about data type. Then I set the dtype into int64 and int32, but still came across the error about data type again.
And for an optional question, should I had approached assembling a table from such a similar data in lists with SQL? I am just learning SQL with mySQL and wonder if it would've been convenient than with pandas for record keeping and persistent storage?
Just use a dict and pass it to pd.DataFrame:
dct = {
'key': pd.Series(key),
'index0': pd.Series(index0),
'index1': pd.Series(index1),
'index2': pd.Series(index2),
}
df = pd.DataFrame(dct)
Output:
>>> df
key index0 index1 index2
0 1234.0 1 4 9
1 2345.0 4 3 4
2 2223.0 6 2 6
3 6578.0 3 1 4
4 9976.0 4 6 3
5 NaN 5 8 2
6 NaN 6 5 1
7 NaN 2 3 4
8 NaN 1 1 1
Here is another way:
First load data into a dictionary:
d = dict(key=[1234,2345,2223,6578,9976],
index0=[1,4,6,3,4,5,6,2,1],
index1=[4,3,2,1,6,8,5,3,1],
index2=[9,4,6,4,3,2,1,4,1])
Then convert to a df:
df = pd.DataFrame({i:pd.Series(j) for i,j in d.items()})
Output:
key index0 index1 index2
0 1234.0 1 4 9
1 2345.0 4 3 4
2 2223.0 6 2 6
3 6578.0 3 1 4
4 9976.0 4 6 3
5 NaN 5 8 2
6 NaN 6 5 1
7 NaN 2 3 4
8 NaN 1 1 1

pandas groupby excluding when a column takes some value

Is there a way to exclude rows that take certain values when aggregating?
For example:
ID | Company | Cost
1 | Us | 2
1 | Them | 1
1 | Them | 1
2 | Us | 1
2 | Them | 2
2 | Them | 1
I would like to do a groupby and sum but ignoring whenever a row is Company="us".
The result should be something like this:
ID | Sum of cost
1 | 2
2 | 3
I solved it by doing this, but I want to know if there's a smarter solution:
df_agg = df[df['Company']!="Us"][['ID','Cost']].groupby(['ID']).sum()
Use:
print (df)
ID Company Cost
0 1 Us 2
1 1 Them 1
2 1 Them 1
3 2 Us 1
4 2 Them 2
5 2 Them 1
6 3 Us 1 <- added new row for see difference
If need filter first and not matched groups (if exist) are not important use:
df1 = df[df.Company!="Us"].groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3
df1 = df.query('Company!="Us"').groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3
If need all groups ID with Cost=0 for Us first set Cost to 0 and then aggregate:
df2 = (df.assign(Cost = df.Cost.where(df.Company!="Us", 0))
.groupby('ID', as_index=False).Cost
.sum())
print (df2)
ID Cost
0 1 2
1 2 3
2 3 0

Count occurrences of specific value in column based on categories of another column

I have a dataset that looks like this:
Categories | Clicks
1 | 1
1 | 3
1 | 2
2 | 2
2 | 1
2 | 1
2 | 2
3 | 1
3 | 2
3 | 3
4 | 2
4 | 1
And to make some bar plots I would like for it to look like this:
Categories | Clicks_count | Clicks_prob
1 | 1 | 33%
2 | 2 | 50%
3 | 1 | 33%
4 | 1 | 50%
so basically: grouping by Categories and calculating on Clicks_count the number of times per category that Clicks takes the value 1, and Clicks_prob the probability of Clicks taking the value 1 (so it's count of Clicks==1/Count of Category i observations)
How could I do this? I tried, to get the second column:
df.groupby("Categories")["Clicks"].count().reset_index()
but the result is:
Categories | Clicks
1 | 3
2 | 4
3 | 3
4 | 2
Try sum and mean on the condition Clicks==1. Since you're working with groups, put them in groupby:
df['Clicks'].eq(1).groupby(df['Categories']).agg(['sum','mean'])
Output:
sum mean
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000
To match output's naming, use named aggregation:
df['Clicks'].eq(1).groupby(df['Categories']).agg(Click_counts='sum', Clicks_prob='mean')
Output:
Click_counts Clicks_prob
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000

Python Pandas keep the first occurence of a specific value and drop the rest of rows with the same specific value

I cannot figure out how to get rid of rows (but keep the first occurence and get rid of every row that has the value) with some condition.
I tried using drop_duplicate but this will get rid of everything. I just want to get rid of some rows with a specific value (Within the same column)
Data is formatted like so:
Col_A | Col_B
5 | 1
5 | 2
1 | 3
5 | 4
1 | 5
5 | 6
I want it like (based on Col_A):
Col_A | Col_B
5 | 1
5 | 2
1 | 3
5 | 4
5 | 6
Use idxmax and check the index. This of course assumes your index is unique.
m = df.Col_A.eq(1) # replace 1 with your desired bad value
df.loc[~m | (df.index == m.idxmax())]
Col_A Col_B
0 5 1
1 5 2
2 1 3
3 5 4
5 5 6
Try this:
df1=df.copy()
mask=df['Col_A'] == 5
df1.loc[mask,'Col_A'] = df1.loc[mask,'Col_A']+range(len(df1.loc[mask,'Col_A']))
df1=df1.drop_duplicates(subset='Col_A',keep='first')
print(df.iloc[df1.index])
Output:
Col_A Col_B
0 5 1
1 5 2
2 1 3
3 5 4
5 5 6

descriptive stats for two categorical variables (pandas)

I need to get the mean and median for frequencies between two categorical variables. E.g.:
Label Letter Num
Foo | A | 1
Foo | B | 2
Foo | C | 4
Bar | A | 2
Bar | G | 3
Bar | N | 1
Bar | P | 2
Cee | B | 1
Cee | B | 2
Cee | C | 4
Cee | D | 5
For instance, what's the mean and median number of letters per label. Here there are 11 cases out of three possible labels (M=3.667) and the median is 4 (3 foo, 4 bar, 4 cee). How can I calculate this in pandas. Is it possible to do this with a groupby statement? My data set is much larger than this.
Need value_counts for one column or groupby + size (or count if need omit NaNs):
a = df['Label'].value_counts()
print (a)
Cee 4
Bar 4
Foo 3
Name: Label, dtype: int64
#alternative
#a = df.groupby('Label').size()
print (a.mean())
3.6666666666666665
print (a.median())
4.0
a = df.groupby(['Label','Letter']).size()
print (a)
Label Letter
Bar A 1
G 1
N 1
P 1
Cee B 2
C 1
D 1
Foo A 1
B 1
C 1
dtype: int64
print (a.mean())
1.1
print (a.median())
1.0

Categories

Resources