Python Panda : Count number of occurence of a number - python

I've searched for long time and I need your help, I'm newbie on python and panda lib. I've a dataframe like that charged from a csv file :
ball_1,ball_2,ball_3,ball_4,ball_5,ball_6,ball_7,extraball_1,extraball_2
10,32,25,5,8,19,21,3,4
43,12,8,19,4,37,12,1,5
12,16,43,19,4,28,40,2,4
ball_X is an int in between 1-50 and extraball_X is an int between 1-9. I want count how many times appear each number in 2 other frames like that :
First DF ball :
Number,Score
1,128
2,34
3,12
4,200
....
50,145
Second DF extraball :
Number,Score
1,340
2,430
3,123
4,540
....
9,120
I've the algorythme in my head but i'm too noob in panda to translate into code.
I Hope it's clear enough and someone will be able to help me. Dont hesitate if you have questions.

groupby on columns with value_counts
def get_before_underscore(x):
return x.split('_', 1)[0]
val_counts = {
k: d.stack().value_counts()
for k, d in df.groupby(get_before_underscore, axis=1)
}
print(val_counts['ball'])
12 3
19 3
4 2
8 2
43 2
32 1
5 1
10 1
37 1
40 1
16 1
21 1
25 1
28 1
dtype: int64
print(val_counts['extraball'])
4 2
1 1
2 1
3 1
5 1
dtype: int64

Related

Take a random sample from a dataframe making sure that I will keep at least one row for each column that has a value different from zero

So I have a dataframe that looks like this:
Player
Points
Assists
Rebounds
Steals
Blocks
Wins
Bryant
35
5
5
1
0
1
James
24
11
9
2
1
0
Durant
31
2
12
0
0
0
Curry
29
4
2
2
0
0
Harden
13
12
0
0
1
0
Doncic
12
5
3
0
0
1
Buttler
24
0
2
1
0
0
Paul
0
12
3
3
0
1
And I want to take a random sample from that dataframe, but in a way that in the resulting sample, each column will have at least one value different from 0. So for example if I decide to take a random sample of 3 players, those 3 players can't be James, Durant and Curry since all three of them have zeros on the Win column. They also couldn't be Bryant, Doncic and Paul since they all have zero blocks.
How can I do this ?
FWI: This dataframe is just a simplification, mine has a lot more of rows and columns, hence I need a generic answer or method.
Thanks!
Try this. I took myself the freedom to add a new player:
import pandas as pd
df = pd.read_csv('./data/players.csv')
_cols = list(df.columns)
_cols.remove('Player')
df['sum'] = df[_cols].sum(axis=1)
df
samples = 3
df[(df['sum']!=0)].sample(samples)
Unfortunately Marcello will never be sampled.
IIUC, you can try something like this:
def sample_df(df, n=3):
while True:
dfs=df.sample(n)
#print(dfs) Just added this print to show dataframes dropped do to zeroes
if ~dfs.iloc[:,1:].sum().eq(0).any():
return dfs
sample_df(df)
Output:
Player Points Assists Rebounds Steals Blocks Wins
1 James 24 11 9 2 1 0
0 Bryant 35 5 5 1 0 1
2 Durant 31 2 12 0 0 0

Pandas groupby results assign to a new column

Hi I'm trying to create a new column in my dataframe and I want the values to based on a calc. The calc is - score share of Student within the Class. There are 2 different students with the same name in different classes, hence why the first group by below is on Class and Student both.
df['share'] = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
With the code above, I get the error incompatible index of inserted column with frame index.
Can someone please help. Thanks
the problem is the groupby aggregate and the index are the unique values of the column you group. And in your case, the SHARE score is the class's score and not the student's, and this sets up a new dataframe with each student's share score.
I understood your problem this way.
ndf = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
ndf = ndf.reset_index()
ndf
If I understood you correctly, given an example df like the following:
Class Student Score
1 1 1 99
2 1 2 60
3 1 3 90
4 1 4 50
5 2 1 93
6 2 2 93
7 2 3 67
8 2 4 58
9 3 1 54
10 3 2 29
11 3 3 34
12 3 4 46
Do you need the following result?
Class Student Score Score_Share
1 1 1 99 0.331104
2 1 2 60 0.200669
3 1 3 90 0.301003
4 1 4 50 0.167224
5 2 1 93 0.299035
6 2 2 93 0.299035
7 2 3 67 0.215434
8 2 4 58 0.186495
9 3 1 54 0.331288
10 3 2 29 0.177914
11 3 3 34 0.208589
12 3 4 46 0.282209
If so, that can be achieved straight forward with:
df['Score_Share'] = df.groupby('Class')['Score'].apply(lambda x: x / x.sum())
You can apply operations within each group's scope like that.
PS. I don't know why a student with the same name in a different class would be a problem, so maybe I'm not getting something right. I'll edit this according to your response. Can't make a comment because I'm a newbie here :)

use groupby and custom agg in a dataframe pandas

I have this dataframe :
id start end
1 1 2
1 13 27
1 30 35
1 36 40
2 2 5
2 8 10
2 25 30
I want to groupby over id and aggregate rows where difference of end of n-1 row and start of n row is less than 10 for example. I already find a way using a loop but it's far too long with over a million rows.
So the expected outcome would be :
id start end
1 1 2
1 13 40
2 2 10
2 25 30
First I can get the required difference by using df['diff']=df['start'].shift(-1)-df['end']. How can I gather ids based on the condition for each different id ?
Thanks !
I believe you can create groups by suntract shifted end by DataFrameGroupBy.shift with greater like 10 and cumulative sum and pass to GroupBy.agg:
g = df['start'].sub(df.groupby('id')['end'].shift()).gt(10).cumsum()
df = (df.groupby(['id',g])
.agg({'start':'first', 'end': 'last'})
.reset_index(level=1, drop=True)
.reset_index())
print (df)
id start end
0 1 1 2
1 1 13 40
2 2 2 10
3 2 25 30

Pandas Count values across rows that are greater than another value in a different column

I have a pandas dataframe like this:
X a b c
1 1 0 2
5 4 7 3
6 7 8 9
I want to print a column called 'count' which outputs the number of values greater than the value in the first column('x' in my case). The output should look like:
X a b c Count
1 1 0 2 2
5 4 7 3 1
6 7 8 9 3
I would like to refrain from using 'lambda function' or 'for' loop or any kind of looping techniques since my dataframe has a large number of rows. I tried something like this but i couldn't get what i wanted.
df['count']=df [ df.iloc [:,1:] > df.iloc [:,0] ].count(axis=1)
I Also tried
numpy.where()
Didn't have any luck with that either. So any help will be appreciated. I also have nan as part of my dataframe. so i would like to ignore that when i count the values.
Thanks for your help in advance!
You can using ge(>=) with sum
df.iloc[:,1:].ge(df.iloc[:,0],axis = 0).sum(axis = 1)
Out[784]:
0 2
1 1
2 3
dtype: int64
After assign it back
df['Count']=df.iloc[:,1:].ge(df.iloc [:,0],axis=0).sum(axis=1)
df
Out[786]:
X a b c Count
0 1 1 0 2 2
1 5 4 7 3 1
2 6 7 8 9 3
df['count']=(df.iloc[:,2:5].le(df.iloc[:,0],axis=0).sum(axis=1) + df.iloc[:,2:5].ge(df.iloc[:,1],axis=0).sum(axis=1))
In case anyone needs such a solution, you can just add the output you get from '.le' and '.ge' in one line. Thanks to #Wen for the answer to my question though!!!

Python random sampling in multiple indices

I have a data frame according to below:
id_1 id_2 value
1 0 1
1 1 2
1 2 3
2 0 4
2 1 1
3 0 5
3 1 1
4 0 5
4 1 1
4 2 6
4 3 7
11 0 8
11 1 14
13 0 10
13 1 9
I would like to take out a random sample of size n, without replacement, from this table based on id_1. This row needs to be unique with respect to the id_1 column and can only occur once.
End result something like:
id_1 id_2 value
1 1 2
2 0 4
4 3 7
13 0 10
I have tried to do a group by and use the indices to take out a row through random.sample but it dosent go all the way.
Can someone give me a pointer on how to make this work? Code for DF below!
As always, thanks for time and input!
/swepab
df = pd.DataFrame({'id_1' : [1,1,1,2,2,3,3,4,4,4,4,11,11,13,13],
'id_2' : [0,1,2,0,1,0,1,0,1,2,3,0,1,0,1],
'value_col' : [1,2,3,4,1,5,1,5,1,6,7,8,14,10,9]})
You can do this using vectorized functions (not loops) using
import numpy as np
uniqued = df.id_1.reindex(np.random.permutation(df.index)).drop_duplicates()
df.ix[np.random.choice(uniqued.index, 1, replace=False)]
uniqued is created by a random shuffle + choice of a unique element by id_1. Then, a random sample (without replacement) is generated on it.
This samples one random per id:
for id in sorted(set(df["id_1"])):
print(df[df["id_1"] == id].sample(1))
PS:
translated above solution using pythons list comprehension, returning a list of of indices:
idx = [df[df["id_1"] == val].sample(1).index[0] for val in sorted(set(df["id_1"]))]

Categories

Resources