How to find the occurrence of a cell in a sequence (pandas) - python

I have a column with names.
df = pd.DataFrame({"Names":['Bob','Rob','John','Bob','Jacob']})
I want to increment the occurrence number by 1 if the name is repeated. How to do that in pandas?
I want the output like below
Names Occurance
0 Bob 1
1 Rob 1
2 John 1
3 Bob 2
4 Jacob 1

Use GroupBy.cumcount and add 1:
df['Occurance'] = df.groupby('Names').cumcount() + 1
print (df)
Names Occurance
0 Bob 1
1 Rob 1
2 John 1
3 Bob 2
4 Jacob 1

Related

subtracting rows using diff()

I have a dataframe similar to this:
name group val
cici a 3
john b 2
john a 1
john c 5
ian a 2
ian a 3
I am trying to 1) group by name 2) calculate the difference among vals.
the returned column should be:
name group val delta
cici a 3 0
john b 2 0
john a 1 -1
john c 5 3
ian a 2 0
ian a 3 1
I used diff() to calculate this, however, for john, I am trying to get b-b, a-b, c-b, but when I use diff(), I got b-b, a-b, c-a...is there anyway I could use the diff to count their difference with the first row in the group?
my code:
df.groupby('name')['val'].transform('diff')
Anyway to fix this?
Do not need diff
df['dif'] = df['val'] - \
df.groupby('name')['val'].transform('first')
df
Out[222]:
name group val dif
0 cici a 3 0
1 john b 2 0
2 john a 1 -1
3 john c 5 3
4 ian a 2 0
5 ian a 3 1

Add or Subract two columns in a dataframe on basis of column?

I have df with has three columns name,amount and type.
I'm trying to add or subract values to user on basis of type
Here's my sample df
name amount type
0 John 10 ADD
1 John 20 ADD
2 John 50 ADD
3 John 50 SUBRACT
4 Adam 15 ADD
5 Adam 25 ADD
6 Adam 5 ADD
7 Adam 30 SUBRACT
8 Mary 100 ADD
My resultant df
name amount
0 John 30
1 Adam 15
2 Mary 100
Idea is multiple by 1 if ADD and -1 if SUBRACT column and then aggregate sum:
df1 = (df['amount'].mul(df['type'].map({'ADD':1, 'SUBRACT':-1}))
.groupby(df['name'], sort=False)
.sum()
.reset_index(name='amount'))
print (df1)
name amount
0 John 30
1 Adam 15
2 Mary 100
Detail:
print (df['type'].map({'ADD':1, 'SUBRACT':-1}))
0 1
1 1
2 1
3 -1
4 1
5 1
6 1
7 -1
8 1
Name: type, dtype: int64
Also is possible specify only negative values with numpy.where for multiple by -1 and all another by 1:
df1 = (df['amount'].mul(np.where(df['type'].eq('SUBRACT'), -1, 1))
.groupby(df['name'], sort=False)
.sum()
.reset_index(name='amount'))
print (df1)
name amount
0 John 30
1 Adam 15
2 Mary 100
One idea could be to use Series.where to change the sign of amount accordingly and then groupby.sum:
df.amount.where(df.type.eq('ADD'), -df.amount).groupby(df.name).sum().reset_index()
name amount
0 Adam 15
1 John 30
2 Mary 100

Disproportionate stratified sampling in Pandas

How can I randomly select one row from each group (column Name) in the following dataframe:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
Expected result:
Distance Name Time Order
4 31 John 9 1
0 23 Kate 3 0
2 32 Peter 2 0
you can use a groupby on Name col and apply sample
df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True)
Distance Name Time Order
0 31 John 9 1
1 15 Kate 7 1
2 32 Peter 2 0
You can shuffle all samples using, for example, the numpy function random.permutation. Then groupby by Name and take N first rows from each group:
df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)
you can achive that using unique
df['Name'].unique()
Shuffle the dataframe:
df.sample(frac=1)
And then drop duplicated rows:
df.drop_duplicates(subset=['Name'])
df.drop_duplicates(subset='Name')
Distance Name Time Order
1 16 John 5 0
0 23 Kate 3 0
2 32 Peter 2 0
This should help, but this not random choice, it keeps the first
How about using random
like this,
Import your provided data,
df=pd.read_csv('random_data.csv', header=0)
which looks like this,
Distance Name Time Order
1 16 John 5 0
4 3 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
then get a random column name,
colname = df.columns[random.randint(1, 3)]
and below it selected 'Name',
print(df[colname])
1 John
4 John
0 Kate
3 Kate
Name: Name, dtype: object
Of course I could have condensed this to,
print(df[df.columns[random.randint(1, 3)]])

pandas groupby rank removes index, returns all 1s

My dataframe looks like this:
name1 name2 value
1 Jane Foo 2
2 Jane Bar 4
3 John Foo 7
4 John Bar 1
If I do df.groupby(['name1', 'name2']).count() I get:
value
name1 name2
Jane Foo 1
Jane Bar 1
John Foo 1
John Bar 1
But I'm trying to find the rank of each value within each multiindex group. Ideally, if I use df.groupby(['name1', 'name2']).rank() I should get:
value
name1 name2
Jane Foo 2
Jane Bar 1
John Foo 1
John Bar 2
But instead I simply get:
value
1 1
2 1
3 1
4 1
with the names of the grouped columns removed, only the index numbers as the index, and the rank value for all rows equaling 1. What am I doing wrong?
I think you need working with numeric - so it seems need grouping be first column name1 and return rank for value:
df['rank'] = df.groupby('name1')['value'].rank(method='dense', ascending=False).astype(int)
print (df)
name1 name2 value rank
1 Jane Foo 2 2
2 Jane Bar 4 1
3 John Foo 7 1
4 John Bar 1 2

groupby a column and count items above 5 in another pandas

So I have a df like this:
NAME TRY SCORE
Bob 1st 3
Sue 1st 7
Tom 1st 3
Max 1st 8
Jay 1st 4
Mel 1st 7
Bob 2nd 4
Sue 2nd 2
Tom 2nd 6
Max 2nd 4
Jay 2nd 7
Mel 2nd 8
Bob 3rd 3
Sue 3rd 5
Tom 3rd 6
Max 3rd 3
Jay 3rd 4
Mel 3rd 6
I want to count haw mant times each person scores more than 5?
into a new df2 that looks like this:
NAME COUNT
Bob 0
Sue 1
Tom 2
Mary 1
Jay 1
Mel 3
My attempts have been many - here is the latest
df2 = df.groupby('NAME')[['SCORE'] > 5].count().reset_index(name="count")
Just using groupby and sum
df.assign(SCORE=df.SCORE.gt(5)).groupby('NAME')['SCORE'].sum().astype(int).reset_index()
Out[524]:
NAME SCORE
0 Bob 0
1 Jay 1
2 Max 1
3 Mel 3
4 Sue 1
5 Tom 2
Or we using set_index with sum
df.set_index('NAME').SCORE.gt(5).sum(level=0).astype(int)
First create boolean mask and then aggregate by sum- Trues values are processes like 1:
df2 = (df['SCORE'] > 5).groupby(df['NAME']).sum().astype(int).reset_index(name="count")
print (df2)
NAME count
0 Bob 0
1 Jay 1
2 Max 1
3 Mel 3
4 Sue 1
5 Tom 2
Detail:
print (df['SCORE'] > 5)
0 False
1 True
2 False
3 True
4 False
5 True
6 False
7 False
8 True
9 False
10 True
11 True
12 False
13 False
14 True
15 False
16 False
17 True
Name: SCORE, dtype: bool
One way to do this is to write a custom groupby function where you take the scores of each group and sum up those that are greater than 5 like this:
df.groupby('NAME')['SCORE'].agg(lambda x: (x > 5).sum())
NAME
Bob 0
Jay 1
Max 1
Mel 3
Sue 1
Tom 2
Name: SCORE, dtype: int64
If you want counts as a dictionary, you can use collections.Counter:
from collections import Counter
c = Counter(df.loc[df['SCORE'] > 5, 'NAME'])
For a dataframe you can map counts from unique names:
res = pd.DataFrame({'NAME': df['NAME'].unique(), 'COUNT': 0})
res['COUNT'] = res['NAME'].map(c).fillna(0).astype(int)
print(res)
COUNT NAME
0 0 Bob
1 1 Sue
2 2 Tom
3 1 Max
4 1 Jay
5 3 Mel
Filter dataframe first, then groupby with aggregation and reindex to fill missing values.
df[df['SCORE'] > 5].groupby('NAME')['SCORE'].size()\
.reindex(df['NAME'].unique(), fill_value=0)
Output:
NAME
Bob 0
Sue 1
Tom 2
Max 1
Jay 1
Mel 3
Name: SCORE, dtype: int64

Categories

Resources