Disproportionate stratified sampling in Pandas

Disproportionate stratified sampling in Pandas - python

How can I randomly select one row from each group (column Name) in the following dataframe:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
Expected result:
Distance Name Time Order
4 31 John 9 1
0 23 Kate 3 0
2 32 Peter 2 0

you can use a groupby on Name col and apply sample
df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True)
Distance Name Time Order
0 31 John 9 1
1 15 Kate 7 1
2 32 Peter 2 0

You can shuffle all samples using, for example, the numpy function random.permutation. Then groupby by Name and take N first rows from each group:
df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)

you can achive that using unique
df['Name'].unique()

Shuffle the dataframe:
df.sample(frac=1)
And then drop duplicated rows:
df.drop_duplicates(subset=['Name'])

df.drop_duplicates(subset='Name')
Distance Name Time Order
1 16 John 5 0
0 23 Kate 3 0
2 32 Peter 2 0
This should help, but this not random choice, it keeps the first

How about using random
like this,
Import your provided data,
df=pd.read_csv('random_data.csv', header=0)
which looks like this,
Distance Name Time Order
1 16 John 5 0
4 3 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
then get a random column name,
colname = df.columns[random.randint(1, 3)]
and below it selected 'Name',
print(df[colname])
1 John
4 John
0 Kate
3 Kate
Name: Name, dtype: object
Of course I could have condensed this to,
print(df[df.columns[random.randint(1, 3)]])

Related

Adding value to each row

I have a pandas dataframe and for each row (column value) I would like to add +5. Meaning that I would leave the original numbers and add 5 to each.
Dataframe:
import pandas as pd
info= {"Num":[12,14,13,12,14,13,15], "NAME":['John','Camili','Rheana','Joseph','Amanti','Alexa','Siri']}
data = pd.DataFrame(info)
print("Original Data frame:\n")
print(data)
Output:
Original Data frame:
Num NAME
0 12 John
1 14 Camili
2 13 Rheana
3 12 Joseph
4 14 Amanti
5 13 Alexa
6 15 Siri
Desired output:
Num NAME
0 17 John
1 19 Camili
2 18 Rheana
3 17 Joseph
4 19 Amanti
5 18 Alexa
6 20 Siri
Attempt to solve:
for i,e in enumerate(data['Num']):
data.at[i,'Num']= +5
output:
data
Out[391]:
Num NAME
0 5 John
1 5 Camili
2 5 Rheana
3 5 Joseph
4 5 Amanti
5 5 Alexa
6 5 Siri
Would appreciate an example with a for loop

You need simple
data['Num'] += 5
without for-loop

import pandas as pd
info= {"Num":[12,14,13,12,14,13,15], "NAME":['John','Camili','Rheana','Joseph','Amanti','Alexa','Siri']}
data = pd.DataFrame(info)
Answer:
for index in range(len(data)):
data['Num'].iloc[index] += 5
Output:
data
Out[617]:
Num NAME
0 17 John
1 19 Camili
2 18 Rheana
3 17 Joseph
4 19 Amanti
5 18 Alexa
6 20 Siri

Add or Subract two columns in a dataframe on basis of column?

I have df with has three columns name,amount and type.
I'm trying to add or subract values to user on basis of type
Here's my sample df
name amount type
0 John 10 ADD
1 John 20 ADD
2 John 50 ADD
3 John 50 SUBRACT
4 Adam 15 ADD
5 Adam 25 ADD
6 Adam 5 ADD
7 Adam 30 SUBRACT
8 Mary 100 ADD
My resultant df
name amount
0 John 30
1 Adam 15
2 Mary 100

Idea is multiple by 1 if ADD and -1 if SUBRACT column and then aggregate sum:
df1 = (df['amount'].mul(df['type'].map({'ADD':1, 'SUBRACT':-1}))
.groupby(df['name'], sort=False)
.sum()
.reset_index(name='amount'))
print (df1)
name amount
0 John 30
1 Adam 15
2 Mary 100
Detail:
print (df['type'].map({'ADD':1, 'SUBRACT':-1}))
0 1
1 1
2 1
3 -1
4 1
5 1
6 1
7 -1
8 1
Name: type, dtype: int64
Also is possible specify only negative values with numpy.where for multiple by -1 and all another by 1:
df1 = (df['amount'].mul(np.where(df['type'].eq('SUBRACT'), -1, 1))
.groupby(df['name'], sort=False)
.sum()
.reset_index(name='amount'))
print (df1)
name amount
0 John 30
1 Adam 15
2 Mary 100

One idea could be to use Series.where to change the sign of amount accordingly and then groupby.sum:
df.amount.where(df.type.eq('ADD'), -df.amount).groupby(df.name).sum().reset_index()
name amount
0 Adam 15
1 John 30
2 Mary 100

Split a Pandas DataFrame where one factor column is evenly distributed among the splits

I'm trying to split a Pandas DataFrame into multiple separate DataFrames where one of the columns is evenly distributed among the resulting DataFrames. For example, if I wanted the following DataFrame split into 3 distinct DataFrames where each one contains one record of each sector (selected at random).
So a df that looks like this:
id Name Sector
1 John A
2 Steven A
3 Jane A
4 Kyle A
5 Ashley B
6 Ken B
7 Tom B
8 Peter B
9 Elaine C
10 Tom C
11 Adam C
12 Simon C
13 Stephanie D
14 Jan D
15 Marsha D
16 David D
17 Drew E
18 Kit E
19 Corey E
20 James E
Would yield two DataFrames, one of which could look like this, while the other consist of the remaining records.
id Name Sector
1 John A
2 Steven A
7 Tom B
8 Peter B
10 Tom C
11 Adam C
13 Stephanie D
16 David D
19 Corey E
20 James E
I know np.array_split(df, 2) will get me part way there, but it may not evenly distribute the sectors like I need.
(Edited for clarity)

Update per comments and updated question:
df_1=df.groupby('Sector', as_index=False, group_keys=False).apply(lambda x: x.sample(n=2))
df_2 = df[~df.index.isin(df_1.index)]
print(df_1)
id Name Sector
2 3 Jane A
3 4 Kyle A
7 8 Peter B
5 6 Ken B
11 12 Simon C
9 10 Tom C
12 13 Stephanie D
15 16 David D
19 20 James E
17 18 Kit E
print(df_2)
id Name Sector
0 1 John A
1 2 Steven A
4 5 Ashley B
6 7 Tom B
8 9 Elaine C
10 11 Adam C
13 14 Jan D
14 15 Marsha D
16 17 Drew E
18 19 Corey E
Here is a "funky" method, using sequential numbering and random sampling:
df['grp'] = df.groupby('Sector')['Sector']\
.transform(lambda x: x.notna().cumsum().sample(frac=1))
dd = dict(tuple(df.groupby('grp')))
Output:
dd[1]
id Name Sector grp
0 1 John A 1
4 5 Ken B 1
6 7 Elaine C 1
dd[2]
id Name Sector grp
2 3 Jane A 2
5 6 Tom B 2
7 8 Tom C 2
dd[3]
id Name Sector grp
1 2 Steven A 3
3 4 Ashley B 3
8 9 Adam C 3
Details:
Create a sequence of numbers in each sector group starting from 1,
then randomize than number in the group to create a grouping key,
grp.
Use grp to groupby then create a dictionary, with keys for each grp.

Here's my way, you can groupbyby sector and randomly select from each group with a loop using the sample function:
for x, i in df.groupby('Sector'):
print(i.sample())
If you need multiple random selection use the sample function to specify how many items you want. For example:
for x, i in df.groupby('Sector'):
print(i.sample(2))
will return 2 random values from each group.

How can we fill the empty values in the column?

I have table A with 3 columns. The column (val) has some of the empty values. The question is: Is there any way to fill the empty values based on the previous value using python. For example, Alex and John take vale 20 and Sam takes value 100.
A = [ id name val
1 jack 10
2 mec 20
3 alex
4 john
5 sam 250
6 tom 100
7 sam
8 hellen 300]

You can try to take in data as a pandas Dataframe and use the built in function fillna() to solve your problem. For an example,
df = # your data
df.fillna(method='pad')
Would return a dataframe like,
id name val
0 1 jack 10
1 2 mec 20
2 3 alex 20
3 4 john 20
4 5 sam 250
5 6 tom 100
6 7 sam 100
7 8 hellen 300
You can refer to this page for more information.

groupby a column and count items above 5 in another pandas

So I have a df like this:
NAME TRY SCORE
Bob 1st 3
Sue 1st 7
Tom 1st 3
Max 1st 8
Jay 1st 4
Mel 1st 7
Bob 2nd 4
Sue 2nd 2
Tom 2nd 6
Max 2nd 4
Jay 2nd 7
Mel 2nd 8
Bob 3rd 3
Sue 3rd 5
Tom 3rd 6
Max 3rd 3
Jay 3rd 4
Mel 3rd 6
I want to count haw mant times each person scores more than 5?
into a new df2 that looks like this:
NAME COUNT
Bob 0
Sue 1
Tom 2
Mary 1
Jay 1
Mel 3
My attempts have been many - here is the latest
df2 = df.groupby('NAME')[['SCORE'] > 5].count().reset_index(name="count")

Just using groupby and sum
df.assign(SCORE=df.SCORE.gt(5)).groupby('NAME')['SCORE'].sum().astype(int).reset_index()
Out[524]:
NAME SCORE
0 Bob 0
1 Jay 1
2 Max 1
3 Mel 3
4 Sue 1
5 Tom 2
Or we using set_index with sum
df.set_index('NAME').SCORE.gt(5).sum(level=0).astype(int)

First create boolean mask and then aggregate by sum- Trues values are processes like 1:
df2 = (df['SCORE'] > 5).groupby(df['NAME']).sum().astype(int).reset_index(name="count")
print (df2)
NAME count
0 Bob 0
1 Jay 1
2 Max 1
3 Mel 3
4 Sue 1
5 Tom 2
Detail:
print (df['SCORE'] > 5)
0 False
1 True
2 False
3 True
4 False
5 True
6 False
7 False
8 True
9 False
10 True
11 True
12 False
13 False
14 True
15 False
16 False
17 True
Name: SCORE, dtype: bool

One way to do this is to write a custom groupby function where you take the scores of each group and sum up those that are greater than 5 like this:
df.groupby('NAME')['SCORE'].agg(lambda x: (x > 5).sum())
NAME
Bob 0
Jay 1
Max 1
Mel 3
Sue 1
Tom 2
Name: SCORE, dtype: int64

If you want counts as a dictionary, you can use collections.Counter:
from collections import Counter
c = Counter(df.loc[df['SCORE'] > 5, 'NAME'])
For a dataframe you can map counts from unique names:
res = pd.DataFrame({'NAME': df['NAME'].unique(), 'COUNT': 0})
res['COUNT'] = res['NAME'].map(c).fillna(0).astype(int)
print(res)
COUNT NAME
0 0 Bob
1 1 Sue
2 2 Tom
3 1 Max
4 1 Jay
5 3 Mel

Filter dataframe first, then groupby with aggregation and reindex to fill missing values.
df[df['SCORE'] > 5].groupby('NAME')['SCORE'].size()\
.reindex(df['NAME'].unique(), fill_value=0)
Output:
NAME
Bob 0
Sue 1
Tom 2
Max 1
Jay 1
Mel 3
Name: SCORE, dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Disproportionate stratified sampling in Pandas - python

How can I randomly select one row from each group (column Name) in the following dataframe: Distance Name Time Order 1 16 John 5 0 4 31 John 9 1 0 23 Kate 3 0 3 15 Kate 7 1 2 32 Peter 2 0 5 26 Peter 4 1 Expected result: Distance Name Time Order 4 31 John 9 1 0 23 Kate 3 0 2 32 Peter 2 0

you can use a groupby on Name col and apply sample df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True) Distance Name Time Order 0 31 John 9 1 1 15 Kate 7 1 2 32 Peter 2 0

You can shuffle all samples using, for example, the numpy function random.permutation. Then groupby by Name and take N first rows from each group: df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)

you can achive that using unique df['Name'].unique()

Shuffle the dataframe: df.sample(frac=1) And then drop duplicated rows: df.drop_duplicates(subset=['Name'])

df.drop_duplicates(subset='Name') Distance Name Time Order 1 16 John 5 0 0 23 Kate 3 0 2 32 Peter 2 0 This should help, but this not random choice, it keeps the first

Related

Adding value to each row

Add or Subract two columns in a dataframe on basis of column?

Split a Pandas DataFrame where one factor column is evenly distributed among the splits

How can we fill the empty values in the column?

groupby a column and count items above 5 in another pandas

Categories

Resources