Sample Pandas Dataframe with equal number based on binary column

Sample Pandas Dataframe with equal number based on binary column - python

I have a dataframe with a data column, and a value column, as in the example below. The value column is always binary, 0 or 1:
data,value
173,1
1378,0
926,0
643,0
1279,0
472,0
706,0
1345,0
1167,1
1401,1
1236,0
447,1
1204,1
398,0
714,0
734,0
1732,0
98,0
1696,0
160,0
1611,0
274,1
562,0
625,0
1028,0
1766,0
511,0
1691,0
898,1
I need to sample the dataset so that basically I have an equal number of both values. So, if I originally have less 1 class, I'll need to use that one as a reference. In turn, if I have less 0 classes, I need to use that.
Any clues on how to do this? I'm working on a jupyter notebook, Python 3.6 (I cannot go up versions).

Sample data
data = [173,926,634,706,398]
value = [1,0,0,1,0]
df = pd.DataFrame({"data": data, "value": value})
print(df)
# data value
# 0 173 1
# 1 926 0
# 2 634 0
# 3 706 1
# 4 398 0
Filter to two DFs
ones = df[df['value'] == 1]
zeros = df[df['value'] == 0]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
# data value
# 1 926 0
# 2 634 0
# 4 398 0
Truncate as required
Find the minimum and then truncate it (take n first rows)
if len(ones) <= len(zeros):
zeros = zeros.iloc[:len(ones), :]
else:
ones = ones.iloc[:len(zeros), :]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
#
#
# data value
# 1 926 0
# 2 634 0

Group your dataframe by values, and then take a sample of the smallest count from each group.
grouped = df.groupby(['value'])
smallest = grouped.count().min().values
try: # Pandas 1.1.0+
print(grouped.sample(smallest))
except AttributeError: # Pre-Pandas 1.1.0
print(grouped.apply(lambda df: df.sample(smallest)))
Output:
data value
25 1766 0
3 643 0
10 1236 0
1 1378 0
14 714 0
6 706 0
24 1028 0
8 1167 1
9 1401 1
0 173 1
12 1204 1
11 447 1
28 898 1
21 274 1

This should do it.
df.groupby('value').sample(df.groupby('value').size().min())

Related

How to combine rows in groupby with several conditions?

I want to combine rows in pandas df with the following logic:
dataframe is grouped by users
rows are ordered by start_at_min
rows are combiend when:
Case A:
if start_at_min<=200:
row1[stop_at_min] - row2[start_at_min] < 5
(eg: 101 -100 = 1 -> combine; 200-100=100: -> dont combine)
Case Bif 200> start_at_min<400:
change threhsold to 3
Case C if start_at_min>400:
Never combine
Example df
user start_at_min stop_at_min
0 1 100 150
1 1 152 201 #row0 with row1 combine
2 1 205 260 #row1 with row 2 NO -> start_at_min above 200 -> threshol =3
3 2 65 100 #no
4 2 200 265 #no
5 2 300 451 #no
6 2 452 460 #no -> start_at_min above 400-> never combine
Expected output:
user start_at_min stop_at_min
0 1 100 201 #row1 with row2 combine
2 1 205 260 #row2 with row 3 NO -> start_at_min above 200 -> threshol =3
3 2 65 100 #no
4 2 200 265 #no
5 2 300 451 #no
6 2 452 460 #no -> start_at_min above 400-> never combine
I have written the funciton combine_rows, that takes in 2 Series and applies this logic
def combine_rows (s1:pd.Series, s2:pd.Series):
# take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5
if s2['start_at_min'] - s1['stop_at_min'] <5:
return pd.Series({
'user': s1['user'],
'start_at_min': s1['start_at_min'],
'stop_at_min' : s2['stop_at_min']
})
else:
return pd.concat([s1,s2],axis=1).T
Howver I am unable to apply this function to the dataframe.
This was my attempt:
df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working
Here is the full code:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"user" : [1, 1, 2,2],
'start_at_min': [60, 101, 65, 200],
'stop_at_min' : [100, 135, 100, 265]
})
def combine_rows (s1:pd.Series, s2:pd.Series):
# take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5
if s2['start_at_min'] - s1['stop_at_min'] <5:
return pd.Series({
'user': s1['user'],
'start_at_min': s1['start_at_min'],
'stop_at_min' : s2['stop_at_min']
})
else:
return pd.concat([s1,s2],axis=1).T
df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working

version 1: one condition
Perform a custom groupby.agg:
threshold = 5
# if the successive stop/start per group are above threshold
# start a new group
group = (df['start_at_min']
.sub(df.groupby('user')['stop_at_min'].shift())
.ge(threshold).cumsum()
)
# groupby.agg
out = (df.groupby(['user', group], as_index=False)
.agg({'start_at_min': 'min',
'stop_at_min': 'max'
})
)
Output:
user start_at_min stop_at_min
0 1 60 135
1 2 65 100
2 2 200 265
Intermediate:
(df['start_at_min']
.sub(df.groupby('user')['stop_at_min'].shift())
)
0 NaN
1 1.0 # below threshold, this will be merged
2 NaN
3 100.0 # above threshold, keep separate
dtype: float64
version 2: multiple conditions
# define variable threshold
threshold = np.where(df['start_at_min'].le(200), 5, 3)
# array([3, 3, 5, 3, 3, 5, 5])
# compute the new starts of group like in version 1
# but using the now variable threshold
m1 = (df['start_at_min']
.sub(df.groupby('user')['stop_at_min'].shift())
.ge(threshold)
)
# add a second restart condition (>400)
m2 = df['start_at_min'].gt(400)
# if either mask is True, start a new group
group = (m1|m2).cumsum()
# groupby.agg
out = (df.groupby(['user', group], as_index=False)
.agg({'start_at_min': 'min',
'stop_at_min': 'max'
})
)
Output:
user start_at_min stop_at_min
0 1 100 201
1 1 205 260
2 2 65 100
3 2 200 265
4 2 300 451
5 2 452 460

How do I check for null or string values in columns in the whole dataset in python?

def check_isnull(self):
df = pd.read_csv(self.table_name)
for j in df.values:
for k in j[0:]:
try:
k = float(k)
Flag=1
except ValueError:
Flag = 0
break
if Flag==1:
QMessageBox.information(self, "Information",
"Dataset is ready to train.",
QMessageBox.Close | QMessageBox.Help)
elif Flag==0:
QMessageBox.information(self, "Information",
"There are one or more non-integer values.",
QMessageBox.Close | QMessageBox.Help)
Greetings, here is only 40 rows of the dataset I am trying to train it. I want to replace the null or string values that exist here. My existing function for the replacement operation works without any problems. I wanted to write a function that only gives an output to detect them. My function sometimes gives an error, where is the problem?
User ID Age EstimatedSalary Female Male Purchased
0 15624510 19.000000 19000 0 1 0
1 1581qqsdasd0944 35.000000 qweqwe 0 1 0
2 15668575 37.684211 43000 1 0 0
3 NaN 27.000000 57000 1 0 0
4 15804002 19.000000 69726.81704 0 1 0
.. ... ... ... ... ... ...
395 15691863 46.000000 41000 1 0 1
396 15706071 51.000000 23000 0 1 1
397 15654296 50.000000 20000 1 0 1
398 15755018 36.000000 33000 0 1 0
399 15594041 49.000000 36000 1 0 1

try
pd.to_numeric(df['estimated'],errors='coerce')
then use this to get rid of those rows and also rows with NANs
df.dropna(subset='estimated')

This should do the job:
mask = df.apply(lambda col: col.astype(str).str.replace('.','',1).str.isdigit(), axis=0)
This will first remove the dots in float numbers. Then check if all chars of a value are numeric. It will return a dataframe where any non numeric value is "False" and any numeric value is (True).
If you need to delete rows that have any of these values you can use:
df = df[mask.all(axis=1)]

def check_isnull(self):
df = pd.read_csv(self.table_name)
Flag = 1
for j in df.values:
for k in j[0:]:
if k=="nan" or k=="NaN" or k=="NAN" or type(k)==str:
Flag = 0
I created a structure that iterates all data. run times are comparable.

Creating multiple subsets from single data frame, without replacement

I am trying to create 10 different subsets of 5 Members without replacement from this data (in Python):
Member CIN Needs Assessment Network Enrolled
117 CS38976K 1 1
118 GN31829N 1 1
119 GD98216H 1 1
120 VJ71307A 1 1
121 OX22563R 1 1
122 YW35494W 1 1
123 QX20765B 1 1
124 NO50548K 1 1
125 VX90647K 1 1
126 RG21661H 1 1
127 IT17216C 1 1
128 LD81088I 1 1
129 UZ49716O 1 1
130 UA16736M 1 1
131 GN07797S 1 1
132 TN64827F 1 1
133 MZ23779M 1 1
134 UG76487P 1 1
135 CY90885V 1 1
136 NZ74233H 1 1
137 CB59280X 1 1
138 LI89002Q 1 1
139 LO64230I 1 1
140 NY27508Q 1 1
141 GU30027P 1 1
142 XJ75065T 1 1
143 OW40240P 1 1
144 JQ23187C 1 1
145 PQ45586F 1 1
146 IM59460P 1 1
147 OU17576V 1 1
148 KL75129O 1 1
149 XI38543M 1 1
150 PO09602E 1 1
151 PS27561N 1 1
152 PC63391R 1 1
153 WR70847S 1 1
154 XL19132L 1 1
155 ZX27683R 1 1
156 MZ63663M 1 1
157 FT35723P 1 1
158 NX90823W 1 1
159 SC16809F 1 1
160 TX83955R 1 1
161 JA79273O 1 1
162 SK66781D 1 1
163 UK69813N 1 1
164 CX01143B 1 1
165 MT45485A 1 1
166 LJ25921O 1 1
I tried using MANY variations of random.sample() for _ in range().
Nothing is working. Nothing so far on stack overflow seems to give me the result I need.

Here a solution using pandas.
Say that master is your master dataframe created with pandas, you can do:
shuffled = master.sample(frac=1)
This creates a copy of your master dataframe with rows randomly reordered. See this answer on stackoverflow or the docs for the sample method.
Then you can simply build 10 smaller dataframes of five rows going in order.
subsets = []
for i in range(10):
subdf = shuffled.iloc[(i*5):(i+1)*5]
subsets.append(subdf)
subsets is the list containing your small dataframes. Do:
for sub in subsets:
print(sub)
to print them all and verify by eye that there are not repetitions.

This seems like a combination problem. Here is a solution:
You should create your list, say L. Then you decide the size of the subset, say r. After that here is the code:
from itertools import combinations
combinations(L,r)
However if you don't want to decide the size of the set to be created, then you can use random module as follows:
import random
from itertools import combinations
combinations(L,r = random(a,b))
In this case, this will create a random set of r (which is random integer between a and b) elements from the list L. If you want to do that 10 times, you can make a for loop.
I hope that works for you.

Let's assume that we have lines variable with an iterator of your dataset. Then:
from random import sample
# Chunk length
chunk_len = 2
# Number of chunks
num_of_chunks = 5
# Get the sample with data for all chunks. It guarantees us that there will
# be no repetitions
random_sample = sample(lines, num_of_chunks*chunk_len)
# Construct the list with chunks
result = [random_sample[i::num_of_chunks] for i in range(num_of_chunks)]
result
Will return:
[['123 QX20765B 1 1',
'118 GN31829N 1 1'],
['127 IT17216C 1 1',
'122 YW35494W 1 1'],
['138 LI89002Q 1 1',
'126 RG21661H 1 1'],
['120 VJ71307A 1 1',
'121 OX22563R 1 1'],
['143 OW40240P 1 1',
'142 XJ75065T 1 1']]

How to iterate through each value in a column with conditions and create a new column with new values

I have a cumulative sum % column in my data frame.
I would like to have a function that iterate through each cell of that column and returns a value in the newly created column M_quintile.
cumsum cumsumperc M_quintile
465 0.001320 a number between 1-5
439 0.002499 a number between 1-5
213 0.003624 a number between 1-5
616 0.004583 a number between 1-5
527 0.005468 a number between 1-5
Here's the function I currently have:
def score(x):
if x <= 0.20:
return 5
elif x <= 0.40:
return 4
elif x <= 0.60:
return 3
elif x <= 0.80:
return 2
else:
return 1
How do i apply this function on a specific column specifically to the cumsumperc column?

I think you're looking for pd.cut(). In your case:
df['M_quintile'] = pd.cut(df.cumsumperc, bins=[-np.inf,0.2,0.4,0.6,0.8,np.inf], labels=[5,4,3,2,1])
>>> df
cumsum cumsumperc M_quintile
0 465 0.001320 5
1 439 0.002499 5
2 213 0.003624 5
3 616 0.004583 5
4 527 0.005468 5
This says: if cumsumperc is between negative infinity and 0.2 (the first 2 values in the bins argument), assign it 5 (the first value in your labels argument), if it's between 0.2 and 0.4, assign it 4, and so on until if it's between 0.8 and infinity, assign it 1.
In your case, all values are between negative infinity and 0.2, so they all get assigned 5. Just for illustration, look what happens if you add another value:
>>> df
cumsum cumsumperc
0 465 0.001320
1 439 0.002499
2 213 0.003624
3 616 0.004583
4 527 0.005468
5 999 0.720000
>>> df['M_quintile'] = pd.cut(df.cumsumperc, bins=[-np.inf,0.2,0.4,0.6,0.8,np.inf], labels=[5,4,3,2,1])
>>> df
cumsum cumsumperc M_quintile
0 465 0.001320 5
1 439 0.002499 5
2 213 0.003624 5
3 616 0.004583 5
4 527 0.005468 5
5 999 0.720000 2

I think there are better ways to do this through Pandas, but if you wanted to use your own function, you can use the apply function.
import pandas as pd
def score(x):
if x <= 0.20:
return 5
elif x <= 0.40:
return 4
elif x <= 0.60:
return 3
elif x <= 0.80:
return 2
else:
return 1
df['M_quintile'] = df['cumsumperc'].apply(score)
Output:
cumsum cumsumperc M_quintile
0 465 0.001320 5
1 439 0.002499 5
2 213 0.003624 5
3 616 0.004583 5
4 527 0.005468 5

how to get the top n values and corresponding column headers to append to a Pandas dataframe

I have The results of a tensorflow multi-class prediction and I have been able to get the top value for each row and its' corresponding column header (which is the most likely predicted class) to append to the original data for further analysis like so:
The original results df with the predictions odds looks kind of like the following but with 260 columns. The column headers are the 1st row of ints.. the likelihood are the row 0 ,1 and so on for millions..
0 1 2 3 4 5 6 7 8 9 10 11 ....... 259
0 8.840584e-08 0.000115 0.000210 0.001662 0.002789
1 0.000312 0.000549 0.002412 0.000630 0.000077
The code that worked to get the top value (contained in the row) is:
eval_datan['odds']=predsdf.max(axis=1) #gets the largest value in the row
And to get the corresponding column header and append it to the original DF:
eval_datan['pred']=predsdf.idxmax(axis=1) #gets the column header for the largest value
I can't figure out how to get the top "n" in this case the top 5 maybe and add them to the original DF
the result currently looks like:
agegrp gender race marital_status region ccs1 ccs2 ccs3 ccs4 ccs5 odds pred
0 272 284 298 288 307 101 164 53 98 200 0.066987 102
1 272 285 300 290 307 204 120 147 258 151 0.196983 47
2 272 284 298 289 307 197 2 39 253 259 0.109894 259
So what I want is the top 5 preds and the top 5 odds...on the end of the original data.
I've looked at nlargest in pandas but so far no luck?

You can pick your top N features by changing the variable n below.
import pandas as pd
df = pd.read_table('your_sample_data.txt', delimiter='\s+')
n=3 # Top N features
frames = []
df.T.apply(lambda x: frames.append(x.sort_values(ascending=False).head(n).index.tolist()), axis=0)
print(df)
print(df.join(pd.DataFrame(frames, columns=['ccs{}'.format(n+1) for n in range(n)])))
0 1 2 3 4
0 8.840584e-08 0.000115 0.000210 0.001662 0.002789
1 3.120000e-04 0.000549 0.002412 0.000630 0.000077
0 1 2 3 4 ccs1 ccs2 ccs3
0 8.840584e-08 0.000115 0.000210 0.001662 0.002789 4 3 2
1 3.120000e-04 0.000549 0.002412 0.000630 0.000077 2 3 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sample Pandas Dataframe with equal number based on binary column - python

This should do it. df.groupby('value').sample(df.groupby('value').size().min())

Related

How to combine rows in groupby with several conditions?

How do I check for null or string values in columns in the whole dataset in python?

Creating multiple subsets from single data frame, without replacement

How to iterate through each value in a column with conditions and create a new column with new values

how to get the top n values and corresponding column headers to append to a Pandas dataframe

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sample Pandas Dataframe with equal number based on binary column - python

This should do it. df.groupby('value').sample(df.groupby('value').size().min())

Related

How to combine rows in groupby with several conditions?

How do I check for null or string values ​in columns in the whole dataset in python?

Creating multiple subsets from single data frame, without replacement

How to iterate through each value in a column with conditions and create a new column with new values

how to get the top n values and corresponding column headers to append to a Pandas dataframe

Categories

Resources

How do I check for null or string values in columns in the whole dataset in python?