random sample per group, with min_rows - python

I have a dataframe and I want to sample it. However while sampling it randomly I want to have at least 1 sample from every element in the column. I also want the distribution have an effect as well.(ex: values with more samples on the original have more on the sampled df)
Similar to this and this question, but with minimum sample size per group.
Lets say this is my df:
df = pd.DataFrame(columns=['class'])
df['class'] = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,2]
df_sample = df.sample(n=4)
And when I sample this I want the df_sample to look like:
Class
0
0
1
2
Thank you.

As suggested by #YukiShioriii you could :
1 - sample one row of each group of values
2 - randomly sample over the remaining rows regardless of the values

Following YukiShioriii's and mprouveur's suggestion
# random_state for reproducibility, remove in production code
sample = df.groupby('class').sample(1, random_state=1)
sample = sample.append(
df[~df.index.isin(sample.index)] # only rows that have not been selected
.sample(n=sample_size-sample.shape[0]) # sample more rows as needed
).sort_index()
Output
class
2 0
4 0
13 1
14 2

Related

Balance dataset using pandas

This is for a machine learning program.
I am working with a dataset that has a csv which contains an id, for a .tif image in another directory, and a label, 1 or 0. There are 220,025 rows in the csv. I have loaded this csv as a pandas dataframe. Currently in the dataframe, there are 220,025 rows, with 130,908 rows with label 0 and 89,117 rows with label 1.
There are 41,791 more rows with label 0 than label 1. I want to randomly drop the extra rows with label 1. After that, I want to decrease the sample size from 178,234 to just 50,000, with 25,000 ids for each label.
Another approach might be to randomly drop 105,908 rows with label 1 and 64,117 with label 0.
How can I do this using pandas?
I have already looked at using .groupby and then using .sample, but that drops an equal amount of rows in both labels, while I only want to drop rows in one label.
Sample of the csv:
id,label
f38a6374c348f90b587e046aac6079959adf3835,0
c18f2d887b7ae4f6742ee445113fa1aef383ed77,1
755db6279dae599ebb4d39a9123cce439965282d,0
bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0
068aba587a4950175d04c680d38943fd488d6a9d,0
acfe80838488fae3c89bd21ade75be5c34e66be7,0
a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da,1
7f6ccae485af121e0b6ee733022e226ee6b0c65f,1
559e55a64c9ba828f700e948f6886f4cea919261,0
8eaaa7a400aa79d36c2440a4aa101cc14256cda4,0
Personally, I would break it up into the following steps:
Since you have more 0s than 1s, we're first going to ensure that we even out the number of each. Here, I'm using the sample data you pasted in as df
Count the number of 1s (since this is our smaller value)
ones_subset = df.loc[df["label"] == 1, :]
number_of_1s = len(ones_subset)
print(number_of_1s)
3
Sample only the zeros to match the number of number_of_1s
zeros_subset = df.loc[df["label"] == 0, :]
sampled_zeros = zeros_subset.sample(number_of_1s)
print(sampled_zeros)
Stick these 2 chunks (all of the 1s from our ones_subset and our matched sampled_zeros together to make one clean dataframe that has an equal number of 1 and 0 labels
clean_df = pd.concat([ones_subset, sampled_zeros], ignore_index=True)
print(clean_df)
id label
0 c18f2d887b7ae4f6742ee445113fa1aef383ed77 1
1 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da 1
2 7f6ccae485af121e0b6ee733022e226ee6b0c65f 1
3 559e55a64c9ba828f700e948f6886f4cea919261 0
4 f38a6374c348f90b587e046aac6079959adf3835 0
5 068aba587a4950175d04c680d38943fd488d6a9d 0
Now that we have a cleaned up dataset, we can proceed with the last step:
Use the groupby(...).sample(...) approach you mentioned to further downsample this dataset. Taking this from a dataset that has 3 of each label (three 1s and three 0s) to a smaller matched size- (two 1s and two 0s)
downsampled_df = clean_df.groupby("label").sample(2)
print(downsampled_df)
id label
4 f38a6374c348f90b587e046aac6079959adf3835 0
5 068aba587a4950175d04c680d38943fd488d6a9d 0
1 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da 1
0 c18f2d887b7ae4f6742ee445113fa1aef383ed77 1

Random values of rows with minimum unique values of a column pandas

I have a huge df (~1 Million rows) with a bunch of columns. One of this column contains some categorical data, like Name:
Code Regione CodeProv Origin Name
0 1 Piemonte 1 Torino
1 1 Piemonte 2 Vercelli
2 1 Piemonte 2 Vercelli
what I want to do is to get a random number of rows, say 10k, but these rows should contain at least 20 unique values of the Name columns, no matters if each unique category has the same row number.
If you number of names is >> 20 and your distribution of names is not concentrated amoungst fewer than 20 names, then don't over complicate it and just do this:
number_of_unique_names_in_sample = 0
while number_of_unique_names_in_sample < 20:
df_sample = df.sample(n=10_000)
number_of_unique_names_in_sample = df_sample["Name"].nunique()
And maybe add in a counter to limit the number of iterations in case your distribution changes (like in a small test sample for example).
This might be what your asking for
name_cols = [list_of_names]
samples_per_name = 500
df[df['Name'].isin(name_cols)].groupby('Name').apply(lambda x: x.sample(samples_per_name))
the result will be 10000 rows with len(name_cols) (20 in your example) each containing 500 rows

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

Pandas - get first n-rows based on percentage

I have a dataframe i want to pop certain number of records, instead on number I want to pass as a percentage value.
for example,
df.head(n=10)
Pops out first 10 records from data set. I want a small change instead of 10 records i want to pop first 5% of record from my data set.
How to do this in pandas.
I'm looking for a code like this,
df.head(frac=0.05)
Is there any simple way to get this?
I want to pop first 5% of record
There is no built-in method but you can do this:
You can multiply the total number of rows to your percent and use the result as parameter for head method.
n = 5
df.head(int(len(df)*(n/100)))
So if your dataframe contains 1000 rows and n = 5% you will get the first 50 rows.
I've extended Mihai's answer for my usage and it may be useful to people out there.
The purpose is automated top-n records selection for time series sampling, so you're sure you're taking old records for training and recent records for testing.
# having
# import pandas as pd
# df = pd.DataFrame...
def sample_first_prows(data, perc=0.7):
import pandas as pd
return data.head(int(len(data)*(perc)))
train = sample_first_prows(df)
test = df.iloc[max(train.index):]
I also had the same problem and #mihai's solution was useful. For my case I re-wrote to:-
percentage_to_take = 5/100
rows = int(df.shape[0]*percentage_to_take)
df.head(rows)
I presume for last percentage rows df.tail(rows) or df.head(-rows) would work as well.
may be this will help:
tt = tmp.groupby('id').apply(lambda x: x.head(int(len(x)*0.05))).reset_index(drop=True)
df=pd.DataFrame(np.random.randn(10,2))
print(df)
0 1
0 0.375727 -1.297127
1 -0.676528 0.301175
2 -2.236334 0.154765
3 -0.127439 0.415495
4 1.399427 -1.244539
5 -0.884309 -0.108502
6 -0.884931 2.089305
7 0.075599 0.404521
8 1.836577 -0.762597
9 0.294883 0.540444
#70% of the Dataframe
part_70=df.sample(frac=0.7,random_state=10)
print(part_70)
0 1
8 1.836577 -0.762597
2 -2.236334 0.154765
5 -0.884309 -0.108502
6 -0.884931 2.089305
3 -0.127439 0.415495
1 -0.676528 0.301175
0 0.375727 -1.297127

Split a dataframe into two files based on the values of a column

I need to split a dataframe into 2 parts. For example if the below dataframe is split randomly based on Col1 both the files should contain samples from each category 1,2 and 3.
Col1 col2
1 a
1 b
2 c
2 d
3 e
So far I am able to split the data into desired ratio by using sklearn.cross_validation import train_test_split. But I am not able to figure out how should splitting be done to pick up samples from every category.
All help will be appreciated. Thanks.
Take a look at StratifiedKFold object.
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html
There is a short example in the doc showing how to use it

Categories

Resources