python 1:1 stratified sampling per each group

python 1:1 stratified sampling per each group - python

How can a 1:1 stratified sampling be performed in python?
Assume the Pandas Dataframe df to be heavily imbalanced. It contains a binary group and multiple columns of categorical sub groups.
df = pd.DataFrame({'id':[1,2,3,4,5], 'group':[0,1,0,1,0], 'sub_category_1':[1,2,2,1,1], 'sub_category_2':[1,2,2,1,1], 'value':[1,2,3,1,2]})
display(df)
display(df[df.group == 1])
display(df[df.group == 0])
df.group.value_counts()
For each member of the main group==1 I need to find a single match of group==0 with.
A StratifiedShuffleSplit from scikit-learn will only return a random portion of data, not a 1:1 match.

If I understood correctly you could use np.random.permutation:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'group': [0, 1, 0, 1, 0], 'sub_category_1': [1, 2, 2, 1, 1],
'sub_category_2': [1, 2, 2, 1, 1], 'value': [1, 2, 3, 1, 2]})
# create new column with an identifier for a combination of categories
columns = ['sub_category_1', 'sub_category_2']
labels = df.loc[:, columns].apply(lambda x: ''.join(map(str, x.values)), axis=1)
values, keys = pd.factorize(labels)
df['label'] = labels.map(dict(zip(keys, values)))
# build distribution of sub-categories combinations
distribution = df[df.group == 1].label.value_counts().to_dict()
# select from group 0 only those rows that are in the same sub-categories combinations
mask = (df.group == 0) & (df.label.isin(distribution))
# do random sampling
selected = np.ravel([np.random.permutation(group.index)[:distribution[name]] for name, group in df.loc[mask].groupby(['label'])])
# display result
result = df.drop('label', axis=1).iloc[selected]
print(result)
Output
group id sub_category_1 sub_category_2 value
4 0 5 1 1 2
2 0 3 2 2 3
Note that this solution assumes the size of the each possible sub_category combination of group 1 is less than the size of the corresponding sub-group in group 0. A more robust version involves using np.random.choice with replacement:
selected = np.ravel([np.random.choice(group.index, distribution[name], replace=True) for name, group in df.loc[mask].groupby(['label'])])
The version with choice does not have the same assumption as the one with permutation, although it requires at least one element for each sub-category combination.

Related

Monte Carlo Simulation with multiple distributions in each loop

I have an array of NaNs 10 columns wide and 5 rows long.
I have a 5x3 array of poisson random number generations. This represents 5 runs of each A, B, and C, where each column has a different lambda value for the poisson distribution.
A B C
[1, 1, 2,
1, 2, 2,
2, 1, 4,
1, 2, 3,
0, 1, 2]
Each row represents the number of events. That is, the first row would produce one event of type A, one event of type B, and two events of type C.
I would like to loop through each row and produce a set of uniform random numbers. For A, it would between 1 and 100, for B it would be between 101 and 200, and for C it would be between 201 and 300.
The output of the first row would have four numbers, one number between 1 and 100, one number between 101 and 200, and two numbers between 201 and 300. So a sample output of the first row might be:
[34, 105, 287, 221]
The second output row would have five numbers in it, the third row would have seven, etc. I would like to store it in my array of NaNs by overwriting the NaNs that get replaced in each row. Can anyone please help with this? Thanks!

I've got a rather inefficient/unvectorised method which may or may not be what you're looking for, because one part of your question is unclear to me. Do you want the final array to have rows of different sizes, or to be the same size but padded with nans?
This solution assumes padding with nans, since you talked about the nans being overwritten and didn't mention the extra/unused nans being deleted. I'm also assuming that your ABC thing is structured into a numpy array of size (5,3), and I'm calling the array of nans results_arr.
import numpy as np
from random import randint
# Initializing the arrays
results_arr = np.full((5,10), np.nan)
abc = np.array([[1, 1, 2], [1, 2, 2], [2, 1, 4], [1, 2, 3], [0, 1, 2]])
# Loops through each row in ABC
for row_idx in range(len(abc)):
a, b, c = abc[row_idx]
# Here, I'm getting a number in the specified uniform distribution as many times as is specified in the A column. The other 2 loops do the same for the B and C columns.
for i in range(0, a):
results_arr[row_idx, i] = randint(1, 100)
for j in range(a, a+b):
results_arr[row_idx, j] = randint(101, 200)
for k in range(a+b, a+b+c):
results_arr[row_idx, k] = randint(201, 300)
Hope that helps!
P.S. Here's a solution with uneven rows. The result is stored in a list of lists because numpy doesn't support ragged arrays (i.e. rows of different lengths).
import numpy as np
from random import randint
# Initializations
results_arr = []
abc = np.array([[1, 1, 2], [1, 2, 2], [2, 1, 4], [1, 2, 3], [0, 1, 2]])
# Same code logic as before, just storing the results differently
for row_idx in range(len(abc)):
a, b, c = abc[row_idx]
results_this_row = []
for i in range(0, a):
results_this_row.append(randint(1, 100))
for j in range(a, a+b):
results_this_row.append(randint(101, 200))
for k in range(a+b, a+b+c):
results_this_row.append(randint(201, 300))
results_arr.append(results_this_row)
I hope these two solutions cover what you're looking for!

Scripting a simple counter

I wanted to create a simple script, which counts values in one column, that are higher in another column:
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
print(df)
a b
1 1 0
2 3 2
My function:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
return a_counter, b_counter
However
diff(df)
returns (3, 1), instead of (2,0). I know the problem is that every single value of one column gets compared to every value of the other column (e.g. 1 gets compared to 0 and 2 of column b). There probably is a special function for my problem, but can you help me fix my script?

I would suggest adding some helper columns in an intuitive way to help compute the sum of each condition a > b and b > a
A working example based on your code :
import numpy as np
import pandas as pd
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
def diff(dataframe):
dataframe['a>b'] = np.where(dataframe['a']>dataframe['b'], 1, 0)
dataframe['b>a'] = np.where(dataframe['b']>dataframe['a'], 1, 0)
return dataframe['a>b'].sum(), dataframe['b>a'].sum()
print(diff(df))
>>> (2, 0)
Basically what np.where() does, the way I used it, is that it produces 1 if the condition is met and 0 otherwise. You can then add those columns up using a simple sum() function applied on the desired columns.

Update
Maybe you can use:
>>> df['a'].gt(df['b']).sum(), df['b'].gt(df['a']).sum()
(2, 0)
IIUC, to fix your code:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
# Subtract the minimum of counters
m = min(a_counter, b_counter)
return a_counter-m, b_counter-m
Output:
>>> diff(df)
(2, 0)

IIUC, you can use the sign of the difference and count the values:
d = {1: 'a', -1: 'b', 0: 'equal'}
(np.sign(df['a'].sub(df['b']))
.map(d)
.value_counts()
.reindex(list(d.values()), fill_value=0)
)
output:
a 2
b 0
equal 0
dtype: int64

How can I reproduce data in numpy with random.choice?

I have a labeled dataset:
data = np.array([5.2, 4, 5, 2, 5.3, 10, 0])
labels = np.array([1, 0, 1, 2, 1, 3, 4])
I want to pick the data 5.2, 5 and 5.3 with the label 1 and reproduce it, like followed:
datalabel1 = data[(labels == 1)]
Then I want to do a random.choice(), for example (pseudo):
# indices are the indices from label 1
random_choices = np.random.choice(indices, size = 5)
And get as output different values with different indices:
# indices are the different indices of the data from the pool out of random choice
data: [5.3 5.2 5.2 5.2 5]
indices: [4 0 0 2 2]
My goal is to pick out of a pool with label 1 data.

labels == 1 is a boolean mask. You nee to apply it to data, not back to labels to get the data elements labeled 1:
np.random.choice(data[labels == 1], ...)
You can also convert labels == 1 to a set of indices and choose on those before indexing:
indices = np.flatnonzero(labels == 1)
data[np.random.choice(indices, ...)]

Show data only from one cluster

I have a pandas dataframe.
input_data = {'col1': [1, 2, 3], 'col2': [3, 4, 5]}
d = pd.DataFrame(data=input_data)
anotherdata= magic(d)
df = pd.DataFrame(data=anotherdata)
I use DBSCAN to cluster df.
As result I have cluster_labels. Labels can have values from -1 (outlier) to 2 in this case.
I want to have an opportunity to show only data from particular class separately and have an access to initial dataframe d by index.
For example, I have an element with index 1 in input_data.
The element is assigned to cluster 0 and there is no other elements of cluster 0.
How can I find this element in input_data by index?

You probably want to use
d[cluster_labels == 0]
Unless your magic function changed indexes.

Pandas: Filling a new column in Dataframe where 2 of the other columns match

Right now I'm having to do calculations on dataframe_one, then create a new column on dataframe_two and fill the results. dataframe_one is multi indexed, while the second one is not but there are columns that are matched to the indices in dataframe_one.
This is what I'm currently doing:
import pandas as pd
import numpy as np
dataframe_two = {}
dataframe_two['project_id'] = [1, 2]
dataframe_two['scenario'] = ['hgh', 'low']
dataframe_two = pd.DataFrame(dataframe_two)
dataframe_one = {}
dataframe_one['ts_project_id'] = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
dataframe_one['ts_scenario'] = ['hgh', 'hgh', 'hgh', 'hgh', 'hgh', 'low', 'low', 'low', 'low', 'low']
dataframe_one['ts_economics_atcf'] = [-2, 2, -3, 4, 5 , -6, 3, -3, 4, 5]
dataframe_one = pd.DataFrame(dataframe_one)
dataframe_one.index = [dataframe_one['ts_project_id'], dataframe_one['ts_scenario']]
project_scenario = zip(dataframe_two['project_id'], dataframe_two['scenario'])
dataframe_two['econ_irr'] = np.zeros(len(dataframe_two.index))
i = 0
for project, scenario in project_scenario:
# Grabs corresponding series from dataframe_one
atcf = dataframe_one.ix[project].ix[scenario]['ts_economics_atcf']
irr = np.irr(atcf.values)
dataframe_two['econ_irr'][i] = irr
i = i + 1
print dataframe_two
Is there an easier way to do this?
Cheers!

If I understood right, you want pandas equivalent for SQL group_by and aggregation functions. They are essentialy the same, groupby method of a DataFrame and a aggregate method of groupby.SeriesGroupBy object.
>>> dataframe_one['ts_economics_atcf'].groupby(level=[0,1]).aggregate(np.irr)
ts_project_id ts_scenario
1 hgh 0.544954
2 low 0.138952
dtype: float64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python 1:1 stratified sampling per each group - python

Related

Monte Carlo Simulation with multiple distributions in each loop

Scripting a simple counter

How can I reproduce data in numpy with random.choice?

Show data only from one cluster

Pandas: Filling a new column in Dataframe where 2 of the other columns match

Categories

Resources