Given a dataframe my goal is to sample rows such that values in one column are as balanced as possible.
Say I have a dataframe below, the sample size is 3 and target column is c
a | b | c
1 | 2 | 0
3 | 4 | 0
5 | 6 | 1
7 | 8 | 2
9 | 10| 2
11| 12| 2
One of possible samples would be
a | b | c
1 | 2 | 0
5 | 6 | 1
7 | 8 | 2
In case of sample size is not a multiple of the number of unique classes, it is fine to have difference in 1 item or so.
How would I approach this in pandas?
EDIT: provided solution that worked for me in answers
I first generated sample sizes for each unique value of column c so that it is balanced. The remainders are distributed over the first few elements
unique_values = df['c'].unique()
sample_sizes = [(k//len(df.columns))] * len(unique_values)
i = 0
while i < k%len(df.columns):
sample_sizes[i]+= 1
i= I+1
This bit generates the samples based on the generated sample sizes
df2= pd.concat([df.loc[df['c'] == unique_values[i]].sample() for i in range(len(sample_sizes)) for j in range(sample_sizes[i])])
You can just get a random sample of the dataframe based on the minimum count of the target column.
column = 'c'
df = df.groupby(column).sample(n=df[column].value_counts().min(), random_state='42')
First, we create your example dataframe
columns = ['a', 'b', 'c']
data = [[1, 2, 0], [4, 4, 0], [5, 6, 1], [7, 8, 2], [9, 10, 2], [11, 12, 2]]
df = pd.DataFrame(data = data, columns = columns)
Now, with the following function you can do what you want
def balanced_sample(dataframe, sample_size, target_column):
# extract existing possible classes
target_columns_values = dataframe.loc[:, target_column].unique().tolist()
# count number of classes
target_columns_unique_classes_size = len(target_columns_values)
# checking if sample size is multiple of number of classes
if sample_size%target_columns_unique_classes_size !=0:
print('Sample size is not a multiple of the number of unique classes')
# to have difference in 1 item or so
instances_per_class = round(sample_size/target_columns_unique_classes_size)
# other possibilitie is to use
# sample_size//target_columns_unique_classes_size instead of round(...)
# but then, instances_per_class will be always <= than
# sample_size/target_columns_unique_classes_size
# checking if there is enought examples per class
values_per_class = dataframe.loc[:, target_column].value_counts()
for idx in values_per_class.index:
if instances_per_class>values_per_class[idx]:
print('Class {} has only {} example, so it is impossible to use {}
sample size, i.e., {} per class'.format(idx, values_per_class[idx],
sample_size, instances_per_class))
return pd.DataFrame(columns = dataframe.columns)
# creating the result dataframe
data = []
for classes in target_columns_values:
class_values = dataframe[dataframe.loc[:, target_column] ==
classes].sample(instances_per_class).values.tolist()
data+=class_values
result_dataframe = pd.DataFrame(columns = dataframe.columns, data = data)
return result_dataframe
Now we check the function:
And with other options:
I hope you find it useful, if you have any doubt, comment it here and I will try to answer you.
Question is a bit ambiguous but let say you want to randomly select 1 row for each column c category one could do:
import pandas as pd
data = [
[1, 2, 0], [1, 4, 0], [2, 2, 1],
[4, 5, 1], [3, 7, 2], [3, 3, 2],
[1, 2, 6], [3, 2, 6], [5, 2, 6]
]
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
sample = df.groupby('c').apply(lambda x: x.sample(n=1).squeeze())
a b c
c
0 1 4 0
1 2 2 1
2 3 3 2
6 1 2 6
I am posting the solution that works for me. It is not the most beautiful or efficient code. But that's honest work.
df = pd.read_csv(path)
target_col = 't'
unique_values = df[target_col].unique()
k = 8 #sample size
per_class_sample_size = int(k/unique_values.shape[0])
arr_samples_per_class = [0] * len(unique_values)
leftover = k - (per_class_sample_size * len(unique_values))
for i, v in enumerate(unique_values):
occ = df[df[target_col] == v].shape[0]
if leftover > 0 and occ > per_class_sample_size:
sz = per_class_sample_size + 1
leftover -= 1
else:
sz = per_class_sample_size if occ >= per_class_sample_size else occ
arr_samples_per_class[i] = sz
fdf = None
for v, sz in zip(unique_values, arr_samples_per_class):
ss = df.loc[df[target_col] == v].sample(sz)
fdf = ss if fdf is None else pd.concat([fdf, ss], axis=0)
Related
I have a data frame consisting of lists as elements. I want to find the closest matching values within a percentage of a given value.
My code:
df = pd.DataFrame({'A':[[1,2],[4,5,6]]})
df
A
0 [1, 2]
1 [3, 5, 7]
# in each row, lets find a the values and their index that match 5 with 20% tolerance
val = 5
tol = 0.2 # find values matching 5 or 20% within 5 (4 or 6)
df['Matching_index'] = (df['A'].map(np.array)-val).map(abs).map(np.argmin)
Present solution:
df
A Matching_index
0 [1, 2] 1 # 2 matches closely with 5 but this is wrong
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Expected solution:
df
A Matching_index
0 [1, 2] NaN # No matching value, hence NaN
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Idea is get difference with val and then replace to missing values if not match tolerance, last get np.nanargmin which raise error if all missing values, so added next condition with np.any:
def f(x):
a = np.abs(np.array(x)-val)
m = a <= val * tol
return np.nanargmin(np.where(m, a, np.nan)) if m.any() else np.nan
df['Matching_index'] = df['A'].map(f)
print (df)
A Matching_index
0 [1, 2] NaN
1 [4, 5, 6] 1.0
Pandas solution:
df1 = pd.DataFrame(df['A'].tolist(), index=df.index).sub(val).abs()
df['Matching_index'] = df1.where(df1 <= val * tol).dropna(how='all').idxmin(axis=1)
I'm not sure it you want all indexes or just a counter.
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[[1,2],[4,5,6,7,8]]})
val = 5
tol = 0.3
def closest(arr,val,tol):
idxs = [ idx for idx,el in enumerate(arr) if (np.abs(el - val) < val*tol)]
result = len(idxs) if len(idxs) != 0 else np.nan
return result
df['Matching_index'] = df['A'].apply(closest, args=(val,tol,))
df
If you want all the indexes, just return idxs instead of len(idxs).
Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.
IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
I need to make matrix (or array) from n by n data
I have data like this
type a | type b
1 | 1
1 | 2
1 | 3
2 | 1
2 | 4
3 | 1
4 | 2
and I want to make like this:
a/b | 1 2 3 4
ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ
1 | 1 1 1 0
2 | 1 0 0 1
3 | 1 0 0 0
4 | 0 1 0 0
Isn't there anything in python library? (pandas.. etc)
This is fairly simple. The table you have just denotes to the indices of a 2-dim array in Python. For the sake of simplicity, you can use NumPy Arrays:
import numpy as np
data = np.array([
[1, 1],
[1, 2],
[1, 3],
[2, 1],
[2, 4],
[3, 1],
[4, 2]
]) - 1 # Index starts at 0
n = 4
matrix = np.zeros((n, n))
# -a index- -b index-
matrix[data[:,0], data[:,1]] = 1
If you put your data in a pandas dataframe you can use crosstab: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
This will give you the frequency table that you are looking for
import pandas as pd
d = {'col1': [1,1,1,2,2,3,4], 'col2': [1,2,3,1,2,1,2]}
df = pd.DataFrame(data=d)
df = df.groupby('col1')['col2'].apply(lambda x:
pd.Series(x.values)).unstack().reset_index()
df = df.fillna(0)
df.columns = ['col1','1','2','3']
df[df[['1','2','3']] != 0] = 1
df
The below is another solution using text CountVectorizer from sklearn package, and it works for any types of data:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
d = {'col1': [1,1,1,2,2,3,4], 'col2': [1,2,3,1,2,1,2]}
df = pd.DataFrame(data=d)
df['col2'] = df['col2'].astype(str)
df = df.groupby(['col1'])['col2'].apply(' '.join).reset_index()
corpus = list(df['col2'])
df = pd.DataFrame(data=corpus, columns=['cols'])
vectorizer = CountVectorizer(vocabulary=['1','2','3','4'], token_pattern=r"(?
u)\b\w+\b")
X = vectorizer.fit_transform(df['cols'].values)
df = pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names())
df.index = df.columns
df
I have defined a function to create a dataframe, but I get two lists in each column, how could I get each element of the list as a separate row in the dataframe as shown below.
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(result, columns=['number','operation'])
return df
function()
Result:
number operation
0 [1, 2, 3, 4] [8, 16, 24, 32]
What I really want to:
number operation
0 1 8
1 2 16
2 3 24
3 4 34
Can anyone help me please? :)
Your problems are twofold, firstly you are pushing the entire list of values (instead of the "current" value) into the result array on each pass through your for loop, and secondly you are overwriting the dataframe each time as well. It would be simpler to use a list comprehension to generate the values for the dataframe:
import pandas as pd
a = [1, 2, 3, 4]
def function():
result = [{'number' : i, 'operation' : 8*i} for i in a]
df = pd.DataFrame(result)
return df
print(function())
Output:
number operation
0 1 8
1 2 16
2 3 24
3 4 32
import numpy as np
a = [1, 2, 3, 4]
def function():
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
v=np.rot90(np.array((number,operation)))
result=np.flipud(v)
df = pd.DataFrame(result, columns=['number','operation'])
return df
print (function())
number operation
0 1 8
1 2 16
2 3 24
3 4 32
You are almost there. Just replace number = [i for i in a] with number = a[i] and operation = [8*i for i in a] with operation = 8 * a[i]
(FYI: No need to create pandas dataframe inside loop. You can get same output with pandas dataframe creation outside loop)
Refer to the below code:
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = a[i]
operation = 8*a[i]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(res, columns=['number','operation'])
return df
function()
number operation
0 1 8
1 2 16
2 3 24
3 4 32
I have table that looks like:
Group Name
1 A
1 B
2 R
2 F
3 B
3 C
And i need group this records by following rool:
If an group has received at least one Name that is contained in another group, then these two groups are in the same group. In my case Group 1 contains A and B. And group 3 contains B and C. They have common name B, so they are must be in the same group.
As result i want to get something like this:
Group Name ResultGroup
1 A 1
1 B 1
2 R 2
2 F 2
3 B 1
3 C 1
I already finded solution, but in my table is about 200k records, so it take too much time (more than 12 hours). Is there way to optimize it? May be using pandas or something like that?
def printList(l, head=""):
if(head!=""):
print(head)
for i in l:
print(i)
def find_group(groups, vals):
for k in groups.keys():
for v in vals:
if v in groups[k]:
return k
return 0
task = [ [1, "AAA"], [1, "BBB"], [3, "CCC"], [4, "DDD"], [5, "JJJ"], [6, "AAA"], [6, "JJJ"], [6, "CCC"], [9, "OOO"], [10, "OOO"], [10, "DDD"], [11, "LLL"], [12, "KKK"] ]
ptrs = {}
groups = {}
group_id = 1
printList(task, "Initial table")
for i in range(0, len(task)):
itask = task[i]
resp = itask[1]
val = [ x[0] for x in task if x[1] == resp ]
minval = min(val)
for v in val:
if not v in ptrs.keys(): ptrs[v] = minval
myGroup = find_group(groups, val)
if(myGroup == 0):
groups[group_id] = list(set(val))
myGroup = group_id
group_id += 1
else:
groups[myGroup].extend(val)
groups[myGroup] = list(set(groups[myGroup]))
itask.append(myGroup)
task[i] = itask
print()
printList(task, "Result table")
You can groupby 'Name' and keep the first Group:
df = pd.DataFrame({'Group': [1, 1, 2, 2, 3, 3], 'Name': ['A', 'B', 'R', 'F', 'B', 'C']})
df2 = df.groupby('Name').first().reset_index()
Then merge with the original data-frame and drop duplicates of the original group:
df3 = df.merge(df2, on='Name', how='left')
df3 = df3[['Group_x', 'Group_y']].drop_duplicates('Group_x')
df3.columns = ['Group', 'ResultGroup']
One more merge will give you the result:
df.merge(df3, on='Group', how='left')
Group Name ResultGroup
1 A 1
1 B 1
2 R 2
2 F 2
3 B 1
3 C 1