Algorithm for grouping records - python

I have table that looks like:
Group Name
1 A
1 B
2 R
2 F
3 B
3 C
And i need group this records by following rool:
If an group has received at least one Name that is contained in another group, then these two groups are in the same group. In my case Group 1 contains A and B. And group 3 contains B and C. They have common name B, so they are must be in the same group.
As result i want to get something like this:
Group Name ResultGroup
1 A 1
1 B 1
2 R 2
2 F 2
3 B 1
3 C 1
I already finded solution, but in my table is about 200k records, so it take too much time (more than 12 hours). Is there way to optimize it? May be using pandas or something like that?
def printList(l, head=""):
for i in l:
def find_group(groups, vals):
for k in groups.keys():
for v in vals:
if v in groups[k]:
return k
return 0
task = [ [1, "AAA"], [1, "BBB"], [3, "CCC"], [4, "DDD"], [5, "JJJ"], [6, "AAA"], [6, "JJJ"], [6, "CCC"], [9, "OOO"], [10, "OOO"], [10, "DDD"], [11, "LLL"], [12, "KKK"] ]
ptrs = {}
groups = {}
group_id = 1
printList(task, "Initial table")
for i in range(0, len(task)):
itask = task[i]
resp = itask[1]
val = [ x[0] for x in task if x[1] == resp ]
minval = min(val)
for v in val:
if not v in ptrs.keys(): ptrs[v] = minval
myGroup = find_group(groups, val)
if(myGroup == 0):
groups[group_id] = list(set(val))
myGroup = group_id
group_id += 1
groups[myGroup] = list(set(groups[myGroup]))
task[i] = itask
printList(task, "Result table")

You can groupby 'Name' and keep the first Group:
df = pd.DataFrame({'Group': [1, 1, 2, 2, 3, 3], 'Name': ['A', 'B', 'R', 'F', 'B', 'C']})
df2 = df.groupby('Name').first().reset_index()
Then merge with the original data-frame and drop duplicates of the original group:
df3 = df.merge(df2, on='Name', how='left')
df3 = df3[['Group_x', 'Group_y']].drop_duplicates('Group_x')
df3.columns = ['Group', 'ResultGroup']
One more merge will give you the result:
df.merge(df3, on='Group', how='left')
Group Name ResultGroup
1 A 1
1 B 1
2 R 2
2 F 2
3 B 1
3 C 1


how to apply multiplication within pandas dataframe

please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)
It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.

Given a dataframe, how do I bucket columns according to their names and merge columns in the same bucket into one?

Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.
one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]
First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]
Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)
Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)

Balanced row sample from dataframe with pandas given categorical target column

Given a dataframe my goal is to sample rows such that values in one column are as balanced as possible.
Say I have a dataframe below, the sample size is 3 and target column is c
a | b | c
1 | 2 | 0
3 | 4 | 0
5 | 6 | 1
7 | 8 | 2
9 | 10| 2
11| 12| 2
One of possible samples would be
a | b | c
1 | 2 | 0
5 | 6 | 1
7 | 8 | 2
In case of sample size is not a multiple of the number of unique classes, it is fine to have difference in 1 item or so.
How would I approach this in pandas?
EDIT: provided solution that worked for me in answers
I first generated sample sizes for each unique value of column c so that it is balanced. The remainders are distributed over the first few elements
unique_values = df['c'].unique()
sample_sizes = [(k//len(df.columns))] * len(unique_values)
i = 0
while i < k%len(df.columns):
sample_sizes[i]+= 1
i= I+1
This bit generates the samples based on the generated sample sizes
df2= pd.concat([df.loc[df['c'] == unique_values[i]].sample() for i in range(len(sample_sizes)) for j in range(sample_sizes[i])])
You can just get a random sample of the dataframe based on the minimum count of the target column.
column = 'c'
df = df.groupby(column).sample(n=df[column].value_counts().min(), random_state='42')
First, we create your example dataframe
columns = ['a', 'b', 'c']
data = [[1, 2, 0], [4, 4, 0], [5, 6, 1], [7, 8, 2], [9, 10, 2], [11, 12, 2]]
df = pd.DataFrame(data = data, columns = columns)
Now, with the following function you can do what you want
def balanced_sample(dataframe, sample_size, target_column):
# extract existing possible classes
target_columns_values = dataframe.loc[:, target_column].unique().tolist()
# count number of classes
target_columns_unique_classes_size = len(target_columns_values)
# checking if sample size is multiple of number of classes
if sample_size%target_columns_unique_classes_size !=0:
print('Sample size is not a multiple of the number of unique classes')
# to have difference in 1 item or so
instances_per_class = round(sample_size/target_columns_unique_classes_size)
# other possibilitie is to use
# sample_size//target_columns_unique_classes_size instead of round(...)
# but then, instances_per_class will be always <= than
# sample_size/target_columns_unique_classes_size
# checking if there is enought examples per class
values_per_class = dataframe.loc[:, target_column].value_counts()
for idx in values_per_class.index:
if instances_per_class>values_per_class[idx]:
print('Class {} has only {} example, so it is impossible to use {}
sample size, i.e., {} per class'.format(idx, values_per_class[idx],
sample_size, instances_per_class))
return pd.DataFrame(columns = dataframe.columns)
# creating the result dataframe
data = []
for classes in target_columns_values:
class_values = dataframe[dataframe.loc[:, target_column] ==
result_dataframe = pd.DataFrame(columns = dataframe.columns, data = data)
return result_dataframe
Now we check the function:
And with other options:
I hope you find it useful, if you have any doubt, comment it here and I will try to answer you.
Question is a bit ambiguous but let say you want to randomly select 1 row for each column c category one could do:
import pandas as pd
data = [
[1, 2, 0], [1, 4, 0], [2, 2, 1],
[4, 5, 1], [3, 7, 2], [3, 3, 2],
[1, 2, 6], [3, 2, 6], [5, 2, 6]
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
sample = df.groupby('c').apply(lambda x: x.sample(n=1).squeeze())
a b c
0 1 4 0
1 2 2 1
2 3 3 2
6 1 2 6
I am posting the solution that works for me. It is not the most beautiful or efficient code. But that's honest work.
df = pd.read_csv(path)
target_col = 't'
unique_values = df[target_col].unique()
k = 8 #sample size
per_class_sample_size = int(k/unique_values.shape[0])
arr_samples_per_class = [0] * len(unique_values)
leftover = k - (per_class_sample_size * len(unique_values))
for i, v in enumerate(unique_values):
occ = df[df[target_col] == v].shape[0]
if leftover > 0 and occ > per_class_sample_size:
sz = per_class_sample_size + 1
leftover -= 1
sz = per_class_sample_size if occ >= per_class_sample_size else occ
arr_samples_per_class[i] = sz
fdf = None
for v, sz in zip(unique_values, arr_samples_per_class):
ss = df.loc[df[target_col] == v].sample(sz)
fdf = ss if fdf is None else pd.concat([fdf, ss], axis=0)

Take the difference of all elements of a series with the previous ones in python pandas

I have a dataframe with sorted values labeled by ids and I want to take the difference of the value for the first element of an id with the value of the last elements of the all previous ids. The code below does what I want:
import pandas as pd
a = 'a'; b = 'b'; c = 'c'
df = pd.DataFrame(data=[*zip([a, a, a, b, b, c, a], [1, 2, 3, 5, 6, 7, 8])],
columns=['id', 'value'])
# # take the last value for a particular id
# last_value_for_id = df.loc[ !=, :]
# print(last_value_for_id)
current_id = ''; prev_values = {};diffs = {}
for t in df.itertuples(index=False):
prev_values[] = t.value
if current_id !=
current_id =
else: continue
for k, v in prev_values.items():
if k == current_id: continue
diffs[(k, current_id)] = t.value - v
print(pd.DataFrame(data=diffs.values(), columns=['diff'], index=diffs.keys()))
id value
0 a 1
1 a 2
2 a 3
3 b 5
4 b 6
5 c 7
6 a 8
a b 2
c 4
b c 1
a 2
c a 1
I want to do this in a vectorized manner however. I have found a way of getting the series of last elements as in:
# take the last value for a particular id
last_value_for_id = df.loc[ !=, :]
which gives me:
id value
2 a 3
4 b 6
5 c 7
but can't find a way of using this to take the diffs in a vectorized manner
Depending on how many ids you have, this works with few thousands:
# enumerate ids, should be careful
ids = [a,b,c]
num_ids = len(ids)
# compute first and last
f = df.groupby('id').value.agg(['first','last'])
# lower triangle mask
mask = np.array([[i>=j for j in range(num_ids)] for i in range(num_ids)])
# compute diff of first and last, then mask
diff = np.where(mask, None, f['first'][None,:] - f['last'][:,None])
diff = pd.DataFrame(diff,
index = ids,
columns = ids)
# stack
a b 2
c 4
b c 1
dtype: object
Edit for updated data:
For the updated data, approach is similar if we can create the f table:
# create blocks of consecutive id
blocks = df['id'].ne(df['id'].shift()).cumsum()
# groupby
groups = df.groupby(blocks)
# create first and last values
df['fv'] = groups.value.transform('first')
df['lv'] = groups.value.transform('last')
# the above f and ids
# note the column name change
f = df[['id','fv', 'lv']].drop_duplicates()
ids = f['id'].values
num_ids = len(ids)
a b 2
c 4
a 5
b c 1
a 2
c a 1
dtype: object
If you want to go further and drop the index (a,a), well, I'm so lazy :D.
My method
Below are all reshape, to match your output
t[np.triu_indices(t.shape[1], 0)] = np.nan
first first
a b 2.0
c 4.0
b c 1.0
a 2.0
c a 1.0
dtype: float64

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
return 'not_assigned'
df['2_name'] = ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

