I have one dataset with 7 variables and 5 indicators (df1):
A B C D E F G R_1 R_2 R_3 R_4 R_5
0 4 16 5 7 1 12 9 B C D F A
1 8 4 10 14 4 5 9 B E A NaN NaN
Second key-value dataset showing cut-off value for each variable (df2):
Variable Value
0 A 11
1 B 15
2 C 22
3 D 25
4 E 3
5 F 14
6 G 15
Want to add another 5 columns R_new_1-R_new_5 on the condition:
if R_1 = B and (value of B, df1) 16>15 (from df2):
df1['R_new_1'] = "C" (from R_2)
df1['R_new_2'] = "D" (from R_3)
df1['R_new_3'] = "F" (from R_4)
df1['R_new_4'] = "A" (from R_5)
df1['R_new_5'] = np.nan
Repeating above for the new R_2 value which is now stored in R_new_2
R_new_1 R_new_2 R_new_3 R_new_4 R_new_5
0 C D F A NaN
1 B A NaN NaN NaN
I have tried the below to automate the above:
var_list={'A','B','C','D','E','F','G'}
for col in var_list:
df1[str(col) + "_val"] = df2[df2['Variable']==str(col)].iloc[0][1]
for col in var_list:
if (df1[str(col) + "_val"] > df1[str(col)]):
df1[str(col) + "_ind"] = "OK"
else:
df1[str(col) + "_ind"] = "NOK"
The first run checks for R_1.
Unable to recursively replace B with C in R_new_1 and replace C with D in R_new_2 and replace D with F in R_new_3 and replace F with A in R_new_4 and CONTINUE checking for R_new_2 and so on
import pandas as pd
##dataframes constructers
#input
data1 = [{'A': 4, 'B': 16, 'C': 5, 'D': 7, 'E': 1, 'F': 12, 'G': 9, 'R_1':'B', 'R_2':'C', 'R_3':'D', 'R_4':'F', 'R_5':'A'},
{'A': 8, 'B': 4, 'C': 10, 'D': 14, 'E': 4, 'F': 5, 'G': 9, 'R_1':'B', 'R_2':'E', 'R_3':'A', 'R_4':np.nan, 'R_5':np.nan}]
df1 = pd.DataFrame(data1)
#input
data2 = [['A', 11], ['B', 15], ['C', 22], ['D', 25], ['E', 3], ['F', 14], ['G', 15]]
df2 = pd.DataFrame(data2, columns=['Variable', 'Value'])
#desired output
data3 = [{'A': 4, 'B': 16, 'C': 5, 'D': 7, 'E': 1, 'F': 12, 'G': 9, 'R_1':'B', 'R_2':'C', 'R_3':'D', 'R_4':'F', 'R_5':'A', 'R_new_1':'C', 'R_new_2':'D', 'R_new_3':'F', 'R_new_4':'A', 'R_new_5':np.nan},
{'A': 8, 'B': 4, 'C': 10, 'D': 14, 'E': 4, 'F': 5, 'G': 9, 'R_1':'B', 'R_2':'E', 'R_3':'A', 'R_4':np.nan, 'R_5':np.nan, 'R_new_1':'B', 'R_new_2':'A', 'R_new_3':np.nan, 'R_new_4':np.nan, 'R_new_5':np.nan}]
df3 = pd.DataFrame(data3)
I have large 1D NumPy array a of any comparable dtype, some of its elements may be repeated.
How do I find sorting indexes ix that will stable-sort (stability in a sense described here) a by frequencies of values in descending/ascending orders?
I want to find fastest and simplest way to do this. Maybe there is existing standard numpy function to do that.
There is another related question here but it was asking specifically to remove arrays duplicates, i.e. output only unique sorted values, I need all values of original array including duplicates.
I've coded my first trial to do the task, but it is not the fastest (uses Python's loop) and probably not shortest/simplest possible form. This python loop can be very expensive if repeating of equal elements is not high and array is huge. Also would be nice to have short function for doing this all if available in NumPy (e.g. imaginary np.argsort_by_freq()).
Try it online!
import numpy as np
np.random.seed(1)
hi, n, desc = 7, 24, True
a = np.random.choice(np.arange(hi), (n,), p = (
lambda p = np.random.random((hi,)): p / p.sum()
)())
us, cs = np.unique(a, return_counts = True)
af = np.zeros(n, dtype = np.int64)
for u, c in zip(us, cs):
af[a == u] = c
if desc:
ix = np.argsort(-af, kind = 'stable') # Descending sort
else:
ix = np.argsort(af, kind = 'stable') # Ascending sort
print('rows: i_col(0) / original_a(1) / freqs(2) / sorted_a(3)')
print(' / sorted_freqs(4) / sorting_ix(5)')
print(np.stack((
np.arange(n), a, af, a[ix], af[ix], ix,
), 0))
outputs:
rows: i_col(0) / original_a(1) / freqs(2) / sorted_a(3)
/ sorted_freqs(4) / sorting_ix(5)
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
[ 1 1 1 1 3 0 5 0 3 1 1 0 0 4 6 1 3 5 5 0 0 0 5 0]
[ 7 7 7 7 3 8 4 8 3 7 7 8 8 1 1 7 3 4 4 8 8 8 4 8]
[ 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 5 5 5 5 3 3 3 4 6]
[ 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 4 4 4 4 3 3 3 1 1]
[ 5 7 11 12 19 20 21 23 0 1 2 3 9 10 15 6 17 18 22 4 8 16 13 14]]
I might be missing something, but it seems that with a Counter you can then sort the indexes of each element according to the count of that element's value, using the element value and then the index to break ties. For example:
from collections import Counter
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
t = [(counts[v], v, i) for i, v in enumerate(a)]
t.sort()
print([v[2] for v in t])
t.sort(reverse=True)
print([v[2] for v in t])
Output:
[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[23, 21, 20, 19, 12, 11, 7, 5, 15, 10, 9, 3, 2, 1, 0, 22, 18, 17, 6, 16, 8, 4, 14, 13]
If you want to maintain ascending order of indexes with groups with equal counts, you can just use a lambda function for the descending sort:
t.sort(key = lambda x:(-x[0],-x[1],x[2]))
print([v[2] for v in t])
Output:
[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 14, 13]
If you want to maintain the ordering of elements in the order that they originally appeared in the array if their counts are the same, then rather than sort on the values, sort on the index of their first occurrence in the array:
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
idxs = {}
t = []
for i, v in enumerate(a):
if not v in idxs:
idxs[v] = i
t.append((counts[v], idxs[v], i))
t.sort()
print([v[2] for v in t])
t.sort(key = lambda x:(-x[0],x[1],x[2]))
print([v[2] for v in t])
Output:
[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 13, 14]
To sort according to count, and then position in the array, you don't need the value or the first index at all:
from collections import Counter
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
t = [(counts[v], i) for i, v in enumerate(a)]
t.sort()
print([v[1] for v in t])
t.sort(key = lambda x:(-x[0],x[1]))
print([v[1] for v in t])
This produces the same output as the prior code for the sample data, for your string array:
a = ['g', 'g', 'c', 'f', 'd', 'd', 'g', 'a', 'a', 'a', 'f', 'f', 'f',
'g', 'f', 'c', 'f', 'a', 'e', 'b', 'g', 'd', 'c', 'b', 'f' ]
This produces the output:
[18, 19, 23, 2, 4, 5, 15, 21, 22, 7, 8, 9, 17, 0, 1, 6, 13, 20, 3, 10, 11, 12, 14, 16, 24]
[3, 10, 11, 12, 14, 16, 24, 0, 1, 6, 13, 20, 7, 8, 9, 17, 2, 4, 5, 15, 21, 22, 19, 23, 18]
I just figured myself probably very fast solution for any dtype using just numpy functions without python looping, it works in O(N log N) time. Used numpy functions: np.unique, np.argsort and array indexing.
Although wasn't asked in original question, I implemented extra flag equal_order_by_val if it is False then array elements with same frequencies are sorted as equal stable range, meaning that there could be c d d c d c output like in outputs dumps below, because this is the order as elements go in original array for equal frequency. When flag is True such elements are in addition sorted by value of original array, resulting in c c c d d d. In other words in case of False we sort stably just by key freq, and when it is True we sort by (freq, value) for ascending order and by (-freq, value) for descending order.
Try it online!
import string, math
import numpy as np
np.random.seed(0)
# Generating input data
hi, n, desc = 7, 25, True
letters = np.array(list(string.ascii_letters), dtype = np.object_)[:hi]
a = np.random.choice(letters, (n,), p = (
lambda p = np.random.random((letters.size,)): p / p.sum()
)())
for equal_order_by_val in [False, True]:
# Solving task
us, ui, cs = np.unique(a, return_inverse = True, return_counts = True)
af = cs[ui]
sort_key = -af if desc else af
if equal_order_by_val:
shift_bits = max(1, math.ceil(math.log(us.size) / math.log(2)))
sort_key = ((sort_key.astype(np.int64) << shift_bits) +
np.arange(us.size, dtype = np.int64)[ui])
ix = np.argsort(sort_key, kind = 'stable') # Do sorting itself
# Printing results
print('\nequal_order_by_val:', equal_order_by_val)
for name, val in [
('i_col', np.arange(n)), ('original_a', a),
('freqs', af), ('sorted_a', a[ix]),
('sorted_freqs', af[ix]), ('sorting_ix', ix),
]:
print(name.rjust(12), ' '.join([str(e).rjust(2) for e in val]))
outputs:
equal_order_by_val: False
i_col 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
original_a g g c f d d g a a a f f f g f c f a e b g d c b f
freqs 5 5 3 7 3 3 5 4 4 4 7 7 7 5 7 3 7 4 1 2 5 3 3 2 7
sorted_a f f f f f f f g g g g g a a a a c d d c d c b b e
sorted_freqs 7 7 7 7 7 7 7 5 5 5 5 5 4 4 4 4 3 3 3 3 3 3 2 2 1
sorting_ix 3 10 11 12 14 16 24 0 1 6 13 20 7 8 9 17 2 4 5 15 21 22 19 23 18
equal_order_by_val: True
i_col 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
original_a g g c f d d g a a a f f f g f c f a e b g d c b f
freqs 5 5 3 7 3 3 5 4 4 4 7 7 7 5 7 3 7 4 1 2 5 3 3 2 7
sorted_a f f f f f f f g g g g g a a a a c c c d d d b b e
sorted_freqs 7 7 7 7 7 7 7 5 5 5 5 5 4 4 4 4 3 3 3 3 3 3 2 2 1
sorting_ix 3 10 11 12 14 16 24 0 1 6 13 20 7 8 9 17 2 15 22 4 5 21 19 23 18
Which is the best way (fastest) to perform union between sets from different rows at same column of a Dataframe.
For example for the following dataframe:
df_input=pd.DataFrame([[1,{1,2,3}],[1,{11,12}],[2,{1111,2222}],[2,{0,99}]], columns=['name', 'set'])
name set
0 1 {1, 2, 3}
1 1 {11, 12}
2 2 {2222, 1111}
3 2 {0, 99}
I would like to get:
name set
0 1 {1, 2, 3, 11, 12}
1 2 {0, 99, 2222, 1111}
And in case I have two columns wiht different sets, how can I join both columns?
For example, for this dataframe:
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
name set1 set2
0 1 {1, 2, 3} {b, a}
1 1 {11, 12} {j}
2 2 {2222, 1111} {m, n}
3 2 {0, 99} {p}
I am looking for the way to get this as ouput:
name set1 set2
0 1 {1, 2, 3, 11, 12} {b, j, a}
1 2 {0, 99, 2222, 1111} {m, p, n}
Thank you.
I am really not very knowleadgable in Pandas, and I'm sure there's a better way and (if you have time) you should probably wait for a better answer, but something like this seems to do the trick?
import pandas as pd
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
new = pd.DataFrame()
for name, agg_df in df_input.groupby('name'):
data = {
'name': name,
'set1': set(),
'set2': set(),
}
agg_df['set1'].apply(lambda c: data['set1'].update(c))
agg_df['set2'].apply(lambda c: data['set2'].update(c))
new = new.append(data, ignore_index=True)
print(new.head())
prints:
name set1 set2
0 1.0 {1, 2, 3, 11, 12} {b, j, a}
1 2.0 {0, 99, 2222, 1111} {p, n, m}
There is more Python syntactic sugar that you sure can use, but that's not really pandas...
import pandas as pd
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
SET_COLUMNS = ('set1', 'set2')
new = pd.DataFrame()
for name, agg_df in df_input.groupby('name'):
data = {**{'name': name}, **{set_col: set() for set_col in SET_COLUMNS}}
for set_col in SET_COLUMNS:
agg_df[set_col].apply(lambda c: data[set_col].update(c))
new = new.append(data, ignore_index=True)
print(new.head())