Split "weighted" list/array into equal size chunks

Split "weighted" list/array into equal size chunks - python

I have a array of items with a weight assigned to each item. I want to split it into equal sized chunks of approx. equal cumulative weight. There is an answer here to do this using numpy https://stackoverflow.com/a/33555976/10690958
Is there a simple way to accomplish this using pure python ?
Example array:
[ ['bob',12],
['jack,6],
['jim',33],
....
]
or
a, 11
b,2
c, 5
d, 3
e, 3
f, 2
Here the correct output would be (assuming 2 chunks needed)
[a,11],[b,2] - cumulative weight of 13
and
[c,5],[d,3],[e,3],[f,2] - cumulative weight of 13
To further clarify the question, imagine a situation of sorting a 100 people into 10 elevators, where we want each elevator to have the same approx. total weight (sum of weights of all people in that elevator). So then the first list would become names and weights. Its a load-balancing problem.

You just have to mimic cumsum: build a list summing the weights. At the end you get the total weight. Scan the list with the cumulated weight, and create a new chunk each time you reach total_weight/number_of_chunks. Code could be:
def split(w_list, n_chunks):
# mimic a cumsum
s = [0,[]]
for i in w_list:
s[0]+= i[1]
s[1].append(s[0])
# now scan the lists populating the chunks
index = 0
splitted = []
stop = 0
chunk = 1
stop = s[0] / n_chunks
for i in range(len(w_list)):
# print(stop, s[1][i]) # uncomment for traces
if s[1][i] >= stop: # reached a stop ?
splitted.append(w_list[index:i+1]) # register a new chunk
index = i+1
chunk += 1
if chunk == n_chunks: # ok we can stop
break
stop = s[0] * chunk / n_chunks # next stop
splitted.append(w_list[index:]) # do not forget last chunk
return splitted

You need something like this split:
array =[ ['bob',12],
['jack',6],
['jim',33],
['bob2',1],
['jack2',16],
['jim2',3],
['bob3',7],
['jack3',6],
['jim3',1],
]
array = sorted(array, key= lambda pair: pair[1], )
summ = sum(pair[1] for pair in array )
chunks = 4
splmitt = summ // chunks
print(array)
print(summ)
print(splmitt)
def split(array, split):
splarr = []
tlist = []
summ = 0
for pair in array:
summ += pair[1]
tlist.append(pair)
if summ > split:
splarr.append(tlist)
tlist = []
summ = 0
if tlist:
splarr.append(tlist)
return splarr
spl = split(array, splmitt)
import pprint
pprint.pprint(spl)

Related

How to generate a number sequence 1111222233334444....9999...?

I want to generate 111122223333.... A sequence of numbers, each number appearing the same number of times, up to a certain number.
I use python for loop to generate the number sequence, but it cost too much time when the end number is 7000.
import pandas as pd
startNum = 1
endNum = 7000
sequence = []
for i in range(endNum):
for j in range(endNum):
sequence.append(i)
print(i)
So what should i do to reduce time, and get my sequence? no matter method, not include excel.Thanks!
I'd like to get the number sequcency 111122223333

So your code is not really doing what you're asking, so I'll add a few comments on what your code does to understand where it's not working, and provide you with an answer that does what you want.
import pandas as pd
startNum = 1
endNum = 7000
sequence = []
for i in range(endNum): # Here you are looping from 1 to endNum = 7000
for j in range(endNum): # Here you are looping from 1 to endNum = 7000
sequence.append(i) # You are adding i (7000 times because of your previous loop)
print(i) # You probably mean to print sequence ?
You probably want the second loop to be run on the number of repeating characters that you want (which is 4).
Here's the code that does what you want:
startNum = 1
endNum = 7000
sequence = []
repeat = 4
for i in range(endNum):
for _ in range(repeat):
sequence.append(i)
print(sequence)
In your case, I'd prefer using extend and list comprehension (both codes are equivalent):
startNum = 1
endNum = 7000
sequence = []
repeat = 4
for i in range(endNum):
sequence.extend([i for _ in range(repeat)])

i don't know what for but
endnum = 7
''.join([f"{str(i)*4}" for i in range(endnum)])
print(result)
result
0000111122223333444455556666
and it takes less then 1s with endnum = 7000
0:00:00.006550

You can try this:
startNum = 1
endNum = 5
seq = [ (str(i+1))*endNum for i in range(endNum) ]
print("".join(seq))

How to split a series by the longest repetition of a number in python?

df = pd.DataFrame({
'label':[f"subj_{i}" for i in range(28)],
'data':[i for i in range(1, 14)] + [1,0,0,0,2] + [0,0,0,0,0,0,0,0,0,0]
})
I have a dataset something like that. It looks like:
I want to cut it at where the longest repetitions of 0s occur, so I want to cut at index 18, but I want to leave index 14-16 intact. So far I've tried stuff like:
Counters
cad_recorder = 0
new_index = []
for i,row in tqdm(temp_df.iterrows()):
if row['cadence'] == 0:
cad_recorder += 1
new_index.append(i)
* But obviously that won't work since the indices will be rewritten at each occurrance of zero.
I also tried a dictionary, but I'm not sure how to compare previous and next values using iterrows.
I also took the rolling mean for X rows at a time, and if its zero then I got an index. But then I got stuck at actually inferring the range of indices. Or finding the longest sequence of zeroes.
Edit: A friend of mine suggested the following logic, which gave the same results as #shubham-sharma. The poster's solution is much more pythonic and elegant.
def find_longest_zeroes(df):
'''
Finds the index at which the longest reptitions of <1 values begin
'''
current_length = 0
max_length = 0
start_idx = 0
max_idx = 0
for i in range(len(df['data'])):
if df.iloc[i,9] <= 1:
if current_length == 0:
start_idx = i
current_length += 1
if current_length > max_length:
max_length = current_length
max_idx = start_idx
else:
current_length = 0
return max_idx
The code I went with following #shubham-sharma's solution:
cut_us_sof = {}
og_df_sof = pd.DataFrame()
cut_df_sof = pd.DataFrame()
for lab in df['label'].unique():
temp_df = df[df['label'] == lab].reset_index(drop=True)
mask = temp_df['data'] <= 1 # some values in actual dataset were 0.0000001
counts = temp_df[mask].groupby((~mask).cumsum()).transform('count')['data']
idx = counts.idxmax()
# my dataset's trailing zeroes are usually after 200th index. But I also didn't want to remove trailing zeroes < 500 in length
if (idx > 2000) & (counts.loc[idx] > 500):
cut_us_sof[lab] = idx
og_df_sof = og_df_sof.append(temp_df)
cut_df_sof = cut_df_sof.append(temp_df.iloc[:idx,:])

We can use boolean masking and cumsum to identify the blocks of zeros, then groupby and transform these blocks using count followed by idxmax to get the starting index of the block having the maximum consecutive zeros
m = df['data'].eq(0)
idx = m[m].groupby((~m).cumsum()).transform('count').idxmax()
print(idx)
18

How to get percentage of combinations computed?

I have this password generator, which comute combination with length of 2 to 6 characters from a list containing small letters, capital letters and numbers (without 0) - together 61 characters.
All I need is to show percentage (with a step of 5) of the combinations already created. I tried to compute all the combinations of selected length, from that number a boundary value (the 5 % step values) and count each combination written in text file and when when the count of combinations meets the boundary value, print the xxx % completed, but this code doesn't seem to work.
Do you know how to easily show the percentage please?
Sorry for my english, I'm not a native speaker.
Thank you all!
def pw_gen(characters, length):
"""generate all characters combinations with selected length and export them to a text file"""
# counting number of combinations according to a formula in documentation
k = length
n = len(characters) + k - 1
comb_numb = math.factorial(n)/(math.factorial(n-length)*math.factorial(length))
x = 0
# first value
percent = 5
# step of percent done to display
step = 5
# 'step' % of combinations
boundary_value = comb_numb/(100/step)
try:
# output text file
with open("password_combinations.txt", "a+") as f:
for p in itertools.product(characters, repeat=length):
combination = ''.join(p)
# write each combination and create a new line
f.write(combination + '\n')
x += 1
if boundary_value <= x <= comb_numb:
print("{} % complete".format(percent))
percent += step
boundary_value += comb_numb/(100/step)
elif x > comb_numb:
break

First of all - I think you are using incorrect formula for combinations because itertools.product creates variations with repetition, so the correct formula is n^k (n to power of k).
Also, you overcomplicated percentage calculation a little bit. I just modified your code to work as expected.
import math
import itertools
def pw_gen(characters, length):
"""generate all characters combinations with selected length and export them to a text file"""
k = length
n = len(characters)
comb_numb = n ** k
x = 0
next_percent = 5
percent_step = 5
with open("password_combinations.txt", "a+") as f:
for p in itertools.product(characters, repeat=length):
combination = ''.join(p)
# write each combination and create a new line
f.write(combination + '\n')
x += 1
percent = 100.0 * x / comb_numb
if percent >= next_percent:
print(f"{next_percent} % complete")
while next_percent < percent:
next_percent += percent_step
The tricky part is a while loop that makes sure that everything will work fine for very small sets (where one combination is more than step percentage of results).

Removed try:, since you are not handling any errors with expect.
Also removed elif:, this condition is never met anyway.
Besides, your formula for comb_numb is not the right one, since you're generating combinations with repetition. With those changes, your code is good.
import math, iterations, string
def pw_gen(characters, length):
"""generate all characters combinations with selected length and export them to a text file"""
# counting number of combinations according to a formula in documentation
comb_numb = len(characters) ** k
x = 0
# first value
percent = 5
# step of percent done to display
step = 5
# 'step' % of combinations
boundary_value = comb_numb/(100/step)
# output text file
with open("password_combinations.txt", "a+") as f:
for p in itertools.product(characters, repeat=length):
combination = ''.join(p)
# write each combination and create a new line
f.write(combination + '\n')
x += 1
if boundary_value <= x:
print("{} % complete".format(percent))
percent += step
boundary_value += comb_numb/(100/step)
pw_gen(string.ascii_letters, 4)

Summation from sub list

If n = 4, m = 3, I have to select 4 elements (basically n elements) from a list from start and end. From below example lists are [17,12,10,2] and [2,11,20,8].
Then between these two lists I have to select the highest value element and after this the element has to be deleted from the original list.
The above step has to be performed m times and take the summation of the highest value elements.
A = [17,12,10,2,7,2,11,20,8], n = 4, m = 3
O/P: 20+17+12=49
I have written the following code. However, the code performance is not good and giving time out for larger list. Could you please help?
A = [17,12,10,2,7,2,11,20,8]
m = 3
n = 4
scoreSum = 0
count = 0
firstGrp = []
lastGrp = []
while(count<m):
firstGrp = A[:n]
lastGrp = A[-n:]
maxScore = max(max(firstGrp), max(lastGrp))
scoreSum = scoreSum + maxScore
if(maxScore in firstGrp):
A.remove(maxScore)
else:
ai = len(score) - 1 - score[::-1].index(maxScore)
A.pop(ai)
count = count + 1
firstGrp.clear()
lastGrp.clear()
print(scoreSum )

I would like to do that this way, you can generalize it later:
a = [17,12,10,2,7,2,11,20,8]
a.sort(reverse=True)
sums=0
for i in range(3):
sums +=a[i]
print(sums)

If you are concerned about performance, you should use specific libraries like numpy. This will be much faster !

A = [17,12,10,2,7,11,20,8]
n = 4
m = 3
score = 0
for _ in range(m):
sublist = A[:n] + A[-n:]
subidx = [x for x in range(n)] + [x for x in range(len(A) - n, len(A))]
sub = zip(sublist, subidx)
maxval = max(sub, key=lambda x: x[0])
score += maxval[0]
del A[maxval[1]]
print(score)
Your method uses a lot of max() calls. Combining the slices of the front and back lists allows you to reduce the amounts of those max() searches to one pass and then a second pass to find the index at which it occurs for removal from the list.

More pythonic way to handle this logic structure

I need to break up a length of numbers into chunks of 100 and what ever is left over and then add them to a final dictionary at the end.
I am able to do it with loops but I feel I might be missing something that would make this a much cleaner and efficient operation.
l = 238 # length of list to process
i = 0 #setting up count for while loop
screenNames = {}#output dictionary
count = 0 #count of total numbers processed
while i < l:
toGet = {}
if l - count > 100:#blocks off in chunks of 100
for m in range (0,100):
toGet[count] = m
count = count + 1
else:
k = count
for k in range (0,(l - count)):#takes the remainder of the numbers
toGet[count] = k
count = count + 1
i = l # kills loop
screenNames.update(toGet)
#This logic structure breaks up the list of numbers in chunks of 100 or their
#Remainder and addes them into a dictionary with their count number as the
#index value
print 'returning:'
print screenNames
The above code works but it feels clunky does anyone have any better ways of handling this?

as far as I can see, you map a key n to the value n % 100, so this might be as well written as
screenNames = dict((i, i%100) for i in range(238))
print screenNames

Running your code, it looks like you're just doing modular arithmetic:
l = 238
sn = {}
for i in xrange(l):
sn[i] = i % 100
print sn
Or more succinctly:
l = 238
print dict((i, i % 100) for i in xrange(l))
That works by constructing a dictionary based on key-pair tuples.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split "weighted" list/array into equal size chunks - python

Related

How to generate a number sequence 1111222233334444....9999...?

How to split a series by the longest repetition of a number in python?

How to get percentage of combinations computed?

Summation from sub list

More pythonic way to handle this logic structure

Categories

Resources