Fill or reduce list intermediate values in python - python

i have a lot of lists with sequence coordinates but with different length. i need to create function to preprocess this list and return a list with length 30 filled with intermediate values or reduced(and recalculated).
For example, for 10 length output(fill):
input = [1, 3, 5, 7, 9, 8, 5, 4] # length = 8
output = [1, 2.3, 3, 5, 7, 9, 8, 6.3, 5, 4] # length = 10
for 10 length output(reduce):
input = [1, 3, 4, 5, 6, 7, 8, 9, 8, 6, 5, 4] # length = 12
output = [1, 2.5, 3.9, 5, 6.5, 7.9, 8.8, 6.5, 5, 4] # length = 10

i wanted to do something like this, sorry for bad explanation
def normalize_length_of_sequence(sequence, sequence_length=30):
if len(sequence) < sequence_length:
while len(sequence) != sequence_length:
diff_in_seq_ind = np.diff(sequence).argmax()
new_mean_walue = (sequence[diff_in_seq_ind] + sequence[diff_in_seq_ind+1]) / 2
sequence.insert(diff_in_seq_ind + 1, new_mean_walue)
elif len(sequence) > sequence_length:
while len(sequence) != sequence_length:
diff_in_seq_ind = np.abs(np.diff(sequence)).argmin()
sequence.pop(diff_in_seq_ind)
return sequence

Related

Get random slice/fraction of a list in python

Imagine we have the following list:
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Now we want a random fraction of the previous list, i.e., a random sublist of length len(l) * frac. Therefore, if frac=0.2 the output list should have length 2 (0.2 x 10).
The following results are expected:
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
l_new = sample(list=l, frac=0.3) # [4, 6, 8]
l_new = sample(list=l, frac=0.6) # [0, 1, 4, 6, 7, 8]
How can I achieve this behaviour? I have looked at random.sample, however it works by supplying a number of elements rather than a fraction of elements.
An alternative is to use random.sample with k = int(len(l) * frac), like so:
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
frac = 0.3
l = random.sample(l, int(len(l) * frac))
print(l)
>>> [4, 7, 9]
Using it as a function:
# Randomly Sample a fraction of elements from a list
def random_sample(l, frac=0.5):
return random.sample(l, int(len(l) * frac))

Saving the output of a for loop in a labeled list in Python

Consider the following loop:
import random
for m in range(0,2):
rand_list=[]
n=10
for i in range(n):
rand_list.append(random.randint(3,9))
print(rand_list)
That outputs two lists as:
[6, 9, 8, 7, 4, 8, 8, 4, 9, 9]
[9, 5, 3, 8, 3, 4, 8, 9, 3, 3]
How can I possibly label them with the dummy variable m define in the loop, for example lst[0],lst[1] or such, in order to compose the mean result
lst_mean = lst[0]+lst[1]/2
or similar quantities?
Note I know that the indexing here is not correct. The idea is that in the end of the loop for each m I define a list labeled by m and contains the corresponding result of rand_list.
You need a list of lists.
import random
lst = []
for m in range(2):
rand_list = []
n = 10
for i in range(n):
rand_list.append(random.randint(3, 9))
lst.append(rand_list)
print(lst)
Example output:
[[5, 7, 9, 7, 8, 9, 9, 5, 3, 9], [8, 8, 5, 7, 7, 9, 8, 7, 4, 7]]
Now you can calculate the mean result.
result = [sum(values) / len(values) for values in zip(*lst)]
print(result)
Output
[6.5, 7.5, 7.0, 7.0, 7.5, 9.0, 8.5, 6.0, 3.5, 8.0]
If you want more lists you can change the 2 in for m in range(2): to a different number.

Creating a New List Using a Percent Amount of a Pre-existing One - Python

Essentially, I have to take a pre-existing list and a percentage and return a new list with the given percentage of items from the first list in a new list. I have what follows:
def select_stop_words(percent, list):
possible_stop_words = []
l = len(list)
new_words_list = l//(percent/100)
x = int(new_words_list - 1)
possible_stop_words = [:x]
return possible_stop_words
But this always yields the same results as the first. Help??
You might want to multiply l to percent / 100:
def select_stop_words(percent, lst):
return lst[:len(lst) * percent // 100]
lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(select_stop_words(50, lst)) # [1, 2, 3, 4, 5]
print(select_stop_words(20, lst)) # [1, 2]
print(select_stop_words(99, lst)) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(select_stop_words(100, lst)) # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

'List index out of range' while reversing list

The problem is regarding reversing a list A of size N in groups of K. For example if A = [1,2,3,4,5], k = 3
Output = [3,2,1,5,4]
The error I get, when I run this is List Index out of range on line 4.
def reverseInGroups(A,N,K):
arr1 = []
for i in range(K):
arr1.append(A[(N-i)%K]) #line 4
for j in range(N-K):
arr1.append(A[N-j-1])
return arr1
This will implement what you are trying to achieve:
def reverseInGroups(A,K):
N = len(A)
arr1 = []
for i in range(0, N, K):
arr1.extend(A[i : i+K][::-1])
return arr1
print(reverseInGroups([1,2,3,4,5], 3))
Interestingly, the code in the question actually works in the example case, but it is not general. The case where it works is where N = 2*K - 1 (although where it does not work, the elements are in the wrong order rather than an IndexError).
Cant seem to reproduce your 'List index out of range' error, but your logic is faulty:
reverseInGroups(A,N,K):
arr1 = []
for i in range(K):
arr1.append(A[(N-i)%K]) #line 4
for j in range(N-K):
arr1.append(A[N-j-1])
return arr1
print(reverseInGroups([1,2,3,4,5],5, 3)) # works, others get wrong result
print(reverseInGroups([1,2,3,4,5,6],6, 3)) # wrong result: [1, 3, 2, 6, 5, 4]
prints:
[3, 2, 1, 5, 4] # correct
[1, 3, 2, 6, 5, 4] # wrong
You fix this and make this smaller by packing it into a list comprehension:
def revv(L,k):
return [w for i in (L[s:s+k][::-1] for s in range(0,len(L),k)) for w in i]
for gr in range(2,8):
print(gr, revv([1,2,3,4,5,6,7,8,9,10,11],gr))
to get:
2 [2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 11]
3 [3, 2, 1, 6, 5, 4, 9, 8, 7, 11, 10]
4 [4, 3, 2, 1, 8, 7, 6, 5, 11, 10, 9]
5 [5, 4, 3, 2, 1, 10, 9, 8, 7, 6, 11]
6 [6, 5, 4, 3, 2, 1, 11, 10, 9, 8, 7]
7 [7, 6, 5, 4, 3, 2, 1, 11, 10, 9, 8]
You can also try with this:
def reverse(l, n):
result = []
for i in range(0, len(l)-1, n):
for item in reversed(l[i:i+n]):
result.append(item)
for item in reversed(l[i+n:]):
result.append(item)
return result
You can reverse the array upto index K and reverse the remaining part and add these both arrays.
def reverseInGroups(A,N,K):
return A[:K][::-1]+A[K:][::-1]
A = [1,2,3,4,5]
N = 5
K = 3
res = reverseInGroups(A,N,K)
print(res)

Find Top N Most Frequent Sequence of Numbers in List of Lists

Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
This is super simplified but, in reality, my actual list-of-lists
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there efficient algorithms or relevant area of research for performing this kind of analysis for various combinations of N and M?
Here is an idea, based on a generalized suffix tree structure. Your list of lists can be seen as a list of strings, where the alphabet would consist of integers (so about 10k characters in the alphabet with the info you provided).
The construction of a generalized suffix tree is done in linear time w.r.t the string length, so this should not be an issue since in any case, you will have to go though your lists at some point.
First, store all your strings in the suffix tree. This requires 2 small adaptations of the structure.
You need to keep a counter of the number of occurences of a certain suffix, since your goal is ultimately to find the most common subsequence respecting certains properties.
Then, you also want to have a lookup table from (i, d) (where i is the integer you're looking for, the target, and d is the depth in your tree, the M) to the set of nodes of your suffix link that are labeled with the 'letter' i (your alphabet is not made of chars, but of integers), located at a depth d. This lookup table can be build by traversing your suffix link (BFS or DFS). You can even possibly store only the node that corresponds to the highest counter value.
From there, for some query (target, M), you would first look in your lookup table, and then find the node in the tree with the highest counter value. This would correspond to the most frequently encountered 'suffix' (or subsequence) in the list of lists.
The implementation is quite complex, since the generalized suffix tree is not a trivial structure (at all), and implementing it correctly, with modifications, would not be a small feat. But I think that this would allow for a very efficient query time.
For a suffix tree implementation, I would recommend that you read only the original papers until you get a deep and real understanding of those(like this or that, sc*-h*b can be your friend) on the matter, and not the online 'explanations' of it, that are riddled with approximations and mistakes (even this post can help to get a first idea, but will misdirect you at some point if your goal is to implement a correct version).
To answer your first question: you can put all lists in an array, fixing the length by extending zeroes so the array becomes something you can work with. From an answer here
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
lens = np.fromiter(map(len, x), np.int)
n1, n2 = len(lens), lens.max()
arr = np.zeros((n1, n2), dtype=np.int)
mask = np.arange(n2) < lens[:,None]
arr[mask] = np.concatenate(x)
arr
>> [[ 1 2 3 4 5 6 7 0 0 0 0]
[ 6 5 10 11 0 0 0 0 0 0 0]
[ 9 8 2 3 4 5 0 0 0 0 0]
[12 12 6 5 0 0 0 0 0 0 0]
[ 5 8 3 4 2 0 0 0 0 0 0]
[ 1 5 0 0 0 0 0 0 0 0 0]
[ 2 8 8 3 5 9 1 4 12 5 6]
[ 7 1 7 3 4 1 2 0 0 0 0]
[ 9 4 12 12 6 5 1 0 0 0 0]]
For the second question: use np.where to find the different positions matching your condition. Then you can broadcast the indeces of row and column by adding dimensions to include the 5s and the preceding 4 elements:
M = 4
N = 5
r, c = np.where(arr[:, M-1:]==N)
arr[r[:,None], (c[:,None] + np.arange(M))]
>>array([[ 2, 3, 4, 5],
[ 2, 3, 4, 5],
[12, 12, 6, 5],
[ 8, 8, 3, 5],
[ 1, 4, 12, 5],
[12, 12, 6, 5]])
There's two parts to your question:
To generate the sub sequences you wanted, you can use a generator to help you:
def gen_m(lst, m, val):
'''
lst = sub_list to parse
m = length required
val = target value
'''
found = 0 # starts with 0 index
for i in range(lst[m-1:].count(val)): # repeat by the count of val
found = lst.index(val, found) + 1 # set and find the next index of val
yield tuple(lst[found-m: found]) # yield the sliced sub_list of m length as a tuple
Then, using another generator, you can create a Counter of your sub lists:
from collections import Counter
target = 5
req_len = 4
# the yielded sub_lists need to be tuples to be hashable for the Counter
counter = Counter(sub_tup for lst in x for sub_tup in gen_m(lst, req_len, target))
Then, create a generator to check the counter object to return the N count required:
req_N = 2
def gen_common(counter, n):
s = set()
for i, (item, count) in enumerate(counter.most_common()):
if i < n or count in s:
yield item
else:
return
s.add(count)
result = list(gen_common(counter, req_N))
Results where N == 2:
[[2, 3, 4, 5], [12, 12, 6, 5]]
Results where N == 3:
[[2, 3, 4, 5], [12, 12, 6, 5], [8, 8, 3, 5], [1, 4, 12, 5]]
With a larger sample:
x = [[1, 2, 3, 4, 5, 6, 7],
[6, 5, 10, 11],
[9, 8, 2, 3, 4, 5],
[12, 12, 6, 5],
[5, 8, 3, 4, 2],
[1, 5],
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],
[7, 1, 7, 3, 4, 1, 2],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 2, 3, 4, 5, 1],
[9, 4, 8, 8, 3, 5, 1],
[9, 4, 7, 8, 9, 5, 1],
[9, 4, 1, 2, 2, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 1, 4, 12, 5],
[9, 1, 4, 12, 5, 1]
]
Where Counter is now:
Counter({(12, 12, 6, 5): 5, (2, 3, 4, 5): 3, (1, 4, 12, 5): 3, (8, 8, 3, 5): 2, (7, 8, 9, 5): 1, (1, 2, 2, 5): 1})
You can get results such as these:
for i in range(6):
# testing req_N from 0 to 5
list(gen_common(c, i))
# req_N = 0: []
# req_N = 1: [(12, 12, 6, 5)]
# req_N = 2: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 3: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 4: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5)]
# req_N = 5: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5), (7, 8, 9, 5), (1, 2, 2, 5)]
Since there is not an only N, M and target I assume there are chunks of lists with lists. Here is an approach in O(N + M) time complexity fashion (where N is the number of lists in a chunk and M is the total number of elements):
def get_seq(x, M, target):
index_for_length_m = M - 1
for v in [l for l in x if len(l) >= M]:
for i in [i for i, v in enumerate(v[index_for_length_m:], start=index_for_length_m) if v == target]:
# convert to str to be hashable
yield str(v[i - index_for_length_m : i + 1])
def process_chunk(x, M, N, target):
return Counter(get_seq(x, M, target)).most_common(N)
with your example:
process_chunk(x, M, 2, target)
output:
[('[2, 3, 4, 5]', 2), ('[12, 12, 6, 5]', 2)]
the performence:
%timeit process_chunk(x, M, 2, target)
# 25 µs ± 713 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Categories

Resources