Lazily transpose dimensions 0 and 2 in iterator model

Lazily transpose dimensions 0 and 2 in iterator model - python

Given an iterable of an iterable of an iterable it_it_it (i.e. a lazy representation of 3d array) you can lazily transpose dimensions 0 and 1 by zip(*it_it_it) and dimensions 1 and 2 by map(lambda it_it: zip(*it_it), it_it_it).
However, the last combination (0 and 2) is trickier. It seems you must full evaluate the outer two iterators before yielding anything and the type yielded must be List[List] not a lazy Iterable[Iterable]; the inner most iterator is the only one that can be lazily evaluated (i.e. Iterable[List[List]] is the best you can do).
I'm going to give an answer but am interested in a more elegant answer.
Aside:
I'm interested in this question for understanding the problem with statically typed iterators i.e. rust and c++. Do you make sure to set up your data so you never have to do this operation. Is the best thing to do is just fully evaluate the iterators to List[List[List]] and then transpose c style.

Solution
def transpose_(it_it_it):
return zip(*map(zip, *it_it_it))
Attempt This Online!
Derivation
My first version was this, and then I just minified it. I first swap the outer two dimensions, then the inner two, then the outer two again. Using your ways of doing that. Note the variable names, they reflect the dimensions / swaps:
def transpose_(xyz):
yxz = zip(*xyz)
yzx = map(lambda xz: zip(*xz), yxz)
zyx = zip(*yzx)
return zyx
Iteration / exhaustion visualization
For visualization, I turned the lists of the 3D input into iterators that report when they've been exhausted, and iterate the transposed data, printing its structure and values. We can see the outer two dimensions get exhausted up-front, and the innermost dimension is exhausted only at the end:
X-iterator exhausted
Y-iterator exhausted
Y-iterator exhausted
Y-iterator exhausted
[
(
0
8
16
)
(
4
12
20
)
]
[
(
1
9
17
)
(
5
13
21
)
]
[
(
2
10
18
)
(
6
14
22
)
]
[
(
3
11
19
)
(
7
15
23
)
]
Z-iterator exhausted
Z-iterator exhausted
Z-iterator exhausted
Z-iterator exhausted
Z-iterator exhausted
Z-iterator exhausted
Code:
import numpy
def transpose_(it_it_it):
return zip(*map(zip, *it_it_it))
# Versions of zip and map that exhaust all inputs
_zip = zip
def zip(*its):
yield from _zip(*its)
for it in its[1:]:
next(it, None)
_map = map
def map(f, *its):
yield from _map(f, *its)
for it in its[1:]:
next(it, None)
# Iterators that report when they've been exhausted
def X(it):
yield from map(Y, it)
print('X-iterator exhausted')
def Y(it):
yield from map(Z, it)
print('Y-iterator exhausted')
def Z(it):
yield from it
print('Z-iterator exhausted')
# Test data
a = numpy.arange(3*2*4).reshape((3, 2, 4))
b = X(a.tolist())
# Iterate the transposed data
zyx = transpose_(b)
for yx in zyx:
print('[')
for x in yx:
print(' (')
for value in x:
print(' ', value)
print(' )')
print(']')
Attempt This Online!
Benchmark
Time and memory of creating and iterating the transposed data, given data of size 100×100×100 (three attempts):
317 ms 845540 bytes transpose_Tom
117 ms 825400 bytes transpose_Kelly
351 ms 840144 bytes transpose_Tom
127 ms 824984 bytes transpose_Kelly
324 ms 844120 bytes transpose_Tom
116 ms 824984 bytes transpose_Kelly
Code:
import numpy
import tracemalloc as tm
from timeit import repeat
import itertools
def transpose_Tom(it_it_it):
blocks = [[iter(it) for it in it_it] for it_it in it_it_it]
it = iter(itertools.cycle(itertools.chain.from_iterable(zip(*blocks))))
while True:
try:
yield [[next(next(it)) for _ in range(len(blocks))]
for _ in range(len(blocks[0]))]
except StopIteration:
break
def transpose_Kelly(it_it_it):
return zip(*map(zip, *it_it_it))
def iterate(iii):
for ii in iii:
for i in ii:
for _ in i:
pass
n = 100
a = numpy.arange(n**3).reshape((n, n, n))
b = a.tolist()
for _ in range(3):
for func in transpose_Tom, transpose_Kelly:
tm.start()
iterate(func(b))
zyx = func(b)
memory = tm.get_traced_memory()[1]
tm.stop()
time = min(repeat(lambda: iterate(func(b)), number=1))
print(f'{round(time*1e3)} ms ', memory, 'bytes ', func.__name__)
print()
Attempt This Online!

import itertools
def transpose_(it_it_it):
blocks = [[iter(it) for it in it_it] for it_it in it_it_it]
it = iter(itertools.cycle(itertools.chain.from_iterable(zip(*blocks))))
while True:
try:
yield [[next(next(it)) for _ in range(len(blocks))]
for _ in range(len(blocks[0]))]
except StopIteration:
break
Here's the code to test it

You can't really avoid the materialization you're trying to avoid.
Consider iterating over the transposed result you want:
for iter1 in transpose_layers_0_and_2(iterator):
for iter2 in iter1:
for iter3 in iter2:
pass
At the end of the first iteration of the outer loop, you've accessed every element of every sub-sub-iterator of the first sub-iterator of transpose_layers_0_and_2(iterator).
In the original pre-transpose iterator, these elements come from the first element of every sub-sub-iterator of the original iterator.
That means that by the time the first iteration of the outer loop is complete, you must have materialized the entire first two layers of the original iterator to produce every sub-sub-iterator. There's no way around it. Plus, you've only used one element of each sub-sub-iterator, so you still have to retain all those sub-sub-iterators in memory to produce the remaining elements. You can't discard them yet.

Related

Python, fast compression of large amount of numbers with Elias Gamma

We have a 2d list, we can convert it into anything if necessary. Each row contains some positive integers(deltas of the original increasing numbers). Total 2 billion numbers, with more than half equals to 1. When using Elias-Gamma coding, we can encode the 2d list row by row (we'll be accessing arbitrary rows with row index later) using around 3 bits per number based on calculation from the distribution. However, our program has been running for 12 hours and it still hasn't finished the encoding.
Here's what we are doing:
from bitstring import BitArray
def _compress_2d_list(input: List[List[int]]) -> List[BitArray]:
res = []
for row in input:
res.append(sum(_elias_gamma_compress_number(num) for num in row))
return res
def _elias_gamma_compress_number(x: int) -> BitArray:
n = _log_floor(x)
return BitArray(bin="0" * n) + BitArray(uint=x, length=_log_floor(x) + 1)
def log_floor(num: int) -> int:
return floor(log(num, 2))
Called by:
input_2d_list: List[List[int]] # containing 1.5M lists, total 2B numbers
compressed_list = _compress_2d_list(input_2d_list)
How can I optimize my code to make it run faster? I mean, MUCH FASTER...... I am ok with using any reliable popular library or data structure.
Also, how do we decompress faster with BitStream? Currently I read prefix 0's one by one, then read the binary of the compressed number in a while loop. It's not very fast either...

If you are ok with numpy "bitfields" you can get the compression done in a matter of minutes. Decoding is slower by a factor of three but still a matter of minutes.
Sample run:
# create example (1'000'000 numbers)
a = make_example()
a
# array([2, 1, 1, ..., 3, 4, 3])
b,n = encode(a) # takes ~100 ms on my machine
c = decode(b,n) # ~300 ms
# check round trip
(a==c).all()
# True
Code:
import numpy as np
def make_example():
a = np.random.choice(2000000,replace=False,size=1000001)
a.sort()
return np.diff(a)
def encode(a):
a = a.view(f'u{a.itemsize}')
l = np.log2(a).astype('u1')
L = ((l<<1)+1).cumsum()
out = np.zeros(L[-1],'u1')
for i in range(l.max()+1):
out[L-i-1] += (a>>i)&1
return np.packbits(out),out.size
def decode(b,n):
b = np.unpackbits(b,count=n).view(bool)
s = b.nonzero()[0]
s = (s<<1).repeat(np.diff(s,prepend=-1))
s -= np.arange(-1,len(s)-1)
s = s.tolist() # list has faster __getitem__
ns = len(s)
def gen():
idx = 0
yield idx
while idx < ns:
idx = s[idx]
yield idx
offs = np.fromiter(gen(),int)
sz = np.diff(offs)>>1
mx = sz.max()+1
out = np.zeros(offs.size-1,int)
for i in range(mx):
out[b[offs[1:]-i-1] & (sz>=i)] += 1<<i
return out

Some simple optimizations result in a factor of three improvement:
def _compress_2d_list(input):
res = []
for row in input:
res.append(BitArray('').join(BitArray(uint=x, length=2*x.bit_length()-1) for x in row))
return res
However, I think you'll need something better than that. On my machine, this would finish in about 12 hours on 1.5 million lists with 1400 deltas each.
In C it takes about a minute to encode. About 15 seconds to decode.

How can I print strings in a list by the number in a different list, on different lines depending on a third list?

For example, given:
On_A_Line = [2,2,3]
Lengths_Of_Lines = [5,2,4,3,2,3,2]
Characters = ['a','t','i','e','u','w','x']
I want it to print:
aaaaatt
iiiieee
uuwwwxx
So far I have tried:
iteration = 0
for number in Lengths_Of_Lines:
s = Lengths_Of_Lines[iteration]*Characters[iteration]
print(s, end = "")
iteration += 1
which prints what I want without the line spacing:
aaaaattiiiieeeuuwwwxx
I just don't have the python knowledge to know what to do from there.

Solution using a generator and itertools:
import itertools
def repeat_across_lines(chars, repetitions, per_line):
gen = ( c * r for c, r in zip(chars, repetitions) )
return '\n'.join(
''.join(itertools.islice(gen, n))
for n in per_line
)
Example:
>>> repeat_across_lines(Characters, Lengths_Of_Lines, On_A_Line)
'aaaaatt\niiiieee\nuuwwwxx'
>>> print(_)
aaaaatt
iiiieee
uuwwwxx
The generator gen yields each character repeated the appropriate number of times. These are joined together n at a time with itertools.islice, where n comes from per_line. Those results are then joined with newline characters. Because gen is a generator, the next call to islice yields the next n of them that haven't been consumed yet, rather than the first n.

You need to loop over the On_A_Line list. This tells you have many iterations of the inner loop to perform before printing a newline.
iteration = 0
for count in On_A_Line:
for _ in range(count):
s = Lengths_Of_Lines[iteration]*Characters[iteration]
print(s, end = "")
iteration += 1
print("") # Print newline

Python: add ranges to a list of range while iterating over it

I encounter a problem and wish anyone could give me a tip to overcome it.
I have a 2D-python-list (83 rows and 3 column). The first 2 columns are the start and end positions for an interval. The 3rd column is a digit index (ex: 9.68). The list is reverse-sorted by the 3rd column.
I want to get all non-overlapping interval with the highest index.
Here is an example of the sorted list:
504 789 9.68
503 784 9.14
505 791 8.78
499 798 8.73
1024 1257 7.52
1027 1305 7.33
507 847 5.86
Here is what I tried:
# Define a function that test if 2 intervals overlap
def overlap(start1, end1, start2, end2):
return not (end1 < start2 or end2 < start1)
best_list = [] # Create a list that will store the best intervals
best_list.append([sort[0][0],sort[0][1]]) # Append the first interval of the sorted list
# Loop through the sorted list
for line in sort:
local_start, local_end = line.rsplit("\s",1)[0].split()
for i in range(len(best_list)):
best_start = best_list[i][0]
best_end = best_list[i][1]
test = overlap(int(best_start), int(best_end), int(local_start), int(local_end))
if test is False:
best_list.append([local_start, local_end])
And I get:
best_list = [(504, 789),(1024, 1257),(1027, 1305)]
But I want:
best_list = [(504, 789),(1024, 1257)]
Thanks!

Well, I have some question about your code. Since sort contains strings then this line append([sort[0][0],sort[0][1]]) does what do you expect?
Anyway, to the main part your problem is that when multiple elements exist in your list it is sufficient for just one of them to pass the overlap test to be added to the list (not what you want). E.g. when both (504, 789),(1024, 1257) exist then (1027, 1305) will be inserted to the list because it passed the test when it's compared to (504, 789).
So, I made a few changes and now it seems to work as expected:
best_list = [] # Create a list that will store the best intervals
best_list.append(sort[0].rsplit(" ", 1)[0].split()) # Append the first interval of the sorted list
# Loop through the sorted list
for line in sort:
local_start, local_end = line.rsplit("\s", 1)[0].split()
flag = False # <- flag to check the overall overlapping
for i in range(len(best_list)):
best_start = best_list[i][0]
best_end = best_list[i][1]
test = overlap(int(best_start), int(best_end), int(local_start), int(local_end))
print(test)
if test:
flag = False
break
flag = True
if flag:
best_list.append([local_start, local_end])
The main idea is to check for every element and if it passes all overlapping tests then add it (last line of my code code). Not before.

Suppose you parse your csv and already have a list with [(start, stop, index), ....] as [(int, int, float), ...] then you can sort it with the following:
from operator import itemgetter
data = sorted(data, key=itemgetter(2), reverse=True)
This means that you sort by third position and return the result in reverse order from max to min.
def nonoverlap(data):
result = [data[0]]
for cand in data[1:]:
start, stop, _ = cand
current_span = range(start, stop+1)
for item in result:
i, j, _ = item
span = range(i, j+1)
if (start in span) or (stop in span):
break
elif (i in current_span) or (j in current_span):
break
else:
result.append(cand)
return result
Then with the above function you will obtain the desired result. For the provided snippet you will obtain [(504, 789, 9.68), (1024, 1257, 7.52)]. I use here the fact that you can use 1 in range(0, 10) which will return True. While this is a naive implementation, you can use it as a starting point. If you want to return only starts and stops replace the return line with return [i[:2] for i in result].
Note: Also I want to add that your code has a logical mistake. You make a decision after each comparison, but must make a decision after you have compared to all the elements already present in yours best_list. That is why (504, 789) and (1027, 1305) passes your test, but should not. I wish this note will help you.

How to pick a sequence of numbers from a list?

I have a startnumber and an endnumber.
From these numbers I need to pick a sequence of numbers.
The sequences is not always the same.
Example:
startnumber = 1
endnumber = 32
I need to create a list of numbers with a certain sequence
p.e.
3 numbers yes, 2 numbers no, 3 numbers yes, 2 numbers no.. etc
Expected output:
[[1-3],[6-8],[11-13],[16-18],[21-23],[26-28],[31-32]]
(at the end there are only 2 numbers remaining (31 and 32))
Is there a simple way in python to select sequences of line from a range of numbers?

numbers = range(1,33)
take = 3
skip = 2
seq = [list(numbers[idx:idx+take]) for idx in range(0, len(numbers),take+skip)]

Extrapolating this out:
def get_data(data, filterfunc=None):
if filterfunc is None:
filterfunc = lambda: True # take every line
result = []
sub_ = []
for line in data:
if filterfunc():
sub_.append(line)
else:
if sub_:
result.append(sub_)
sub_ = []
return result
# Example filterfunc
def example_filter(take=1, leave=1):
"""example_filter is a less-fancy version of itertools.cycle"""
while True:
for _ in range(take):
yield True
for _ in range(leave):
yield False
# Your example
final = get_data(range(1, 33), example_filter(take=3, leave=2))
As alluded to in the docstring of example_filter, the filterfunc for get_data is really just expecting a True or False based on a call. You could change this easily to be of the signature:
def filterfunc(some_data: object) -> bool:
So that you can determine whether to take or leave based on the value (or even the index), but it currently takes no arguments and just functions as a less magic itertools.cycle (since it should return its value on call, not on iteration)

from itertools import islice
def grouper(iterable, n, min_chunk=1):
it = iter(iterable)
while True:
chunk = list(islice(it, n))
if len(chunk) < min_chunk:
return
yield chunk
def pick_skip_seq(seq, pick, skip, skip_first=False):
if skip_first:
ret = [ x[skip:] for x in grouper(seq, pick+skip, skip+1) ]
else:
ret = [ x[:pick] for x in grouper(seq, pick+skip) ]
return ret
pick_skip_seq(range(1,33), 3, 2) gives required list.
In pick_skip_seq(seq, pick, skip, skip_first=False),
seq is sequence to pick/skip from,
pick/skip are no. of elements to pick/skip,
skip_first is to be set True if
such behavior is desired.
grouper returns chunks of n elements, it
ignores last group if it has less
than min_chunk elements.
It is derived from stuff given in
https://stackoverflow.com/a/8991553/1921546.
Demo:
# pick 3 skip 2
for i in range(30,35):
print(pick_skip_seq(range(1,i), 3, 2))
# skip 3 pick 2
for i in range(30,35):
print(pick_skip_seq(range(1,i), 3, 2, True))
An alternative implementation of pick_skip_seq:
from itertools import chain,cycle,repeat,compress
def pick_skip_seq(seq, pick, skip, skip_first=False):
if skip_first:
c = cycle(chain(repeat(0, skip), repeat(1, pick)))
else:
c = cycle(chain(repeat(1, pick), repeat(0, skip)))
return list(grouper(compress(seq, c), pick))
All things used are documented here: https://docs.python.org/3/library/itertools.html#itertools.compress

Filtering another filter object

I am trying to generate prime endlessly,by filtering out composite numbers. Using list to store and test for all primes makes the whole thing slow, so i tried to use generators.
from itertools import count
def chk(it,num):
for i in it:
if i%num:
yield(i)
genStore = [count(2)]
primeStore = []
while 1:
prime = next(genStore[-1])
primeStore.append(prime)
genStore.append(chk(genStore[-1],num))
It works quite well, generating primes, until it hit maximum recursion depth.
So I found ifilter (or filter in python 3).
From documentation of python standard library:
Make an iterator that filters elements from iterable returning only those for which the predicate is True. If predicate is None, return the items that are true. Equivalent to:
def ifilter(predicate, iterable):
# ifilter(lambda x: x%2, range(10)) --> 1 3 5 7 9
if predicate is None:
predicate = bool
for x in iterable:
if predicate(x):
yield x
So I get the following:
from itertools import count
genStore = [count(2)]
primeStore = []
while 1:
prime = next(genStore[-1])
primeStore.append(prime)
genStore.append(filter(lambda x:x%num,genStore[-1]))
I expected to get:
2
3
5
7
11
13
17
...
What I get is:
2
3
4
5
6
7
...
It seems next() only iterate through count(), not the filter. Object in list should point to the object, so I expected it works like filter(lambda x: x%n,(.... (filter(lambda x:x%3,filter(lambda x:x%2,count(2)))). I do some experiment and noticed the following characteristic:
filter(lambda x:x%2,filter(lambda x:x%3,count(0))) #works, filter all 2*n and 3*n
genStore = [count(2)]; genStore.append(filter(lambda x:x%2,genStore[-1])); genStore.append (filter(lambda x:x%2,genStore[-1])) - works, also filter all 2*n and 3*n
next(filter(lambda x:x%2,filter(lambda x:x%3,count(2)))) - works, printing out 5
On contrast:
from itertools import count
genStore = [count(2)]
primeStore = []
while 1:
prime = next(genStore[-1])
print(prime)
primeStore.append(prime)
genStore.append(filter(lambda x:x%prime,genStore[-1]))
if len(genStore) == 3:
for i in genStore[-1]:
print(i)
#It doesn't work, only filtering out 4*n.
Questions:
Why doesn't it work?
Is it a feature of python, or I made mistakes somewhere?
Is there any way to fix it?

I think your problem stems from the fact that lambdas are not evaluated will it's 'too late' and then you will get prime be same for all of them as all of them point at the same variable.
you can try to add custom filter and use normal function instead of lambda:
def myfilt(f, i, p):
for n in i:
print("gen:", n, i)
if f(n, p):
yield n
def byprime(x, p):
if x % p:
print("pri:", x, p)
return True
f = myfilt(byprime, genStore[-1], prime)
this way you avoid the problems of lambdas being all the same

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Lazily transpose dimensions 0 and 2 in iterator model - python

Related

Python, fast compression of large amount of numbers with Elias Gamma

How can I print strings in a list by the number in a different list, on different lines depending on a third list?

Python: add ranges to a list of range while iterating over it

How to pick a sequence of numbers from a list?

Filtering another filter object

Categories

Resources