How do I extract from a list in python? - python

If I have a list that is made up of 1MM ids, how would I pull from that list in intervals of 50k?
For example:
[1]cusid=df['customer_id'].unique().tolist()
[1]1,000,500
If I want to pull in chunks, is the below correct for 50k?
cusid=cusid[:50000] - first 50k ids
cusid=cusid[50000:100001] - the next 50k of ids
cusid=cusid[100001:150001] - the next 50k
are my interval selections correct?
Thanks!

cusid2 = [cusid[a:a+50000] for a in range(0, 950000, 50000)]
This is a list comprehension basically you will add to your list every element cusid[a: a+50000] for a going from 0 to 950000 (so 1m minus 50k) and iterate with a step of 50k so a will go up by 50k every iteration

Couple of things to mention:
It seems that you're using "data science" stack for your work, good chance you have numpy available, please take a look at numpy.array_split. You can calculate chunk amount once and use np view machinery. Most probably this is a lot faster than bringing np arrays in to native python lists
Idiomatic python approach (IMO) would be leveraging iterators + islice:
from itertools import islice
# create iterator from your array/list, this is cheap operation
iterator = iter(cusid)
# if you want element-wise operations, you can use your chunk in loops or function that require iterations
# this is really memory-efficient, as you don't put whole chunk in memory
chunk = islice(iterator, 50000)
s = sum(chunk)
# in case you really need whole chunk in memory, just turn isclice into list
chunk = list(islice(iterator, 50000))
last_in_chunk = chunk[-1]
# and you always use same code to consume next chunk from your source
# without maintaining any counters
next_chunk = list(islice(iterator, 50000))
When your iterator is exhausted (there's no values left) you will get empty chunk(s). When there's not enough elements to create full chunk, you will get as much as is left there.

Related

Memory problems for multiple large arrays

I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("\n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
def log_likely_list(patient, remove_index_list):
array_data = np.load("data/{}/array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?
Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:
log_likely = functools.reduce(np.add, final_data)

How to optimize my writing from RAM to disc?

I have some python code for reading data from RAM of an FPGA and writing it to disk on my computer. The code's runtime is 2.56sec. I need to bring it down to 2sec.
mem = device.getNode("udaq.readout_mem").readBlock(16384)
device.dispatch()
ram.append(mem)
ram.reverse()
memory = ram.pop()
for j in range(16384):
if 0 < j < 4096:
f.write('0x%05x\t0x%08x\n' %(j, memory[j]))
if 8192 < j < 12288:
f.write('0x%05x\t0x%08x\n' %(j, memory[j]))
Your loop is very unefficient. You're literally iterating for nothing when values aren't in range. And you're spending a lot of time testing the indices.
Don't do one loop & 2 tests. Just create 2 loops without index tests (note that first index is skipped if we respect your tests:
for j in range(1,4096):
f.write('0x%05x\t0x%08x\n' %(j, memory[j]))
for j in range(8193,12288):
f.write('0x%05x\t0x%08x\n' %(j, memory[j]))
maybe more pythonic & more concise (& not using memory[j] so it has a chance to be faster):
import itertools
for start,end in ((1,4096),(8193,12288)):
sl = itertools.islice(memory,start,end)
for j,m in enumerate(sl,start):
f.write('0x%05x\t0x%08x\n' %(j, m))
the outer loop saves the 2 loops (so if there are more offsets, just add them in the tuple list). The islice object creates a slice of the memory but no copies are made. It iterates without checking the indices each time for array out of bounds, so it can be faster. It has yet to be benched, but the writing to disk is probably taking a lot of time as well.
Jean-François Fabre's observations on the loops are very good, but we can go further. The code is performing around 8000 write operations, of constant size, and with nearly the same content. We can prepare a buffer to do that in one operation.
# Prepare buffer with static portions
addresses = list(range(1,4096)) + list(range(8193,12288))
dataoffset = 2+5+1+2
linelength = dataoffset+8+1
buf = bytearray(b"".join(b'0x%05x\t0x%08x\n'%(j,0)
for j in addresses))
# Later on, fill in data
for line,address in enumerate(addresses):
offset = linelength*line+dataoffset
buf[offset:offset+8] = b"%08x"%memory[address]
f.write(buf)
This means far fewer system calls. It's likely we can go even further by e.g. reading the memory as a buffer and using b2a_hex or similar rather than a string formatting per word. It might also make sense to precalculate the offsets rather than using enumerate.

Computing with a large data file

I have a very large (say a few thousand) list of partitions, something like:
[[9,0,0,0,0,0,0,0,0],
[8,1,0,0,0,0,0,0,0],
...,
[1,1,1,1,1,1,1,1,1]]
What I want to do is apply to each of them a function (which outputs a small number of partitions), then put all the outputs in a list and remove duplicates.
I am able to do this, but the problem is that my computer gets very slow if I put the above list directly into the python file (esp. when scrolling). What is making it slow? If it is memory being used to load the whole list,
Is there a way to put the partitions in another file, and have the function just read the list term by term?
EDIT: I am adding some code. My code is probably very inefficient because I'm quite an amateur. So what I really have is a list of lists of partitions, that I want to add to:
listofparts3 = [[[3],[2,1],[1,1,1]],
[[6],[5,1],...,[1,1,1,1,1,1]],...]
def addtolist3(n):
a=int(n/3)-2
counter = 0
added = []
for i in range(len(listofparts3[a])):
first = listofparts3[a][i]
if len(first)<n:
for i in range(n-len(first)):
first.append(0)
answer = lowering1(fock(first),-2)[0]
for j in range(len(answer)):
newelement = True
for k in range(len(added)):
if (answer[j]==added[k]).all():
newelement = False
break
if newelement==True:
added.append(answer[j])
print(counter)
counter = counter+1
for i in range(len(added)):
added[i]=partition(added[i]).tolist()
return(added)
fock, lowering1, partition are all functions in earlier code, they are pretty simple functions. The above function, say addtolist(24), takes all the partition of 21 that I have and returns the desired list of partitions of 24, which I can then append to the end of listofparts3.
A few thousand partitions uses only a modest amount of memory, so that likely isn't the source of your problem.
One way to speed-up function application is to use map() for Python 3 or itertools.imap() from Python 2.
The fastest way to eliminate duplicates is to feed them into a Python set() object.

Limiting the number of combinations /permutations in python

I was going to generate some combination using the itertools, when i realized that as the number of elements increase the time taken will increase exponentially. Can i limit or indicate the maximum number of permutations to be produced so that itertools would stop after that limit is reached.
What i mean to say is:
Currently i have
#big_list is a list of lists
permutation_list = list(itertools.product(*big_list))
Currently this permutation list has over 6 Million permutations. I am pretty sure if i add another list, this number would hit the billion mark.
What i really need is a significant amount of permutations (lets say 5000). Is there a way to limit the size of the permutation_list that is produced?
You need to use itertools.islice, like this
itertools.islice(itertools.product(*big_list), 5000)
It doesn't create the entire list in memory, but it returns an iterator which consumes the actual iterable lazily. You can convert that to a list like this
list(itertools.islice(itertools.product(*big_list), 5000))
itertools.islice has many benefits such as ability to set start and step. Solutions below aren't that flexible and you should use them only if start is 0 and step is 1. On the other hand, they don't require any imports.
You could create a tiny wrapper around itertools.product
it = itertools.product(*big_list)
pg = (next(it) for _ in range(5000)) # generator expression
(next(it) for _ in range(5000)) returns a generator not capable of producing more than 5000 values. Convert it to list by using the list constructor
pl = list(pg)
or by wrapping the generator expression with square brackets (instead of round ones)
pl = [next(it) for _ in range(5000)] # list comprehension
Another solution, which is just as efficient as the first one, is
pg = (p for p, _ in zip(itertools.product(*big_list), range(5000))
Works in Python 3+, where zip returns an iterator that stops when the shortest iterable is exhausted. Conversion to list is done as in the first solution.
You can try out this method to get particular number of permutations number of results a permutation produce is n! where n stands for the number of elements in a list for example if you want to get only 2 results then you can try the following:
Use any temporary variable and limit it
from itertools import permutations
m=['a','b','c','d']
per=permutations(m)
temp=1
for i in list(per):
if temp<=2: #2 is the limit set
print (i)
temp=temp+1
else:
break

How can I pop() lots of elements from a deque?

I have a deque object what holds a large amount of data. I want to extract, say, 4096 elements from the front of the queue (I'm using it as a kind of FIFO). It seems like there should be way of doing this without having to iterate over 4096 pop requests.
Is this correct/efficient/stupid?
A = arange(100000)
B = deque()
C = [] # List will do
B.extend(A) # Nice large deque
# extract 4096 elements
for i in xrange(4096):
C.append(A.popleft())
There is no multi-pop method for deques. You're welcome to submit a feature request to bugs.python.org and I'll consider adding it.
I don't know the details of your use case, but if your data comes in blocks of 4096, consider storing the blocks in tuples or lists and then adding the blocks to the deque:
block = data[:4096]
d.append(block)
...
someblock = d.popleft()
Where you're using a deque the .popleft() method is really the best method of getting elements off the front. You can index into it, but index performance degrades toward the middle of the deque (as opposed to a list that has quick indexed access, but slow pops). You could get away with this though (saves a few lines of code):
A = arange(100000)
B = deque(A)
C = [B.popleft() for _i in xrange(4096)]

Categories

Resources