Computing with a large data file - python

I have a very large (say a few thousand) list of partitions, something like:
[[9,0,0,0,0,0,0,0,0],
[8,1,0,0,0,0,0,0,0],
...,
[1,1,1,1,1,1,1,1,1]]
What I want to do is apply to each of them a function (which outputs a small number of partitions), then put all the outputs in a list and remove duplicates.
I am able to do this, but the problem is that my computer gets very slow if I put the above list directly into the python file (esp. when scrolling). What is making it slow? If it is memory being used to load the whole list,
Is there a way to put the partitions in another file, and have the function just read the list term by term?
EDIT: I am adding some code. My code is probably very inefficient because I'm quite an amateur. So what I really have is a list of lists of partitions, that I want to add to:
listofparts3 = [[[3],[2,1],[1,1,1]],
[[6],[5,1],...,[1,1,1,1,1,1]],...]
def addtolist3(n):
a=int(n/3)-2
counter = 0
added = []
for i in range(len(listofparts3[a])):
first = listofparts3[a][i]
if len(first)<n:
for i in range(n-len(first)):
first.append(0)
answer = lowering1(fock(first),-2)[0]
for j in range(len(answer)):
newelement = True
for k in range(len(added)):
if (answer[j]==added[k]).all():
newelement = False
break
if newelement==True:
added.append(answer[j])
print(counter)
counter = counter+1
for i in range(len(added)):
added[i]=partition(added[i]).tolist()
return(added)
fock, lowering1, partition are all functions in earlier code, they are pretty simple functions. The above function, say addtolist(24), takes all the partition of 21 that I have and returns the desired list of partitions of 24, which I can then append to the end of listofparts3.

A few thousand partitions uses only a modest amount of memory, so that likely isn't the source of your problem.
One way to speed-up function application is to use map() for Python 3 or itertools.imap() from Python 2.
The fastest way to eliminate duplicates is to feed them into a Python set() object.

Related

How do I extract from a list in python?

If I have a list that is made up of 1MM ids, how would I pull from that list in intervals of 50k?
For example:
[1]cusid=df['customer_id'].unique().tolist()
[1]1,000,500
If I want to pull in chunks, is the below correct for 50k?
cusid=cusid[:50000] - first 50k ids
cusid=cusid[50000:100001] - the next 50k of ids
cusid=cusid[100001:150001] - the next 50k
are my interval selections correct?
Thanks!
cusid2 = [cusid[a:a+50000] for a in range(0, 950000, 50000)]
This is a list comprehension basically you will add to your list every element cusid[a: a+50000] for a going from 0 to 950000 (so 1m minus 50k) and iterate with a step of 50k so a will go up by 50k every iteration
Couple of things to mention:
It seems that you're using "data science" stack for your work, good chance you have numpy available, please take a look at numpy.array_split. You can calculate chunk amount once and use np view machinery. Most probably this is a lot faster than bringing np arrays in to native python lists
Idiomatic python approach (IMO) would be leveraging iterators + islice:
from itertools import islice
# create iterator from your array/list, this is cheap operation
iterator = iter(cusid)
# if you want element-wise operations, you can use your chunk in loops or function that require iterations
# this is really memory-efficient, as you don't put whole chunk in memory
chunk = islice(iterator, 50000)
s = sum(chunk)
# in case you really need whole chunk in memory, just turn isclice into list
chunk = list(islice(iterator, 50000))
last_in_chunk = chunk[-1]
# and you always use same code to consume next chunk from your source
# without maintaining any counters
next_chunk = list(islice(iterator, 50000))
When your iterator is exhausted (there's no values left) you will get empty chunk(s). When there's not enough elements to create full chunk, you will get as much as is left there.

Memory problems for multiple large arrays

I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("\n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
def log_likely_list(patient, remove_index_list):
array_data = np.load("data/{}/array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?
Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:
log_likely = functools.reduce(np.add, final_data)

Which is better: deque or list slicing?

If I use the code
from collections import deque
q = deque(maxlen=2)
while step <= step_max:
calculate(item)
q.append(item)
another_calculation(q)
how does it compare in efficiency and readability to
q = []
while step <= step_max:
calculate(item)
q.append(item)
q = q[-2:]
another_calculation(q)
calculate() and another_calculation() are not real in this case but in my actual program are simply two calculations. I'm doing these calculations every step for millions of steps (I'm simulating an ion in 2-d space). Because there are so many steps, q gets very long and uses a lot of memory, while another_calculation() only uses the last two values of q. I had been using the latter method, then heard deque mentioned and thought it might be more efficient; thus the question.
I.e., how do deques in python compare to just normal list slicing?
q = q[-2:]
now this is a costly operation because it recreates a list everytime (and copies the references). (A nasty side effect here is that it changes the reference of q even if you can use q[:] = q[-2:] to avoid that).
The deque object just changes the start of the list pointer and "forgets" the oldest item. So it's faster and it's one of the usages it's been designed for.
Of course, for 2 values, there isn't much difference, but for a bigger number there is.
If I interpret your question correctly, you have a function, that calculates a value, and you want to do another calculation with this and the previous value. The best way is to use two variables:
while step <= step_max:
item = calculate()
another_calculation(previous_item, item)
previous_item = item
If the calculations are some form of vector math, you should consider using numpy.

Increment numbers in list from a certain point

I have a list of numbers, e.g. [50,100,150,200,250]. I need to increment (or decrement) each number from a specified index and by a specified amount. I have been able to do this in two ways:
from itertools import islice
l = [50,100,150,200,250]
start_increment_index = 3
l[start_increment_index:] = [e+100 for e in l[start_increment_index:]]
print (l)
l = [50,100,150,200,250]
l[start_increment_index:] = [e+100 for e in islice(l,start_increment_index,len(l))]
print (l)
Both print: [50, 100, 150, 300, 350].
However, my real list contains millions of numbers and this operation is performed repeatedly with different indexes and different increments/decrements. Would there be a faster way of doing this using a Python list? I have been considering writing my own C/C++ extension to deal with this.
Edit: Would this be a useful module for Python in general? Having a function written in C which can take parameters (python_list_object, increment_amount, start_index, end_index)?
Main problem in your solution that you creates(allocating memory + copy) two lists. First it's list comprehension by itself and second l[start_increment_index:] inside it.
If you data source is python list, you can do you operation for O(n):
for i in range(start_increment_index, len(l)):
l[i] += increment
NB: define increment first.
It depends specifically on your goals. I suppose that you can use segment tree for this case. For more information see https://en.m.wikipedia.org/wiki/Segment_tree.
Just for brief description. This structure represents array upon which will be performed range operations (like addition/substraction subarray with number) This structure is optimized for case where you have very big number of such range queries.
Note: if you want to use only python list structure, then you can implement sparse table (it is another view of segment tree with implicit storing of tree in arrays)

Searching values in a large matrix

Im working with python 3.5 and Im writing a script that handles large spreadsheet files. Each row of the spreadsheet contains a phrase and several other relevant values. I'm parsing the file as a matrix, but for the example file, it has over 3000 rows (and even larger files should be within expected). I also have a list of 100 words. I need to search for each word, which row of the matrix contains it in its string, and print the some averages based on that.
Currently I'm iterating over each row of the matrix, and then check if the string contains any of the mentioned words, but this process takes 3000 iterations, with 100 checks for each one. Is there any better way to accomplish this task?
In the long run, I would encourage you to use something more suitable for the task. A SQL database, for instance.
But if you stick with writing your own python solution, here are some things you can do to optimize it:
Use sets. Sets have a very efficient membership check.
wordset_100 = set(worldlist_100)
for row in data_3k:
word_matches = wordset_100.intersect(row.phrase.split(" "))
for match in word_matches:
# add to accumulator
# this loop will be run less than len(row.phrase.split(' ')) times
pass
Parallelize.
from multiprocessing import Pool
from collections import defaultdict
def matches(wordset_100, row):
return wordset_100.intersect(row.phrase.split(" ")), row
if __name__ == "__main__":
accu = defaultdict(int)
p = Pool()
wordset_100 = set(worldlist_100)
for m, r in p.map(matches, data_3k):
for word in m:
accu[word] += r.number

Categories

Resources