Splitting a list of file names in a predefined ratio

Splitting a list of file names in a predefined ratio - python

I am trying to form an optimized approach to splitting a list of file names(examples shortly) in a x:y ratio based on the file names. This file list was procured using os.scandir (better performance vs os.listdir, src: Python Docs scandir).
Example -
Files (extension disregarded)-
A_1,A_2,...A_10 (here A is filename and 1 is the sample number of the file)
B_1,B_2,...B_10
and so on
Let's say the x:y ratio is 7:3
So I would like 70% of file names (A_1..A7,B_1..B_7) and 30%(A_8--A_10,B_8..B_10) in different lists, it does not matter that the first list should be in that order meaning the files could be A_1,A_9,A_5 etc as long as they are split 7 files in list 1 to 3 files in list 2.
Now it must be noted that this directory is huge (~150k files) and the samples of each type of files vary, i.e. it maybe that files with filename A have 1000 files or it may have only 5. Also there are about 400 unique filenames.
This current solution should not be called a solution at all as it defies the purpose of an accurate ratio for each filename. It is currently splitting the list of fileObjects(basically- name like A, number like 1, data within file A_1 and so on) as a whole in x:y ratio and taking advantage of the fact that entries are yielded in arbitrary order when using os.scandir.
ratio_number = int(len(list_of_fileObjects) *.7)
list_70 = list_of_fileObjects[:ratio_number]
list_30 = list_of_fileObjects[ratio_number:]
My second approach which would at least be a valid solution was to create a list separately for each filename(involves sorting the whole list of files), split it in the ratio and do this for each filename. I am looking for a more pythonic/elegant solution to this problem. Any suggestions or help would be appreciated especially considering the size of data being dealt with.

If I understand the situation correctly, your trying to partition the same proportion of each filename prefix's files. Your current method selects the correct proportion from the whole set of files, but it doesn't consider the different filename prefixes, so it may not get them in the correct proportion (though it will probably be somewhat close, most of the time).
Your second approach avoids that issue by first separating the filenames by prefix, then partitioning each sublist. But if you want a combined list with all the prefixes together, this approach may end up wasting time copying data around, since you have to separate out and then recombine the separate lists by prefix.
I think you can do what you want with a single loop over the filenames. You'll need to keep track of two data points for each filename prefix: The number of files with that prefix you've selected for the first sample and the total number of files with that prefix that you've seen.
ratio = 0.7
prefix_dict = {} # values are lists: [number_selected_for_first_list, total_number_seen]
first_sample = [] # gets a proportion of the files equal to ratio (for each prefix)
second_sample = [] # gets the rest of the files
for filename in list_of_files:
prefix = filename.split("_", 1)[0]
selected_seen = prefix_dict.setdefault(prefix, [0, 0])
selected_seen[1] += 1
if selected_seen[0] < round(ratio * selected_seen[1]):
first_sample.append(filename)
selected_seen[0] += 1
else:
second_sample.append(filename)
The only tricky part to this code is the use of dict.setdefault to fetch the selected_seen list. It if the requested prefix didn't yet exist in the dictionary, a new value ([0, 0]) will be added to the dictionary under that key (and returned). The later code modifies the list in place.
Depending on how exactly you want to handle inexact proportions, you can change the if condition a bit. I put in a round call (which I think will partition most accurately), but the code would work OK without it (biasing the selection towards the second sample) or with selected_seen[0] <= int(ratio * selected_seen[1]) (biasing towards the first sample).
Note that whichever way you choose to round when partitioning each prefix, there's the possibility that the separate prefixes will all end up unbalanced in the same direction, making the overall samples unbalanced by more than you'd normally expect. For instance, if you had ten prefixes with ten files (for 100 files total), a ratio of 7.5 would result in final sample lists of 80 and 20 files rather than 75 and 25. That happens since each of the prefixes gets partitioned 8 and 2 (7.5 rounds up). If every file had a unique prefix, you'd end up with everything in the first sample! If it's very important that the overall samples be the right sizes, you might need to fudge the sampling of the items a bit, based on the overall sample sizes.

I figured out a good solution to this problem.
all_file_names = {}
# ObjList is a list of objects but we only need
# file_name from that object for our solution
for x in ObjList:
if x.file_name not in all_file_names:
all_file_names[x.file_name] = 1
else:
all_file_names[x.file_name] += 1
trainingData = []
testData = []
temp_dict = {}
for x in ObjList:
ratio = int(0.7*all_file_names[x.file_name])+1
if x.file_name not in temp_dict:
temp_dict[x.file_name] = 1
trainingData.append(x)
else:
temp_dict[x.file_name] += 1
if(temp_dict[x.file_name] < ratio):
trainingData.append(x)
else:
testData.append(x)

Related

Memory problems for multiple large arrays

I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("\n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
def log_likely_list(patient, remove_index_list):
array_data = np.load("data/{}/array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?

Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:
log_likely = functools.reduce(np.add, final_data)

Computing with a large data file

I have a very large (say a few thousand) list of partitions, something like:
[[9,0,0,0,0,0,0,0,0],
[8,1,0,0,0,0,0,0,0],
...,
[1,1,1,1,1,1,1,1,1]]
What I want to do is apply to each of them a function (which outputs a small number of partitions), then put all the outputs in a list and remove duplicates.
I am able to do this, but the problem is that my computer gets very slow if I put the above list directly into the python file (esp. when scrolling). What is making it slow? If it is memory being used to load the whole list,
Is there a way to put the partitions in another file, and have the function just read the list term by term?
EDIT: I am adding some code. My code is probably very inefficient because I'm quite an amateur. So what I really have is a list of lists of partitions, that I want to add to:
listofparts3 = [[[3],[2,1],[1,1,1]],
[[6],[5,1],...,[1,1,1,1,1,1]],...]
def addtolist3(n):
a=int(n/3)-2
counter = 0
added = []
for i in range(len(listofparts3[a])):
first = listofparts3[a][i]
if len(first)<n:
for i in range(n-len(first)):
first.append(0)
answer = lowering1(fock(first),-2)[0]
for j in range(len(answer)):
newelement = True
for k in range(len(added)):
if (answer[j]==added[k]).all():
newelement = False
break
if newelement==True:
added.append(answer[j])
print(counter)
counter = counter+1
for i in range(len(added)):
added[i]=partition(added[i]).tolist()
return(added)
fock, lowering1, partition are all functions in earlier code, they are pretty simple functions. The above function, say addtolist(24), takes all the partition of 21 that I have and returns the desired list of partitions of 24, which I can then append to the end of listofparts3.

A few thousand partitions uses only a modest amount of memory, so that likely isn't the source of your problem.
One way to speed-up function application is to use map() for Python 3 or itertools.imap() from Python 2.
The fastest way to eliminate duplicates is to feed them into a Python set() object.

Python collections Counter with large data

I have two text files both consisting of approximately 700,000 lines.
Second file consists of responses to statements in the first file for corresponding line.
I need to calculate Fisher's Exact Score for each word pair that appears on matching lines.
For example, if nth lines in the files are
how are you
and
fine thanx
then I need to calculate Fisher's score for (how,fine), (how,thanx), (are,fine), (are,thanx), (you,fine), (you,thanx).
In order to calculate Fisher's Exact Score, I used collections module's Counter to count the number of appearances of each word, and their co-appearances throughout the two files, as in
with open("finalsrc.txt") as f1, open("finaltgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(set(list(find_words(line1))))
words2 = list(set(list(find_words(line2))))
counts1.update(words1)
counts2.update(words2)
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
then I calculate the Fisher's exact score for each pair using scipy module by
from scipy import stats
def calculateFisher(s, t):
sa = counts1[s]
ta = counts2[t]
st = counts_pair[s, t]
snt = sa - st
nst = ta - st
nsnt = n - sa - ta + st
oddsratio, pvalue = stats.fisher_exact([[st, snt], [nst, nsnt]])
return pvalue
This works fast and fine for small text files,
but since my files contain 700,000 lines each, I think the Counter gets too large to retrieve the values quickly, and this becomes very very slow.
(Assuming 10 words per each sentence, the counts_pair would have (10^2)*700,000=70,000,000 entries.)
It would take tens of days to finish the computation for all word pairs in the files.
What would be the smart workaround for this?
I would greatly appreciate your help.

How exactly are you calling the calculateFisher function? Your counts_pair will not have 70 million entries: a lot of word pairs will be seen more than once, so seventy million is the sum of their counts, not the number of keys. You should be only calculating the exact test for pairs that do co-occur, and the best place to find those is in counts_pair. But that means that you can just iterate over it; and if you do, you never have to look anything up in counts_pair:
for (s, t), count in counts_pair.iteritems():
sa = counts1[s]
ta = counts2[t]
st = count
# Continue with Fisher's exact calculation
I've factored out the calculate_fisher function for clarity; I hope you get the idea. So if dictionary look-ups were what's slowing you down, this will save you a whole lot of them. If not, ... do some profiling and let us know what's really going on.
But note that simply looking up keys in a huge dictionary shouldn't slow things down too much. However, "retrieving values quickly" will be difficult if your program must to swap most of its data to disk. Do you have enough memory in your computer to hold the three counters simultaneously? Does the first loop complete in a reasonable amount of time? So find the bottleneck and you'll know more about what needs fixing.
Edit: From your comment it sounds like you are calculating Fisher's exact score over and over during a subsequent step of text processing. Why do that? Break up your program in two steps: First, calculate all word pair scores as I describe. Write each pair and score out into a file as you calculate it. When that's done, use a separate script to read them back in (now the memory contains nothing else but this one large dictionary of pairs & Fisher's exact scores), and rewrite away. You should do that anyway: If it takes you ten days just to get the scores (and you *still haven't given us any details on what's slow, and why), get started and in ten days you'll have them forever, to use whenever you wish.
I did a quick experiment, and a python process with a list of a million ((word, word), count) tuples takes just 300MB (on OS X, but the data structures should be about the same size on Windows). If you have 10 million distinct word pairs, you can expect it to take about 2.5 GB of RAM. I doubt you'll have even this many word pairs (but check!). So if you've got 4GB of RAM and you're not doing anything wrong that you haven't told us about, you should be all right. Otherwise, YMMV.

I think that your bottleneck is in how you manipulate the data structures other than the counters.
words1 = list(set(list(find_words(line1)))) creates a list from a set from a list from the result of find_words. Each of these operations requires allocating memory to hold all of your objects, and copying. Worse still, if the type returned by find_words does not include a __len__ method, the resulting list will have to grow and be recopied as it iterates.
I'm assuming that all you need is an iterable of unique words in order to update your counters, for which set will be perfectly sufficient.
for line1, line2 in itertools.izip(f1, f2):
words1 = set(find_words(line1)) # words1 now has list of unique words from line1
words2 = set(find_words(line2)) # words2 now has list of unique words from line2
counts1.update(words1) # counts1 increments words from line1 (once per word)
counts2.update(words2) # counts2 increments words from line2 (once per word)
counts_pair.update(itertools.product(words1, words2)
Note that you don't need to change the output of itertools.product that is passed to counts_pair as there are no repeated elements in words1 or words2, so the Cartesian product will not have any repeated elements.

Sounds like you need to generate the cross-products lazily - a Counter with 70 million elements will take a lot of RAM and suffer from cache misses on virtually every access.
So how about instead save a dict mapping a "file 1" word to a list of sets of corresponding "file 2" words?
Initial:
word_to_sets = collections.defaultdict(list)
Replace:
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
with:
for w1 in words1:
word_to_sets[w1].append(words2)
Then in your Fisher function, replace this:
st = counts_pair[s, t]
with:
st = sum(t in w2set for w2set in word_to_sets.get(s, []))
That's as lazy as I can get - the cross-products are never computed at all ;-)
EDIT Or map a "list 1" word to its own Counter:
Initial:
word_to_counter = collections.defaultdict(collections.Counter)
Replace:
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
with:
for w1 in words1:
word_to_counter[w1].update(words2)
In Fisher function:
st = word_to_counter[s][t]

Python memory error for a large data set

I want to generate a 'bag of words' matrix containing documents with the corresponding counts for the words in the document. In order to do this I run below code for initialising the bag of words matrix. Unfortunately I receive a memory error after x amounts of documents in the line where I read the document. Is there a better way of doing this, so that I can avoid the memory error? Please be aware that I would like to process a very large amount of documents ~ 2.000.000 with only 8 Gb of RAM.
def __init__(self, paths, words_count, normalize_matrix = False ,trainingset_size = None, validation_set_words_list = None):
'''
Open all documents from the given path.
Initialize the variables needed in order
to construct the word matrix.
Parameters
----------
paths: paths to the documents.
words_count: number of words in the bag of words.
trainingset_size: the proportion of the data that should be set to the training set.
validation_set_words_list: the attributes for validation.
'''
print '################ Data Processing Started ################'
self.max_words_matrix = words_count
print '________________ Reading Docs From File System ________________'
timer = time()
for folder in paths:
self.class_names.append(folder.split('/')[len(folder.split('/'))-1])
print '____ dataprocessing for category '+folder
if trainingset_size == None:
docs = os.listdir(folder)
elif not trainingset_size == None and validation_set_words_list == None:
docs = os.listdir(folder)[:int(len(os.listdir(folder))*trainingset_size-1)]
else:
docs = os.listdir(folder)[int(len(os.listdir(folder))*trainingset_size+1):]
count = 1
length = len(docs)
for doc in docs:
if doc.endswith('.txt'):
d = open(folder+'/'+doc).read()
# Append a filtered version of the document to the document list.
self.docs_list.append(self.__filter__(d))
# Append the name of the document to the list containing document names.
self.docs_names.append(doc)
# Increase the class indices counter.
self.class_indices.append(len(self.class_names)-1)
print 'Processed '+str(count)+' of '+str(length)+' in category '+folder
count += 1

What you're asking for isn't possible. Also, Python doesn't automatically get the space benefits you're expecting from BoW. Plus, I think you're doing the key piece wrong in the first place. Let's take those in reverse order.
Whatever you're doing in this line:
self.docs_list.append(self.__filter__(d))
… is likely wrong.
All you want to store for each document is a count vector. In order to get that count vector, you will need to append to a single dict of all words seen. Unless __filter__ is modifying a hidden dict in-place, and returning a vector, it's not doing the right thing.
The main space savings in the BoW model come from not having to store copies of the string keys for each document, and from being able to store a simple array of ints instead of a fancy hash table. But an integer object is nearly as big as a (short) string object, and there's no way to predict or guarantee when you get new integers or strings vs. additional references to existing ones. So, really, the only advantage you get is 1/hash_fullness; if you want any of the other advantages, you need something like an array.array or numpy.ndarray.
For example:
a = np.zeros(len(self.word_dict), dtype='i2')
for word in split_into_words(d):
try:
idx = self.word_dict[word]
except KeyError:
idx = len(self.word_dict)
self.word_dict[word] = idx
np.resize(a, idx+1)
a[idx] = 1
else:
a[idx] += 1
self.doc_vectors.append(a)
But this still won't be enough. Unless you have on the order of 1K unique words, you can't fit all those counts in memory.
For example, if you have 5000 unique words, you've got 2M arrays, each of which has 5000 2-byte counts, so the most compact possible representation will take 20GB.
Since most documents won't have most words, you will get some benefit by using sparse arrays (or a single 2D sparse array), but there's only so much benefit you can get. And, even if things happened to be ordered in such a way that you get absolutely perfect RLE compression, if the average number of unique words per doc is on the order of 1K, you're still going to run out of memory.
So, you simply can't store all of the document vectors in memory.
If you can process them iteratively instead of all at once, that's the obvious answer.
If not, you'll have to page them in and out to disk (whether explicitly, or by using PyTables or a database or something).

Python speeding up the search for a value in a dictionary of ranges

I have a file with a column of values I would like to use to compare with a dictionary that contains two values that together form a range.
for instance:
File A:
Chr1 200 ....
Chr3 300
File B:
Chr1 200 300 ...
Chr2 300 350 ...
For now I created a dictionary of values for File B:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(LineB)
For the comparison:
for Line in MethylationFile:
Line = Line.strip("\n")
Info = Line.split("\t")
Chr = Info[0]
Location = int(Info[1])
Annotation = ""
for i, r in enumerate(Ranges[Chr]):
n = i + 1
while (n < len(Ranges[Chr])):
if (int(Ranges[Chr][i][1]) <= Location <= int(Ranges[Chr][i][2])):
Annotation = '\t'.join(Ranges[Chr][i][4:])
n +=1
OutFile.write(Line + '\t' + Annotation + '\n')
If I leave the while loop the program does not seem to run (or is probably running too slow to get results) since I have over 7,000 values in each dictionary. If I change the while loop to an if loop the program runs but at an incredibly slow pace.
I'm looking for a way to make this program faster and more efficient

Dictionaries are great when you want to look up a key by exact match. In particular, the hash of the lookup key has to be the same as the hash of the stored key.
If your ranges are consistent, you could fake this by writing a hash function that returns the same value for a range, and for every value within that range. But if they're not, this hash function would have to keep track of all of the known ranges, which takes you back to the same problem you're starting with.
In that case, the right data structure here is probably some kind of sorted collection. If you only need to build up the collection, and then use it many times without ever modifying it, just sorting a list and using the bisect module will do it for you. If you need to modify the collection after creation, you'll want something built around a binary tree or B-tree variant of some kind, like blist or bintrees.
This will reduce the time to find a range from N/2 to log2(N). So, if you've got 10000 ranges, instead of 5000 comparisons, you'll do 14.
While we're at it, it would help to convert the range start and stop values to ints once, instead of doing it each time. Also, if you want to use the stdlib bisect, you unfortunately can't pass a key to most functions, so let's reorganize the ranges into comparable order too. So:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(int(LineB[1]), int(LineB[2]), [LineB[0])
for r in Ranges:
r.sort()
Now, instead of this loop:
for i, r in enumerate(Ranges[Chr]):
# ...
Do this:
i = bisect.bisect(Ranges[Chr], (Location, Location, None))
if i:
r = Ranges[Chr][i-1]
if r[0] <= Location < r[1]:
# do whatever you wanted with r
else:
# there is no range that includes Location
else:
# Location is before all ranges
You have to be careful thinking about bisect, and it's possible I've got this wrong on the first attempt, so… read the docs on what it does, and experiment with your data (printing out the results of the bisect function), before trusting this.
If your ranges can overlap, and you want to be able to find all ranges that contain a value rather than just one, you'll need a bit more than this to keep things efficient. There's no way to fully-order overlapping ranges, so bisect won't cut it.
If you're expecting more than log N matches per average lookup, you can do it with two sorted lists and bisect.
But otherwise, you need a more complex data structure, and more complex code. For example, if you can spare N^2 space, you can keep the time at log N by having, for each range in the first list, a second list, sorted by end, of all the values with a matching start.
And at this point, I think it's getting complex enough that you want to look for a library to do it for you.
However, you might want to consider a different solution.
If you use numpy or a database instead of pure Python, this can't cut the algorithmic complexity from N to log N… but it can cut the constant overhead by a factor of 10 or so, which may be good enough. In fact, if you're doing tons of searches on a medium-small list, it may even be better.
Plus, it looks a lot simpler, and once you get used to array operations or SQL, it may even be more readable. So:
RangeArrays = [np.array(a[:2] for a in value) for value in Ranges]
… or, if Ranges is a dict mapping strings to values, instead of a list:
RangeArrays = {key: np.array(a[:2] for a in value) for key, value in Ranges.items()}
Then, instead of this:
for i, r in enumerate(Ranges[Chr]):
# ...
Do:
comparisons = Location < RangeArrays[Chr]
matches = comparisons[:,0] < comparisons[:,1]
indices = matches.nonzero()[0]
for index in indices:
r = Ranges[indices[0]]
# Do stuff with r
(You can of course make things more concise, but it's worth doing it this way and printing out all of the intermediate steps to see why it works.)
Or, using a database:
cur = db.execute('''SELECT Start, Stop, Chr FROM Ranges
WHERE Start <= ? AND Stop > ?''', (Location, Location))
for (Start, Stop, Chr) in cur:
# do stuff

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.