Speed up creating a graph from 2.92M data points - python

I have 2.92M data points in a 3.0GB CSV file and I need to loop through it twice to create a graph which I want to load into NetworkX. At the current rate it will take me days to generate this graph. How can I speed this up?
similarity = 8
graph = {}
topic_pages = {}
CSV.foreach("topic_page_node_and_edge.csv") do |row|
topic_pages[row[0]] = row[1..-1]
end
CSV.open("generate_graph.csv", "wb") do |csv|
i = 0
topic_pages.each do |row|
i+=1
row = row.flatten
topic_pages_attributes = row[1..-1]
graph[row[0]] = []
topic_pages.to_a[i..-1].each do |row2|
row2 = row2.flatten
topic_pages_attributes2 = row2[1..-1]
num_matching_attributes = (topic_pages_attributes2 & topic_pages_attributes).count
if num_matching_attributes >= similarity or num_matching_attributes == topic_pages_attributes2.count or num_matching_attributes == topic_pages_attributes.count
graph[row[0]].push(row2[0])
end
end
csv << [row[0], graph[row[0]]].flatten
end
end

benchmark. For example using cProfile, which comes with Python. It's easy to have some costly inefficiencies in your code, and they can easily come at a 10x performance cost in intensive applications.
Pretty code such as
(topic_pages_attributes2 & topic_pages_attributes).count
may turn out to be a major factor in your runtime, that can easily be reduced by using more traditional code.
Use a more efficient language. For example in benchmarksgame.alioth, on a number of intensive problems, the fastest Python 3 program is in median 63x slower than the fastest C program (Ruby is at 67x, JRuby at 33x). Yes, the performance gap can be big, even with well-optimized Python code. But if you didn't optimize your code, it may be even bigger; and you may be able to get a 100x-1000x speedup by using a more efficient language and carefully optimizing your code.
Consider more clever formulations of your problem. For example, instead of iterating over each node, iterate over each edge once. In your case, that would probably mean building an inverted index, topic -> pages. This is very similar to the way text search engines work, and a popular way to compute such operations on clusters: the individual topics can be split on separate nodes. This approach benefits from the sparsity in your data.
This can take down the runtime of your algorithm drastically.
You have about 3 Mio documents. Judging from your total data size, they probably have less than 100 topics on average? Your pairwise comparison approach needs 3mio^2 comparisons, that is what hurts you. If the more popular topics are used on only 30.000 documents each, you may get away with computing only 30k^2 * number of topics. Assuming you have 100 of such very popular topics (rare topics don't matter much), this would be a 100x speedup.
Simplify your problem. For example, first merge all documents that have exactly the same topics by sorting. To make this more effective, also eliminate all topics that occur exactly once. But probably there are only some 10.000-100.000 different sets of documents. This step can be easily solved using sorting, and will make your problem some 900-90000 times easier (assuming above value range).
Some of these numbers may be too optimistic - for example, IO was not taken into account at all, and if your problem is I/O bound, using C/Java may not help much. There may be some highly popular topics that can hurt with the approaches discussed in C. For D) you need O(n log n) time for sorting your data; but there are very good implementations for this available. But it definitely is a simplification that you should do. These documents will also form fully connected cliques in your final data, which likely hurt other analyses as well.

The most amount of time is spent in loading data from disk i belive. Parallelize reading data into multiple threads / processes and then create graph.
Also you could probably create subset graphs in different machines and combine them later.

Related

Get combinations in C rather than in Python itertools.combinations

Is switching to C an adequate approach to get the combinations faster than in Python or am I on a wrong track? I only "speak" python and hope for some guidance to decide on the next programming steps for my self-chosen learning project. I am working on a Data Science project and based on your answers, I would recommend to invite a computer scientist to the project or drop my project approach.
I have a list of 69 strings where I need all possible combinations of 8 elements.
At the moment I can do this in python with itertools.combinations()
for i in itertools.combinations(DictAthleteObjects.keys()),8):
do stuff here on instances of classes
In Python the itertools.combinations works perfectly fine for a view combinations but due to the large amount of combinations it is not time efficient and sometimes crashes (I think because of too less memory) when not breaking the loop after a view iterations. Generally the time-complexity is very large.
According to this StackOverflow discussion it could be a valid approach to generate the combinations in C and also do all the programming that works in python in C, because its much faster.
On the other hand I have received a comment that itertools.combinations is using C itself. But I cannot find any sources on that.
The comments you've received so far basically answer your question (profiling, minor gains for lots of C hassle, code redesign) but in the past handful of days I went through a similar dilemma for a home project and wanted to throw in my thoughts.
On the profiling front I just used Python's time module and a global start time variable to get basic benchmarks as my program ran. For highly complex scenarios I recommend using one of the Python profilers mentioned in the comments instead.
import time
start_time = time.process_time()
// stuff
print(f'runtime(sec): {time.process_time() - start_time}')
This allowed me to get the 10,000ft view of how long my code was taking to do various things, then I found a workable size of input data that didn't take too long to run but was a good representation of the larger dataset and tried to make incremental improvements.
After a lot of messing around I figured out what the most expensive things needed to be done were, and at the top of the list was generating those unique combinations. So what I ended up doing was splitting things up into a pipeline of sorts that allowed me to cut down the total runtime to the amount of time it takes to perform the most expensive work.
In the case of itertools.combinations it's not actually generating the real output values so it runs insanely quickly, but when it comes time to do the for loop and actually produce those values things get interesting. On my machine it took about 3ms to return the generator that would produce ~31.2B combinations if I looped over that.
# Code to check how long itertools.combinations() takes to run
import itertools
import time
data = []
for i in range(250000):
data.append(i)
ncombos = (250000 * 249999) / 2
for num_items in range(2, 9):
start_time = time.process_time()
g = itertools.combinations(data, num_items)
print(f'combo_sz:{num_items} num_combos:{ncombos} elapsed(sec):{time.process_time() - start_time}')
In my case I couldn't find a way to nicely split up a generator into parts so I decided to use the multiprocessing module (Process, Queue, Lock) to pass off data as it came in (this saves big on memory as well). In summary it was really helpful to look at things from the subtask perspective because each subtask may require something different.
Also don't be like me and skim the documentation too quickly haha, many problems can be resolved by reading that stuff. I hope you found something useful out of this reply, good luck!

How to implement a fast fuzzy-search engine using BK-trees when the corpus has 10 billion unique DNA sequences?

I am trying to use the BK-tree data structure in python to store a corpus with ~10 billion entries (1e10) in order to implement a fast fuzzy search engine.
Once I add over ~10 million (1e7) values to a single BK-tree, I start to see a significant degradation in the performance of querying.
I was thinking to store the corpus into a forest of a thousand BK-trees and to query them in parallel.
Does this idea sound feasible? Should I create and query 1,000 BK-trees simultaneously? What else can I do in order to use BK-tree for this corpus.
I use pybktree.py and my queries are intended to find all entries within an edit distance d.
Is there some architecture or database which will allow me to store those trees?
Note: I don’t run out of memory, rather the tree begins to be inefficient (presumably each node has too many children).
FuzzyWuzzy
Since you are mentioning your usage of FuzzyWuzzy as distance metric I will concentrate on efficient ways to implement the fuzz.ratio algorithm used by FuzzyWuzzy. FuzzyWuzzy provides the following two implementations for fuzz.ratio:
difflib, which is completely implemented in Python
python-Levenshtein which uses a weighted Levenshtein distance with the weight 2 for substitutions (substitutions are deletion + insertion). Python-Levenshtein is implemented in C and a lot faster than the pure Python implementation.
Implementation in python-Levenshtein
The implementation of python-Levenshtein uses the following implementation:
removes common prefix and suffix of the two strings, since they do not have any influence on the end result. This can be done in linear time, so matching similar strings is very fast.
The Levenshtein distance between the trimmed strings is implemented with quadratic runtime and linear memory usage.
RapidFuzz
I am the author of the library RapidFuzz which implements the algorithms used by FuzzyWuzzy in a more performant way. RapidFuzz uses the following interface for fuzz.ratio:
def ratio(s1, s2, processor = None, score_cutoff = 0)
The additional score_cutoff parameter can be used to provide a score threshold as a float between 0 and 100. For ratio < score_cutoff 0 is returned instead. This can be used by the implementation to use more a more optimized implementation in some cases. In the following I will describe the optimizations used by RapidFuzz depending on the input parameters. In the following max distance refers to the maximum distance that is possible without getting a ratio below the score threshold.
max distance == 0
The similarity can be calculated using a direct comparison,
since no difference between the strings is allowed. The time complexity of
this algorithm is O(N).
max distance == 1 and len(s1) == len(s2)
The similarity can be calculated using a direct comparisons as well, since a substitution would cause a edit distance higher than max distance. The time complexity of this algorithm is O(N).
Remove common prefix
A common prefix/suffix of the two compared strings does not affect
the Levenshtein distance, so the affix is removed before calculating the similarity. This step is performed for any of the following algorithms.
max distance <= 4
The mbleven algorithm is used. This algorithm
checks all possible edit operations that are possible under
the threshold max distance. A description of the original algorithm can be found here. I changed this algorithm to support the weigth of 2 for substitutions. As a difference to the normal Levenshtein distance this algorithm can even be used up to a threshold of 4 here, since the higher weight of substitutions decreases the amount of possible edit operations. The time complexity of this algorithm is O(N).
len(shorter string) <= 64 after removing common affix
The BitPAl algorithm is used, which calculates the Levenshtein distance in
parallel. The algorithm is described here and is extended with support
for UTF32 in this implementation. The time complexity of this algorithm is O(N).
Strings with a length > 64
The Levenshtein distance is calculated using
Wagner-Fischer with Ukkonens optimization. The time complexity of this algorithm is O(N * M).
This could be replaced with a blockwise implementation of BitPal in the future.
Improvements to processors
FuzzyWuzzy provides multiple processors like process.extractOne that are used to calculate the similarity between a query and multiple choices. Implementing this in C++ as well allows two more important optimizations:
when a scorer is used that is implemented in C++ as well we can directly call the C++ implementation of the scorer and do not have to go back and forth between Python and C++, which provides a massive speedup
We can preprocess the query depending on the scorer that is used. As an example when fuzz.ratio is used as scorer it only has to store the query into the 64bit blocks used by BitPal once, which saves around 50% of the runtime when calculating the Levenshtein distance
So far only extractOne and extract_iter are implemented in Python, while extract which you would use is still implemented in Python and uses extract_iter. So it can already use the 2. optimization, but still has to switch a lot between Python and C++ which is not optimal (This will probably be added in v1.0.0 as well).
Benchmarks
I performed benchmarks for extractOne and the individual scorers during the development that shows the performance difference between RapidFuzz and FuzzyWuzzy. Keep in mind that the performance for your case (all strings length 20) is probably not as good, since many of the strings in the dataset used are very small.
The source of the reproducible-science DATA :
words.txt ( dataset with 99171 words )
The hardware the graphed benchmarks were run on (specification) :
CPU: single core of a i7-8550U
RAM: 8 GB
OS: Fedora 32
Benchmark Scorers
The code for this benchmark can be found here
Benchmark extractOne
For this benchmark the code of process.extractOne is slightly changed to remove the score_cutoff parameter. This is done because in extractOne the score_cutoff is increased whenever a better match is found (and it exits once it finds a perfect match). In the future it would make more sense to benchmark process.extract which does not has this behavior (the benchmark is performed using process.extractOne, since process.extract is not fully implemented in C++ yet). The benchmark code can be found here
This shows that when possible the scorers should not be used directly but through the processors, that can perform a lot more optimizations.
Alternative
As an Alternative you could use a C++ implementation. The library RapidFuzz is available for C++ here. The implementation in C++ is relatively simple as well
// function to load words into vector
std::vector<std::string> choices = load("words.txt");
std::string query = choices[0];
std::vector<double> results;
results.reserve(choices.size());
rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);
for (const auto& choice : choices)
{
results.push_back(scorer.ratio(choice));
}
or in parallel using open mp
// function to load words into vector
std::vector<std::string> choices = load("words.txt");
std::string query = choices[0];
std::vector<double> results;
results.reserve(choices.size());
rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);
#pragma omp parallel for
for (const auto& choice : choices)
{
results.push_back(scorer.ratio(choice));
}
On my machine (see Benchmark above) this evaluates 43 million words/sec and 123 million words/sec in the parallel version. This is around 1.5 times as fast as the Python implementation (due to conversions between Python and C++ Types). However the main advantage of the C++ version is that you are relatively free to combine algorithms whichever way you want, while in the Python version your forced to use the process functions that are implemented in C++ to achieve good performance.
Few thoughts
BK-trees
Kudos to Ben Hoyt and his link to the issue which I will draw from. That being said, the first observation from the mentioned issue is that the BK tree isn't exactly logarithmic. From what you told us your usual d is ~6, which is 3/10 of your string length. Unfortunately, that means that if we look at the tables from the issue you will get the complexity of somewhere between O(N^0.8) to O(N). In the optimistic case of the
exponent being 0.8(it will likely be slightly worse) you get an improvement factor of ~100 on your 10B entries. So if you have a reasonably fast implementation of BK-trees it can still be worth it to use them or use them as a basis for a further optimization.
The downside of this is that even if you use 1000 trees in parallel, you will only get the improvement from the parallelization as the perfomance of the trees depends on the d rather than on the amount of the nodes within the tree. However even if you run all the 1000 trees at once with a massive machine, we are at the ~10M nodes/tree which you reported as slow. Still, computation wise, this seems doable.
A brute force approach
If you don't mind paying a little I would look into something like Google cloud big query if that doesn't clash with some kind of data confidentiality. They will brute force the solution for you - for a fee. The current rate is $5/TB of a query. Your dataset is ~10B rows * 20chars. Taking one byte per char, one query would take 200GB so ~1$ per query if you went the lazy way.
However, since the charge is per byte of a data in a column and not per complexity of a question, you could improve on this by storing your strings as bits - 2bits per a letter, this would save you 75% of the expenses.
Improving further, you can write your query in such a way that it will ask for a dozen strings at once. You might need to be a bit careful to use a batch of similar strings for the purpose of the query to avoid clogging of the result with too many one-offs though.
Brute forcing of the BK-trees
Since if you go with the route above, you will have to pay depending on the volume, the ~100-fold decrease in the computations needed becomes ~100-fold decrease in price which might be useful, especially if you have a lot of queries to run.
However you would need to figure out a way to store this tree in a several layers of databases to query recursively as the Bigquery pricing depends on the volume of the data in the queried table.
Building a smart batch engine for recursive processing of the queries to minimize the costs could be fun optimization excercise.
A choice of language
One more thing. While I think that Python is a good language for fast prototyping, analysis and thinking about code in general you are past that stage. You are currently looking for a way to do a specific, well defined and well thought operation as fast as possible. Python is not a great language for this as this example shows. While I used all the tricks I could think of in Python, the Java and C solutions were still several times faster. (Not to mention the rust one that beat us all - but he beat us by algorithm as well so it's hard to compare.) So if you go from python to a faster language, you might gain another factor or ten or maybe even more of a performance gain. This could be another fun optimization exercise.
Note: I am being rather conservative with the estimate as the fuzzywuzzy already offers to use a C library in the background so I'm not too sure about how much of the work still depends on the python. My experience in similar cases is that the performance gain can be factor of 100 from pure python(or worse, pure R) to a compiled language.
Quite late to the party, but here is a possible solution which
I would implement if I were in your situation:
Save the dataset as text file, and put that file on a very
fast disk region (preferably on tmpfs).
Prepare a beefy computer with many physical CPU cores (such
as Threadripper 3990X that has 64 cores).
Use this implementation and GNU parallel to grok the dataset.
Here is a bit of technical info behind this solution:
The optimized version of Myers' algorithm (linked above) can
process about 14 million entries per sec on a single CPU core.
If you can fully utilize all the 64 physical cores, you can
archive the throughput of 896 million per sec (= 14m * 64 cores).
At this speed, you can perform a single query on 10 billion
datasets in 12 seconds using a single machine.
I posted more detailed analysis at this article.
As shown in the article, I could perform a query against a dataset of 100 million records
in 1.04s with my cheap desktop machine.
By using a more performant CPU (or splitting the task between
multiple computers), I believe you can archive the desired result.
Hope this helps.

all (2^m−2)/2 possible ways to partition list

Each sample is an array of features (ints). I need to split my samples into two separate groups by figuring out what the best feature, and the best splitting value for that feature, is. By "best", I mean the split that gives me the greatest entropy difference between the pre-split set and the weighted average of the entropy values on the left and right sides. I need to try all (2^m−2)/2 possible ways to partition these items into two nonempty lists (where m is the number of distinct values (all samples with the same value for that feature are moved together as a group))
The following is extremely slow so I need a more reasonable/ faster way of doing this.
sorted_by_feature is a list of (feature_value, 0_or_1) tuples.
same_vals = {}
for ele in sorted_by_feature:
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
l = same_vals.keys()
orderings = list(itertools.permutations(l))
for ordering in orderings:
list_tups = []
for dic_key in ordering:
list_tups += same_vals[dic_key]
left_1 = 0
left_0 = 0
right_1 = num_one
right_0 = num_zero
for index, tup in enumerate(list_tups):
#0's or #1's on the left +/- 1
calculate entropy on left/ right, calculate entropy drop, etc.
Trivial details (continuing the code above):
if index == len(sorted_by_feature) -1:
break
if tup[1] == 1:
left_1 += 1
right_1 -= 1
if tup[1] == 0:
left_0 += 1
right_0 -= 1
#only calculate entropy if values to left and right of split are different
if list_tups[index][0] != list_tups[index+1][0]:
tl;dr
You're asking for a miracle. No programming language can help you out of this one. Use better approaches than what you're considering doing!
Your Solution has Exponential Time Complexity
Let's assume a perfect algorithm: one that can give you a new partition in constant O(1) time. In other words, no matter what the input, a new partition can be generated in a guaranteed constant amount of time.
Let's in fact go one step further and assume that your algorithm is only CPU-bound and is operating under ideal conditions. Under ideal circumstances, a high-end CPU can process upwards of 100 billion instructions per second. Since this algorithm takes O(1) time, we'll say, oh, that every new partition is generated in a hundred billionth of a second. So far so good?
Now you want this to perform well. You say you want this to be able to handle an input of size m. You know that that means you need about pow(2,m) iterations of your algorithm - that's the number of partitions you need to generate, and since generating each algorithm takes a finite amount of time O(1), the total time is just pow(2,m) times O(1). Let's take a quick look at the numbers here:
m = 20 means your time taken is pow(2,20)*10^-11 seconds = 0.00001 seconds. Not bad.
m = 40 means your time taken is pow(2,40)10-11 seconds = 1 trillion/100 billion = 10 seconds. Also not bad, but note how small m = 40 is. In the vast panopticon of numbers, 40 is nothing. And remember we're assuming ideal conditions.
m = 100 means 10^41 seconds! What happened?
You're a victim of algorithmic theory. Simply put, a solution that has exponential time complexity - any solution that takes 2^m time to complete - cannot be sped up by better programming. Generating or producing pow(2,m) outputs is always going to take you the same proportion of time.
Note further that 100 billion instructions/sec is an ideal for high-end desktop computers - your CPU also has to worry about processes other than this program you're running, in which case kernel interrupts and context switches eat into processing time (especially when you're running a few thousand system processes, which you no doubt are). Your CPU also has to read and write from disk, which is I/O bound and takes a lot longer than you think. Interpreted languages like Python also eat into processing time since each line is dynamically converted to bytecode, forcing additional resources to be devoted to that. You can benchmark your code right now and I can pretty much guarantee your numbers will be way higher than the simplistic calculations I provide above. Even worse: storing 2^40 permutations requires 1000 GBs of memory. Do you have that much to spare? :)
Switching to a lower-level language, using generators, etc. is all a pointless affair: they're not the main bottleneck, which is simply the large and unreasonable time complexity of your brute force approach of generating all partitions.
What You Can Do Instead
Use a better algorithm. Generating pow(2,m) partitions and investigating all of them is an unrealistic ambition. You want, instead, to consider a dynamic programming approach. Instead of walking through the entire space of possible partitions, you want to only consider walking through a reduced space of optimal solutions only. That is what dynamic programming does for you. An example of it at work in a problem similar to this one: unique integer partitioning.
Dynamic programming problems approaches work best on problems that can be formulated as linearized directed acyclic graphs (Google it if not sure what I mean!).
If a dynamic approach is out, consider investing in parallel processing with a GPU instead. Your computer already has a GPU - it's what your system uses to render graphics - and GPUs are built to be able to perform large numbers of calculations in parallel. A parallel calculation is one in which different workers can do different parts of the same calculation at the same time - the net result can then be joined back together at the end. If you can figure out a way to break this into a series of parallel calculations - and I think there is good reason to suggest you can - there are good tools for GPU interfacing in Python.
Other Tips
Be very explicit on what you mean by best. If you can provide more information on what best means, we folks on Stack Overflow might be of more assistance and write such an algorithm for you.
Using a bare-metal compiled language might help reduce the amount of real time your solution takes in ordinary situations, but the difference in this case is going to be marginal. Compiled languages are useful when you have to do operations like searching through an array efficiently, since there's no instruction-compilation overhead at each iteration. They're not all that more useful when it comes to generating new partitions, because that's not something that removing the dynamic bytecode generation barrier actually affects.
A couple of minor improvements I can see:
Use try/catch instead of if not in to avoid double lookup of keys
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
# Should be changed to
try:
same_vals[ele[0]].append(ele) # Most of the time this will work
catch KeyError:
same_vals[ele[0]] = [ele]
Dont explicitly convert a generator to a list if you dont have to. I dont immediately see any need for your casting to a list, which would slow things down
orderings = list(itertools.permutations(l))
for ordering in orderings:
# Should be changed to
for ordering in itertools.permutations(l):
But, like I said, these are only minor improvements. If you really needed this to be faster, consider using a different language.

Running parallel iterations

I am trying to run a sort of simulations where there are fixed parameters i need to iterate on and find out the combinations which has the least cost.I am using python multiprocessing for this purpose but the time consumed is too high.Is there something wrong with my implementation?Or is there better solution.Thanks in advance
import multiprocessing
class Iters(object):
#parameters for iterations
iters['cwm']={'min':100,'max':130,'step':5}
iters['fx']={'min':1.45,'max':1.45,'step':0.01}
iters['lvt']={'min':106,'max':110,'step':1}
iters['lvw']={'min':9.2,'max':10,'step':0.1}
iters['lvk']={'min':3.3,'max':4.3,'step':0.1}
iters['hvw']={'min':1,'max':2,'step':0.1}
iters['lvh']={'min':6,'max':7,'step':1}
def run_mp(self):
mps=[]
m=multiprocessing.Manager()
q=m.list()
cmain=self.iters['cwm']['min']
while(cmain<=self.iters['cwm']['max']):
t2=multiprocessing.Process(target=mp_main,args=(cmain,iters,q))
mps.append(t2)
t2.start()
cmain=cmain+self.iters['cwm']['step']
for mp in mps:
mp.join()
r1=sorted(q,key=lambda x:x['costing'])
returning=[r1[0],r1[1],r1[2],r1[3],r1[4],r1[5],r1[6],r1[7],r1[8],r1[9],r1[10],r1[11],r1[12],r1[13],r1[14],r1[15],r1[16],r1[17],r1[18],r1[19]]
self.counter=len(q)
return returning
def mp_main(cmain,iters,q):
fmain=iters['fx']['min']
while(fmain<=iters['fx']['max']):
lvtmain=iters['lvt']['min']
while (lvtmain<=iters['lvt']['max']):
lvwmain=iters['lvw']['min']
while (lvwmain<=iters['lvw']['max']):
lvkmain=iters['lvk']['min']
while (lvkmain<=iters['lvk']['max']):
hvwmain=iters['hvw']['min']
while (hvwmain<=iters['hvw']['max']):
lvhmain=iters['lvh']['min']
while (lvhmain<=iters['lvh']['max']):
test={'cmain':cmain,'fmain':fmain,'lvtmain':lvtmain,'lvwmain':lvwmain,'lvkmain':lvkmain,'hvwmain':hvwmain,'lvhmain':lvhmain}
y=calculations(test,q)
lvhmain=lvhmain+iters['lvh']['step']
hvwmain=hvwmain+iters['hvw']['step']
lvkmain=lvkmain+iters['lvk']['step']
lvwmain=lvwmain+iters['lvw']['step']
lvtmain=lvtmain+iters['lvt']['step']
fmain=fmain+iters['fx']['step']
def calculations(test,que):
#perform huge number of calculations here
output={}
output['data']=test
output['costing']='foo'
que.append(output)
x=Iters()
x.run_thread()
From a theoretical standpoint:
You're iterating every possible combination of 6 different variables. Unless your search space is very small, or you wanted just a very rough solution, there's no way you'll get any meaningful results within reasonable time.
i need to iterate on and find out the combinations which has the least cost
This very much sounds like an optimization problem.
There are many different efficient ways of dealing with these problems, depending on the properties of the function you're trying to optimize. If it has a straighforward "shape" (it's injective), you can use a greedy algorithm such as hill climbing, or gradient descent. If it's more complex, you can try shotgun hill climbing.
There are a lot more complex algorithms, but these are the basic, and will probably help you a lot in this situation.
From a more practical programming standpoint:
You are using very large steps - so large, in fact, that you'll only probe the function 19,200. If this is what you want, it seems very feasible. In fact, if I comment the y=calculations(test,q), this returns instantly on my computer.
As you indicate, there's a "huge number of calculations" there - so maybe that is your real problem, and not the code you're asking for help with.
As to multiprocessing, my honest advise is to not use it until you already have your code executing reasonably fast. Unless you're running a supercomputing cluster (you're not programming a supercomputing cluster in python, are you??), parallel processing will get you speedups of 2-4x. That's absolutely negligible, compared to the gains you get by the kind of algorithmic changes I mentioned.
As an aside, I don't think I've ever seen that many nested loops in my life (excluding code jokes). If don't want to switch to another algorithm, you might want to consider using itertools.product together with numpy.arange

Minimising reading from and writing to disk in Python for a memory-heavy operation

Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?
take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.
A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.
Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.
Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.
You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.
Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.
I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)
Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.
From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)
Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.
Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.

Categories

Resources