I am writing a program that reads millions of acedemic paper abstracts and collects bits of data from them. I have been having issues with running out of memory and have scaled down almost everything I can.
My next idea was to delete from memory an abstract after my program was finished reading it. Here is my loop:
for i in range(0, len(abstracts)):
abstract = abstracts[i]
name = abstract.id
self.Xdict[name] = self.Xdata.getData(abstract)
self.Ydict[name] = self.Ydata.getData(abstract)
sys.stdout.write("\rScanned Papers: %d" % count) #A visual counter
sys.stdout.flush()
count += 1
sys.stdout.write("\rScanned Papers: %d" % count)
sys.stdout.flush()
This is my code without any sort of methond of removing items from memory. I have currently tried using:
del abstracts[0] # This is too slow
abstracts = abstracts[1:] # This is way too slow
abstract = abstracts.pop(0) # Doesn't seem to free up any memory
Any help would be fantastic.
Thank you!
To free the memory associated to each abstract in O(1) you can do
abstracts[i] = None
after processing it; this will keep just a pointer and will be very fast.
Much better would be however not even reading in all abstracts upfront unless you really need that for reasons that are not specified in the question.
Note also that the Python data structure that supports quick appending/deleting elements from both ends of a sequence is deque, not list.
if possible You can split your abstract like 10 Gb data read first 1 gb process it and next 1 gb like it will process easy and wont take much time and memory
Related
I have some code that is slow (30-60mins by last count), that I need to optimize, it is a data extraction script for Abaqus for a structural engineering model. The worst part of the script is the loop where it iterates through the object model database first by frame (i.e. the time in the time history of the simulation) and nested under this it iterates by each of the nodes. The silly thing is that there are ~100k 'nodes' but only about ~20k useful nodes. But luckily for me the nodes are always in the same order, meaning I do not need to look up the node's uniqueLabel, I can do this in a separate loop once and then filter what I get at the end. That is why I have dumped everything into one list and then I remove all the nodes that are repeats. But as you can see from the code:
timeValues = []
peeqValues = []
for frame in frames: #760 loops
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in setValues: # 100k loops
peeqValues.append(value.data)
It still needs to make the value.data calls unnecessarily, about ~80k times. If anyone is familiar with Abaqus odb (object database) objects, they're super slow under python. To add insult to injury they only run in a single thread, under Abaqus which has its own python version (2.6.x) and packages (so e.g. numpy is available, pandas is not). Another thing that may be annoying is the fact that you can address the objects by position e.g. frames[-1] gives you the last frame, but you cannot slice, so e.g. you can't do this: for frame in frames[0:10]: # iterate first 10 elements.
I don't have any experience with itertools but I'd want to provide it a list of nodeIDs (or list of True/False) to map onto the setValues. The length and pattern of setValues to skip is always the same for each of the 760 frames. Maybe something like:
for frame in frames: #still 760 calls
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
# nodeSet_IDs_TF = [True, True, False, False, False, ...] same length as
# setValues
filteredSetValues = ifilter(nodeSet_IDs_TF, setValues)
for value in filteredSetValues: # only 20k calls
peeqValues.append(value.data)
Any other tips also appreciated, after this I did want to "avoid the dots" by removing the .append() from the loop, and then putting the whole thing in a function to see if it helps. The whole script already runs in under 1.5 hours (down from 6 and at one point 21 hours), but once you start optimizing there is no way to stop.
Memory considerations also appreciated, I run these on a cluster and I believe I got away once with 80 GB of RAM. The scripts definitely work on 160 GB, the issue is getting the resources allocated to me.
I've searched around for a solution but maybe I'm using the wrong keywords, I'm sure this is not an uncommon issue in looping.
EDIT 1
Here is what I ended up using:
# there is no compress under 2.6.x ... so use the equivalent recipe:
from itertools import izip
def compress(data, selectors):
# compress('ABCDEF', [1,0,1,0,1,1]) --> ACEF
return (d for d, s in izip(data, selectors) if s)
def iterateOdb(frames, selectors): # minor speed up
peeqValues = []
timeValues = []
append = peeqValues.append # minor speed up
for frame in frames:
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in compress(setValues, selectors): # massive speed up
append(value.data)
return peeqValues, timeValues
peeqValues, timeValues = iterateOdb(frames, selectors)
The biggest improvement came from using the compress(values, selectors) method (the whole script, including the odb portion went from ~1:30 hours to 25mins. There was also a minor improvement from append = peeqValues.append as well as enclosing everything in def iterateOdb(frames, selectors):.
I used tips from: https://wiki.python.org/moin/PythonSpeed/PerformanceTips
Thanks to everyone for answering & helping!
If you're not confident with itertools try using an if statement in your for loop first.
eg.
for index, item in enumerate(values):
if not selectors[index]:
continue
...
# where selectors is a truth array like nodeSet_IDs_TF
This way you can be more sure that you are getting the correct behaviour and will get getting most of the performance increase you would get from using itertools.
The itertools equivalent is compress.
for item in compress(values, selectors):
...
I'm not familiar with abaqus, but the best optimisations you could achieve would be to see if there is anyway way to give abaqus your selectors so it doesn't have to waste creating each value, only for it to be thrown away. If abaqus is used for doing large array-based manipulations of data then it's like this is the case.
Another variant in addition to those in Dunes's solution:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
If you want to keep the output list length the same as the setValue length, then add an else clause:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
else:
peeqValue.append(None)
selector is here a vector with True/False, and it has the same length as setValues.
In this case it is really a matter of taste which one you like. If the full iteration of 76 million nodes (760 x 100 000) takes 30 minutes, the time is not spent in python's loops.
I tried this:
def loopit(a):
for i in range(760):
for j in range(100000):
a = a + 1
return a
IPython's %timeit reports the loop time as 3.54 s. So, the looping spends maybe 0.1 % of the total time.
I have a program which imports a text file through standard input and aggregates the lines into a dictionary. However the input file is very large (1Tb order) and I wont have enough space to store the whole dictionary in memory (running on 64Gb ram machine). Currently Iv got a very simple clause which outputs the dictionary once it has reached a certain length (in this case 100) and clears the memory. The output can then be aggregated at later point.
So i want to: output the dictionary once memory is full. what is the best way of managing this? Is there a function which gives me the current memory usage? Is this costly to keep on checking? Am I using the right tactic?
import sys
X_dic = dict()
# Used to print the dictionary in required format
def print_dic(dic):
for key, value in dic.iteritems():
print "{0}\t{1}".format(key, value)
for line in sys.stdin:
value, key = line.strip().split(",")
if (not key in X_dic):
X_dic[key] = []
X_dic[key].append(value)
# Limit size of dic.
if( len(X_dic) == 100):
print_dic(X_dic) # Print and clear dictionary
X_dic = dict()
# Now output
print_dic(X_dic)
The module resource provides some information on how much resources (memory, etc.) you are using. See here for a nice little usage.
On a Linux system (I don't know where you are) you can watch the contents of the file /proc/meminfo. As part of the proc file system it is updated automatically.
But I object to the whole strategy of monitoring the memory and using it up as much as possible, actually. I'd rather propose to dump the dictionary regularly (after 1M entries have been added or such). It probably will speed up your program to keep the dict smaller than possible; also it presumably will have advantages for later processing if all dumps are of similar size. If you dump a huge dict which fit into your whole memory when nothing else was using memory, then you later will have trouble re-reading that dict if something else is currently using some of your memory. So then you would have to create a situation in which nothing else is using memory (e. g. reboot or similar). Not very convenient.
I am trying to write a list of the numbers from 0 to 1000000000 as strings, directly to a text file. I would also like each number to have leading zeros up to ten digit places, e.g. 0000000000, 0000000001, 0000000002, 0000000003, ... n. However I find that it is taking much too long for my taste.
I can use seq, but there is no support for leading zeros and I would prefer to avoid using awk and other auxiliary tools to handle these tasks. I am aware of dramatic speed-up benefits just from coding this in C, but I don't want to resort to it. I was considering mapping some functions to a large list and executing them in a loop, however I only have 2GB of RAM available, so please keep this in mind when approaching my problem.
I am using Python-Progressbar, and I am getting an ETA of approximately 2 hours. I would appreciate it if someone can offer me some advice as to how to approach this problem:
pbar = ProgressBar(widgets=[Percentage(), Bar(), ' ', ETA(), ' ', FileTransferSpeed()], maxval=1000000000).start()
with open('numlistbegin','w') as numlist:
limit, nw, pu = 1000000000, numlist.write, pbar.update
for x in range(limit):
nw('%010d\n'%(x,))
pu(x)
pbar.finish()
EDIT: So I have discovered that the formatting (regardless of what programming language you are using), creates vast amounts of overhead. Seq gets the job done quickly, but much more slowly with the formatting option (-f). However, if anyone would like to offer a python solution nonetheless, it would be most welcome.
FWIW seq does have a formatting option:
$ seq -f "%010g" 1 5
0000000001
0000000002
0000000003
0000000004
0000000005
Your long time might be memory-related. with large ranges, using xrange is more memory-efficient since it doesn't try to calculate and store the entire range in memory before it starts. See the post titled Should you always favor xrange() over range()?
Edit: Using Python 3 means xrange vs. range usage is irrelevant.
I did some experimenting and noticed that writing in larger batches improved the performance by about 30%. I'm not sure why your code is taking 2 hours to generate the file -- unless the progress bar is killing performance. If so, you should apply the same batching logic to updating the progress bar. My old Windows box will create a file one-tenth the required size in about 73 seconds.
# Python 2.
# Change xrange to range for Python 3.
import time
start_time = time.time()
limit = 100000000 # 1/10 your limit.
skip = 1000 # Batch size.
with open('numlistbegin', 'w') as fh:
for i in xrange(0, limit, skip):
batch = ''.join('%010d\n' % j for j in xrange(i, i + skip, 1))
fh.write(batch)
print time.time() - start_time # 73 sec. (106 sec. without batching).
I am currently selecting a large list of rows from a database using pyodbc. The result is then copied to a large list, and then i am trying to iterate over the list. Before I abandon python, and try to create this in C#, I wanted to know if there was something I was doing wrong.
clientItems.execute("Select ids from largetable where year =?", year);
allIDRows = clientItemsCursor.fetchall() #takes maybe 8 seconds.
for clientItemrow in allIDRows:
aID = str(clientItemRow[0])
# Do something with str -- Removed because I was trying to determine what was slow
count = count+1
Some more information:
The for loop is currently running at about 5 loops per second, and that seems insanely slow to me.
The total rows selected is ~489,000.
The machine its running on has lots of RAM and CPU. It seems to only run one or two cores, and ram is 1.72GB of 4gb.
Can anyone tell me whats wrong? Do scripts just run this slow?
Thanks
This should not be slow with Python native lists - but maybe ODBC's driver is returning a "lazy" object that tries to be smart but just gets slow. Try just doing
allIDRows = list(clientItemsCursor.fetchall())
in your code and post further benchmarks.
(Python lists can get slow if you start inserting things in its middle, but just iterating over a large list should be fast)
It's probably slow because you load all result in memory first and performing the iteration over a list. Try iterating the cursor instead.
And no, scripts shouldn't be that slow.
clientItemsCursor.execute("Select ids from largetable where year =?", year);
for clientItemrow in clientItemsCursor:
aID = str(clientItemrow[0])
count = count + 1
More investigation is needed here... consider the following script:
bigList = range(500000)
doSomething = ""
arrayList = [[x] for x in bigList] # takes a few seconds
for x in arrayList:
doSomething += str(x[0])
count+=1
This is pretty much the same as your script, minus the database stuff, and takes a few seconds to run on my not-terribly-fast machine.
When you connect to your database directly (I mean you get an SQL prompt), how many secods runs this query?
When query ends, you get a message like this:
NNNNN rows in set (0.01 sec)
So, if that time is so big, and your query is slow as "native", may be you have to create an index on that table.
This is slow because you are
Getting all the results
Allocating memory and assigning the values to that memory to create the list allIDRows
Iterating over that list and counting.
If execute gives you back a cursor then use the cursor to it's advantage and start counting as you get stuff back and save time on the mem allocation.
clientItemsCursor.execute("Select ids from largetable where year =?", year);
for clientItemrow in clientItemsCursor:
count +=1
Other hints:
create an index on year
use 'select count(*) from ... to get the count for the year' this will probably be optimised on the db.
Remove the aID line if not needed this is converting the first item of the row to a string even though its not used.
I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)
Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
collections.Counter(map(uuid, open("log.txt")))
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
Your code is slow because you are using the wrong data structures. A dict is what you want here.
This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.
Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c
Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.
Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of uidHits.keys() to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.
In any case, you can eliminate the lineCount variable, and use i instead. And find(uid) != -1 might be better than count(uid) == 1 if the lines are very long.