Writing numbers from 0 to 1000000000 - python

I am trying to write a list of the numbers from 0 to 1000000000 as strings, directly to a text file. I would also like each number to have leading zeros up to ten digit places, e.g. 0000000000, 0000000001, 0000000002, 0000000003, ... n. However I find that it is taking much too long for my taste.
I can use seq, but there is no support for leading zeros and I would prefer to avoid using awk and other auxiliary tools to handle these tasks. I am aware of dramatic speed-up benefits just from coding this in C, but I don't want to resort to it. I was considering mapping some functions to a large list and executing them in a loop, however I only have 2GB of RAM available, so please keep this in mind when approaching my problem.
I am using Python-Progressbar, and I am getting an ETA of approximately 2 hours. I would appreciate it if someone can offer me some advice as to how to approach this problem:
pbar = ProgressBar(widgets=[Percentage(), Bar(), ' ', ETA(), ' ', FileTransferSpeed()], maxval=1000000000).start()
with open('numlistbegin','w') as numlist:
limit, nw, pu = 1000000000, numlist.write, pbar.update
for x in range(limit):
nw('%010d\n'%(x,))
pu(x)
pbar.finish()
EDIT: So I have discovered that the formatting (regardless of what programming language you are using), creates vast amounts of overhead. Seq gets the job done quickly, but much more slowly with the formatting option (-f). However, if anyone would like to offer a python solution nonetheless, it would be most welcome.

FWIW seq does have a formatting option:
$ seq -f "%010g" 1 5
0000000001
0000000002
0000000003
0000000004
0000000005
Your long time might be memory-related. with large ranges, using xrange is more memory-efficient since it doesn't try to calculate and store the entire range in memory before it starts. See the post titled Should you always favor xrange() over range()?
Edit: Using Python 3 means xrange vs. range usage is irrelevant.

I did some experimenting and noticed that writing in larger batches improved the performance by about 30%. I'm not sure why your code is taking 2 hours to generate the file -- unless the progress bar is killing performance. If so, you should apply the same batching logic to updating the progress bar. My old Windows box will create a file one-tenth the required size in about 73 seconds.
# Python 2.
# Change xrange to range for Python 3.
import time
start_time = time.time()
limit = 100000000 # 1/10 your limit.
skip = 1000 # Batch size.
with open('numlistbegin', 'w') as fh:
for i in xrange(0, limit, skip):
batch = ''.join('%010d\n' % j for j in xrange(i, i + skip, 1))
fh.write(batch)
print time.time() - start_time # 73 sec. (106 sec. without batching).

Related

Efficiently delete first element of a list

I am writing a program that reads millions of acedemic paper abstracts and collects bits of data from them. I have been having issues with running out of memory and have scaled down almost everything I can.
My next idea was to delete from memory an abstract after my program was finished reading it. Here is my loop:
for i in range(0, len(abstracts)):
abstract = abstracts[i]
name = abstract.id
self.Xdict[name] = self.Xdata.getData(abstract)
self.Ydict[name] = self.Ydata.getData(abstract)
sys.stdout.write("\rScanned Papers: %d" % count) #A visual counter
sys.stdout.flush()
count += 1
sys.stdout.write("\rScanned Papers: %d" % count)
sys.stdout.flush()
This is my code without any sort of methond of removing items from memory. I have currently tried using:
del abstracts[0] # This is too slow
abstracts = abstracts[1:] # This is way too slow
abstract = abstracts.pop(0) # Doesn't seem to free up any memory
Any help would be fantastic.
Thank you!
To free the memory associated to each abstract in O(1) you can do
abstracts[i] = None
after processing it; this will keep just a pointer and will be very fast.
Much better would be however not even reading in all abstracts upfront unless you really need that for reasons that are not specified in the question.
Note also that the Python data structure that supports quick appending/deleting elements from both ends of a sequence is deque, not list.
if possible You can split your abstract like 10 Gb data read first 1 gb process it and next 1 gb like it will process easy and wont take much time and memory

How can I make this Python file scan faster?

EDIT: Python 2.7.8
I have two files. p_m has a few hundred records that contain acceptable values in column 2. p_t has tens of millions of records in which I want to make sure that column 14 is from the set of acceptable values already mentioned. So in the first while loop I'm reading in all the acceptable values, making a set (for de-duping), and then turning that set into a list (I didn't benchmark to see if a set would have been faster than a list, actually...). I got it down to about as few lines as possible in the second loop, but I don't know if they are the FASTEST few lines (I'm using the [14] index twice because exceptions are so very rare that I didn't want to bother with an assignment to a variable). Currently it takes about 40 minutes to do a scan. Any ideas on how to improve that?
def contentScan(p_m,p_t):
""" """
vcont=sets.Set()
i=0
h = open(p_m,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
vcont.add(line.split("|")[2])
h.close()
vcont = list(vcont)
vcont.sort()
i=0
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
h.close()
print "PASS All variable_content_id values exist in the matrix." %rem
return 0
Checking for membership in a set of a few hundred items is much faster than checking for membership in the equivalent list. However, given your staggering 40-minutes running time, the difference may not be that meaningful. E.g:
ozone:~ alex$ python -mtimeit -s'a=list(range(300))' '150 in a'
100000 loops, best of 3: 3.56 usec per loop
ozone:~ alex$ python -mtimeit -s'a=set(range(300))' '150 in a'
10000000 loops, best of 3: 0.0789 usec per loop
so if you're checking "tens of millions of times" using the set should save you tens of seconds -- better than nothing, but barely measurable.
The same consideration applies for other very advisable improvements, such as turning the loop structure:
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
...
h.close()
into a much-sleeker:
with open(p_t, 'rb') as h:
for line in h:
...
again, this won't save you as much as a microsecond per iteration -- so, over, say, 50 million lines, that's less than one of those 40 minutes. Ditto for the removal of the completely unused i += 1 -- it makes no sense for it to be there, but taking it way will make little difference.
One answer focused on the cost of the split operation. That depends on how many fields per record you have, but, for example:
ozone:~ alex$ python -mtimeit -s'a="xyz|"*20' 'a.split("|")[14]'
1000000 loops, best of 3: 1.81 usec per loop
so, again, whatever optimization here could save you maybe at most a microsecond per iteration -- again, another minute shaved off, if that.
Really, the key issue here is why reading and checking e.g 50 million records should take as much as 40 minutes -- 2400 seconds -- 48 microseconds per line; and no doubt still more than 40 microseconds per line even with all the optimizations mentioned here and in other answers and comments.
So once you have applied all the optimizations (and confirmed the code is still just too slow), try profiling the program -- per e.g http://ymichael.com/2014/03/08/profiling-python-with-cprofile.html -- to find out exactly where all of the time is going.
Also, just to make sure it's not just the I/O to some peculiarly slow disk, do a run with the meaty part of the big loop "commented out" - just reading the big file and doing no processing or checking at all on it; this will tell you what's the "irreducible" I/O overhead (if I/O is responsible for the bulk of your elapsed time, then you can't do much to improve things, though changing the open to open(thefile, 'rb', HUGE_BUFFER_SIZE) might help a bit) and may want to consider improving the hardware set-up instead -- defragment a disk, use a local rather than remote filesystem, whatever...
The list lookup was the issue (as you correctly noticed). Searching the list has O(n) time complexity, where n is the number of items stored in the list. on the other hand, finding a value in a hashtable (this is what the python dictionary actually is) has O(1) complexity. As you have hundreds of items in the list, the list lookup is about two orders of magnitude more expensive than the dictionary lookup. This is in line with the 34x improvement you saw when replacing the list with the dictionary.
To further reduce execution time by 5-10x you can use a Python JIT. I personally like Pypy http://pypy.org/features.html . You do not need to modify your script, just install pypy and run:
pypy [your_script.py]
EDIT: Made more pythony.
EDIT 2: Using set builtin rather than dict.
Based on the comments, I decided to try using a dict instead of a list to store the acceptable values against which I'd be checking the big file (I did keep a watchful eye on .split but did not change it). Based on just changing the list to a dict, I saw an immediate and HUGE improvement in execution time.
Using timeit and running 5 iterations over a million-line file, I get 884.2 seconds for the list-based checker, and 25.4 seconds for the dict-based checker! So like a 34x improvement for changing 2 or 3 lines.
Thanks all for the inspiration! Here's the solution I landed on:
def contentScan(p_m,p_t):
""" """
vcont=set()
with open(p_m,'rb') as h:
for line in h:
vcont.add(line.split("|")[2])
with open(p_t,"rb") as h:
for line in h:
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
print "PASS All variable_content_id values exist in the matrix."
return 0
Yes, it's not optimal at all. split is EXPENSIVE like hell (creates new list, creates N strings, append them to list). scan for 13s "|", scan for 14s "|" (from 13s pos) and line[pos13 + 1:pos14 - 1].
Pretty sure you can make in run 2-10x faster with this small change. To add more - you can not extract string, but loop through valid strings and for each start at pos13+1 char by char while chars match. If you ended at "|" for one of the strings - it's good one. Also it'll help a bit to sort valid strings list by frequency in data file. But not creating list with dozens of strings on each step is way more important.
Here are your tests:
generator (ts - you can adjust it to make us some real data).
no-op (just reading)
original solution.
no split solution
Wasn't able to make it format the code, so here is gist.
https://gist.github.com/iced/16fe907d843b71dd7b80
Test conditions: VBox with 2 cores and 1GB RAM running ubuntu 14.10 with latest updates. Each variation was executed 10 times, with rebooting VBox before each run and throwing off lowest and highest run time.
Results:
no_op: 25.025
original: 26.964
no_split: 25.102
original - no_op: 1.939
no_split - no_op: 0.077
Though in this case this particular optimization is useless as majority of time was spent in IO. I was unable to find test setup to make IO less than 70%. In any case - split IS expensive and should be avoided when it's not needed.
PS. Yes, I understand that in case of even 1K of good items it's way better to use hash (actually, it's better at the point when hash computation is faster than loookup - probably at 100 of elements), my point is that split is expensive in this case.

Skipping a pattern of elements using itertools and accompanying list

I have some code that is slow (30-60mins by last count), that I need to optimize, it is a data extraction script for Abaqus for a structural engineering model. The worst part of the script is the loop where it iterates through the object model database first by frame (i.e. the time in the time history of the simulation) and nested under this it iterates by each of the nodes. The silly thing is that there are ~100k 'nodes' but only about ~20k useful nodes. But luckily for me the nodes are always in the same order, meaning I do not need to look up the node's uniqueLabel, I can do this in a separate loop once and then filter what I get at the end. That is why I have dumped everything into one list and then I remove all the nodes that are repeats. But as you can see from the code:
timeValues = []
peeqValues = []
for frame in frames: #760 loops
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in setValues: # 100k loops
peeqValues.append(value.data)
It still needs to make the value.data calls unnecessarily, about ~80k times. If anyone is familiar with Abaqus odb (object database) objects, they're super slow under python. To add insult to injury they only run in a single thread, under Abaqus which has its own python version (2.6.x) and packages (so e.g. numpy is available, pandas is not). Another thing that may be annoying is the fact that you can address the objects by position e.g. frames[-1] gives you the last frame, but you cannot slice, so e.g. you can't do this: for frame in frames[0:10]: # iterate first 10 elements.
I don't have any experience with itertools but I'd want to provide it a list of nodeIDs (or list of True/False) to map onto the setValues. The length and pattern of setValues to skip is always the same for each of the 760 frames. Maybe something like:
for frame in frames: #still 760 calls
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
# nodeSet_IDs_TF = [True, True, False, False, False, ...] same length as
# setValues
filteredSetValues = ifilter(nodeSet_IDs_TF, setValues)
for value in filteredSetValues: # only 20k calls
peeqValues.append(value.data)
Any other tips also appreciated, after this I did want to "avoid the dots" by removing the .append() from the loop, and then putting the whole thing in a function to see if it helps. The whole script already runs in under 1.5 hours (down from 6 and at one point 21 hours), but once you start optimizing there is no way to stop.
Memory considerations also appreciated, I run these on a cluster and I believe I got away once with 80 GB of RAM. The scripts definitely work on 160 GB, the issue is getting the resources allocated to me.
I've searched around for a solution but maybe I'm using the wrong keywords, I'm sure this is not an uncommon issue in looping.
EDIT 1
Here is what I ended up using:
# there is no compress under 2.6.x ... so use the equivalent recipe:
from itertools import izip
def compress(data, selectors):
# compress('ABCDEF', [1,0,1,0,1,1]) --> ACEF
return (d for d, s in izip(data, selectors) if s)
def iterateOdb(frames, selectors): # minor speed up
peeqValues = []
timeValues = []
append = peeqValues.append # minor speed up
for frame in frames:
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in compress(setValues, selectors): # massive speed up
append(value.data)
return peeqValues, timeValues
peeqValues, timeValues = iterateOdb(frames, selectors)
The biggest improvement came from using the compress(values, selectors) method (the whole script, including the odb portion went from ~1:30 hours to 25mins. There was also a minor improvement from append = peeqValues.append as well as enclosing everything in def iterateOdb(frames, selectors):.
I used tips from: https://wiki.python.org/moin/PythonSpeed/PerformanceTips
Thanks to everyone for answering & helping!
If you're not confident with itertools try using an if statement in your for loop first.
eg.
for index, item in enumerate(values):
if not selectors[index]:
continue
...
# where selectors is a truth array like nodeSet_IDs_TF
This way you can be more sure that you are getting the correct behaviour and will get getting most of the performance increase you would get from using itertools.
The itertools equivalent is compress.
for item in compress(values, selectors):
...
I'm not familiar with abaqus, but the best optimisations you could achieve would be to see if there is anyway way to give abaqus your selectors so it doesn't have to waste creating each value, only for it to be thrown away. If abaqus is used for doing large array-based manipulations of data then it's like this is the case.
Another variant in addition to those in Dunes's solution:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
If you want to keep the output list length the same as the setValue length, then add an else clause:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
else:
peeqValue.append(None)
selector is here a vector with True/False, and it has the same length as setValues.
In this case it is really a matter of taste which one you like. If the full iteration of 76 million nodes (760 x 100 000) takes 30 minutes, the time is not spent in python's loops.
I tried this:
def loopit(a):
for i in range(760):
for j in range(100000):
a = a + 1
return a
IPython's %timeit reports the loop time as 3.54 s. So, the looping spends maybe 0.1 % of the total time.

Why is it much slower to print an answer to a calculation when compared to just computing it?

I'm computing the answer for some very big division questions and wonder why b=a/c (where a and c and both positive whole numbers) is so must faster to figure out than when you type the questions and ask that the answer be printed: b=a/c is way faster than b=a/c followed by print b.
Very slow:
from datetime import datetime - startTime = datetime.now()
a=2**1000000-3
b=a/13
print b
print(datetime.now()-startTime)
but without the print b it is very fast. I later typed in c=a%13 to see if anything was actually happening (I'm still pretty new to programming) and it is very fast when I type in print c (without the print b code).
As far as I understand, IO operations are slow and printing to the screen is like writing to a file and it is going to block the thread for some time.
As someone pointed out, conversion from number to string could be taking time too. Whenever I have to measure time of something. I measure the time of the calculation and print any kind of result after measuring time.
To make the program even faster, but memory hungry, you could save every result in a list and then compile one big string and print only once.
Repetitive call to print take more time than one big call to print.
from datetime import datetime
startTime = datetime.now()
a=2**1000000-3
b=a/13
elapsedTime = datetime.now() - startTime
print "Elapsed time %s\n Number: %s" % (elapsedTime, b)
You're trying to output pretty big number (about 10^300000) – it takes time to convert it to decimal format from binary (I guess need to do about 300000 divisions for this, internally numbers are stored in binary format). If you really need to output whole number in decimal format – I don't think you can speed it up a lot. But you can quickly print number in hex or binary format:
hex(b)
bin(b)
You can use Decimal type to store numbers in decimal format internally but calculations with this type could be much slower.
IO is slow as people have pointed out. When doing IO, you do not know how your process will get scheduled - it will probably be put on a waiting queue. I am not sure what you're timing but simple engineering logic will dictate that you will want to make the IO a small portion of the overall run time (i.e. insignificant). So I suggest you do 1*10^6 divisions (the more the better) and then do IO assuming you are testing just the mathematics. Note that you are involving how computers and OS work in your calculations.

Best seed for parallel process

I need to run a MonteCarlo simulations in parallel on different machines. The code is in c++, but the program is set up and launched with a python script that set a lot of things, in particular the random seed. The function setseed thake a 4 bytes unsigned integer
Using a simple
import time
setseed(int(time.time()))
is not very good because I submit the jobs to a queue on a cluster, they remain pending for some minutes then they starts, but the start time is impredicible, it can be that two jobs start at the same time (seconds), so I switch to:
setseet(int(time.time()*100))
but I'm not happy. What is the best solution? Maybe I can combine information from: time, machine id, process id. Or maybe the best solution is to read from /dev/random (linux machines)?
How to read 4 bytes from /dev/random?
f = open("/dev/random","rb")
f.read(4)
give me a string, I want an integer!
Reading from /dev/random is a good idea. Just convert the 4 byte string into an Integer:
f = open("/dev/random","rb")
rnd_str = f.read(4)
Either using struct:
import struct
rand_int = struct.unpack('I', rnd_string)[0]
Update Uppercase I is needed.
Or multiply and add:
rand_int = 0
for c in rnd_str:
rand_int <<= 8
rand_int += ord(c)
You could simply copy over the four bytes into an integer, that should be the least of your worries.
But parallel pseudo-random number generation is a rather complex topic and very often not done well. Usually you generate seeds on one machine and distribute them to the others.
Take a look at SPRNG, which handles exactly your problem.
If this is Linux or a similar OS, you want /dev/urandom -- it always produces data immediately.
/dev/random may stall waiting for the system to gather randomness. It does produce cryptographic-grade random numbers, but that is overkill for your problem.
You can use a random number as the seed, which has the advantage of being operating-system agnostic (no /dev/random needed), with no conversion from string to int:
Why not simply use
random.randrange(-2**31, 2**31)
as the seed of each process? Slightly different starting times give wildly different seeds, this way…
You could also alternatively use the random.jumpahead method, if you know roughly how many random numbers each process is going to use (the documentation of random.WichmannHill.jumpahead is useful).

Categories

Resources