How to create a big file quickly with Python

How to create a big file quickly with Python - python

I have the following code for producing a big text file:
d = 3
n = 100000
f = open("input.txt",'a')
s = ""
for j in range(0, d-1):
s += str(round(random.uniform(0,1000), 3))+" "
s += str(round(random.uniform(0,1000), 3))
f.write(s)
for i in range(0, n-1):
s = ""
for j in range(0, d-1):
s += str(round(random.uniform(0,1000), 3))+" "
s += str(round(random.uniform(0,1000), 3))
f.write("\n"+s)
f.close()
But it seems to be pretty slow to even generate 5GB of this.
How can I make it better? I wish the output to be like:
796.802 691.462 803.664
849.483 201.948 452.155
144.174 526.745 826.565
986.685 238.462 49.885
137.617 416.243 515.474
366.199 687.629 423.929

Well, of course, the whole thing is I/O bound. You can't output the file
faster than the storage device can write it. Leaving that aside, there
are some optimizations that could be made.
Your method of building up a long string from several shorter strings is
suboptimal. You're saying, essentially, s = s1 + s2. When you tell
Python to do this, it concatenates two string objects to make a new
string object. This is slow, especially when repeated.
A much better way is to collect the individual string objects in a list
or other iterable, then use the join method to run them together. For
example:
>>> ''.join(['a', 'b', 'c'])
'abc'
>>> ', '.join(['a', 'b', 'c'])
'a, b, c'
Instead of n-1 string concatenations to join n strings, this does
the whole thing in one step.
There's also a lot of repeated code that could be combined. Here's a
cleaner design, still using the loops.
import random
d = 3
n = 1000
f = open('input.txt', 'w')
for i in range(n):
nums = []
for j in range(d):
nums.append(str(round(random.uniform(0, 1000), 3)))
s = ' '.join(nums)
f.write(s)
f.write('\n')
f.close()
A cleaner, briefer, more Pythonic way is to use a list comprehension:
import random
d = 3
n = 1000
f = open('input.txt', 'w')
for i in range(n):
nums = [str(round(random.uniform(0, 1000), 3)) for j in range(d)]
f.write(' '.join(nums))
f.write('\n')
f.close()
Note that in both cases, I wrote the newline separately. That should be
faster than concatenating it to the string, since I/O is buffered
anyway. If I were joining a list of strings without separators, I'd just
tack on a newline as the last string before joining.
As Daniel's answer says, numpy is probably faster, but maybe you don't
want to get into numpy yet; it sounds like you're kind of a beginner at
this point.

Using numpy is probably faster:
import numpy
d = 3
n = 100000
data = numpy.random.uniform(0, 1000,size=(n,d))
numpy.savetxt("input.txt", data, fmt='%.3f')

This could be a bit faster:
nlines = 100000
col = 3
for line in range(nlines):
f.write('{} {} {}\n'.format(*((round(random.uniform(0,1000), 3))
for e in range(col))))
or use string formatting:
for line in range(nlines):
numbers = [random.uniform(0, 1000) for e in range(col)]
f.write('{:6.3f} {:6.3f} {:6.3f}\n'.format(*numbers))

I guess its better if you want to use a infinite loop and want to make a so big file without limitation the better is use like that
import random
d = 3
n = 1000
f = open('input.txt', 'w')
for i in range(10**9):
nums = [str(round(random.uniform(0, 1000), 3)) for j in range(d)]
f.write(' '.join(nums))
f.write('\n')
f.close()
The code will not stopped while you click on ctr-c

Related

Increment 2 different parameters separately in a for loop

I have the following code. The file file.txt contains a list of variables. Some of them should be str type and others should be int type.
var = [None] * 3
j = 0
with open("file.txt", "r") as f:
content = f.readline().split(";")
for i in range(2, 5):
var[j] = int(content[i])
j += 1
Instead of incrementing j manually I'd like to do it in a cleaner way (e.g. within the 'instructions' of the for loop, or something like that.
What would be a shorter/better way to handle this task?

You can use enumerate:
for i, j in enumerate(range(2,5)):
var[j] = int(content[i])
Also, you don't need to initialize var at all - just use a list comprehension:
var = [int(content[i]) for i in range(2, 5)]
Another approach (may be less Pythonic/less efficient/less readable):
You can zip two ranges together:
for i, j in zip(range(len(range(2, 5))), range(2,5)):
var[j] = int(content[i])
You know that the second range is range(2, 5) and want the first range to be from zero to len(range(2, 5)) - that's range(len(range(2, 5))).

The idiomatic way to count the current iteration index is by using enumerate:
for j, i in enumerate(range(2, 5)):
var[j] = int(content[i])
(There's no need to initialize j = 0 in this case.)
However, your example code would usually just be written as:
with open("file.txt", "r") as f:
content = f.readline().split(";")
var = [int(x) for x in content[2:5]]
which uses language features such as
a slice ([2:5]) to select a part of a list
a list comprehension to create a new list from an input sequence

How to optimize code of comparison between a list and strings through stdin - performance is important

Right now i have a script that receives strings from stdin and i have also a list which is about 70mb on disk (in many partial files) which i load into memory into one list.
I then search each string as it comes in from stdin and search if it exists in the list. I understand this is slow because of the huge list and the number of strings that can also be a great amount that come in.
It goes like this:
def buildindex():
# j = 0
# while j < len(parts_list):
# f = urllib2.urlopen("https://s3.amazonaws.com/source123/output/" + parts_list[j])
j = 0
while j <= 200:
if j < 10:
f = urllib2.urlopen("https://s3.amazonaws.com/source123/output/part-0000" + str(j))
if j < 100 and j >= 10:
f = urllib2.urlopen("https://s3.amazonaws.com/source123/output/part-000" + str(j))
if j >= 100:
f = urllib2.urlopen("https://s3.amazonaws.com/source123/output/part-00" + str(j))
for line in f.readlines():
line = line.rstrip()
yield line
print line
j += 1
f.close()
linelist = list(buildindex())
for suspicious_line in sys.stdin:
if "," in suspicious_line:
suspicious_key, suspicious_source, suspicious_start, suspicious_end = suspicious_line.strip().split(",")
x=re.compile(suspicious_key)
sub_list = filter(x.match, linelist)
# do something
I tried to run this locally and its been over 20 minutes and its still going. Also i will use these scripts on Amazon EMR (hadoop) and it also fails for some reason. If i try a subset of the list, it works.
What performance wise changes can i make to keep things neat and relatively fast?

The problem maybe not in for suspicious_line in sys.stdin block, but in the build_index. Reading files from s3 can be slow. Have you timed buildindex? Run the script without the for suspicious_line in sys.stdin block and see how much time it takes.
If buildindex is the problem, download the files to the disk.
If buildindex is not the problem, you can try using the "simpler" filter in instead of regex (creating a regex is expensive)
sub_list = [line for line in linelist if suspicious_line in line]

Increase Python code speed

Does anyone have a clue how to increase the speed of this part of python code?
It was designed to deal with small files (with just a few lines, and for this is very fast) but i want to run it with big files (with ~50Gb, and millions of lines).
The main goal of this code is to get stings from a file (.txt) and search for these in a input file printing the the number of times that these occurred in the output file.
Here is the code: infile, seqList and out are determined by the optparse as Options in the beginning of the code (not shown)
def novo (infile, seqList, out) :
uDic = dict()
rDic = dict()
nmDic = dict()
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
for k in lines:
l1 = k.split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
if l1[4] == "U":
for k in uDic.keys():
if k == l3:
uDic[k] += 1
if l1[4] == "R":
for j in rDic.keys():
if j == l3:
rDic[j] += 1
if l1[4] == "NM":
for h in nmDic.keys():
if h == l3:
nmDic[h] += 1
f = open(out, "w")
f.write("Sample"+"\t"+"R"+"\t"+"U"+"\t"+"NM"+"\t"+"TOTAL"+"\t"+"%R"+"\t"+"%U"+"\t"+"%NM"+"\n")
for i in samples:
U = int()
R = int()
NM = int ()
for k, j in uDic.items():
if k == i:
U = j
for o, p in rDic.items():
if o == i:
R = p
for y,u in nmDic.items():
if y == i:
NM = u
TOTAL = int(U + R + NM)
try:
f.write(i+"\t"+str(R)+"\t"+str(U)+"\t"+str(NM)+"\t"+str(TOTAL)+"\t"+str(float(R) / TOTAL)+"\t"+str(float(U) / TOTAL)+"\t"+str(float(NM) / TOTAL$
except:
continue
f.close()

With processing 50 GB files, the question is not how to make it faster, but how to make it runnable
at all.
The main problem is, you will run out of memory and shall modify the code to be processing the files
without having all the files in memory, but rather having in memory onle a line, which is needed.
Following code from your question is reading all the lines form two files:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
You shall defer reading the lines till the latest possible moment like this:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
Note: did you want to use line.strip() or line.split()?
This way, you do not have to keep all the content in memory.
There are many more options for optimization, but this one will let you to take off and run.

It would make it much easier if you provided some sample inputs. Because you haven't I haven't tested this, but the idea is simple - iterate through each file only once, using iterators rather than reading the whole file into memory. Use the efficient collections.Counter object to handle the counting and minimise inner looping:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]

If you convert your script into functions (it makes profiling easier), and then see what it does when you code profile it : I suggest using runsnake : runsnakerun

I would try replacing your loops with list & dictionary comprehensions:
For example, instead of
for i in samples:
uDict[i.replace(" ","")] = 0
Try:
udict = {i.replace(" ",""):0 for i in samples}
and similarly for the other dicts
I don't really follow what's going on in your "for k in lines" loop, but you only need l3 (and l2) when you have certain values for l1[4]. Why not check for those values before splitting and replacing?
Lastly, instead of looping through all the keys of a dict to see if a given element is in that dict, try:
if x in myDict:
myDict[x] = ....
for example:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
can be replaced with:
if l3 in uDic:
uDic[l3] += 1
Other than that, try profiling.

1)Look into profilers and adjust the code that is taking the most time.
2)You could try optimizing some methods with Cython - use data from profiler to modify correct thing
3)It looks like you can use a counter instead of a dict for the output file, and a set for the input file -look into them.
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4) If you are reading 50GB of memory you won't be able to store it all in RAM (I'm assuming who knows what kind of computer you have), so generators should save your memory and time.
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)

How to output integer arrays to file in python?

I have 3000000 ints' long array which I want to output to a file. How can I do that?
Also, is this
for i in range(1000):
for k in range(1000):
(r, g, b) = rgb_im.getpixel((i, k))
rr.append(r)
gg.append(g)
bb.append(b)
d.extend(rr)
d.extend(gg)
d.extend(bb)
a good practice to join array together?
All of the arrays are declared like this d = array('B')
EDIT:
Managed to output all int`s delimited by ' ' with this
from PIL import Image
import array
side = 500
for j in range(1000):
im = Image.open(r'C:\Users\Ivars\Desktop\RS\Shape\%02d.jpg' % (j))
rgb_im = im.convert('RGB')
d = array.array('B')
rr = array.array('B')
gg = array.array('B')
bb = array.array('B')
f = open(r'C:\Users\Ivars\Desktop\RS\ShapeData\%02d.txt' % (j), 'w')
for i in range(side):
for k in range(side):
(r, g, b) = rgb_im.getpixel((i, k))
rr.append(r)
gg.append(g)
bb.append(b)
d.extend(rr)
d.extend(gg)
d.extend(bb)
o = ' '.join(str(t) for t in d)
print('#', j, ' - ', len(o))
f.write(o)
f.close()

if you're using python >= 2.6 then you can use print functions from the future!
from __future__ import print_function
#your code
# This will print out a string representation of list to the file.
# If you need it formatted differently, then you'll have to construct the string yourself
print(d, file=open('/path/to/file.txt','w')
#you can join the list items with an empty string to get only the numbers
print("".join(d),file=('/path/to/file.txt','w'))
This has the side effect of turning print from a statement into a function, so you'll have to wrap whatever you want printed in ()

You want tofile(), which requires you to open a file object. See https://docs.python.org/2/library/array.html and https://docs.python.org/2/library/stdtypes.html#bltin-file-objects. Also, have you considered using NumPy?
import array
a = array.array('B')
b = array.array('B')
a.append(3)
a.append(4)
print a
print b
with open('c:/test.dat', 'w') as f:
a.tofile(f)
with open('c:/test.dat', 'r') as f:
b.fromfile(f, 2)
print b
EDIT: Based on your edit, you can use numpy with PIL and generate the array in a line or two, without looping. See, e.g., Conversion between Pillow Image object and numpy array changes dimension for example code.

Iterate through lines changing only one character python

I have a file that looks like this
N1 1.023 2.11 3.789
Cl1 3.124 2.4534 1.678
Cl2 # # #
Cl3 # # #
Cl4
Cl5
N2
Cl6
Cl7
Cl8
Cl9
Cl10
N3
Cl11
Cl12
Cl13
Cl14
Cl15
The three numbers continue down throughout.
What I would like to do is pretty much a permutation. These are 3 data sets, set 1 is N1-Cl5, 2 is N2-Cl10, and set three is N3 - end.
I want every combination of N's and Cl's. For example the first output would be
Cl1
N1
Cl2
then everything else the same. the next set would be Cl1, Cl2, N1, Cl3...and so on.
I have some code but it won't do what I want, becuase it would know that there are three individual data sets. Should I have the three data sets in three different files and then combine, using a code like:
list1 = ['Cl1','Cl2','Cl3','Cl4', 'Cl5']
for line in file1:
line.replace('N1', list1(0))
list1.pop(0)
print >> file.txt, line,
or is there a better way?? Thanks in advance

This should do the trick:
from itertools import permutations
def print_permutations(in_file):
separators = ['N1', 'N2', 'N3']
cur_separator = None
related_elements = []
with open(in_file, 'rb') as f:
for line in f:
line = line.strip()
# Split Nx and CIx from numbers.
value = line.split()[0]
# Found new Nx. Print previous permutations.
if value in separators and related_elements:
for perm in permutations([cur_separator] + related_elements)
print perm
cur_separator = line
related_elements = []
else:
# Found new CIx. Append to the list.
related_elements.append(value)

You could use regex to find the line numbers of the "N" patterns and then slice the file using those line numbers:
import re
n_pat = re.compile(r'N\d')
N_matches = []
with open(sample, 'r') as f:
for num, line in enumerate(f):
if re.match(n_pat, line):
N_matches.append((num, re.match(n_pat, line).group()))
>>> N_matches
[(0, 'N1'), (12, 'N2'), (24, 'N3')]
After you figure out the line numbers where these patterns appear, you can then use itertools.islice to break the file up into a list of lists:
import itertools
first = N_matches[0][0]
final = N_matches[-1][0]
step = N_matches[1][0]
data_set = []
locallist = []
while first < final + step:
with open(file, 'r') as f:
for item in itertools.islice(f, first, first+step):
if item.strip():
locallist.append(item.strip())
dataset.append(locallist)
locallist = []
first += step
itertools.islice is a really nice way to take a slice of an iterable. Here's the result of the above on a sample:
>>> dataset
[['N1 1.023 2.11 3.789', 'Cl1 3.126 2.6534 1.878', 'Cl2 3.124 2.4534 1.678', 'Cl3 3.924 2.1134 1.1278', 'Cl4', 'Cl5'], ['N2', 'Cl6 3.126 2.6534 1.878', 'Cl7 3.124 2.4534 1.678', 'Cl8 3.924 2.1134 1.1278', 'Cl9', 'Cl10'], ['N3', 'Cl11', 'Cl12', 'Cl13', 'Cl14', 'Cl15']]
After that, I'm a bit hazy on what you're seeking to do, but I think you want permutations of each sublist of the dataset? If so, you can use itertools.permutations to find permutations on various sublists of dataset:
for item in itertools.permutations(dataset[0]):
print(item)
etc.
Final Note:
Assuming I understand correctly what you're doing, the number of permutations is going to be huge. You can calculate how many permutations there are them by taking the factorial of the number of items. Anything over 10 (10!) is going to produce over 3,000,000 million permutations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a big file quickly with Python - python

Using numpy is probably faster: import numpy d = 3 n = 100000 data = numpy.random.uniform(0, 1000,size=(n,d)) numpy.savetxt("input.txt", data, fmt='%.3f')

Related

Increment 2 different parameters separately in a for loop

How to optimize code of comparison between a list and strings through stdin - performance is important

Increase Python code speed

How to output integer arrays to file in python?

Iterate through lines changing only one character python

Categories

Resources