Does it takes RAM to save a readlines array? - python

I am using the command lineslist = file.readlines() of a 2GB file.
So, I guess it will create a lineslist array of 2GB or more size. So, basically is it the same as readfile = file.read(), which also creates readfile (instance/variable?) of 2GB exactly?
Why should I prefer readlines in this case?
Adding to that I have one more question, it is also mentioned here https://docs.python.org/2/tutorial/inputoutput.html:
readline(): a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous;
I don't understand the last point. So, does readlines() also have unambiguous value in the last element of its array if there is no \n in the end of the file?
We are dealing with combining the files (which were split on the basis of blocksize) So, I am thinking of choosing readlines or read. As the individual files may not be end with a \n after splitting and if readlines returns unambiguous values, it would be a problem, I think.)
PS: I haven't learnt python. So, forgive me if there is no such thing as instances in python or if I am speaking rubbish. I am just assuming.
EDIT:
Ok, I just found. It's not returning any unambiguous output.
len(lineslist)
6923798
lineslist[6923797]
"\xf4\xe5\xcf1)\xff\x16\x93\xf2\xa3-\....\xab\xbb\xcd"
So, it doesn't end with '\n'. But it's not unambiguous output eiter.
Also, no unambiguous output with readline either for the lastline.

If I understood your issue correctly you just want to combine (ie concatenate) files.
If memory is an issue normally for line in f is the way to go.
I tried benchmarking using a 1.9GB csv file. One possible alternative is to read in large chunks of the data which fit in memory.
Codes:
#read in large chunks - fastest in my test
chunksize = 2**16
with open(fn,'r') as f:
chunk = f.read(chunksize)
while chunk:
chunk = f.read(chunksize)
#1 loop, best of 3: 4.48 s per loop
#read whole file in one go - slowest in my test
with open(fn,'r') as f:
chunk = f.read()
#1 loop, best of 3: 11.7 s per loop
#read file using iterator over each line - most practical for most cases
with open(fn,'r') as f:
for line in f:
s = line
#1 loop, best of 3: 6.74 s per loop
Knowing this you could write something like:
with open(outputfile,'w') as fo:
for inputfile in inputfiles: #assuming inputfiles is a list of filepaths
with open(inputfile,'r') as fi:
for chunk in iter(lambda: fi.read(chunksize), ''):
fo.write(fi.read(chunk))
fo.write('\n') #newline between each file(might not be necessary)

file.read() will read the entire stream of data as 1 long string, whereas file.readlines() will create a list of lines from the stream.
Generally performance will suffer, especially in the case of large files, if you read in the entire thing all at once. The general approach is to iterate over the file object line by line, which it supports.
for line in file_object:
# Process the line
As this way of processing will only consume memory for a line (loosely speaking) and not the entire contents of the file.

Yes, readlines() causes reading all file to variable.
Much better it would be to read file line by line:
f = open("file_path", "r")
for line in f:
print f
It will cause loading only one line to RAM, so you're saving about 1.99 GB of memory :)
As I understood You want to concatenate two files.
target = open("target_file", "w")
f1 = open("f1", "r")
f2 = open("f2", "r")
for line in f1:
print >> target, line
for line in f2:
print >> target, line
target.close()
Or consider using other technology like bash:
cat file1 > target
cat file2 >> target

Related

When should I ever use file.read() or file.readlines()?

I noticed that if I iterate over a file that I opened, it is much faster to iterate over it without "read"-ing it.
i.e.
l = open('file','r')
for line in l:
pass (or code)
is much faster than
l = open('file','r')
for line in l.read() / l.readlines():
pass (or code)
The 2nd loop will take around 1.5x as much time (I used timeit over the exact same file, and the results were 0.442 vs. 0.660), and would give the same result.
So - when should I ever use the .read() or .readlines()?
Since I always need to iterate over the file I'm reading, and after learning the hard way how painfully slow the .read() can be on large data - I can't seem to imagine ever using it again.
The short answer to your question is that each of these three methods of reading bits of a file have different use cases. As noted above, f.read() reads the file as an individual string, and so allows relatively easy file-wide manipulations, such as a file-wide regex search or substitution.
f.readline() reads a single line of the file, allowing the user to parse a single line without necessarily reading the entire file. Using f.readline() also allows easier application of logic in reading the file than a complete line by line iteration, such as when a file changes format partway through.
Using the syntax for line in f: allows the user to iterate over the file line by line as noted in the question.
(As noted in the other answer, this documentation is a very good read):
https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
Note:
It was previously claimed that f.readline() could be used to skip a line during a for loop iteration. However, this doesn't work in Python 2.7, and is perhaps a questionable practice, so this claim has been removed.
Hope this helps!
https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory
Sorry for all the edits!
For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:
for line in f:
print line,
This is the first line of the file.
Second line of the file
Note that readline() is not comparable to the case of reading all lines in for-loop since it reads line by line and there is an overhead which is pointed out by others already.
I ran timeit on two identical snippts but one with for-loop and the other with readlines(). You can see my snippet below:
def test_read_file_1():
f = open('ml/README.md', 'r')
for line in f.readlines():
print(line)
def test_read_file_2():
f = open('ml/README.md', 'r')
for line in f:
print(line)
def test_time_read_file():
from timeit import timeit
duration_1 = timeit(lambda: test_read_file_1(), number=1000000)
duration_2 = timeit(lambda: test_read_file_2(), number=1000000)
print('duration using readlines():', duration_1)
print('duration using for-loop:', duration_2)
And the results:
duration using readlines(): 78.826229238
duration using for-loop: 69.487692794
The bottomline, I would say, for-loop is faster but in case of possibility of both, I'd rather readlines().
readlines() is better than for line in file when you know that the data you are interested starts from, for example, 2nd line. You can simply write readlines()[1:].
Such use cases are when you have a tab/comma separated value file and the first line is a header (and you don't want to use additional module for tsv or csv files).
#The difference between file.read(), file.readline(), file.readlines()
file = open('samplefile', 'r')
single_string = file.read() #Reads all the elements of the file
#into a single string(\n characters might be included)
line = file.readline() #Reads the current line where the cursor as a string
#is positioned and moves to the next line
list_strings = file.readlines()#Makes a list of strings

Why does Python require double the RAM to read a file?

I am reading a file which is 24 GB in size. I am using
lines = open(fname).read().splitlines()
and it seems that when reading the lines, it always uses ~double the amount of RAM which should be needed. It uses about 50 GB for my current script (after it jumps up to 50 it goes back down to 28) but every time I use this kind of line to read a file in Python it generally uses double the file size before dropping down to a size that I would expect.
Any idea why this is happening or how I can avoid it?
RAM Usage: Filesize * 1: Read the entire file into memory
open(fname).read()
RAM Usage Filesize * 2: Allocate enough space in a list to split the newlines
open(fname).read().splitlines()
After this operation is complete, the RAM usage drops back down to about Filesize * 1 because the full text of the file isn't needed anymore and it can be garbage-collected.
If you don't need the full text of the file at once, and are only operating on lines, then just iterate over the file
with open(filename) as f:
for line in f:
# do something
My guess is that read returns a string of the entire file, which is not garbage collected until a list is returned from splitlines. If you need the file in the memory, try readlines method:
with open(fname) as f:
lines = f.readlines()
read() tries to load the whole file in memory. With overhead and buffers, this can exceed the size of the file. Then you split the contents of the file into lines, because python allocates new memory for each line.
Can your code be refactored to use readline() and process the lines one by one instead? this would reduce the amount of memory that your program uses at once.
with open(filename) as f:
for line in f:
# process a single line, maybe keeping some state elsewhere.
However, if you still need to load all of the lines in memory at once, use readlines() instead:
with open(filename) as f:
lines = f.readlines()
read() is returning a single str with the whole file data in it. splitlines is returning the list of lines with the same data. The whole file data isn't cleaned up until after splitlines creates the list, so you store two copies of the data for a brief period.
If you want to minimize this overhead (and still strip newlines), you can try:
with open(fname) as f:
lines = [line.rstrip('\r\n') for line in f]
If you can process line by line (don't need whole list at once), it's even better:
with open(fname) as f:
for line in f:
line = line.rstrip('\r\n')
which avoids storing more than two lines at a time.
If the file contains 25Gb of data, then file_handle.read() will return a string that is 25Gb in size. When you split that string, you create a list that holds strings that add up to 25Gb of data (plus additional string overhead for each one). So you end up using about twice the memory.
The big string will get reaped almost immediately by the garbage collector making the memory available for new python objects to occupy, but that doesn't mean that the memory is completely freed to the operating system (due to optimizations in python's memory allocator).
A better approach is to accumulate a list of lines one at a time:
with open(filename) as f:
lines = list(f)
You'll only hold approximately one line in memory from the file at a time1 so then your memory use will be mostly just the memory to store the list.
1This isn't exactly true ... pythons internal line buffering will probably have a couple kb of data at any given time buffered...
Of course, there might also be the option to process the file iteratively:
with open(filename) as f:
for line in f:
process(line)
You read the whole file into memory with:
open(fname).read()
In a second step you create of list from this string with .splitlines().
During this time the string stays in memory but you copy the parts of the
string into the list, line-by-line. Only after your are finished creating the
list, the string can be garbage collected. So during this time you store
all information twice and hence need twice the memory.
You could use open(fname).readlines() or read the file line-by-line to reduce
the memory footprint.

Memory issues with splitting lines in huge files in Python

I'm trying to read from disk a huge file (~2GB) and split each line into multiple strings:
def get_split_lines(file_path):
with open(file_path, 'r') as f:
split_lines = [line.rstrip().split() for line in f]
return split_lines
Problem is, it tries to allocate tens and tens of GB in memory. I found out that it doesn't happen if I change my code in the following way:
def get_split_lines(file_path):
with open(file_path, 'r') as f:
split_lines = [line.rstrip() for line in f] # no splitting
return split_lines
I.e., if I do not split the lines, memory usage drastically goes down.
Is there any way to handle this problem, maybe some smart way to store split lines without filling up the main memory?
Thank you for your time.
After the split, you have multiple objects: a tuple plus some number of string objects. Each object has its own overhead in addition to the actual set of characters that make up the original string.
Rather than reading the entire file into memory, use a generator.
def get_split_lines(file_path):
with open(file_path, 'r') as f:
for line in f:
yield line.rstrip.split()
for t in get_split_lines(file_path):
# Do something with the tuple t
This does not preclude you from writing something like
lines = list(get_split_lines(file_path))
if you really need to read the entire file into memory.
In the end, I ended up storing a list of stripped lines:
with open(file_path, 'r') as f:
split_lines = [line.rstrip() for line in f]
And, in each iteration of my algorithm, I simply recomputed on-the-fly the split line:
for line in split_lines:
split_line = line.split()
#do something with the split line
If you can afford to keep all the lines in memory like I did, and you have to go through all the file more than once, this approach is faster than the one proposed by #chepner as you read the file lines just once.

Which is faster while retrieving data from file; [with open() as, and then looping over] or [f.readlines, and then looping over]

I have a big files to read and process.
Which is the faster method to read the file through and process it.
with open('file') as file:
for line in file:
print line
OR
file = open('file')
lines = f.readlines()
file.close()
for line in lines:
print line
The former can use buffered reading; the latter requires reading the entire file into memory first before it can start looping.
In general, it's a better idea to use the former; it's not going to be any slower than the latter and it's better on memory usage.
If you have some line-base large file, I strongly suggest using the following lines to achieve your goal:
file = open('file')
for line in f.readlines():
print line
file.close()
There are 2 points,
Read all content to memory is never a good idea, the right way is read them by trunk(line)
Don't call lines=f.readlines(), this will also cause reading all content to memory
PS: The former "with" statement only is short for try:open();execept:pass; readlines is implemented using iterator, so it won't eat all your memory.

Split large files using python

I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file.
But there are two ways of "reading" files.
1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before)
In python, approaches to read WHOLE file at once I've tried include:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
Well, then I can just easily group 40000 lines into one file by: list[40000,80000] or list[80000,120000]
Or the advantage of using list is that we can easily point to specific lines.
2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory.
Examples include:
f=gzip.open(file)
for line in f: blablabla...
or
for line in fileinput.FileInput(fileName):
I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object?
Thanks
NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:
If given an optional parameter sizehint, it reads that many bytes from
the file and enough more to complete a line, and returns the lines
from that. This is often used to allow efficient reading of a large
file by lines, but without having to load the entire file in memory.
Only complete lines will be returned.
...so you could write that code something like this:
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt", "rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber, "wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
The best solution I have found is using the library filesplit.
You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you.
from fsplit.filesplit import Filesplit
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs = Filesplit()
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:
Open the input file.
Open the first output file.
Read one line from the input file and write it to the output file.
Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
Repeat steps 3-4 until you've reached the end of the input file.
Close both files.
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
Obviously, as you are doing work on the file, you will need to iterate over the file's contents in some way -- whether you do that manually or you let a part of the Python API do it for you (e.g. the readlines() method) is not important. In big O analysis, this means you will spend O(n) time (n being the size of the file).
But reading the file into memory requires O(n) space also. Although sometimes we do need to read a 10 gb file into memory, your particular problem does not require this. We can iterate over the file object directly. Of course, the file object does require space, but we have no reason to hold the contents of the file twice in two different forms.
Therefore, I would go with your second solution.
I created this small script to split the large file in a few seconds. It took only 20 seconds to split a text file with 20M lines into 10 small files each with 2M lines.
split_length = 2_000_000
file_count = 0
large_file = open('large-file.txt', encoding='utf-8', errors='ignore').readlines()
for index in range(0, len(large_file)):
if (index > 0) and (index % 2000000 == 0):
new_file = open(f'splitted-file-{file_count}.txt', 'a', encoding='utf-8', errors='ignore')
split_start_value = file_count * split_length
split_end_value = split_length * (file_count + 1)
file_content_list = large_file[split_start_value:split_end_value]
file_content = ''.join(line for line in file_content_list)
new_file.write(file_content)
new_file.close()
file_count += 1
print(f'created file {file_count}')
To split a file line-wise:
group every, say 40000 lines into one file
You can use module filesplit with method bylinecount (version 4.0):
import os
from filesplit.split import Split
LINES_PER_FILE = 40_000 # see PEP515 for readable numeric literals
filename = 'myinput.txt'
outdir = 'splitted/' # to store split-files `myinput_1.txt` etc.
Split(filename, outdir).bylinecount(LINES_PER_FILE)
This is similar to rafaoc's answer which apparently used outdated version 2.0 to split by size.

Categories

Resources