I'm new to Python from the R world, and I'm working on big text files, structured in data columns (this is LiDaR data, so generally 60 million + records).
Is it possible to change the field separator (eg from tab-delimited to comma-delimited) of such a big file without having to read the file and do a for loop on the lines?
No.
Read the file in
Change separators for each line
Write each line back
This is easily doable with just a few lines of Python (not tested but the general approach works):
# Python - it's so readable, the code basically just writes itself ;-)
#
with open('infile') as infile:
with open('outfile', 'w') as outfile:
for line in infile:
fields = line.split('\t')
outfile.write(','.join(fields))
I'm not familiar with R, but if it has a library function for this it's probably doing exactly the same thing.
Note that this code only reads one line at a time from the file, so the file can be larger than the physical RAM - it's never wholly loaded in.
You can use the linux tr command to replace any character with any other character.
Actually lets say yes, you can do it without loops eg:
with open('in') as infile:
with open('out', 'w') as outfile:
map(lambda line: outfile.write(','.join(line.split('\n'))), infile)
You cant, but i strongly advise you to check generators.
Point is that you can make faster and well structured program without need to write and store data in memory in order to process it.
For instance
file = open("bigfile","w")
j = (i.split("\t") for i in file)
s = (","join(i) for i in j)
#and now magic happens
for i in s:
some_other_file.write(i)
This code spends memory for holding only single line.
Related
I noticed that if I iterate over a file that I opened, it is much faster to iterate over it without "read"-ing it.
i.e.
l = open('file','r')
for line in l:
pass (or code)
is much faster than
l = open('file','r')
for line in l.read() / l.readlines():
pass (or code)
The 2nd loop will take around 1.5x as much time (I used timeit over the exact same file, and the results were 0.442 vs. 0.660), and would give the same result.
So - when should I ever use the .read() or .readlines()?
Since I always need to iterate over the file I'm reading, and after learning the hard way how painfully slow the .read() can be on large data - I can't seem to imagine ever using it again.
The short answer to your question is that each of these three methods of reading bits of a file have different use cases. As noted above, f.read() reads the file as an individual string, and so allows relatively easy file-wide manipulations, such as a file-wide regex search or substitution.
f.readline() reads a single line of the file, allowing the user to parse a single line without necessarily reading the entire file. Using f.readline() also allows easier application of logic in reading the file than a complete line by line iteration, such as when a file changes format partway through.
Using the syntax for line in f: allows the user to iterate over the file line by line as noted in the question.
(As noted in the other answer, this documentation is a very good read):
https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
Note:
It was previously claimed that f.readline() could be used to skip a line during a for loop iteration. However, this doesn't work in Python 2.7, and is perhaps a questionable practice, so this claim has been removed.
Hope this helps!
https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory
Sorry for all the edits!
For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:
for line in f:
print line,
This is the first line of the file.
Second line of the file
Note that readline() is not comparable to the case of reading all lines in for-loop since it reads line by line and there is an overhead which is pointed out by others already.
I ran timeit on two identical snippts but one with for-loop and the other with readlines(). You can see my snippet below:
def test_read_file_1():
f = open('ml/README.md', 'r')
for line in f.readlines():
print(line)
def test_read_file_2():
f = open('ml/README.md', 'r')
for line in f:
print(line)
def test_time_read_file():
from timeit import timeit
duration_1 = timeit(lambda: test_read_file_1(), number=1000000)
duration_2 = timeit(lambda: test_read_file_2(), number=1000000)
print('duration using readlines():', duration_1)
print('duration using for-loop:', duration_2)
And the results:
duration using readlines(): 78.826229238
duration using for-loop: 69.487692794
The bottomline, I would say, for-loop is faster but in case of possibility of both, I'd rather readlines().
readlines() is better than for line in file when you know that the data you are interested starts from, for example, 2nd line. You can simply write readlines()[1:].
Such use cases are when you have a tab/comma separated value file and the first line is a header (and you don't want to use additional module for tsv or csv files).
#The difference between file.read(), file.readline(), file.readlines()
file = open('samplefile', 'r')
single_string = file.read() #Reads all the elements of the file
#into a single string(\n characters might be included)
line = file.readline() #Reads the current line where the cursor as a string
#is positioned and moves to the next line
list_strings = file.readlines()#Makes a list of strings
I'm trying to insert text at very specific locations in a text file. This text file can be fairly large (>> 10 GB)
The approach I am currently using to read it:
with open("my_text_file.txt") as f:
while True:
result = f.read(set_number_of_bytes)
x = process_result(result)
if x:
replace_some_characters_that_i_just_read_and write_it_back_to_same_file
However, I am unsure as to how to implement
replace_some_characters_that_i_just_read_and write_it_back_to_same_file
Is there some method which I can use to determine where I have read up to in the current file that I might be able to use to write to the file.
Performance-wise, if I was to use the approach above to write to the original file at specific locations, would there be efficiency issues with having to find the write location before writing?
Or would you recommend creating an entirely different file and appending to that file on each loop above. Then deleting the original file after this operation is completed? Assuming space is not a large concern but performance is.
Use the fileinput module, which handles files correctly when replacing data, with the inplace flag set:
import sys
import fileinput
for line in fileinput.input('my_text_file.txt', inplace=True):
x = process_result(line)
if x:
line = line.replace('something', x)
sys.stdout.write(line)
When you use the inplace flag, the original file is moved to a backup, and anything your write to sys.stdout is written to the original filename (so, as a new file). Make sure you include all lines, altered or not.
You have to rewrite the complete file whenever your replacement data is not exactly the same number of bytes as the parts that you are replacing.
I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file.
But there are two ways of "reading" files.
1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before)
In python, approaches to read WHOLE file at once I've tried include:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
Well, then I can just easily group 40000 lines into one file by: list[40000,80000] or list[80000,120000]
Or the advantage of using list is that we can easily point to specific lines.
2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory.
Examples include:
f=gzip.open(file)
for line in f: blablabla...
or
for line in fileinput.FileInput(fileName):
I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object?
Thanks
NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:
If given an optional parameter sizehint, it reads that many bytes from
the file and enough more to complete a line, and returns the lines
from that. This is often used to allow efficient reading of a large
file by lines, but without having to load the entire file in memory.
Only complete lines will be returned.
...so you could write that code something like this:
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt", "rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber, "wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
The best solution I have found is using the library filesplit.
You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you.
from fsplit.filesplit import Filesplit
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs = Filesplit()
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:
Open the input file.
Open the first output file.
Read one line from the input file and write it to the output file.
Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
Repeat steps 3-4 until you've reached the end of the input file.
Close both files.
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
Obviously, as you are doing work on the file, you will need to iterate over the file's contents in some way -- whether you do that manually or you let a part of the Python API do it for you (e.g. the readlines() method) is not important. In big O analysis, this means you will spend O(n) time (n being the size of the file).
But reading the file into memory requires O(n) space also. Although sometimes we do need to read a 10 gb file into memory, your particular problem does not require this. We can iterate over the file object directly. Of course, the file object does require space, but we have no reason to hold the contents of the file twice in two different forms.
Therefore, I would go with your second solution.
I created this small script to split the large file in a few seconds. It took only 20 seconds to split a text file with 20M lines into 10 small files each with 2M lines.
split_length = 2_000_000
file_count = 0
large_file = open('large-file.txt', encoding='utf-8', errors='ignore').readlines()
for index in range(0, len(large_file)):
if (index > 0) and (index % 2000000 == 0):
new_file = open(f'splitted-file-{file_count}.txt', 'a', encoding='utf-8', errors='ignore')
split_start_value = file_count * split_length
split_end_value = split_length * (file_count + 1)
file_content_list = large_file[split_start_value:split_end_value]
file_content = ''.join(line for line in file_content_list)
new_file.write(file_content)
new_file.close()
file_count += 1
print(f'created file {file_count}')
To split a file line-wise:
group every, say 40000 lines into one file
You can use module filesplit with method bylinecount (version 4.0):
import os
from filesplit.split import Split
LINES_PER_FILE = 40_000 # see PEP515 for readable numeric literals
filename = 'myinput.txt'
outdir = 'splitted/' # to store split-files `myinput_1.txt` etc.
Split(filename, outdir).bylinecount(LINES_PER_FILE)
This is similar to rafaoc's answer which apparently used outdated version 2.0 to split by size.
I'm new to Python, and I need to do a parsing exercise. I got a file, and I need to parse it (just the headers), but after the process, i need to keep the file the same format, the same extension, and at the same place in disk, but only with the differences of new headers..
I tried this code...
for line in open ('/home/name/db/str/dir/numbers/str.phy'):
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
print linepars
..and it does the job, but I don't know how to "overwrite" the file with the new parsing.
The easiest way, but not the most efficient (by far, and especially for long files) would be to rewrite the complete file.
You could do this by opening a second file handle and rewriting each line, except in the case of the header, you'd write the parsed header. For example,
fr = open('/home/name/db/str/dir/numbers/str.phy')
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
fr.close()
EDIT: Note that this does not use readlines(), so its more memory efficient. It also does not store every output line, but only one at a time, writing it to file immediately.
Just as a cool trick, you could use the with statement on the input file to avoid having to close it (Python 2.5+):
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
with open('/home/name/db/str/dir/numbers/str.phy') as fr:
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
P.S. Welcome :-)
As others are saying here, you want to open a file and use that file object's .write() method.
The best approach would be to open an additional file for writing:
import os
current_cfg = open(...)
parsed_cfg = open(..., 'w')
for line in current_cfg:
new_line = parse(line)
print new_line
parsed.cfg.write(new_line + '\n')
current_cfg.close()
parsed_cfg.close()
os.rename(....) # Rename old file to backup name
os.rename(....) # Rename new file into place
Additionally I'd suggest looking at the tempfile module and use one of its methods for either naming your new file or opening/creating it. Personally I'd favor putting the new file in the same directory as the existing file to ensure that os.rename will work atomically (the configuration file named will be guaranteed to either point at the old file or the new file; in no case would it point at a partially written/copied file).
The following code DOES the job.
I mean it DOES overwrite the file ON ONESELF; that's what the OP asked for. That's possible because the transformations are only removing characters, so the file's pointer fo that writes is always BEHIND the file's pointer fi that reads.
import re
regx = re.compile('\AENS([A-Z]+)0+([0-9]{6})')
with open('bomo.phy','rb+') as fi, open('bomo.phy','rb+') as fo:
fo.writelines(regx.sub('\\1\\2',line) for line in fi)
I think that the writing isn't performed by the operating system one line at a time but through a buffer. So several lines are read before a pool of transformed lines are written. That's what I think.
newlines = []
for line in open ('/home/name/db/str/dir/numbers/str.phy').readlines():
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
newlines.append( linepars )
open ('/home/name/db/str/dir/numbers/str.phy', 'w').write('\n'.join(newlines))
(sidenote: Of course if you are working with large files, you should be aware that the level of optimization required may depend on your situation. Python by nature is very non-lazily-evaluated. The following solution is not a good choice if you are parsing large files, such as database dumps or logs, but a few tweaks such as nesting the with clauses and using lazy generators or a line-by-line algorithm can allow O(1)-memory behavior.)
targetFile = '/home/name/db/str/dir/numbers/str.phy'
def replaceIfHeader(line):
if line.startswith('ENS'):
return re.sub('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
else:
return line
with open(targetFile, 'r') as f:
newText = '\n'.join(replaceIfHeader(line) for line in f)
try:
# make backup of targetFile
with open(targetFile, 'w') as f:
f.write(newText)
except:
# error encountered, do something to inform user where backup of targetFile is
edit: thanks to Jeff for suggestion
This question already has answers here:
How to read a file line-by-line into a list?
(28 answers)
Closed 8 years ago.
I want to prompt a user for a number of random numbers to be generated and saved to a file. He gave us that part. The part we have to do is to open that file, convert the numbers into a list, then find the mean, standard deviation, etc. without using the easy built-in Python tools.
I've tried using open but it gives me invalid syntax (the file name I chose was "numbers" and it saved into "My Documents" automatically, so I tried open(numbers, 'r') and open(C:\name\MyDocuments\numbers, 'r') and neither one worked).
with open('C:/path/numbers.txt') as f:
lines = f.read().splitlines()
this will give you a list of values (strings) you had in your file, with newlines stripped.
also, watch your backslashes in windows path names, as those are also escape chars in strings. You can use forward slashes or double backslashes instead.
Two ways to read file into list in python (note these are not either or) -
use of with - supported from python 2.5 and above
use of list comprehensions
1. use of with
This is the pythonic way of opening and reading files.
#Sample 1 - elucidating each step but not memory efficient
lines = []
with open("C:\name\MyDocuments\numbers") as file:
for line in file:
line = line.strip() #or some other preprocessing
lines.append(line) #storing everything in memory!
#Sample 2 - a more pythonic and idiomatic way but still not memory efficient
with open("C:\name\MyDocuments\numbers") as file:
lines = [line.strip() for line in file]
#Sample 3 - a more pythonic way with efficient memory usage. Proper usage of with and file iterators.
with open("C:\name\MyDocuments\numbers") as file:
for line in file:
line = line.strip() #preprocess line
doSomethingWithThisLine(line) #take action on line instead of storing in a list. more memory efficient at the cost of execution speed.
the .strip() is used for each line of the file to remove \n newline character that each line might have. When the with ends, the file will be closed automatically for you. This is true even if an exception is raised inside of it.
2. use of list comprehension
This could be considered inefficient as the file descriptor might not be closed immediately. Could be a potential issue when this is called inside a function opening thousands of files.
data = [line.strip() for line in open("C:/name/MyDocuments/numbers", 'r')]
Note that file closing is implementation dependent. Normally unused variables are garbage collected by python interpreter. In cPython (the regular interpreter version from python.org), it will happen immediately, since its garbage collector works by reference counting. In another interpreter, like Jython or Iron Python, there may be a delay.
f = open("file.txt")
lines = f.readlines()
Look over here. readlines() returns a list containing one line per element. Note that these lines contain the \n (newline-character) at the end of the line. You can strip off this newline-character by using the strip()-method. I.e. call lines[index].strip() in order to get the string without the newline character.
As joaquin noted, do not forget to f.close() the file.
Converting strint to integers is easy: int("12").
The pythonic way to read a file and put every lines in a list:
from __future__ import with_statement #for python 2.5
with open('C:/path/numbers.txt', 'r') as f:
lines = f.readlines()
Then, assuming that each lines contains a number,
numbers =[int(e.strip()) for e in lines]
You need to pass a filename string to open. There's an extra complication when the string has \ in it, because that's a special string escape character to Python. You can fix this by doubling up each as \\ or by putting a r in front of the string as follows: r'C:\name\MyDocuments\numbers'.
Edit: The edits to the question make it completely different from the original, and since none of them was from the original poster I'm not sure they're warrented. However it does point out one obvious thing that might have been overlooked, and that's how to add "My Documents" to a filename.
In an English version of Windows XP, My Documents is actually C:\Documents and Settings\name\My Documents. This means the open call should look like:
open(r"C:\Documents and Settings\name\My Documents\numbers", 'r')
I presume you're using XP because you call it My Documents - it changed in Vista and Windows 7. I don't know if there's an easy way to look this up automatically in Python.
hdl = open("C:/name/MyDocuments/numbers", 'r')
milist = hdl.readlines()
hdl.close()
To summarize a bit from what people have been saying:
f=open('data.txt', 'w') # will make a new file or erase a file of that name if it is present
f=open('data.txt', 'r') # will open a file as read-only
f=open('data.txt', 'a') # will open a file for appending (appended data goes to the end of the file)
If you wish have something in place similar to a try/catch
with open('data.txt') as f:
for line in f:
print line
I think #movieyoda code is probably what you should use however
If you have multiple numbers per line and you have multiple lines, you can read them in like this:
#!/usr/bin/env python
from os.path import dirname
with open(dirname(__file__) + '/data/path/filename.txt') as input_data:
input_list= [map(int,num.split()) for num in input_data.readlines()]