I have a 22mb text file containing a list of numbers (1 number per line). I am trying to have python read the number, process the number and write the result in another file. All of this works but if I have to stop the program it starts all over from the beginning. I tried to use a mysql database at first but it was way too slow. I am getting about 4 times the number being processed this way. I would like to be able to delete the line after the number was processed.
with open('list.txt', 'r') as file:
for line in file:
filename = line.rstrip('\n') + ".txt"
if os.path.isfile(filename):
print "File", filename, "exists, skipping!"
else:
#process number and write file
#(need code to delete current line here)
As you can see every time it is restarted it has to search the hard drive for the file name to make sure it gets to the place it left off. With 1.5 million numbers this can take a while. I found an example with truncate but it did not work.
Are there any commands similar to array_shift (PHP) for python that will work with text files.
I would use a marker file to keep the number of the last line processed instead of rewriting the input file:
start_from = 0
try:
with open('last_line.txt', 'r') as llf: start_from = int(llf.read())
except:
pass
with open('list.txt', 'r') as file:
for i, line in enumerate(file):
if i < start_from: continue
filename = line.rstrip('\n') + ".txt"
if os.path.isfile(filename):
print "File", filename, "exists, skipping!"
else:
pass
with open('last_line.txt', 'w') as outfile: outfile.write(str(i))
This code first checks for the file last_line.txt and tries to read a number from it. The number is the number of line which was processed in during the previous attempt. Then it simply skips the required number of lines.
I use Redis for stuff like that. Install redis and then pyredis and you can have a persistent set in memory. Then you can do:
r = redis.StrictRedis('localhost')
with open('list.txt', 'r') as file:
for line in file:
if r.sismember('done', line):
continue
else:
#process number and write file
r.sadd('done', line)
if you don't want to install Redis you can also use the shelve module, making sure that you open it with the writeback=False option. I really recommend Redis though, it makes things like this so much easier.
Reading the data file should not be a bottleneck. The following code read a 36 MB, 697997 line text file in about 0,2 seconds on my machine:
import time
start = time.clock()
with open('procmail.log', 'r') as f:
lines = f.readlines()
end = time.clock()
print 'Readlines time:', end-start
Because it produced the following result:
Readlines time: 0.1953125
Note that this code produces a list of lines in one go.
To know where you've been, just write the number of lines you've processed to a file. Then if you want to try again, read all the lines and skip the ones you've already done:
import os
# Raad the data file
with open('list.txt', 'r') as f:
lines = f.readlines()
skip = 0
try:
# Did we try earlier? if so, skip what has already been processed
with open('lineno.txt', 'r') as lf:
skip = int(lf.read()) # this should only be one number.
del lines[:skip] # Remove already processed lines from the list.
except:
pass
with open('lineno.txt', 'w+') as lf:
for n, line in enumerate(lines):
# Do your processing here.
lf.seek(0) # go to beginning of lf
lf.write(str(n+skip)+'\n') # write the line number
lf.flush()
os.fsync() # flush and fsync make sure the lf file is written.
Related
lines = file.readlines()
del lines[68]
This is the code im using to delete the lines, I have already opened the file it works with lots of other stuff. When I run this code it pops up with an index error when Im deleting lines in the middle of the txt file. I ve tried many versions of deleting lines in the txt file but none of them work. Any ideas?
In short
You can add f.seek(0) before f.readlines(), and try your code again.
Long explain
I tried your code and it seems working normally when deleting single element in lines.
Did you use f.readlines() multiple times? In that case, the second time this method will return empty list because the first call already move the cursor to end of file.
To read the file again, you have to use this method f.seek(0) to move the cursor back to the begin of file before calling f.readlines()
How about testing if the txt file has 69 lines at all ?
def delete_line(path, number)
f = opne(path, "r")
lines = f.readlines()
f.close()
if len(lines) - 1 < number:
print("The file %s has not a line number %d" % (path, number))
else:
del lines[number]
return lines
Use it like this:
lines = delete_line("path/to/file/you/want/to/delete/a/line/from.txt", 68)
If your file was long enough, you will have the file minus the specified line saved in lines, else a warning will be printed and the value of lines will be the unmodified file.
So I have a file that contains this:
SequenceName 4.6e-38 810..924
SequenceName_FGS_810..924 VAWNCRQNVFWAPLFQGPYTPARYYYAPEEPKHYQEMKQCFSQTYHGMSFCDGCQIGMCH
SequenceName 1.6e-38 887..992
SequenceName_GYQ_887..992 PLFQGPYTPARYYYAPEEPKHYQEMKQCFSQTYHGMSFCDGCQIGMCH
I want my program to read only the lines that contain these protein sequences. Up until now I got this, which skips the first line and read the second one:
handle = open(filename, "r")
handle.readline()
linearr = handle.readline().split()
handle.close()
fnamealpha = fname + ".txt"
handle = open(fnamealpha, "w")
handle.write(">%s\n%s\n" % (linearr[0], linearr[1]))
handle.close()
But it only processes the first sequence and I need it to process every line that contains a sequence, so I need a loop, how can I do it?
The part that saves to a txt file is really important too so I need to find a way in which I can combine these two objectives.
My output with the above code is:
>SequenceName_810..924
VAWNCRQNVFWAPLFQGPYTPARYYYAPEEPKHYQEMKQCFSQTYHGMSFCDGCQIGMCH
Okay, I think I understand your question--you want to iterate over the lines in the file, right? But only the second line in the sequence--the one with the protein sequence--matters, correct? Here's my suggestion:
# context manager `with` takes care of file closing, error handling
with open(filename, 'r') as handle:
for line in handle:
if line.startswith('SequenceName_'):
print line.split()
# Write to file, etc.
My reasoning being that you're only interested in lines that start with SequenceName_###.
Use readlines and throw it all into a for loop.
with open(filename, 'r') as fh:
for line in fh.readlines:
# do processing here
In the #do processing here section, you can just prepare another list of lines to write to the other file. (Using with handles all the proper closure and sure.)
I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file.
But there are two ways of "reading" files.
1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before)
In python, approaches to read WHOLE file at once I've tried include:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
Well, then I can just easily group 40000 lines into one file by: list[40000,80000] or list[80000,120000]
Or the advantage of using list is that we can easily point to specific lines.
2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory.
Examples include:
f=gzip.open(file)
for line in f: blablabla...
or
for line in fileinput.FileInput(fileName):
I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object?
Thanks
NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:
If given an optional parameter sizehint, it reads that many bytes from
the file and enough more to complete a line, and returns the lines
from that. This is often used to allow efficient reading of a large
file by lines, but without having to load the entire file in memory.
Only complete lines will be returned.
...so you could write that code something like this:
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt", "rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber, "wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
The best solution I have found is using the library filesplit.
You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you.
from fsplit.filesplit import Filesplit
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs = Filesplit()
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:
Open the input file.
Open the first output file.
Read one line from the input file and write it to the output file.
Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
Repeat steps 3-4 until you've reached the end of the input file.
Close both files.
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
Obviously, as you are doing work on the file, you will need to iterate over the file's contents in some way -- whether you do that manually or you let a part of the Python API do it for you (e.g. the readlines() method) is not important. In big O analysis, this means you will spend O(n) time (n being the size of the file).
But reading the file into memory requires O(n) space also. Although sometimes we do need to read a 10 gb file into memory, your particular problem does not require this. We can iterate over the file object directly. Of course, the file object does require space, but we have no reason to hold the contents of the file twice in two different forms.
Therefore, I would go with your second solution.
I created this small script to split the large file in a few seconds. It took only 20 seconds to split a text file with 20M lines into 10 small files each with 2M lines.
split_length = 2_000_000
file_count = 0
large_file = open('large-file.txt', encoding='utf-8', errors='ignore').readlines()
for index in range(0, len(large_file)):
if (index > 0) and (index % 2000000 == 0):
new_file = open(f'splitted-file-{file_count}.txt', 'a', encoding='utf-8', errors='ignore')
split_start_value = file_count * split_length
split_end_value = split_length * (file_count + 1)
file_content_list = large_file[split_start_value:split_end_value]
file_content = ''.join(line for line in file_content_list)
new_file.write(file_content)
new_file.close()
file_count += 1
print(f'created file {file_count}')
To split a file line-wise:
group every, say 40000 lines into one file
You can use module filesplit with method bylinecount (version 4.0):
import os
from filesplit.split import Split
LINES_PER_FILE = 40_000 # see PEP515 for readable numeric literals
filename = 'myinput.txt'
outdir = 'splitted/' # to store split-files `myinput_1.txt` etc.
Split(filename, outdir).bylinecount(LINES_PER_FILE)
This is similar to rafaoc's answer which apparently used outdated version 2.0 to split by size.
New to python and trying to learn the ropes of file i/o.
Working with pulling lines from a large (2 million line) file in this format:
56fr4
4543d
4343d
hirh3
I've been reading that readline() is best because it doesn't pull the whole file into memory. But when I try to read the documentation on it, it seems to be Unix only? And I'm on a Mac.
Can I use readline on the Mac without loading the whole file into memory? What would the syntax be to simply readline number 3 in the file? The examples in the docs are a bit over my head.
Edit
Here is the function to return a code:
def getCode(i):
with open("test.txt") as file:
for index, line in enumerate(f):
if index == i:
code = # what does it equal?
break
return code
You don't need readline:
with open("data.txt") as file:
for line in file:
# do stuff with line
This will read the entire file line-by-line, but not all at once (so you don't need all the memory). If you want to abort reading the file, because you found the line you want, use break to terminate the loop. If you know the index of the line you want, use this:
with open("data.txt") as file:
for index, line in enumerate(file):
if index == 2: # looking for third line (0-based indexes)
# do stuff with this line
break # no need to go on
+1 # SpaceC0wb0y
You could also do:
f = open('filepath')
f.readline() # first line - let it pass
f.readline() # second line - let it pass
third_line = f.readline()
f.close()
I am new to Python programming...
I have a .txt file....... It looks like..
0,Salary,14000
0,Bonus,5000
0,gift,6000
I want to to replace the first '0' value to '1' in each line. How can I do this? Any one can help me.... With sample code..
Thanks in advance.
Nimmyliji
I know that you're asking about Python, but forgive me for suggesting that perhaps a different tool is better for the job. :) It's a one-liner via sed:
sed 's/^0,/1,/' yourtextfile.txt > output.txt
This applies the regex /^0,/ (which matches any 0, that occurs at the beginning of a line) to each line and replaces the matched text with 1, instead. The output is directed into the file output.txt specified.
inFile = open("old.txt", "r")
outFile = open("new.txt", "w")
for line in inFile:
outFile.write(",".join(["1"] + (line.split(","))[1:]))
inFile.close()
outFile.close()
If you would like something more general, take a look to Python csv module. It contains utilities for processing comma-separated values (abbreviated as csv) in files. But it can work with arbitrary delimiter, not only comma. So as you sample is obviously a csv file, you can use it as follows:
import csv
reader = csv.reader(open("old.txt"))
writer = csv.writer(open("new.txt", "w"))
writer.writerows(["1"] + line[1:] for line in reader)
To overwrite original file with new one:
import os
os.remove("old.txt")
os.rename("new.txt", "old.txt")
I think that writing to new file and then renaming it is more fault-tolerant and less likely corrupt your data than direct overwriting of source file. Imagine, that your program raised an exception while source file was already read to memory and reopened for writing. So you would lose original data and your new data wouldn't be saved because of program crash. In my case, I only lose new data while preserving original.
o=open("output.txt","w")
for line in open("file"):
s=line.split(",")
s[0]="1"
o.write(','.join(s))
o.close()
Or you can use fileinput with in place edit
import fileinput
for line in fileinput.FileInput("file",inplace=1):
s=line.split(",")
s[0]="1"
print ','.join(s)
f = open(filepath,'r')
data = f.readlines()
f.close()
edited = []
for line in data:
edited.append( '1'+line[1:] )
f = open(filepath,'w')
f.writelines(edited)
f.flush()
f.close()
Or in Python 2.5+:
with open(filepath,'r') as f:
data = f.readlines()
with open(outfilepath, 'w') as f:
for line in data:
f.write( '1' + line[1:] )
This should do it. I wouldn't recommend it for a truly big file though ;-)
What is going on (ex 1):
1: Open the file in read mode
2,3: Read all the lines into a list (each line is a separate index) and close the file.
4,5,6: Iterate over the list constructing a new list where each line has the first character replaced by a 1. The line[1:] slices the string from index 1 onward. We concatenate the 1 with the truncated list.
7,8,9: Reopen the file in write mode, write the list to the file (overwrite), flush the buffer, and close the file handle.
In Ex. 2:
I use the with statement that lets the file handle closing itself, but do essentially the same thing.