Remove lines in a text file after processing them in a loop - python

I have a simple program that processes some lines in a text file (adds some text to them). But then it saves them to another file. I would like to know if you can remove the line after the line is processed in the loop. Here is a example of how my program works:
datafile = open("data.txt", "a+")
donefile = open("done.txt", "a+")
for i in datafile:
#My program goes in here
donefile.write(processeddata)
#end of loop
datafile.close()
donefile.close()
As you can see, it just processes some lines from a file (separated by a newline). Is there a way to remove the line in the end of the loop so that when the program is closed it can continue where it left off?

Just so that I get the question right- you'd like to remove the line from datafile once you've processed and stored it in donefile ?
There is no need to do this and its also pretty risky to write to a file which is your source of read.
Instead , why not delete the donefile after you exit the loop? (i.e. after you close your files)
file iterator is a lazy iterator. So when you do for i in datafile it loads one line into memory at a time, so you are only working with that one line...so memory constraints shouldn't be of your concern
Lastly, to access files, please consider using with statement. It takes care of file handle exceptions and makes your program more robust

Related

How to start reading a file from a particular line in the case of a huge text file as I cannot iterate from line one

This is an issue of trying to reach to the line to start from and proceed from there in the shortest time possible.
I have a huge text file that I'm reading and performing operations line after line. I am currently keeping track of the line number that i have parsed so that in case of any system crash I know how much I'm done with.
How do I restart reading a file from the point if I don't want to start over from the beginning again.
count = 0
all_parsed = os.listdir("urltextdir/")
with open(filename,"r") as readfile :
for eachurl in readfile:
if str(count)+".txt" not in all_parsed:
urltext = getURLText(eachurl)
with open("urltextdir/"+str(count)+".txt","w") as writefile:
writefile.write(urltext)
result = processUrlText(urltext)
saveinDB(result)
This is what I'm currently doing, but when it crashes at a million lines, I'm having to through all these lines in the file to reach the point I want to start from, my Other alternative is to use readlines and load the entire file in memory.
Is there an alternative that I can consider.
Unfortunately line number isn't really a basic position for file objects, and the special seeking/telling functions are ruined by next, which is called in your loop. You can't jump to a line, but you can to a byte position. So one way would be:
line = readfile.readline()
while line:
line = readfile.readline(): #Must use `readline`!
lastell = readfile.tell()
print(lastell) #This is the location of the imaginary cursor in the file after reading the line
print(line) #Do with line what you would normally do
print(line) #Last line skipped by loop
Now you can easily jump back with
readfile.seek(lastell) #You need to keep the last lastell)
You would need to keep saving lastell to a file or printing it so on restart you know which byte you're starting at.
Unfortunately you can't use the written file for this, as any modification to the character amount will ruin a count based on this.
Here is one full implementation. Create a file called tell and put 0 inside of it, and then you can run:
with open('tell','r+') as tfd:
with open('abcdefg') as fd:
fd.seek(int(tfd.readline())) #Get last position
line = fd.readline() #Init loop
while line:
print(line.strip(),fd.tell()) #Action on line
tfd.seek(0) #Clear and
tfd.write(str(fd.tell())) #write new position only if successful
line = fd.readline() #Advance loop
print(line) #Last line will be skipped by loop
You can check if such a file exists and create it in the program of course.
As #Edwin pointed out in the comments, you may want to fd.flush() and os.fsync(fd.fileno) (import os if that isn't clear) to make sure after every write you file contents are actually on disk - this would apply to both write operations you are doing, the tell the quicker of the two of course. This may slow things down considerably for you, so if you are satisfied with the synchronicity as is, do not use that, or only flush the tfd. You can also specify the buffer when calling open size so Python automatically flushes faster, as detailed in https://stackoverflow.com/a/3168436/6881240.
If I got it right,
You could make a simple log file to store the count in.
but still would would recommand to use many files or store every line or paragraph in a database le sql or mongoDB
I guess it depends on what system your script is running on, and what resources (such as memory) you have available.
But with the popular saying "memory is cheap", you can simply read the file into memory.
As a test, I created a file with 2 million lines, each line 1024 characters long with the following code:
ms = 'a' * 1024
with open('c:\\test\\2G.txt', 'w') as out:
for _ in range(0, 2000000):
out.write(ms+'\n')
This resulted in a 2 GB file on disk.
I then read the file into a list in memory, like so:
my_file_as_list = [a for a in open('c:\\test\\2G.txt', 'r').readlines()]
I checked the python process, and it used a little over 2 GB in memory (on a 32 GB system)
Access to the data was very fast, and can be done by list slicing methods.
You need to keep track of the index of the list, when your system crashes, you can start from that index again.
But more important... if your system is "crashing" then you need to find out why it is crashing... surely a couple of million lines of data is not a reason to crash anymore these days...

How does readline() work behind the scenes when reading a text file?

I would like to understand how readline() takes in a single line from a text file. The specific details I would like to know about, with respect to how the compiler interprets the Python language and how this is handled by the CPU, are:
How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
I am a "beginner" (I have about 4 years of "simpler" programming experience), so I wouldn't be able to understand technical details, but feel free to expand if it could help others understand!
Example using the file file.txt:
fake file
with some text
in a few lines
Question 1: How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
When you open a file in python, it creates a file object. File objects act as file descriptors, which means at any one point in time, they point to a specific place in the file. When you first open the file, that pointer is at the beginning of the file. When you call readline(), it moves the pointer forward to the character just after the next newline it reads.
Calling the tell() function of a file object returns the location the file descriptor is currently pointing to.
with open('file.txt', 'r') as fd:
print fd.tell()
fd.readline()
print fd.tell()
# output:
0
10
# Or 11, depending on the line separators in the file
Question 2: Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
First off, reading a file doesn't really have anything to do with the CPU. It has to do with the operating system and the file system. Both of those determine how files can be read and written to. Barebones explanation of files
For random access in files, you can use the mmap module of python. The Python Module of the Week site has a great tutorial.
Example, jumping to the 2nd line in the example file and reading until the end:
import mmap
import contextlib
with open('file.txt', 'r') as fd:
with contextlib.closing(mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)) as mm:
print mm[10:]
# output:
with some text
in a few lines
This is a very broad question and it's unlikely that all details about what the CPU does would fit in an answer. But a high-level answer is possible:
readline reads each line in order. It starts by reading chunks of the file from the beginning. When it encounters a line break, it returns that line. Each successive invocation of readline returns the next line until the last line has been read. Then it returns an empty string.
with open("myfile.txt") as f:
while True:
line = f.readline()
if not line:
break
# do something with the line
Readline uses operating system calls under the hood. The file object corresponds to a file descriptor in the OS, and it has a pointer that keeps track of where in the file we are at the moment. The next read will return the next chunk of data from the file from that point on.
You would have to scan through the file first in order to know how many lines there are, and then use some way of starting from the "middle" line. If you meant some arbitrary line except the first and last lines, you would have to scan the file from the beginning identifying lines (for example, you could repeatedly call readline, throwing away the result), until you have reached the line you want). There is a ready-made module for this: linecache.
import linecache
linecache.getline("myfile.txt", 5) # we already know we want line 5

Creating a program which counts words number in a row of a text file (Python)

I am trying to create a program which takes an input file, counts the number of words in each row and writes a string of that certain number in another output file. I managed to develope this code:
in_file = "our_input.txt"
out_file = "output.txt"
f=open(in_file)
g=open(out_file,"w")
for line in f:
if line == "\n":
g.write("0\n")
else:
g.write(str(line.count(" ")+1)+"\n")
now, this works well, but the problem is that it works for only a certain amount of lines. If my input file has 8000 lines, it will display only the first 6800. If it has 6000, than will be displayed (all numbers are rounded, right).
I tried creating another program, which splits each line to a list, and then counting the length of it, but the problem remains just the same.
Any idea what could cause this?
You need to close each file after you're done with it. The safest way to do this is by using the with statement:
with open(in_file) as f, open(out_file,"w") as g:
for line in f:
if line == "\n":
g.write("0\n")
else:
g.write(str(line.count(" ")+1)+"\n")
When reaching the end of a with block, all files you opened in the with line will be closed.
The reason for the behavior you see is that for performance reasons, reading and writing to/from files is buffered. Because of the way hard drives are constructed, data is read/written in blocks rather than in individual bytes - so even if you attempt to read/write a single byte, you have to read/write an entire block. Therefore, most programming languages' built-in file IO functions actually read (at least) one block at a time into memory and feed you data from that in-memory block until it needs to read another block. Similarly, writing is performed by actually writing into a memory block first, and only writing the block to disk when it is full. If you don't close the file writer, whatever is in the last in-memory block won't be written.

How can I make Python program read line in file

I have 2 files, passwd and dictionary. The passwd is a test file with one word, while the dictionary has a list of a few lines of words. My program so far reads and compares only the first line of the dictionary file. For example. My dictionary file contain (egg, fish, red, blue). My passwd file contains only (egg).
The program runs just fine, but once I switch the word egg in the dictionary file to lets say last in the list, the program wont read it and wont pull up results.
My code is below.
#!/usr/bin/passwd
import crypt
def testPass(line):
e = crypt.crypt(line,"HX")
print e
def main():
dictionary = open('dictionary', 'r')
password = open('passwd', 'r')
for line in dictionary:
for line2 in password:
if line == line2:
testPass(line2)
dictionary.close()
password.close()
main()
If you do
for line in file_obj:
....
you are implicitly using the readline method of the file, advancing the file pointer with each call. This means that after the inner loop is done for the first time, it will no longer be executed, because there are no more lines to read.
One possible solution is to keep one -- preferably the smaller -- file in memory using readlines. This way, you can iterate over it for each line you read from the other file.
file_as_list = file_obj.readlines()
for line in file_obj_2:
for line in file_as_list:
..
Once your inner loop runs once, it will have reached the end of the password file. When the outer loop hits its second iteration, there's nothing left to read in the password file because you haven't move the file pointer back to the start of the file.
There are many solutions to the problem. You can use seek to move the file pointer back to the start. Or, you can read the whole password file once and save the data in a list. Or, you can reopen the file on every iteration of the outer loop. The choice of which is best depends on the nature of the data (how many lines there are, are they on a slow network share or fast local disk?) and what your performance requirements are.

question about splitting a large file

Hey I need to split a large file in python into smaller files that contain only specific lines. How do I do this?
You're probably going to want to do something like this:
big_file = open('big_file', 'r')
small_file1 = open('small_file1', 'w')
small_file2 = open('small_file2', 'w')
for line in big_file:
if 'Charlie' in line: small_file1.write(line)
if 'Mark' in line: small_file2.write(line)
big_file.close()
small_file1.close()
small_file2.close()
Opening a file for reading returns an object that allows you to iterate over the lines. You can then check each line (which is just a string of whatever that line contains) for whatever condition you want, then write it to the appropriate file that you opened for writing. It is worth noting that when you open a file with 'w' it will overwrite anything already written to that file. If you want to simply add to the end, you should open it with 'a', to append.
Additionally, if you expect there to be some possibility of error in your reading/writing code, and want to make sure the files are closed, you can use:
with open('big_file', 'r') as big_file:
<do stuff prone to error>
Do you mean breaking it down into subsections? Like if I had a file with chapter 1, chapter 2, and chapter 3, you want it to be broken down into separate files for each chapter?
The way I've done this is similar to Wilduck's response, but closes the input file as soon as it reads in the data and keeps all the lines read in.
data_file = open('large_file_name', 'r')
lines = data_file.readlines()
data_file.close()
outputFile = open('output_file_one', 'w')
for line in lines:
if 'SomeName' in line:
outputFile.write(line)
outputFile.close()
If you wanted to have more than one output file you could either add more loops or open more than one outputFile at a time.
I'd recommend using Wilducks response, however, as it uses less space and will take less time with larger files since the file is read only once.
How big and does it need to be done in python? If this is on unix, would split/csplit/grep suffice?
First, open the big file for reading.
Second, open all the smaller file names for writing.
Third, iterate through every line. Every iteration, check to see what kind of line it is, then write it to that file.
More info on File I/O: http://docs.python.org/tutorial/inputoutput.html

Categories

Resources