Python 3.3 readlines truncating text file - python

I am working with Python 3.3 using PyDev for Eclipse, Alright, so this is my code:
countdata = open(countfilename, 'r')
countlist = countdata.readlines()
print(len(countlist))
genecountline = wordlist(countlist[-1])
print(genecountline)
countfilename refers to a rather lengthy text file of 7847 lines that is generated from a text file using a script given to me by the instructor in my machine learning class (I did have to convert said script to Python 3 using 2to3).
wordlist is a simple function I built that takes a line of text and returns the words in it as a list.
I pull the whole file into a list of lines so that I an refer to specific lines at will for my calculation. Whether I read them in all at once with readlines or iterate over the file and add the lines to the list one by one like this:
countdata = open(countfilename, 'r')
countlist = []
for line in countdata:
countlist.append(line)
doesn't matter. Either way I do it, print(len(countlist)) gives me approximately 7630, I say approximately because sometimes it is as low as 7628 or as high as 7633. The specific line returned by countlist[-1] is always different (the file is built using a generator object, as mentioned my instructor built that script and I am not entirely sure how exactly it works).
genecountline = wordlist(countlist[-1])
print(genecountline)
I put in just to see what python thinks the last line of the file is. And when I open the file in textpad, the line it returns is in fact the line number returned by len(countlist). In other words it appears to be ignoring the last approx. 210 lines of my file. So my question is how do I fix this, and how do I prevent it from doing this again?

If you're not reading from a static text file but from the one that generates each time you run your program, it could be that you don't close that file (in which case everything might not have been written to it). If you don't want to close it, you could flush it (.flush() method).
You should post the code that generates the file.

Related

Search for a word, and modify the whole line in Python text processing

This is my carDatabase.txt
CarID:c01 ModelName:honda VehicleType:city Price:20
CarID:c02 ModelName:honda VehicleType:x Price:30
I want to search for the carID and be only able to modify the whole line without interrupting others
my current code is here:
# Converting txt data into a string and modify
carsDatabaseFile = open('carsDatabase.txt', 'r')
allDataFromDatabase = [line.split(',') for line in carsDatabaseFile.readlines()]
Note:
Your question has a couple of issues: your sample from carDatabase.txt looks like it is tab-delimited, but your current code looks like it is splitting the line around the ',' character. This also looks like a place where a list comprehension might be hurting you more than it is helping you. Break that up into a for-loop if you're trying to add some logic to manipulate a single line.
For looking at CSV files, I would highly recommend using pandas for general manipulation of data in comma ceparated as well as a number of other formats.
That said, if you are truly restricted to only using built-in packages, or you are looking at this as a learning exercise, and your goal is to directly manipulate just one line of that file, what you are looking for is the seek method. You can use this in combination with the tell method ( documented just blow seek in the above link ) to find where you are in the file.
Write a for loop to identify which line in the file you are looking for
From there, you can get the output of tell() to find the specific place in the file you are trying to manipulate
Using the output from the above two steps, you can set the file pointer to a specific location using the seek() method (by byte: files are really stored as one dimensional).
You can now use the write() method to directly update the file at the location you determined above.

How to start reading a file from a particular line in the case of a huge text file as I cannot iterate from line one

This is an issue of trying to reach to the line to start from and proceed from there in the shortest time possible.
I have a huge text file that I'm reading and performing operations line after line. I am currently keeping track of the line number that i have parsed so that in case of any system crash I know how much I'm done with.
How do I restart reading a file from the point if I don't want to start over from the beginning again.
count = 0
all_parsed = os.listdir("urltextdir/")
with open(filename,"r") as readfile :
for eachurl in readfile:
if str(count)+".txt" not in all_parsed:
urltext = getURLText(eachurl)
with open("urltextdir/"+str(count)+".txt","w") as writefile:
writefile.write(urltext)
result = processUrlText(urltext)
saveinDB(result)
This is what I'm currently doing, but when it crashes at a million lines, I'm having to through all these lines in the file to reach the point I want to start from, my Other alternative is to use readlines and load the entire file in memory.
Is there an alternative that I can consider.
Unfortunately line number isn't really a basic position for file objects, and the special seeking/telling functions are ruined by next, which is called in your loop. You can't jump to a line, but you can to a byte position. So one way would be:
line = readfile.readline()
while line:
line = readfile.readline(): #Must use `readline`!
lastell = readfile.tell()
print(lastell) #This is the location of the imaginary cursor in the file after reading the line
print(line) #Do with line what you would normally do
print(line) #Last line skipped by loop
Now you can easily jump back with
readfile.seek(lastell) #You need to keep the last lastell)
You would need to keep saving lastell to a file or printing it so on restart you know which byte you're starting at.
Unfortunately you can't use the written file for this, as any modification to the character amount will ruin a count based on this.
Here is one full implementation. Create a file called tell and put 0 inside of it, and then you can run:
with open('tell','r+') as tfd:
with open('abcdefg') as fd:
fd.seek(int(tfd.readline())) #Get last position
line = fd.readline() #Init loop
while line:
print(line.strip(),fd.tell()) #Action on line
tfd.seek(0) #Clear and
tfd.write(str(fd.tell())) #write new position only if successful
line = fd.readline() #Advance loop
print(line) #Last line will be skipped by loop
You can check if such a file exists and create it in the program of course.
As #Edwin pointed out in the comments, you may want to fd.flush() and os.fsync(fd.fileno) (import os if that isn't clear) to make sure after every write you file contents are actually on disk - this would apply to both write operations you are doing, the tell the quicker of the two of course. This may slow things down considerably for you, so if you are satisfied with the synchronicity as is, do not use that, or only flush the tfd. You can also specify the buffer when calling open size so Python automatically flushes faster, as detailed in https://stackoverflow.com/a/3168436/6881240.
If I got it right,
You could make a simple log file to store the count in.
but still would would recommand to use many files or store every line or paragraph in a database le sql or mongoDB
I guess it depends on what system your script is running on, and what resources (such as memory) you have available.
But with the popular saying "memory is cheap", you can simply read the file into memory.
As a test, I created a file with 2 million lines, each line 1024 characters long with the following code:
ms = 'a' * 1024
with open('c:\\test\\2G.txt', 'w') as out:
for _ in range(0, 2000000):
out.write(ms+'\n')
This resulted in a 2 GB file on disk.
I then read the file into a list in memory, like so:
my_file_as_list = [a for a in open('c:\\test\\2G.txt', 'r').readlines()]
I checked the python process, and it used a little over 2 GB in memory (on a 32 GB system)
Access to the data was very fast, and can be done by list slicing methods.
You need to keep track of the index of the list, when your system crashes, you can start from that index again.
But more important... if your system is "crashing" then you need to find out why it is crashing... surely a couple of million lines of data is not a reason to crash anymore these days...

Python return to line above for writing

How can I make python program return to the start of output area after writing 4 lines of data. For example Program outputs fields 1....field 4 in different lines,after this program wants to add some data to line of field 1 ,but output is coming on line 5. The program is for converting data into tabular form.
If you are writing to a file, you can use the seek() function to relocate the file pointer wherever you want. For example, f.seek(0,0) will take you to the beginning of the file and then you can output the next data item there. However, keep in mind that you'll need to first move the data that you already wrote to the file, otherwise it will be over-written; that is, you need to "make space" for the new data you want to write to the beginning of the file.
For a quick intro, see https://docs.python.org/3.5/tutorial/inputoutput.html, near the bottom of the page.

How does readline() work behind the scenes when reading a text file?

I would like to understand how readline() takes in a single line from a text file. The specific details I would like to know about, with respect to how the compiler interprets the Python language and how this is handled by the CPU, are:
How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
I am a "beginner" (I have about 4 years of "simpler" programming experience), so I wouldn't be able to understand technical details, but feel free to expand if it could help others understand!
Example using the file file.txt:
fake file
with some text
in a few lines
Question 1: How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
When you open a file in python, it creates a file object. File objects act as file descriptors, which means at any one point in time, they point to a specific place in the file. When you first open the file, that pointer is at the beginning of the file. When you call readline(), it moves the pointer forward to the character just after the next newline it reads.
Calling the tell() function of a file object returns the location the file descriptor is currently pointing to.
with open('file.txt', 'r') as fd:
print fd.tell()
fd.readline()
print fd.tell()
# output:
0
10
# Or 11, depending on the line separators in the file
Question 2: Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
First off, reading a file doesn't really have anything to do with the CPU. It has to do with the operating system and the file system. Both of those determine how files can be read and written to. Barebones explanation of files
For random access in files, you can use the mmap module of python. The Python Module of the Week site has a great tutorial.
Example, jumping to the 2nd line in the example file and reading until the end:
import mmap
import contextlib
with open('file.txt', 'r') as fd:
with contextlib.closing(mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)) as mm:
print mm[10:]
# output:
with some text
in a few lines
This is a very broad question and it's unlikely that all details about what the CPU does would fit in an answer. But a high-level answer is possible:
readline reads each line in order. It starts by reading chunks of the file from the beginning. When it encounters a line break, it returns that line. Each successive invocation of readline returns the next line until the last line has been read. Then it returns an empty string.
with open("myfile.txt") as f:
while True:
line = f.readline()
if not line:
break
# do something with the line
Readline uses operating system calls under the hood. The file object corresponds to a file descriptor in the OS, and it has a pointer that keeps track of where in the file we are at the moment. The next read will return the next chunk of data from the file from that point on.
You would have to scan through the file first in order to know how many lines there are, and then use some way of starting from the "middle" line. If you meant some arbitrary line except the first and last lines, you would have to scan the file from the beginning identifying lines (for example, you could repeatedly call readline, throwing away the result), until you have reached the line you want). There is a ready-made module for this: linecache.
import linecache
linecache.getline("myfile.txt", 5) # we already know we want line 5

Python - Opening a text file for edition before modifiying it in place with fileinput.input()

I have a python script used to edit a text file. Firstly, the first line of the text file is removed. After that, a line is added to the end of the text file.
I noticed a weird phenomenon, but I cannot explain the reason of this behaviour:
This script works as expected (removes the first line and adds a line at the end of the file):
import fileinput
# remove first line of text file
i = 0
for line in fileinput.input('test.txt', inplace=True):
i += 1
if i != 1:
print line.strip()
# add a line at the end of the file
f = open('test.txt', 'a+') # <= line that is moved
f.write('test5')
f.close()
But in the following script, as the text file is opened before removing, the removal occurs but the content isn't added (with the write() method):
import fileinput
# file opened before removing
f = open('test.txt', 'a+') # <= line that is moved
# remove first line of text file
i = 0
for line in fileinput.input('test.txt', inplace=True):
i += 1
if i != 1:
print line.strip()
# add a line at the end of the file
f.write('test5')
f.close()
Note that in the second example, open() is placed a the beginning, whereas in the first it is called after removing the last line of the text file.
What's the explanation of the behaviour?
When using fileinput with the inplace parameter, the modified content is saved in a backup file. The backup file is renamed to the original file when the output file is closed. In your example, you do not close the fileinput file explicitly, relying on the self-triggered closing, which is not documented and might be unreliable.
The behaviour you describe in the first example is best explained if we assume that opening the same file again triggers fileinput.close(). In your second example, the renaming only happens after f.close() is executed, thus overwriting the other changes (adding "test5").
So apparently you should explicitly call fileinput.close() in order to have full control over when your changes are written to disk. (It is generally recommended to release external resources explicitly as soon as they are not needed anymore.)
EDIT:
After more profound testing, this is what I think is happening in your second example:
You open a stream with mode a+ to the text file and bind it to the variable f.
You use fileinput to alter the same file. Under the hood, a new file is created, which is afterwards renamed to what the file was called originally. Note: this doesn't actually change the original file – rather, the original file is made inaccessible, as its former name now points to a new file.
However, the stream f still points to the original file (which has no file name anymore). You can still write to this file and close it properly, but you cannot see it anymore (since it has no filename anymore).
Please note that I'm not an expert in this kind of low-level operations; the details might be wrong and the terminology certainly is. Also, the behaviour might be different across OS and Python implementations. However, it might still help you understand why things go different from what you expected.
In conclusion I'd say that you shouldn't be doing what you do in your second example. Why don't you just read the file into memory, alter it, and then write it back to disk? Actual in-place (on-disk) altering of files is no fun in Python, as it's too high-level a language.

Categories

Resources