Python readline() and readlines() not working - python

I'm trying to read the contents of a 5GB file and then sort them and find duplicates. The file is basically just a list of numbers (each on a new line). There are no empty lines or any symbols other than digits. The numbers are all pretty big (at least 6 digits). I am currently using
for line in f:
do something to line
to avoid memory problems. I am fine with using that. However, I am interested to know why readline() and readlines() didn't work for me. When I try
print f.readline(10)
the program always returns the same line no matter which number I use as a parameter. To be precise, if I do readline(0) it returns an empty line, even though the first line in the file is a big number. If I try readline(1) it returns 2, even though the number 2 is not in the file. When the parameter is >= 6, it always returns the same number: 291965.
Additionally, the readlines() method always returns the same lines no matter what the parameter is. Even if I try to print f.readlines(2), it's still giving me a list of over 1000 numbers.
I am not sure if I explained it very well. Sorry, English is not my first language. Anyway, I can make it work without the readline methods but I really want to know why they don't work as expected.
This is what the first 10 lines of the file look like:
548098
968516
853181
485102
69638
689242
319040
610615
936181
486052

I can not reproduce f.readline(1) returning 2, or f.readlines(10) returning "thousands of lines", but it seems like you misunderstood what the integer parameters to those functions do.
Those number do not specify the number of the line to read, but the maximum bytes readline will read.
>>> f = open("data.txt")
>>> f.readline(1)
'5'
>>>f.readline(100)
'48098\n'
Both commands will read from the first line, which is 548098; the first will only read 1 byte, and the second command reads the rest of the line, as there are less than 100 bytes left. If you call readline again, it will continue with the second line, etc.
Similarly, f.readlines(10) will read full lines until the total amount of bytes read is larger than the specified number:
>>> f.readlines(10)
['968516\n', '853181\n']

Related

Don't understand why getting 'ValueError: Mixing iteration and read methods would lose data' [duplicate]

I'm writing a script that logs errors from another program and restarts the program where it left off when it encounters an error. For whatever reasons, the developers of this program didn't feel it necessary to put this functionality into their program by default.
Anyways, the program takes an input file, parses it, and creates an output file. The input file is in a specific format:
UI - 26474845
TI - the title (can be any number of lines)
AB - the abstract (can also be any number of lines)
When the program throws an error, it gives you the reference information you need to track the error - namely, the UI, which section (title or abstract), and the line number relative to the beginning of the title or abstract. I want to log the offending sentences from the input file with a function that takes the reference number and the file, finds the sentence, and logs it. The best way I could think of doing it involves moving forward through the file a specific number of times (namely, n times, where n is the line number relative to the beginning of the seciton). The way that seemed to make sense to do this is:
i = 1
while i <= lineNumber:
print original.readline()
i += 1
I don't see how this would make me lose data, but Python thinks it would, and says ValueError: Mixing iteration and read methods would lose data. Does anyone know how to do this properly?
You get the ValueError because your code probably has for line in original: in addition to original.readline(). An easy solution which fixes the problem without making your program slower or consume more memory is changing
for line in original:
...
to
while True:
line = original.readline()
if not line: break
...
Use for and enumerate.
Example:
for line_num, line in enumerate(file):
if line_num < cut_off:
print line
NOTE: This assumes you are already cleaning up your file handles, etc.
Also, the takewhile function could prove useful if you prefer a more functional flavor.
Assuming you need only one line, this could be of help
import itertools
def getline(fobj, line_no):
"Return a (1-based) line from a file object"
return itertools.islice(fobj, line_no-1, line_no).next() # 1-based!
>>> print getline(open("/etc/passwd", "r"), 4)
'adm:x:3:4:adm:/var/adm:/bin/false\n'
You might want to catch StopIteration errors (if the file has less lines).
Here's a version without the ugly while True pattern and without other modules:
for line in iter(original.readline, ''):
if …: # to the beginning of the title or abstract
for i in range(lineNumber):
print original.readline(),
break

Python's function readlines(n) behavior

I've read the documentation, but what does readlines(n) do? By readlines(n), I mean readlines(3) or any other number.
When I run readlines(3), it returns same thing as readlines().
The optional argument should mean how many (approximately) bytes are read from the file. The file will be read further, until the current line ends:
readlines([size]) -> list of strings, each a line from the file.
Call readline() repeatedly and return a list of the lines so read.
The optional size argument, if given, is an approximate bound on the
total number of bytes in the lines returned.
Another quote:
If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that.
You're right that it doesn't seem to do much for small files, which is interesting:
In [1]: open('hello').readlines()
Out[1]: ['Hello\n', 'there\n', '!\n']
In [2]: open('hello').readlines(2)
Out[2]: ['Hello\n', 'there\n', '!\n']
One might think it's explained by the following phrase in the documentation:
Read until EOF using readline() and return a list containing the lines thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.
However, even when I try to read the file without buffering, it doesn't seem to change anything, which means some other kind of internal buffer is meant:
In [4]: open('hello', 'r', 0).readlines(2)
Out[4]: ['Hello\n', 'there\n', '!\n']
On my system, this internal buffer size seems to be around 5k bytes / 1.7k lines:
In [1]: len(open('hello', 'r', 0).readlines(5))
Out[1]: 1756
In [2]: len(open('hello', 'r', 0).readlines())
Out[2]: 28080
Depending on the size of the file, readlines(hint) should return a smaller set of lines. From the documentation:
f.readlines() returns a list containing all the lines of data in the file.
If given an optional parameter sizehint, it reads that many bytes from the file
and enough more to complete a line, and returns the lines from that.
This is often used to allow efficient reading of a large file by lines,
but without having to load the entire file in memory. Only complete lines
will be returned.
So, if your file has 1000s of lines, you can pass in say... 65536, and it will only read up to that many bytes at a time + enough to complete the next line, returning all the lines that are completely read.
It lists the lines, through which the given character size 'n' spans
starting from the current line.
Ex: In a text file, with content of
one
two
three
four
open('text').readlines(0) returns ['one\n', 'two\n', 'three\n', 'four\n']
open('text').readlines(1) returns ['one\n']
open('text').readlines(3) returns ['one\n']
open('text').readlines(4) returns ['one\n', 'two\n']
open('text').readlines(7) returns ['one\n', 'two\n']
open('text').readlines(8) returns ['one\n', 'two\n', 'three\n']
open('text').readlines(100) returns ['one\n', 'two\n', 'three\n', 'four\n']

Using text in one file to search for match in second file

I'm using python 2.6 on linux.
I have two text files
first.txt has a single string of text on each line. So it looks like
lorem
ipus
asfd
The second file doesn't quite have the same format.
it would look more like this
1231 lorem
1311 assss 31 1
etc
I want to take each line of text from first.txt and determine if there's a match in the second text. If there isn't a match then I would like to save the missing text to a third file. I would like to ignore case but not completely necessary. This is why I was looking at regex but didn't have much luck.
So I'm opening the files, using readlines() to create a list.
Iterating through the lists and printing out the matches.
Here's my code
first_file=open('first.txt', "r")
first=first_file.readlines()
first_file.close()
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
while i < len(first):
j=search[i]
while k < len(second):
m=compare[k]
if not j.find(m):
print m
i=i+1
k=k+1
exit()
It's definitely not elegant. Anyone have suggestions how to fix this or a better solution?
My approach is this: Read the second file, convert it into lowercase and then create a list of the words it contains. Then convert this list into a set, for better performance with large files.
Then go through each line in the first file, and if it (also converted to lowercase, and with extra whitespace removed) is not in the set we created, write it to the third file.
with open("second.txt") as second_file:
second_values = set(second_file.read().lower().split())
with open("first.txt") as first_file:
with open("third.txt", "wt") as third_file:
for line in first_file:
if line.lower().strip() not in second_values:
third_file.write(line + "\n")
set objects are a simple container type that is unordered and cannot contain duplicate value. It is designed to allow you to quickly add or remove items, or tell if an item is already in the set.
with statements are a convenient way to ensure that a file is closed, even if an exception occurs. They are enabled by default from Python 2.6 onwards, in Python 2.5 they require that you put the line from __future__ import with_statements at the top of your file.
The in operator does what it sounds like: tell you if a value can be found in a collection. When used with a list it just iterates through, like your code does, but when used with a set object it uses hashes to perform much faster. not in does the opposite. (Possible point of confusion: in is also used when defining a for loop (for x in [1, 2, 3]), but this is unrelated.)
Assuming that you're looking for the entire line in the second file:
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
first_file=open('first.txt', "r")
for line in first_file:
if line not in second:
print line
first_file.close()

how to load a big file and cut it into smaller files?

I have file about 4MB (which i called as big one)...this file has about 160000 lines..in a specific format...and i need to cut them at regular interval(not at equal intervals) i.e at the end of a certain format and write the part into another file..
Basically,what i wanted is to copy the information for the big file into the many smaller files ...as i read the big file keep writing the information into one file and after the a certain pattern occurs then end this and starting writing for that line into another file...
Normally, if it is a small file i guess it can be done dont know if i can perform file.readline() to read each line check if pattern end if not then write it to a file if patter end then change the file name open new file..so on but how to do it for this big file..
thanks in advance..
didnt mention the file format as i felt it is not neccesary will mention if required..
I would first read all of the allegedly-big file in memory as a list of lines:
with open('socalledbig.txt', 'rt') as f:
lines = f.readlines()
should take little more than 4MB -- tiny even by the standard of today's phones, much less ordinary computers.
Then, perform whatever processing you need to determine the beginning and ending of each group of lines you want to write out to a smaller files (I'm not sure by your question's text whether such groups can overlap or leave gaps, so I'm offering the most general solution where they're fully allowed to -- this will also cover more constrained use cases, with no real performance penalty, though code might be a tad simpler if the constraints were very rigid).
Say that you put these numbers in lists starts (index from 0 of first line to write, included), ends (index from 0 of first line to NOT write -- may legitimately and innocuosly be len(lines) or more), names (filenames to which you want to write), all lists having the same length of course.
Then, lastly:
assert len(starts) == len(ends) == len(names)
for s, e, n in zip(starts, ends, names):
with open(n, 'wt') as f:
f.writelines(lines[s:e])
...and that's all you need to do!
Edit: the OP seems to be confused by the concept of having these lists, so let me try to give an example: each block written out to a file starts at a line containing 'begin' (included) and ends at the first immediately succeeding line containing 'end' (also included), and the names of the files to be written are to be result0.txt, result1.txt, and so on.
It's an error if the number of "closing ends" differ from that of "opening begins" (and remember, the first immediately succeeding "end" terminates all pending "begins"); no line is allowed to contain both 'begin' and 'end'.
A very arbitrary set of conditions, to be sure, but then, the OP leaves us totally in the dark about the actual specifics of the problem, so what else can we do but guess most wildly?-)
outfile = 0
starts = []
ends = []
names = []
for i, line in enumerate(lines):
if 'begin' in line:
if 'end' in line:
raise ValueError('Both begin and end: %r' % line)
starts.append(i)
names.append('result%d.txt' % outfile)
outfile += 1
elif 'end' in line:
ends.append(i + 1) # remember ends are EXCLUDED, hence the +1
That's it -- the assert about the three lists having identical lengths will take care of checking that the constraints are respected.
As the constraints and specs are changed, so of course will this snippet of code change accordingly -- as long as it fills the three equal-length lists starts, ends, and names, exactly how it does so matters not in the least to the rest of the code.
A 4MB file is very small, it fits in memory for sure. The fastest approach would be to read it all and then iterate over each line searching for the pattern, writing out the line to the appropriate file depending on the pattern (your approach for small files.)
I'm not going to get into the actual code, but pseudo code would do this.
BIGFILE="filename"
SMALLFILE="smallfile1"
while(readline(bigfile)) {
write(SMALLFILE, line)
if(line matches pattern) {
SMALLFILE="smallfile++"
}
}
Which is really bad code, but maybe you get the point. I should also have said that it doesn't matter how big your file is since you have to read the file anyway.

Read the last lineof the file

Imagine I have a file with
Xpto,50,30,60
Xpto,a,v,c
Xpto,1,9,0
Xpto,30,30,60
that txt file can be appended a lot of times and when I open the file I want only to get the values of the last line of the txt file... How can i do that on python? reading the last line?
I think my answer from the last time this came up was sadly overlooked. :-)
If you're on a unix box,
os.popen("tail -10 " +
filepath).readlines() will probably
be the fastest way. Otherwise, it
depends on how robust you want it to
be. The methods proposed so far will
all fall down, one way or another.
For robustness and speed in the most
common case you probably want
something like a logarithmic search:
use file.seek to go to end of the file
minus 1000 characters, read it in,
check how many lines it contains, then
to EOF minus 3000 characters, read in
2000 characters, count the lines, then
EOF minus 7000, read in 4000
characters, count the lines, etc.
until you have as many lines as you
need. But if you know for sure that
it's always going to be run on files
with sensible line lengths, you may
not need that.
You might also find some inspiration
in the source code for the unix
tail command.
f.seek( pos ,2) seeks to 'pos' relative to the end of the file.
try a reasonable value for pos then readlines() and get the last line.
You have to account for when 'pos' is not a good guess, i.e. suppose you choose 300, but the last line is 600 chars long! in that case, just try again with a reasonable guess, until you capture the entire line. (this worst case should be very rare)
Um why not just seek to the end of the file and read til you hit a newline?.
i=0
while(1):
f.seek(i, 2)
c = f.read(1)
if(c=='\n'):
break
Not sure about a python specific implementation, but in a more language agnostic fashion, what you would want to do is skip (seek) to the end of the file, and then read each character in backwards order until you reach the line feed character that your file is using, usually a character with value 13. just read forward from that point to the end of the file, and you will have the last line in the file.

Categories

Resources