I've read the documentation, but what does readlines(n) do? By readlines(n), I mean readlines(3) or any other number.
When I run readlines(3), it returns same thing as readlines().
The optional argument should mean how many (approximately) bytes are read from the file. The file will be read further, until the current line ends:
readlines([size]) -> list of strings, each a line from the file.
Call readline() repeatedly and return a list of the lines so read.
The optional size argument, if given, is an approximate bound on the
total number of bytes in the lines returned.
Another quote:
If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that.
You're right that it doesn't seem to do much for small files, which is interesting:
In [1]: open('hello').readlines()
Out[1]: ['Hello\n', 'there\n', '!\n']
In [2]: open('hello').readlines(2)
Out[2]: ['Hello\n', 'there\n', '!\n']
One might think it's explained by the following phrase in the documentation:
Read until EOF using readline() and return a list containing the lines thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.
However, even when I try to read the file without buffering, it doesn't seem to change anything, which means some other kind of internal buffer is meant:
In [4]: open('hello', 'r', 0).readlines(2)
Out[4]: ['Hello\n', 'there\n', '!\n']
On my system, this internal buffer size seems to be around 5k bytes / 1.7k lines:
In [1]: len(open('hello', 'r', 0).readlines(5))
Out[1]: 1756
In [2]: len(open('hello', 'r', 0).readlines())
Out[2]: 28080
Depending on the size of the file, readlines(hint) should return a smaller set of lines. From the documentation:
f.readlines() returns a list containing all the lines of data in the file.
If given an optional parameter sizehint, it reads that many bytes from the file
and enough more to complete a line, and returns the lines from that.
This is often used to allow efficient reading of a large file by lines,
but without having to load the entire file in memory. Only complete lines
will be returned.
So, if your file has 1000s of lines, you can pass in say... 65536, and it will only read up to that many bytes at a time + enough to complete the next line, returning all the lines that are completely read.
It lists the lines, through which the given character size 'n' spans
starting from the current line.
Ex: In a text file, with content of
one
two
three
four
open('text').readlines(0) returns ['one\n', 'two\n', 'three\n', 'four\n']
open('text').readlines(1) returns ['one\n']
open('text').readlines(3) returns ['one\n']
open('text').readlines(4) returns ['one\n', 'two\n']
open('text').readlines(7) returns ['one\n', 'two\n']
open('text').readlines(8) returns ['one\n', 'two\n', 'three\n']
open('text').readlines(100) returns ['one\n', 'two\n', 'three\n', 'four\n']
Related
In this code.
Why is it :
for i in range(0, len(outputdata)):
connectionSocket.send(outputdata[i].encode())
and not simply :
connectionSocket.send(outputdata.encode())
I tried the simpler version and it worked too so why doing it character by character?
Using the first approach, you are actually writing the content of the outputdata to a client socket one byte per iteration.
Assuming the actual length of outputdata variable is 100, here's what happens:
range() function is executed once
len() function is executed once
encode() function is executed 100 times
connectionSocket.send() is executed 100 times
So there, you have 1002 invocations in order to send just 100 bytes...
It doesn't look good.
Also, in this particular case it is done by the Python interpreter which also adds a significant overhead to executing those function calls.
Now, using the second approach:
encode() function is executed once
connectionSocket.send() is executed once
Thus, sending 100 bytes of data requires only two invocations in the code you've written.
So, generally, the second approach seems to look all the way better.
Now, for the code you provided neither first, nor the second approach are actually good, because the outputdata is a variable
that contains the ENTIRE FILE CONTENT IN MEMORY! (see fp.read() at line 16).
So, fp.read() reads all the content into the process memory and it WILL fail should the size of the file exceed the available memory.
The encode() function creates a byte array as a result of encoding outputdata, effectively doubling it in the same process memory
thus doubling the amount of memory needed to actually hold two variables - the outputdata that has been read from the file and the encoded version of it, created and returned by the encode().
P.S. And really, (and I mean, like, REALLY), the code by your link shouldn't ever be used in any kind of production environment under the penalty of death :)
I need to do a or linebreak add 2 spaces at end
You need to use some sort of Realloc() function. This function is used to extend the allocated size. The program should be something like that:
Allocate default value with malloc.
Read next number from your input.
If you got the number and this is not EOF (End of file), then use realloc to extend the allocated size by 1 and put the new number at the end.
Keep doing this untill you reach EOF.
Of course this is just one solution, and there may be others.
Another solution is some kind of a trick without using realloc(). You can read your file twice.
Open a file
Iterate through its content and find a size of the future array
Close your file
Allocate memory
Open a file again
Read numbers from the file and fill your array
P.S. In the future, try to be more specific while writing questions titles.
I'm trying to read the contents of a 5GB file and then sort them and find duplicates. The file is basically just a list of numbers (each on a new line). There are no empty lines or any symbols other than digits. The numbers are all pretty big (at least 6 digits). I am currently using
for line in f:
do something to line
to avoid memory problems. I am fine with using that. However, I am interested to know why readline() and readlines() didn't work for me. When I try
print f.readline(10)
the program always returns the same line no matter which number I use as a parameter. To be precise, if I do readline(0) it returns an empty line, even though the first line in the file is a big number. If I try readline(1) it returns 2, even though the number 2 is not in the file. When the parameter is >= 6, it always returns the same number: 291965.
Additionally, the readlines() method always returns the same lines no matter what the parameter is. Even if I try to print f.readlines(2), it's still giving me a list of over 1000 numbers.
I am not sure if I explained it very well. Sorry, English is not my first language. Anyway, I can make it work without the readline methods but I really want to know why they don't work as expected.
This is what the first 10 lines of the file look like:
548098
968516
853181
485102
69638
689242
319040
610615
936181
486052
I can not reproduce f.readline(1) returning 2, or f.readlines(10) returning "thousands of lines", but it seems like you misunderstood what the integer parameters to those functions do.
Those number do not specify the number of the line to read, but the maximum bytes readline will read.
>>> f = open("data.txt")
>>> f.readline(1)
'5'
>>>f.readline(100)
'48098\n'
Both commands will read from the first line, which is 548098; the first will only read 1 byte, and the second command reads the rest of the line, as there are less than 100 bytes left. If you call readline again, it will continue with the second line, etc.
Similarly, f.readlines(10) will read full lines until the total amount of bytes read is larger than the specified number:
>>> f.readlines(10)
['968516\n', '853181\n']
Goal
Reading in a massive binary file approx size 1.3GB and change certain bits and then writing it back to a separate file (cannot modify original file).
Method
When I read in the binary file it gets stored in a massive string encoded in hex format which is immutable since I am using python.
My algorithm loops through the entire file and stores in a list all the indexes of the string that need to be modified. The catch is that all the indexes in the string need to be modified to the same value. I cannot do this in place due to immutable nature. I cannot convert this into a list of chars because that blows up my memory constraints and takes a hell lot of time. The viable thing to do is to store it in a separate string, but due to the immutable nature I have to make a ton of string objects and keep on concatenating to them.
I used some ideas from https://waymoot.org/home/python_string/ however it doesn't give me a good performance. Any ideas, the goal is to copy an existing super long string exactly into another except for certain placeholders determined by the values in the index List ?
So, to be honest, you shouldn't be reading your file into a string. You shouldn't especially be writing anything but the bytes you actually change.
That is just a waste of resources, since you only seem to be reading linearly through the file, noting the down the places that need to be modified.
On all OSes with some level of mmap support (that is, Unixes, among them Linux, OS X, *BSD and other OSes like Windows), you can use Python's mmap module to just open the file in read/write mode, scan through it and edit it in place, without the need to ever load it to RAM completely and then write it back out. Stupid example, converting all 12-valued bytes by something position-dependent:
Note: this code is mine, and not MIT-licensed. It's for text-enhancement purposes and thus covered by CC-by-SA. Thanks SE for making this stupid statement necessary.
import mmap
with open("infilename", "r") as in_f:
in_view = mmap.mmap(in_f.fileno(), 0) ##length = 0: complete file mapping
length = in_view.size()
with open("outfilename", "w") as out_f
out_view = mmap.mmap(out_f.fileno(), length)
for i in range(length):
if in_view[i] == 12:
out_view[i] = in_view[i] + i % 10
else:
out_view[i] = in_view[i]
What about slicing the string, modify each slice, write it back on the disk before moving on to the next slice? Too intensive for the disk?
Consider the following simple python code:
f=open('raw1', 'r')
i=1
for line in f:
line1=line.split()
for word in line1:
print word,
print '\n'
In the first for loop i.e "for line in f:", how does python know that I want to read a line and not a word or a character?
The second loop is clearer as line1 is a list. So the second loop will iterate over the list elemnts.
Python has a notation of what are called "iterables". They're things that know how to let you traverse some data they hold. Some common iterators are lists, sets, dicts, pretty much every data structure. Files are no exception to this.
The way things become iterable is by defining a method to return an object with a next method. This next method is meant to be called repeatedly and return the next piece of data each time. The for foo in bar loops actually are just calling the next method repeatedly behind the scenes.
For files, the next method returns lines, that's it. It doesn't "know" that you want lines, it's just always going to return lines. The reason for this is that ~50% of cases involving file traversal are by line, and if you want words,
for word in (word for line in f for word in line.split(' ')):
...
works just fine.
In python the for..in syntax is used over iterables (elements tht can be iterated upon). For a file object, the iterator is the file itself.
Please refer here to the documentation of next() method - excerpt pasted below:
A file object is its own iterator, for example iter(f) returns f
(unless f is closed). When a file is used as an iterator, typically in
a for loop (for example, for line in f: print line), the next() method
is called repeatedly. This method returns the next input line, or
raises StopIteration when EOF is hit when the file is open for reading
(behavior is undefined when the file is open for writing). In order to
make a for loop the most efficient way of looping over the lines of a
file (a very common operation), the next() method uses a hidden
read-ahead buffer. As a consequence of using a read-ahead buffer,
combining next() with other file methods (like readline()) does not
work right. However, using seek() to reposition the file to an
absolute position will flush the read-ahead buffer. New in version
2.3.