I have written a code which keep on append the file. Here is the code for it:
writel = open('able.csv','a',encoding='utf-8',errors='ignore')
with open('test','r',encoding='utf-8',errors='ignore') as file:
for i in file.readlines():
data = functionforprocess(i)
if data is not "":
writel.write(data)
if count% 10000 == 0:
log = open('log','w')
log.write(str(count))
log.close()
My question is: whether the file that I have opened in the append mode is available in RAM? Does that file is acting like a buffer, means If I store the data in variable and then write the variable to file is equal to open a file in append mode and write directly?
Kindly, get me out of this confusion.
Appending is a basic function of file I/O and is carried out by the operating system. For instance, fopen with mode a or a+ is part of the POSIX standard. With file I/O, the OS will also tend to buffer reads and writes; for instance, for most purposes it's not necessary to make sure that the data that you've passed to write is actually on the disk all the time. Sometimes it sits in a buffer somewhere in the OS; sometimes the OS dumps these buffers out to disk. You can force writes using fsync if it's important to you; this is also a really good reason to make sure that you always call close on your open file objects when you're done with them (or use a context manager); if you forget, you might get weird behaviour because of those buffers hanging around in the OS.
So, to answer your question. The file that you opened is most likely in RAM at any given moment. However, as far as I know, it's not available to you. You can interact with the data in the file using file I/O methods, but it's not like there's a buffer that you can get the memory address of, and read back what you just wrote. As to if append-mode writing is equivalent to storing something in a buffer and then writing to disk, I guess I would say no. Any kind of file I/O write will probably be buffered the same way by the OS, and the reason this is efficient is that the OS gets to make the decision on when to flush the buffers. If you store things in a variable and then write them out atomically to disk, you get to decide when the writes take place.
The signature of the open function is:
open(file, mode=’r’, buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
If you open in "a" (append) mode, it means: open for writing, appending to the end of the file if it exists. There is nothing about buffering.
Buffering can be customized with the buffering parameter. Quoting the doc:
buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
“Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.
In your example, your file is opened for append in text mode.
So, only a chunk of your data is stored in RAM during writing. If you write a "big" data, it will be divided into several chunks.
Related
I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)
Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.
You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.
You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())
You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.
This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.
File Handler to be flushed.
f.flush()
The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.
I am sniffing network packets using Tshark (Command-Line wireshark) and writing them to a file just as I receive them. My code block is similar to following:
documents = PriorityQueue(maxsize=0)
writing_enabled = True
with open("output.txt", 'w') as opened_file:
while writing_enabled:
try:
data = documents.get(timeout=1)
except Exception as e:
#No document pushed by producer thread
continue
opened_file.write(json.dumps(data) + "\n")
If I receive files from Tshark thread, I put them into queue then another thread writes it to a file using the code above. However, after file reaches 600+ MB process slows down and then change status into Not Responding. After a research I think that this is because of default buffering mechanism of open file method. Is it reasonable to change the with open("output.txt", 'w') as opened_file:
into with open("output.txt", 'w', 1000) as opened_file: to use 1000 byte of buffer in writing mode? Or there is another way to overcome this?
For writing the internal buffer to the file you can use the files flush function. However, this should generally be handled by your operating system which has a default buffer size. You can use something like this to open your file if you want to specify your own buffer size:
f = open('file.txt', 'w', buffering=bufsize)
Please also see the following question: How often does Python flush to file
Alternatively to flushing the buffer you could also try to use rolling files, i.e. open a new file if the size of your currently opened file exceeds a certain size. This is generally good practice if you intend to write a lot of data.
Trying to open a file to read and another to write into using below syntax.
Not clear if the file will be read completely and written completely before inserting record into the table (docs) or it will read only the buffered size 1024, write into the (dest) file and create the row in the table.
Want to read many files in a folder/ subfolders and create entry for each file in the table.
Or is it advisable to read the file in chunks and add the chunks together and write in (dest) file at once ?
If I don't specify the buffer size, reading will be limited by available RAM in the disk.
with open(os.path.join(db_path,filename),'rb') as src, \
open(os.path.join(upload_folder,filename), 'wb') as dest:
for chunk in iter(lambda: src.read(4096), b' '):
dest.write(chunk)
if 1: # inserting record into a table
ins = docs.insert().values(
file_name = filename,
up_date=datetime.datetime.utcnow())
I have modfied with what I could understand from your comment. Can you please let me know if it is right now?
You will read one chunk up to 1024 bytes. If you want to read the entire file you need to run read in a loop until it does not return anything.
Reading and writing one chunk at a time is advisable since your program can re-use the same memory and buffers instead of allocating more. You can experiment with the buffer size, though. I would not go below 4096, and the optimum may be somewhere around 8k to 16k. The optimum is determined by buffering in hardware and kernel buffer and page sizes.
I have a very large text file on disk. Assume it is 1 GB or more. Also assume the data in this file has a \n character every 120 characters.
I am using python-gnupg to encrypt on this file.
Since the file is so large, I cannot read the entire file into memory at one time.
However, the gnupg.encrypt() method that I'm using requires that I send in all the data at once -- not in chunks. So how can I encrypt the file without using up all my system memory?
Here is some sample code:
import gnupg
gpg = gnupg.GPG(gnupghome='/Users/syed.saqibali/.gnupg/')
for curr_line in open("VeryLargeFile.txt", "r").xreadlines():
encrypted_ascii_data = gpg.encrypt(curr_line, "recipient#gmail.com")
open("EncryptedOutputFile.dat", "a").write(encrypted_ascii_data)
This sample produces an invalid output file because I cannot simply concatenate encrypted blobs together into a file.
Encrypting line for line results in a vast number of OpenPGP documents, which will not only make decryption more complicated, but also massively blow up the file size and computation effort.
The GnuPG module for Python also knows a method encrypt_file, which takes a stream as input and knows the optional output parameter to directly write the result to a file.
with open("VeryLargeFile.txt", "r") as infile:
gpg.encrypt_file(
infile, "recipient#example.org",
output="EncryptedOutputFile.dat")
This results in a streaming behavior with constant and low memory requirements.
I added a "b" (for binary) to the open command and it worked great for my code. Encrypting this way for some reason is slower than half the speed of encrpyting via shell/bash command though.
By definition, "a+" mode opens the file for both appending and reading. Appending works, but what is the method for reading? I did some searches, but couldn't find it clarified anywhere.
f=open("myfile.txt","a+")
print (f.read())
Tried this, it prints blank.
Use f.seek() to set the file offset to the beginning of the file.
Note: Before Python 2.7, there was a bug that would cause some operating systems to not have the file position always point to the end of the file. This could cause some users to have your original code work. For example, on CentOS 6 your code would have worked as you wanted, but not as it should.
f = open("myfile.txt","a+")
f.seek(0)
print f.read()
when you open the file using f=open(myfile.txt,"a+"), the file can be both read and written to.
By default the file handle points to the start of the file,
this can be determined by f.tell() which will be 0L.
In [76]: f=open("myfile.txt","a+")
In [77]: f.tell()
Out[77]: 0L
In [78]: f.read()
Out[78]: '1,2\n3,4\n'
However, f.write will take care of moving the pointer to the last line before writing.
There are still quirks in newer version of Python dependant on OS and they are due to differences in implementation of the fopen() function in stdio.
Linux's man fopen:
a+ - Open for reading and appending (writing at end of file). The file is created if it does not exist. The initial file position for reading is at the beginning of the file, but output is always appended to the end of the file.
OS X:
``a+'' - Open for reading and writing. The file is created if it does not exist. The stream is positioned at the end of the file. Subsequent writes to the file will always end up at the then current end of file, irrespective of any intervening fseek(3) or similar.
MSDN doesn't really state where the pointer is initially set, just that it moves to the end on writes.
When a file is opened with the "a" or "a+" access type, all write operations occur at the end of the file. The file pointer can be repositioned using fseek or rewind, but is always moved back to the end of the file before any write operation is carried out. Thus, existing data cannot be overwritten.
Replicating the differences on various systems with both Python 2.7.x and 3k are pretty straightforward with .open .tell
When dealing with anything through the OS, it's safer to take precautions like using an explicit .seek(0).
MODES
r+ read and write Starts at the beginning of the file
r read only Starts at the beginning of the file
a+ Read/Append. Preserves file content by writing to the end of the file
Good Luck!
Isabel Ruiz