Flush data written to numeric file handle? - python

How can I flush the content written to a file opened as a numeric file handle?
For illustration, one can do the following in Python:
f = open(fn, 'w')
f.write('Something')
f.flush()
On the contrary, I am missing a method when doing the following:
import os
fd = os.open(fn)
os.pwrite(fd, buffer, offset)
# How do I flush fd here?

Use os.fsync(fd). See docs for fsync.
Be careful if you do fsync on a file descriptor obtained from a python file object. In that case you need to flush the python file object first.

Related

Convert bytes to a file object in python

I have a small application that reads local files using:
open(diefile_path, 'r') as csv_file
open(diefile_path, 'r') as file
and also uses linecache module
I need to expand the use to files that send from a remote server.
The content that is received by the server type is bytes.
I couldn't find a lot of information about handling IOBytes type and I was wondering if there is a way that I can convert the bytes chunk to a file-like object.
My goal is to use the API is specified above (open,linecache)
I was able to convert the bytes into a string using data.decode("utf-8"),
but I can't use the methods above (open and linecache)
a small example to illustrate
data = 'b'First line\nSecond line\nThird line\n'
with open(data) as file:
line = file.readline()
print(line)
output:
First line
Second line
Third line
can it be done?
open is used to open actual files, returning a file-like object. Here, you already have the data in memory, not in a file, so you can instantiate the file-like object directly.
import io
data = b'First line\nSecond line\nThird line\n'
file = io.StringIO(data.decode())
for line in file:
print(line.strip())
However, if what you are getting is really just a newline-separated string, you can simply split it into a list directly.
lines = data.decode().strip().split('\n')
The main difference is that the StringIO version is slightly lazier; it has a smaller memory foot print compared to the list, as it splits strings off as requested by the iterator.
The answer above that using StringIO would need to specify an encoding, which may cause wrong conversion.
from Python Documentation using BytesIO:
from io import BytesIO
f = BytesIO(b"some initial binary data: \x00\x01")

Convert file into BytesIO object using python

I have a file and want to convert it into BytesIO object so that it can be stored in database's varbinary column.
Please can anyone help me convert it using python.
Below is my code:
f = open(filepath, "rb")
print(f.read())
myBytesIO = io.BytesIO(f)
myBytesIO.seek(0)
print(type(myBytesIO))
Opening a file with open and mode read-binary already gives you a Binary I/O object.
Documentation:
The easiest way to create a binary stream is with open() with 'b' in the mode string:
f = open("myfile.jpg", "rb")
So in normal circumstances, you'd be fine just passing the file handle wherever you need to supply it. If you really want/need to get a BytesIO instance, just pass the bytes you've read from the file when creating your BytesIO instance like so:
from io import BytesIO
with open(filepath, "rb") as fh:
buf = BytesIO(fh.read())
This has the disadvantage of loading the entire file into memory, which might be avoidable if the code you're passing the instance to is smart enough to stream the file without keeping it in memory. Note that the example uses open as a context manager that will reliably close the file, even in case of errors.

Fast reading of gzip (text file) using io.BufferedReader in Python 3

I'm trying to efficiently read in, and parse, a compressed text file using the gzip module. This link suggests wrapping the gzip file object with io.BufferedReader, like so:
import gzip, io
gz = gzip.open(in_path, 'rb')
f = io.BufferedReader(gz)
for line in f.readlines():
# do stuff
gz.close()
To do this in Python 3, I think gzip must be called with mode='rb'. So the result is that line is a binary string. However, I need line to be a text/ascii string. Is there a more efficient way to read in the file as a text string using BufferedReader, or will I have to decode line inside the for loop?
You can use io.TextIOWrapper to seamlessly wrap a binary stream to a text stream instead:
f = io.TextIOWrapper(gz)
Or as #ShadowRanger pointed out, you can simply open the gzip file in text mode instead, so that the gzip module will apply the io.TextIOWrapper wrapper for you:
for line in gzip.open(in_path, 'rt'):
# do stuff

subprocess.Popen stdin read file

I'm trying to call a process on a file after part of it has been read. For example:
with open('in.txt', 'r') as a, open('out.txt', 'w') as b:
header = a.readline()
subprocess.call(['sort'], stdin=a, stdout=b)
This works fine if I don't read anything from a before doing the subprocess.call, but if I read anything from it, the subprocess doesn't see anything. This is using python 2.7.3. I can't find anything in the documentation that explains this behaviour, and a (very) brief glance at the subprocess source didn't reveal a cause.
If you open the file unbuffered then it works:
import subprocess
with open('in.txt', 'rb', 0) as a, open('out.txt', 'w') as b:
header = a.readline()
rc = subprocess.call(['sort'], stdin=a, stdout=b)
subprocess module works at a file descriptor level (low-level unbuffered I/O of the operating system). It may work with os.pipe(), socket.socket(), pty.openpty(), anything with a valid .fileno() method if OS supports it.
It is not recommended to mix the buffered and unbuffered I/O on the same file.
On Python 2, file.flush() causes the output to appear e.g.:
import subprocess
# 2nd
with open(__file__) as file:
header = file.readline()
file.seek(file.tell()) # synchronize (for io.open and Python 3)
file.flush() # synchronize (for C stdio-based file on Python 2)
rc = subprocess.call(['cat'], stdin=file)
The issue can be reproduced without subprocess module with os.read():
#!/usr/bin/env python
# 2nd
import os
with open(__file__) as file: #XXX fully buffered text file EATS INPUT
file.readline() # ignore header line
os.write(1, os.read(file.fileno(), 1<<20))
If the buffer size is small then the rest of the file is printed:
#!/usr/bin/env python
# 2nd
import os
bufsize = 2 #XXX MAY EAT INPUT
with open(__file__, 'rb', bufsize) as file:
file.readline() # ignore header line
os.write(2, os.read(file.fileno(), 1<<20))
It eats more input if the first line size is not evenly divisible by bufsize.
The default bufsize and bufsize=1 (line-buffered) behave similar on my machine: the beginning of the file vanishes -- around 4KB.
file.tell() reports for all buffer sizes the position at the beginning of the 2nd line. Using next(file) instead of file.readline() leads to file.tell() around 5K on my machine on Python 2 due to the read-ahead buffer bug (io.open() gives the expected 2nd line position).
Trying file.seek(file.tell()) before the subprocess call doesn't help on Python 2 with default stdio-based file objects. It works with open() functions from io, _pyio modules on Python 2 and with the default open (also io-based) on Python 3.
Trying io, _pyio modules on Python 2 and Python 3 with and without file.flush() produces various results. It confirms that mixing buffered and unbuffered I/O on the same file descriptor is not a good idea.
It happens because the subprocess module extracts the File handle from the File Object.
http://hg.python.org/releasing/2.7.6/file/ba31940588b6/Lib/subprocess.py
In line 1126, coming from 701.
The File Object uses buffers and has already read a lot from the file handle when the subprocess extracts it.
As mentioned by #jfs
When using popen it passes the file descriptor to the process,
At the same time python reads in chunks (e.g. 4096 bytes),
The result is that the position at the fd level is different than what you would expect.
I solved it in python 2.7 by aligning the file descriptor position.
_file = open(some_path)
_file.read(codecs.BOM_UTF8)
os.lseek(_file.fileno(), _file.tell(), os.SEEK_SET)
truncate_null_cmd = ['tr','-d', '\\000']
subprocess.Popen(truncate_null_cmd, stdin=_file, stdout=subprocess.PIPE)

"an integer is required" when open()'ing a file as utf-8?

I have a file I'm trying to open up in python with the following line:
f = open("C:/data/lastfm-dataset-360k/test_data.tsv", "r", "utf-8")
Calling this gives me the error
TypeError: an integer is required
I deleted all other code besides that one line and am still getting the error. What have I done wrong and how can I open this correctly?
From the documentation for open():
open(name[, mode[, buffering]])
[...]
The optional buffering argument specifies the file’s desired buffer
size: 0 means unbuffered, 1 means line buffered, any other positive
value means use a buffer of (approximately) that size. A negative
buffering means to use the system default, which is usually line
buffered for tty devices and fully buffered for other files. If
omitted, the system default is used.
You appear to be trying to pass open() a string describing the file encoding as the third argument instead. Don't do that.
You are using the wrong open.
>>> help(open)
Help on built-in function open in module __builtin__:
open(...)
open(name[, mode[, buffering]]) -> file object
Open a file using the file() type, returns a file object. This is the
preferred way to open a file. See file.__doc__ for further information.
As you can see it expects the buffering parameter which is a integer.
What you probably want is codecs.open:
open(filename, mode='rb', encoding=None, errors='strict', buffering=1)
From the help docs:
open(...)
open(file, mode='r', buffering=-1, encoding=None,
errors=None, newline=None, closefd=True) -> file object
you need encoding='utf-8'; python thinks you are passing in an argument for buffering.
The last parameter to open is the size of the buffer, not the encoding of the file.
File streams are more or less encoding-agnostic (with the exception of newline translation on files not open in binary mode), you should handle encoding elsewhere (e.g. when you get the data with a read() call, you can interpret it as utf-8 using its decode method).
This resolved my issue, ie providing an encoding(utf-8) while opening the file
with open('tomorrow.txt', mode='w', encoding='UTF-8', errors='strict', buffering=1) as file:
file.write(result)

Categories

Resources