I have a small application that reads local files using:
open(diefile_path, 'r') as csv_file
open(diefile_path, 'r') as file
and also uses linecache module
I need to expand the use to files that send from a remote server.
The content that is received by the server type is bytes.
I couldn't find a lot of information about handling IOBytes type and I was wondering if there is a way that I can convert the bytes chunk to a file-like object.
My goal is to use the API is specified above (open,linecache)
I was able to convert the bytes into a string using data.decode("utf-8"),
but I can't use the methods above (open and linecache)
a small example to illustrate
data = 'b'First line\nSecond line\nThird line\n'
with open(data) as file:
line = file.readline()
print(line)
output:
First line
Second line
Third line
can it be done?
open is used to open actual files, returning a file-like object. Here, you already have the data in memory, not in a file, so you can instantiate the file-like object directly.
import io
data = b'First line\nSecond line\nThird line\n'
file = io.StringIO(data.decode())
for line in file:
print(line.strip())
However, if what you are getting is really just a newline-separated string, you can simply split it into a list directly.
lines = data.decode().strip().split('\n')
The main difference is that the StringIO version is slightly lazier; it has a smaller memory foot print compared to the list, as it splits strings off as requested by the iterator.
The answer above that using StringIO would need to specify an encoding, which may cause wrong conversion.
from Python Documentation using BytesIO:
from io import BytesIO
f = BytesIO(b"some initial binary data: \x00\x01")
Related
I have a file and want to convert it into BytesIO object so that it can be stored in database's varbinary column.
Please can anyone help me convert it using python.
Below is my code:
f = open(filepath, "rb")
print(f.read())
myBytesIO = io.BytesIO(f)
myBytesIO.seek(0)
print(type(myBytesIO))
Opening a file with open and mode read-binary already gives you a Binary I/O object.
Documentation:
The easiest way to create a binary stream is with open() with 'b' in the mode string:
f = open("myfile.jpg", "rb")
So in normal circumstances, you'd be fine just passing the file handle wherever you need to supply it. If you really want/need to get a BytesIO instance, just pass the bytes you've read from the file when creating your BytesIO instance like so:
from io import BytesIO
with open(filepath, "rb") as fh:
buf = BytesIO(fh.read())
This has the disadvantage of loading the entire file into memory, which might be avoidable if the code you're passing the instance to is smart enough to stream the file without keeping it in memory. Note that the example uses open as a context manager that will reliably close the file, even in case of errors.
I'm trying to efficiently read in, and parse, a compressed text file using the gzip module. This link suggests wrapping the gzip file object with io.BufferedReader, like so:
import gzip, io
gz = gzip.open(in_path, 'rb')
f = io.BufferedReader(gz)
for line in f.readlines():
# do stuff
gz.close()
To do this in Python 3, I think gzip must be called with mode='rb'. So the result is that line is a binary string. However, I need line to be a text/ascii string. Is there a more efficient way to read in the file as a text string using BufferedReader, or will I have to decode line inside the for loop?
You can use io.TextIOWrapper to seamlessly wrap a binary stream to a text stream instead:
f = io.TextIOWrapper(gz)
Or as #ShadowRanger pointed out, you can simply open the gzip file in text mode instead, so that the gzip module will apply the io.TextIOWrapper wrapper for you:
for line in gzip.open(in_path, 'rt'):
# do stuff
I have a csv file compressed into a bz2 file that I'm trying to load from a website, decompress, and write to a local csv file by
# Get zip file from website
archive = StringIO()
url_data = urllib2.urlopen(url)
archive.write(url_data.read())
# Extract the training data
data = bz2.decompress(archive.read())
# Write to csv
output_file = open('dataset_' + mode + '.csv', 'w')
output_file.write(data)
On the decompress call, I get IOError: invalid data stream. As a note, the csv file contained in the archive has quite a few characters that could be causing some issues. Particularly, if I try putting the file contents in unicode, I get an error about not being able to decode 0xfd. I only have the single file within the archive, but I'm wondering if something could also be going on due to not extracting a specific file.
Any ideas?
I suspect you are getting this error because the stream you are feeding the decompress() function is not a valid bz2 stream.
You must also "rewind" your StringIO buffer after writing to it. See the notes below in comments. The following code (same as yours with the exception of imports, and the seek() fix) works if the URL points to a valid bz2 file.
from StringIO import StringIO
import urllib2
import bz2
# Get zip file from website
url = "http://www.7-zip.org/a/7z920.tar.bz2" # just an example bz2 file
archive = StringIO()
# in case the request fails (e.g. 404, 500), this will raise
# a `urllib2.HTTPError`
url_data = urllib2.urlopen(url)
archive.write(url_data.read())
# will print how much compressed data you have buffered.
print "Length of file:", archive.tell()
# important!... make sure to reset the file descriptor read position
# to the start of the file.
archive.seek(0)
# Extract the training data
data = bz2.decompress(archive.read())
# Write to csv
output_file = open('output_file', 'w')
output_file.write(data)
re: encoding issues
Generally, character encoding errors will generate UnicodeError (or one of its cousins), but not IOError. IOError suggests something is wrong with the input, like truncation, or some error that would prevent the decompressor to do its work completely.
You have omitted the imports from your question, and one of the subtle differences between the StringIO and cStringIO (according to the docs ) is that cStringIO cannot work with unicode strings that cannot be converted to ascii. That no longer seems to hold (in my tests at least), but it may be at play.
Unlike the StringIO module, this module (cStringIO) is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.
Uploading a file in Django (1.7) using Python 3:
f = form.files['file']
f.__repr__()
outputs
<InMemoryUploadedFile: index.html (text/html)>
If I call f.readline() I get bytes back.
Normally that would be okay, I could just read the file and decode it, however in this case I'm passing the file on to another function that expects to call readline() on the parameter it receives, and readline() needs to return unicode rather than bytes.
Is it possible to set encoding or such on an instance of InMemoryUploadedFile, so readline would return unicode rather than bytes? Or do I have to use StringIO to first read in the entire file and then pass the instance of StringIO to my function?
The general way to handle this may be to write a custom upload handler and tell Django to use it. But I've never done this, so I'm not sure.
But a simple approach would be to just wrap the underlying file object. (If you use TextIOWrapper instead of StringIO you shouldn't need to worry about the overhead.)
from io import TextIOWrapper
f = form.files['file']
text_f = TextIOWrapper(f.file, encoding='utf-8')
I'm using urllib2 to read in a page. I need to do a quick regex on the source and pull out a few variables but urllib2 presents as a file object rather than a string.
I'm new to python so I'm struggling to see how I use a file object to do this. Is there a quick way to convert this into a string?
You can use Python in interactive mode to search for solutions.
if f is your object, you can enter dir(f) to see all methods and attributes. There's one called read. Enter help(f.read) and it tells you that f.read() is the way to retrieve a string from an file object.
From the doc file.read() (my emphasis):
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.
Be aware that a regexp search on a large string object may not be efficient, and consider doing the search line-by-line, using file.next() (a file object is its own iterator).
Michael Foord, aka Voidspace has an excellent tutorial on urllib2 which you can find here:
urllib2 - The Missing Manual
What you are doing should be pretty straightforward, observe this sample code:
import urllib2
import re
response = urllib2.urlopen("http://www.voidspace.org.uk/python/articles/urllib2.shtml")
html = response.read()
pattern = '(V.+space)'
wordPattern = re.compile(pattern, re.IGNORECASE)
results = wordPattern.search(html)
print results.groups()