Open a file for RawIOBase python - python

I need to read in the the binary of a file for a function, and from this link https://docs.python.org/2/library/io.html, it looks like I should be using a RawIOBase object to read it in. But I can't find any where on how to open a file to use with RawIOBase. Right now I have tried this to read the binary into a string
with (open(documentFileName+".bin", "rb")) as binFile:
document = binFile.RawIOBase.read()
print document
but that throws the error AttributeError: 'file' object has no attribute 'RawIOBase'
So with no open attribute in RawIOBase, how do I open the file for it to read from?

Don't delve into the implementation details of the io thicket unless you need to code your own peculiar file-oid-like types! In your case,
with open(documentFileName+".bin", "rb") as binFile:
document = binFile.read()
will be perfectly fine!
Note in passing that I've killed the superfluous parentheses you were using -- "no unneeded pixels!!!" -- but, while important!, that's a side issue to your goal here.
Now, assuming Python 2, document is a str -- an immutable array of bytes. It may be confusing that displaying document shows it as a string of characters, but that's just Py2's confusion between text and byte strings (in Py3, the returned type would be bytes).
If you prefer to work with (e.g) a mutable array of ints, use e.g
theints = map(ord, document)
or, for an immutable array of bytes that displays numerically,
import array
thearray = array.array('b', document)

Related

Python: pickle: No code suggestion after extracting string object from pickle file

for example, this is my code:
#extract the object from "lastringa.pickle" and save it
extracted = ""
with open("lastringa.pickle","rb") as f:
extracted = pickle.load(f)
Where "lasting.pickle" contains a string object with some text.
So if I type extracted. before the opening of the file, I'm able to get the code suggestion as shown in the picture:
But then, after this operation extracted = pickle.load(f), if I type extracted. I don't get code suggestion anymore.
Can somebody explain me why is that and how to solve this?
Pickle reads and writes objects as binary files. You can confirm this by the open('lastringa.pickle', 'rb'), command where you are using the rb option, i.e. read binary.
Your IDE doesn't know the type of the object that the pickle is expected to read, so that it can suggest the string methods (e.g. .split(), .read())
On the other hand, in the first photo, your IDE knows that expected is a string and it knows what to suggest.

ValueError Reading large data set with pd.read_json

I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line. I have made a smaller version of the JSON file, with only 100 records, for testing.
I can read the entire test file into a pandas dataframe and examine it.
The complete dataset file, however, has about 6 million lines. The recommendation is to use chunksize and build a json reader. I'm hitting errors, even with my test input.
My code currently looks like this
path = 'file://localhost/Users/.../DSC_Intro/'
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader =
pd.read_json(StringIO(filename), lines=True, chunksize=10)
type(review_reader)
The type call returns
pandas.io.json.json.JsonReader
which looks good.
Then I try
for chunk in review_reader:
print(chunk)
as referenced in pandas user guide
and I get an error:
ValueError: Unexpected character found when decoding 'false'
Update - it has been suggested that the issue is caused by embedded (quoted) "\n" characters in the data file; that pandas is seeing the JSON records as, not one per line, but multiple lines.
The error message is VERY opaque, if that's the case. Also, with 6 million lines, how should I tell pd.read_json to ignore "\n" and only look at actual newlines in the data?
Update
It's been suggested that if I fix my typo (it was a typo in this post, not a typo in my code) and use a Unix file path instead of a URL (JSON doesn't care: see docs).
When I do this but keep StringIO(), I get a different ValueError.
When I do this but remove StringIO(), the code works.
This seems to be very fragile. :-(
Note The tutorial has an answer key. I've tried that code. The answer key uses
review_reader =
pd.read_json(filename, lines=True, chunksize=10)
which throws the TypeError
sequence item 0: expected str instance, bytes found
Adding StringIO() seems to have solved that.
Input Sample JSON record, one per line of the input file.
{"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"}
Firstly, your
path 'file://localhost/Users/.../DSC_Intro/'
is not valid python code. If you try to execute that as such, you will get an invalid syntax error. I assume, then, that this is just showing the value of the path variable. I don't know whether those ellipses are literal or the result of your environment truncating the display of path. I'll assume here that your path is a valid file URL for your system as it doesn't seem germane here to consider an incorrect path.
Either way, yes, read_json can read json from a file URL as you're specifying there (I learned something there) if you read it in one go:
pd.read_json(fname, lines=True)
But if you try to create a reader from this, by specifying
pd.read_json(fname, lines=True, chunksize=...)
then you get
TypeError: sequence item 0: expected str instance, bytes found
Secondly, yes, wrapping your file-like argument with StringIO makes this error go away, but it isn't helping for any reason you might think and its use is based on a misreading of the pandas docs you point to.
I'll quote a couple of bits from the read_json doc here:
Signature: pd.read_json(
path_or_buf=None, ...
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json
So with read_json, you can either give it an actual string that is valid JSON, or you can give it a file-like object that points to a file that contains JSON.
Notice in the pandas docs that you cite:
In [258]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:
is JSON, not a path. When their example then does:
df = pd.read_json(jsonl, lines=True)
it is merely parsing the JSON in the string - no files are involved here.
When it then wants to demonstrate reading from a file in chunks, it does
# reader is an iterator that returns `chunksize` lines each iteration
In [262]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
In other words, they are wrapping a JSON string, not a path, by StringIO(). This is just for the purposes of the documented example, so you can see that if you treated the JSON string as if it were being read from a file you can read it in chunks. That's what StringIO() does. So when you wrap the string that describes your file URL in StringIO(), I expect that read_json is then trying to interpret that string as JSON that's being read from a file and parse it. It understandably falls over because it isn't JSON.
This brings us back to why read_json cannot read your file URL in chunks. I don't have an immediate good answer to that. I suspect it lies in the internals of how read_json opens file URLs, or what function underlies this. If you were intent upon, or forced to, do this chunking from a file URL then I suspect you'd be looking at controlling the mode in which the file is opened, or perhaps somehow providing explicit guidance to read_json how to interpret the bytestream it gets. Libraries such as urllib2 may be useful here, I'm not sure.
But let's cut to the best fix here. Why are we trying to specify the path as a file URL? Simply specify your path as an OS path, e.g.
path = '/path/to/my/data/'
and then
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader = pd.read_json(filename, lines=True, chunksize=10)
And I betcha it works as intended! (It does for me, as it always has).
Caveat: windows doesn't use forward-slash path delimiters, and constructing paths by concatenating strings in the above fashion can be fragile, but usually if you use 'proper' forward-slash delimiters (smile), decent languages internally understand that. It's constructing paths using backslashes that is guaranteed to cause you pain. But just keep an eye on that.

how to convert string into byte array in this particular scenario

I am following the tutorial which was designed for Python 2x in Python 3.5. I have made updates to the code in order to make it compatible, but I am falling at the last hurdle.
In 2x Python changes between text and binary as and when, whereas Python 3x marks a clear demarcation.
I am trying to produce a CSV file for submission, but it requires that the data be in int32 format.
test_file = open('/Users/williamneal/Scratch/Titanic/test.csv', 'rt')
test_file_object = csv.reader(test_file)
header = test_file_object.__next__
Opens up a file object before making it a read object. I modified the original code, 'wb' --> 'wt' to account for the fact that by default python returns a string.
prediction_file = open("/Users/williamneal/Scratch/Titanic/genderbasedmodel.csv", "wt")
prediction_file_object = csv.writer(prediction_file)
The opens a file object before making it a writing object. Like previously I modified the mode.
prediction_file_object.writerow([bytearray(b"PassengerId"), bytearray(b"Survived")])
for row in test_file_object:
if row[3] == 'female':
prediction_file_object.writerow([row[0],int(1)])
else:
prediction_file_object.writerow([row[0],int(0)])
test_file.close()
prediction_file.close()
I changed the writers in the for-loop into integers but as I attempt to cast the string into binary I get the following error at the point of submission:
I am flummoxed, I do not see how I can submit my file which needs the headers "PassengerId" and "Survivors" and also be type int32.
Any suggestions?

What are the appropriate argument/return types for a function to take binary files/streams/filenames and convert them to readable text format?

I have a function that's intended to take a binary file format and convert it to a readable text format, e.g.:
def textualize(binary_stuff):
# magic to turn binary stuff into text
return text_stuff
There are a few different types I could accept as input or produce as output, and I'm unsure what to use. Here are some options and corresponding objections I can think of:
Take a bytes object as input and return a string.
Problematic if, say, the input is originating from a huge file that now has to be read into memory.
Take a file-like object as input, read it, and return a string.
Relies on the caller to open the file in the right mode.
The asymmetry of this disturbs me for reasons I can't quite put a finger on.
Take two file-like objects; read from one and write to the other instead of returning anything.
Again relies on the caller to open the files in the right mode.
Makes the most common cases (named file to named file, or bytes to string) more unwieldly than they need to be.
Take two filenames and handle opening stuff myself.
What if the caller wants to convert data that isn't in a named file?
Accept multiple possible input types.
Possibly complicated to program.
Still leaves the question of what to return.
Is there an established Right Thing to do for conversions like this? Are there additional tradeoffs I'm missing?
You could do this how the json module does this. One function for strings and another for files. And leave the opening and closing of files to the caller -- gives the caller more flexibility. You could then use functools.singledispatch to provide ways to dispatch your functions
eg.
from functools import singledispatch
from io import BytesIO, StringIO, IOBase, TextIOBase
#singledispatch
def textualise(input, output):
if not isinstance(input, IOBase):
raise TypeError(input)
if not isinstance(output, TextIOBase):
raise TypeError(output)
data = input.read().decode("utf-8")
output.write(data)
output.flush()
#textualise.register(bytes)
def textualise_bytes(bytes_):
input = BytesIO(bytes_)
output = StringIO()
textualise(input, output)
return output.getvalue()
#textualise.register(str)
def textualise_filenames(in_filename, out_filename):
with open(in_filename, "rb") as input, open(out_filename, "wt") as output:
textualise(input, output)
s = textualise(b"some text")
assert s == "some text"
textualise("inputfile.txt", "outputfile.txt")
I would personally avoid the the third form since bytes objects are also valid filenames. For example, textualise(b"inputfile.txt", "outputfile.txt") would get dispatched to the wrong function (textualise_bytes).

Read file object as string in python

I'm using urllib2 to read in a page. I need to do a quick regex on the source and pull out a few variables but urllib2 presents as a file object rather than a string.
I'm new to python so I'm struggling to see how I use a file object to do this. Is there a quick way to convert this into a string?
You can use Python in interactive mode to search for solutions.
if f is your object, you can enter dir(f) to see all methods and attributes. There's one called read. Enter help(f.read) and it tells you that f.read() is the way to retrieve a string from an file object.
From the doc file.read() (my emphasis):
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.
Be aware that a regexp search on a large string object may not be efficient, and consider doing the search line-by-line, using file.next() (a file object is its own iterator).
Michael Foord, aka Voidspace has an excellent tutorial on urllib2 which you can find here:
urllib2 - The Missing Manual
What you are doing should be pretty straightforward, observe this sample code:
import urllib2
import re
response = urllib2.urlopen("http://www.voidspace.org.uk/python/articles/urllib2.shtml")
html = response.read()
pattern = '(V.+space)'
wordPattern = re.compile(pattern, re.IGNORECASE)
results = wordPattern.search(html)
print results.groups()

Categories

Resources