Find the encoding of this byte string - python

Okay this is probably a very vague ill-posed question.. but Im going to give it a try anyway.
I have read in the first line of a .data file by using
with open('raw_000.data', 'rb') as f:
A = f.readline()
(using only 'r', it used an UTF-8 encoding and that failed).
This gave me the following byte-string
b'\x10\xa2\x8f\xbc-X\x98?\xfe\xd4\x17>\xdd\xda\x0e\xbf\xdc\xc5d?e\x19\x91?\xe0m\xb0<\xe7\xa8R#=\xca\xbd>\x94\x91\xb2\xbf\xba\xb3u>)\xbe\x01\xc0\x05\x1f\x83\xbf#\x04\xe2\xbf\x80\xbd;>\xe5\x0e<\xc0\x0cS0?\xbd\xcaG?\x15\x9c\x07\xc0lX\x9d?\xc5\xa3j\xc0X+D\xc0T\x91\xad?\x13\x87\xdd\xbfjCs?m\xdd\x02#\xebBi\xbf\xfc\xd8g=*NM\xbf&\x94&\xc0\x94\x91\xb2?=\xca\xbd>\xfc\xbfm\xbf\xf5\x96\x9f?\xf4\x8b\xc0\xbfAz\x12#X\xc6\xee\xbe\x84\t\xcf\xbf\x1d\xdb\x93\xbfpw\x19\xc0\xbc\xe0\x85>|\xd5\xa1?\xe5\x0e\xbc?\x80\xbd\xbb=|\xc0\xf7\xbe\\xc5\xda\xbe\xacB\xe4\xbf\x99\xbb\r#NGB\xbf\xaa\xbd~#;\xc0\xf2\xbf\x1a\xd1\xc8>\xdc\xc5\xe4\xbfe\x19\x11\xc0\x10\xa2\x8f<-X\x98\xbf\n'
Now this should contain some meaningful data. But I have no idea how to 'decode' this, as in.. what type of decoding...
All I know is that
chardet.detect('...')
gave
{'confidence': 0.0, 'encoding': None}.
And besides that, the file raw_000.data comes from an MRI machine by philips. However, I could not find any documentation in that area as well.
What other options do I have?

Well, apparently there is something called Little Endian Ordering. Never knew anything about it, but luckily there is a wikipedia page
https://en.wikipedia.org/wiki/Endianness
And of course, we can give this information to python as well by using the `struct' package like so:
import struct
with open('your_byte_file', 'rb') as f:
A = f.readline()
res = struct.unpack('<f', A)
Here we have that the <-sign tells us that we are dealing with little Endian Ordering,
details. And the f tells us that we are expecting a float, details.

Related

ValueError Reading large data set with pd.read_json

I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line. I have made a smaller version of the JSON file, with only 100 records, for testing.
I can read the entire test file into a pandas dataframe and examine it.
The complete dataset file, however, has about 6 million lines. The recommendation is to use chunksize and build a json reader. I'm hitting errors, even with my test input.
My code currently looks like this
path = 'file://localhost/Users/.../DSC_Intro/'
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader =
pd.read_json(StringIO(filename), lines=True, chunksize=10)
type(review_reader)
The type call returns
pandas.io.json.json.JsonReader
which looks good.
Then I try
for chunk in review_reader:
print(chunk)
as referenced in pandas user guide
and I get an error:
ValueError: Unexpected character found when decoding 'false'
Update - it has been suggested that the issue is caused by embedded (quoted) "\n" characters in the data file; that pandas is seeing the JSON records as, not one per line, but multiple lines.
The error message is VERY opaque, if that's the case. Also, with 6 million lines, how should I tell pd.read_json to ignore "\n" and only look at actual newlines in the data?
Update
It's been suggested that if I fix my typo (it was a typo in this post, not a typo in my code) and use a Unix file path instead of a URL (JSON doesn't care: see docs).
When I do this but keep StringIO(), I get a different ValueError.
When I do this but remove StringIO(), the code works.
This seems to be very fragile. :-(
Note The tutorial has an answer key. I've tried that code. The answer key uses
review_reader =
pd.read_json(filename, lines=True, chunksize=10)
which throws the TypeError
sequence item 0: expected str instance, bytes found
Adding StringIO() seems to have solved that.
Input Sample JSON record, one per line of the input file.
{"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"}
Firstly, your
path 'file://localhost/Users/.../DSC_Intro/'
is not valid python code. If you try to execute that as such, you will get an invalid syntax error. I assume, then, that this is just showing the value of the path variable. I don't know whether those ellipses are literal or the result of your environment truncating the display of path. I'll assume here that your path is a valid file URL for your system as it doesn't seem germane here to consider an incorrect path.
Either way, yes, read_json can read json from a file URL as you're specifying there (I learned something there) if you read it in one go:
pd.read_json(fname, lines=True)
But if you try to create a reader from this, by specifying
pd.read_json(fname, lines=True, chunksize=...)
then you get
TypeError: sequence item 0: expected str instance, bytes found
Secondly, yes, wrapping your file-like argument with StringIO makes this error go away, but it isn't helping for any reason you might think and its use is based on a misreading of the pandas docs you point to.
I'll quote a couple of bits from the read_json doc here:
Signature: pd.read_json(
path_or_buf=None, ...
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json
So with read_json, you can either give it an actual string that is valid JSON, or you can give it a file-like object that points to a file that contains JSON.
Notice in the pandas docs that you cite:
In [258]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:
is JSON, not a path. When their example then does:
df = pd.read_json(jsonl, lines=True)
it is merely parsing the JSON in the string - no files are involved here.
When it then wants to demonstrate reading from a file in chunks, it does
# reader is an iterator that returns `chunksize` lines each iteration
In [262]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
In other words, they are wrapping a JSON string, not a path, by StringIO(). This is just for the purposes of the documented example, so you can see that if you treated the JSON string as if it were being read from a file you can read it in chunks. That's what StringIO() does. So when you wrap the string that describes your file URL in StringIO(), I expect that read_json is then trying to interpret that string as JSON that's being read from a file and parse it. It understandably falls over because it isn't JSON.
This brings us back to why read_json cannot read your file URL in chunks. I don't have an immediate good answer to that. I suspect it lies in the internals of how read_json opens file URLs, or what function underlies this. If you were intent upon, or forced to, do this chunking from a file URL then I suspect you'd be looking at controlling the mode in which the file is opened, or perhaps somehow providing explicit guidance to read_json how to interpret the bytestream it gets. Libraries such as urllib2 may be useful here, I'm not sure.
But let's cut to the best fix here. Why are we trying to specify the path as a file URL? Simply specify your path as an OS path, e.g.
path = '/path/to/my/data/'
and then
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader = pd.read_json(filename, lines=True, chunksize=10)
And I betcha it works as intended! (It does for me, as it always has).
Caveat: windows doesn't use forward-slash path delimiters, and constructing paths by concatenating strings in the above fashion can be fragile, but usually if you use 'proper' forward-slash delimiters (smile), decent languages internally understand that. It's constructing paths using backslashes that is guaranteed to cause you pain. But just keep an eye on that.

file_put_contents and iconv equivalents in Python?

What I want is extremely simple and can be done in PHP language literally with one line of code:
file_put_contents('target.txt', iconv('windows-1252', 'utf-8', file_get_contents('source.txt')));
In Python I spent a whole day trying to figure out how to achieve the same trivial thing, but to no avail. When I try to read or write files I usually get UnicodeDecode errors, str has no method decode and a dozen of similar errors. It seems like I scanned all threads at SO, but still do not know how can I do this.
Are you specifying the "encoding" keyword argument when you call open?
with open('source.txt', encoding='windows-1252') as f_in:
with open('target.txt', 'w', encoding='utf-8') as f_out:
f_out.write(f_in.read())
Since Python 3.5 you can write:
Path('target.txt').write_text(
Path('source.txt').read_text(encoding='windows-1252'),
encoding='utf-8'
)

Reading a binary file as plain text using Python

A friend of mine has written simple poetry using C's fprintf function. It was written using the 'wb' option so the generated file is in binary. I'd like to use Python to show the poetry in plain text.
What I'm currently getting are lots of strings like this: ��������
The code I am using:
with open("read-me-if-you-can.bin", "rb") as f:
print f.read()
f.close()
The thing is, when dealing with text written to a file, you have to know (or correctly guess) the character encoding used when writing said file. If the program reading the file is assuming the wrong encoding here, you will end up with strange characters in the text if you're lucky and with utter garbage if you're unlucky.
Don't try to guess, try to know: you need to ask your friend in what character encoding he or she wrote the poetry text to the file. You then have to open the file in Python specifying that character encoding. Let's say his/her answer is "UTF-16-LE" (for sake of example), you then write:
with open("poetry.bin", encoding="utf-16-le") as f:
print(f.read())
It seems you're on Python 2 still though, so there you write:
import io
with io.open("poetry.bin", encoding="utf-16-le") as f:
print f.read()
You could start by trying UTF-8 first though, that is an often used encoding.

how does pickle know which to pick?

I have my pickle function working properly
with open(self._prepared_data_location_scalar, 'wb') as output:
# company1 = Company('banana', 40)
pickle.dump(X_scaler, output, pickle.HIGHEST_PROTOCOL)
pickle.dump(Y_scaler, output, pickle.HIGHEST_PROTOCOL)
with open(self._prepared_data_location_scalar, 'rb') as input_f:
X_scaler = pickle.load(input_f)
Y_scaler = pickle.load(input_f)
However, I am very curious how does pickle know which to load? Does it mean that everything has to be in the same sequence?
What you have is fine. It's a documented feature of pickle:
It is possible to make multiple calls to the dump() method of the same Pickler instance. These must then be matched to the same number of calls to the load() method of the corresponding Unpickler instance.
There is no magic here, pickle is a really simple stack-based language that serializes python objects into bytestrings. The pickle format knows about object boundaries: by design, pickle.dumps('x') + pickle.dumps('y') is not the same bytestring as pickle.dumps('xy').
If you're interested to learn some background on the implementation, this article is an easy read to shed some light on the python pickler.
wow I did not even know you could do this ... and I have been using python for a very long time... so thats totally awesome in my book, however you really should not do this it will be very hard to work with later(especially if it isnt you working on it)
I would recommend just doing
pickle.dump({"X":X_scalar,"Y":Y_scalar},output)
...
data = pickle.load(fp)
print "Y_scalar:",data['Y']
print "X_scalar:",data['X']
unless you have a very compelling reason to save and load the data like you were in your question ...
edit to answer the actual question...
it loads from the start of the file to the end (ie it loads them in the same order they were dumped)
Yes, pickle pick objects in order of saving.
Intuitively, pickle append to the end when it write (dump) to a file,
and read (load) sequentially the content from a file.
Consequently, order is preserved, allowing you to retrieve your data in the exact order you serialize it.

Decompressing a .bz2 file in Python

So, this is a seemingly simple question, but I'm apparently very very dull. I have a little script that downloads all the .bz2 files from a webpage, but for some reason the decompressing of that file is giving me a MAJOR headache.
I'm quite a Python newbie, so the answer is probably quite obvious, please help me.
In this bit of the script, I already have the file, and I just want to read it out to a variable, then decompress that? Is that right? I've tried all sorts of way to do this, I usually get "ValueError: couldn't find end of stream" error on the last line in this snippet. I've tried to open up the zipfile and write it out to a string in a zillion different ways. This is the latest.
openZip = open(zipFile, "r")
s = ''
while True:
newLine = openZip.readline()
if(len(newLine)==0):
break
s+=newLine
print s
uncompressedData = bz2.decompress(s)
Hi Alex, I should've listed all the other methods I've tried, as I've tried the read() way.
METHOD A:
print 'decompressing ' + filename
fileHandle = open(zipFile)
uncompressedData = ''
while True:
s = fileHandle.read(1024)
if not s:
break
print('RAW "%s"', s)
uncompressedData += bz2.decompress(s)
uncompressedData += bz2.flush()
newFile = open(steamTF2mapdir + filename.split(".bz2")[0],"w")
newFile.write(uncompressedData)
newFile.close()
I get the error:
uncompressedData += bz2.decompress(s)
ValueError: couldn't find end of stream
METHOD B
zipFile = steamTF2mapdir + filename
print 'decompressing ' + filename
fileHandle = open(zipFile)
s = fileHandle.read()
uncompressedData = bz2.decompress(s)
Same error :
uncompressedData = bz2.decompress(s)
ValueError: couldn't find end of stream
Thanks so much for you prompt reply. I'm really banging my head against the wall, feeling inordinately thick for not being able to decompress a simple .bz2 file.
By the by, used 7zip to decompress it manually, to make sure the file isn't wonky or anything, and it decompresses fine.
You're opening and reading the compressed file as if it was a textfile made up of lines. DON'T! It's NOT.
uncompressedData = bz2.BZ2File(zipFile).read()
seems to be closer to what you're angling for.
Edit: the OP has shown a few more things he's tried (though I don't see any notes about having tried the best method -- the one-liner I recommend above!) but they seem to all have one error in common, and I repeat the key bits from above:
opening ... the compressed file as if
it was a textfile ... It's NOT.
open(filename) and even the more explicit open(filename, 'r') open, for reading, a text file -- a compressed file is a binary file, so in order to read it correctly you must open it with open(filename, 'rb'). ((my recommended bz2.BZ2File KNOWS it's dealing with a compressed file, of course, so there's no need to tell it anything more)).
In Python 2.*, on Unix-y systems (i.e. every system except Windows), you could get away with a sloppy use of open (but in Python 3.* you can't, as text is Unicode, while binary is bytes -- different types).
In Windows (and before then in DOS) it's always been indispensable to distinguish, as Windows' text files, for historical reason, are peculiar (use two bytes rather than one to end lines, and, at least in some cases, take a byte worth '\0x1A' as meaning a logical end of file) and so the reading and writing low-level code must compensate.
So I suspect the OP is using Windows and is paying the price for not carefully using the 'rb' option ("read binary") to the open built-in. (though bz2.BZ2File is still simpler, whatever platform you're using!-).
openZip = open(zipFile, "r")
If you're running on Windows, you may want to do say openZip = open(zipFile, "rb") here since the file is likely to contain CR/LF combinations, and you don't want them to be translated.
newLine = openZip.readline()
As Alex pointed out, this is very wrong, as the concept of "lines" is foreign to a compressed stream.
s = fileHandle.read(1024)
[...]
uncompressedData += bz2.decompress(s)
This is wrong for the same reason. 1024-byte chunks aren't likely to mean much to the decompressor, since it's going to want to work with it's own block-size.
s = fileHandle.read()
uncompressedData = bz2.decompress(s)
If that doesn't work, I'd say it's the new-line translation problem I mentioned above.
This was very helpful.
44 of 2300 files gave an end of file missing error, on Windows open.
Adding the b(inary) flag to open fixed the problem.
for line in bz2.BZ2File(filename, 'rb', 10000000) :
works well. (the 10M is the buffering size that works well with the large files involved)
Thanks!

Categories

Resources