How do I decode garbled text from the Library of Congress?

How do I decode garbled text from the Library of Congress? - python

I am making a z39.50 search in python, but have a problem with decoding search results.
The first search result for "harry potter" is apparantly a hebrew version of the book.
How can I make this into unicode?
This is the minimal code I use to get a post:
#!/usr/bin/env python
# encoding: utf-8
from PyZ3950 import zoom
from PyZ3950 import zmarc
conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName = 'VOYAGER'
query = zoom.Query('CCL', 'ti="HARRY POTTER"')
res = conn.search(query)
print "%d hits:" % len(res)
for r in res[:1]:
print unicode( r.data )
Running the script results in "UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 788: ordinal not in range(128)"

r.data.decode('windows-1255').encode('utf-8')
you'll have to figure the correct encoding they used, and put that instead of 'windows-1255' (which might work, if you're right about the hebrew guess).

I'm trying to reproduce your problem, but am getting into the Python equivalent of "DLL Hell". Please specify what version of each of (Python, PyZ3950, and PLY) that you are using.
You will note from the error message that there are 788 bytes of ASCII before you get a non-ASCII byte. Doesn't sound like Hebrew/Arabic/Greek/Cyrillic/etc which use non-ASCII bytes to represent the characters most often used in those languages.
Instead of print unicode(r.data), do print type(r.data), repr(r.data) and edit your question to show the results.
Update I managed to get it running with the latest versions of PyZ3950 and PLY with Python 2.6 -- needed from ply import lex instead of import lex in PyZ3950/ccl.py (and likewise fixed import yacc.
Here are the results of dumping hit 0 and hit 200:
>>> print repr(res[0].data)
"01688cam 22003614a 45000010009000000050017000090080041000260350018000670350020
00085906004500105925004400150955002400194010001700218020001500235040001300250041
00130026305000180027610000270029488000540032124000330037524501270040888001620053
52460070006972600092007678800200008593000029010594900019010888800045011077000029
01152880006301181700002901244880005301273\x1e16012113\x1e20091209015332.0\x1e091
208s2008 is a 000 1 heb \x1e \x1fa(DLC)16012909\x1e \x1fa(DLC)200
9664431\x1e \x1fa0\x1fbibc\x1fcorignew\x1fd3\x1fencip\x1ff20\x1fgy-nonroman\x1e
0 \x1faacquire\x1fb1 shelf copies\x1fxpolicy default\x1e \x1fbcd06 2009-12-08 I
BC\x1e \x1fa 2009664431\x1e \x1fa965511564X\x1e \x1faDLC\x1fcDLC\x1e1 \x1fah
eb\x1fheng\x1e00\x1faPZ40.R685\x1fbH+\x1e1 \x1f6880-01\x1faRowling, J. K.\x1e1 \
x1f6100-01/(2/r‏\x1fa\x1b(2xelipb, b\x1b(B'\x1b(2i. wi.\x1b(B\x1e10\x1faH
arry Potter and ??.\x1flHebrew\x1e10\x1f6880-02\x1faHari Po\xf2ter \xf2ve-misdar
\xb0of ha-\xf2hol ? /\x1fcG'e. \xf2Ke. Roling ; me-Anglit, Gili Bar-Hilel Samu
; iyurim, Mery Granpreh.\x1e10\x1f6245-02/(2/r‏\x1fa‏\x1b(2d`xi te
hx e........‏ /\x1b(B\x1fc‏\x1b(2b\x1b(B'\x1b(2i. wi. xelipb ; n`p
bliz, bili ax\x1b(B-\x1b(2dll qne ; `iexim, nxi bx`ptxd.\x1b(B\x1e1 \x1fiTitle o
n t.p. verso:\x1faHarry Potter and the order of the phoenix ?\x1e \x1f6880-03\x
1faTel-Aviv :\x1fbYedi\xb0ot a\xf2haronot :\x1fbSifre \xf2hemed :\x1fbSifre \xb0
Aliyat ha-gag,\x1fcc[2008]\x1e \x1f6260-03/(2/r‏\x1fa‏\x1b(2zl\x1
b(B-\x1b(2`aia‏ :\x1b(B\x1fb\x1b(2icirez `gxepez :‏\x1b(B\x1fb&#x2
00f;\x1b(2qtxi gnc :‏\x1b(B\x1fb‏\x1b(2qtxi rliiz dbb,‏\x1b
(B\x1fc‏‪[2008]‬\x1e \x1fa887 p. :\x1fbill. ;\x1fc21 cm.\x
1e0 \x1f6880-04\x1faProzah\x1e0 \x1f6490-04/(2/r‏\x1fa‏\x1b(2txefd
\x1b(B\x1e1 \x1f6880-05\x1faBar-Hilel, Gili.\x1e1 \x1f6700-05/(2/r‏\x1fa&
#x200f;\x1b(2ax\x1b(B-\x1b(2dll qne, bili.\x1b(B\x1e1 \x1f6880-06\x1faGrandPr\xe
2e, Mary.\x1e1 \x1f6700-06/(2/r‏\x1fa‏\x1b(2bx`ptxd, nxi.\x1b(B\x1
e\x1d"
>>> print repr(res[200].data)
"01427cam 22003614a 45000010009000000050017000090080041000269060045000679250044
00112955017900156010001700335020001800352020001500370035002400385040001800409042
00140042705000220044110000280046324501160049126000760060730000200068344000350070
35040041007386500018007796500013007976500017008106500041008276000019008686000039
00887600004800926710005900974923003201033\x1e14882660\x1e20070925153312.0\x1e070
607s2007 ie b 000 0 eng d\x1e \x1fa7\x1fbcbc\x1fccopycat\x1fd3\x1fe
ncip\x1ff20\x1fgy-gencatlg\x1e0 \x1faacquire\x1fb2 shelf copies\x1fxpolicy defau
lt\x1e \x1fanb05 2007-06-07 z-processor ; nb05 2007-06-07 to HLCD for processin
g;\x1falk21 2007-08-09 to sh00\x1fish21 2007/09-18 (telework)\x1fesh49 2007-09-2
0 to BCCD\x1fesh45 2007-09-25 (Revised)\x1e \x1fa 2007390561\x1e \x1fa9780955
492617\x1e \x1fa0955492610\x1e \x1fa(OCoLC)ocn129545188\x1e \x1faVYF\x1fcVYF\
x1fdDLC\x1e \x1falccopycat\x1e00\x1faBT1105\x1fb.H44 2007\x1e1 \x1faHederman, M
ark Patrick.\x1e10\x1faHarry Potter and the Da Vinci code :\x1fb'Thunder of a Ba
ttle fought in some other Star' /\x1fcMark Patrick Hederman.\x1e \x1faDublin :\
x1fbDublin Centre for the Study of the Platonic Tradition,\x1fc2007.\x1e \x1fa3
8 p. ;\x1fc21 cm.\x1e 0\x1faPlatonic Centre pamphlets ;\x1fv2\x1e \x1faIncludes
bibliographical references.\x1e 0\x1faChristianity.\x1e 0\x1faMystery.\x1e 0\x1
faImagination.\x1e 0\x1faPotter, Harry (Fictitious character)\x1e10\x1faRowling,
J. K.\x1e10\x1faBrown, Dan,\x1fd1964-\x1ftDa Vinci code.\x1e10\x1faYeats, W. B.
\x1fq(William Butler),\x1fd1865-1939.\x1e2 \x1faDublin Centre for the Study of t
he Platonic Tradition.\x1e \x1fd20070411\x1fn565079784\x1fsKennys\x1e\x1d"
You will notice that there are quite a few of \x1e and \x1f in the "ASCII" part before the part where it blew up. There's also a \x1d at the end of each dump. (GROUP|UNIT|RECORD) SEPARATORs, perhaps. You will also notice that the second output also looks like gobbledegook but it's not mentioning Hebrew.
Conclusion: Forget Hebrew. Forget Unicode -- that stuff is NOT the result of sensible_unicode_text.encode("any_known_encoding"). Z3950 reeks of punched cards and magnetic drums and tapes. If it knows about Unicode, it's not evident in that data.
Looks like you need to read the ZOOM API docs that come with PyZ3950, and that will lead you to the ZOOM docs ... good luck.
Update 2
>>> r0 = res[0]
>>> dir(r0)
['__doc__', '__init__', '__module__', '__str__', '_rt', 'data', 'databaseName',
'get_field', 'get_fieldcount', 'is_surrogate_diag', 'syntax']
>>> r0.syntax
'USMARC'
>>>
Looks like you need to understand MARC
Update 3 Noticed BIDI stuff like ‏‪[2008]‬ in the first dump ... so you'll end up with Unicode eventually, AFTER you drop down through the levels of the docs working out what's wrapped in what ... again, good luck!

U need to convert Marc data for this:
U can use code below:
from pymarc import MARCReader
temp_list = []
for i in range(0, 2):# You can take len(res) here for all results
temp_list.append(res[i].data)
for i in range(0, 2):# You can take len(res) here for all results
reader = MARCReader(temp_list[i])
for i in reader:
print i.title(),i.author()

Related

Data reading - csv

I have some datas in a .dfx file and I trying to read it as a csv with pandas. But it has some special characters which are not read by pandas. They are separators as well.I attached one line from it
The "DC4" is being removed when I print the file. The SI is read as space, correctly. I tried some encoding (utf-8, latin1 etc), but no success.
I attached the printed first line as well. I marked the place where the characters should be.
My code is simple:
import pandas
file_log = pandas.read_csv("file_log.DFX", header=None)
print(file_log)
I hope I was clear and someone has an idea.
Thanks in advance!
EDIT:
The input. LINK: drive.google.com/open?id=0BxMDhep-LHOIVGcybmsya2JVM28
The expected output:
88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839
30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033

By examining the example.DFX in hex (with xxd), the two separators are 0x14 and 0x0f accordingly.
Read the csv with multiple separators using python engine:
import pandas
sep1 = chr(0x14) # the one shows dc4
sep2 = chr(0x0f) # the one shows si
file_log = pandas.read_csv('example.DFX', header=None, sep='{}|{}'.format(sep1, sep2), engine='python')
print file_log
And you get:
0 1 2 3 4 5 6 7
0 88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839 NaN
1 30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033 NaN
It seems it has an empty column at the end. But I'm sure you can handle that.

The encoding seems to be ASCII here. DC4 stands for "device control 4" and SI for "Shift In". These are control characters in an ASCII file and not printable. Thus you cannot see them when you issue a "print(file_log)", although it might do something depending on your terminal to view this (like \n would do a new-line).
Try typing file_log in your interpreter to get the representation of that variable and check if those special characters are included. Chances are that you'll see DC4 in the representation as '\x14' which means hexadecimal 14.
You may then further process these strings in your program by using string manipulation like replace.

reading in rho delimited file

I'm trying to use Pandas to read in a delimited file. The separator is a greek character, lowercase rho (þ).
I'm struggling to define the correct read_table parameters so that the resulting data frame is correctly formatted.
Does anyone have any experience or suggestions with this?
An example of the file is below
TimeþUser-IDþAdvertiser-IDþOrder-IDþAd-IDþCreative-IDþCreative-VersionþCreative-Size-IDþSite-IDþPage-IDþCountry-IDþState/ProvinceþBrowser-IDþBrowser-VersionþOS-IDþDMA-IDþCity-IDþZip-CodeþSite-DataþTime-UTC-Sec
03-28-2016-00:50:03þ0þ3893600þ7786669þ298662779þ67802437þ1þ300x250þ1722397þ125754620þ68þþ30þ0.0þ501012þ0þ3711þþþ1459122603
03-28-2016-00:24:29þ0þ3893600þ7352234þ290743769þ55727503þ1þ1x1þ1602646þ117915815þ68þþ31þ0.0þ501012þ0þ3711þþþ1459121069
03-28-2016-00:13:42þ0þ3893600þ7352234þ290743769þ55727503þ1þ1x1þ1602646þ117915815þ68þþ31þ0.0þ501012þ0þ3711þþþ1459120422
03-28-2016-00:21:09þ0þ3893600þ7352234þ290743769þ55727503þ1þ1x1þ1602646þ117915815þ68þþ31þ0.0þ501012þ0þ3711þþþ1459120869

I think what's happening is that the C engine isn't working here. If we switch to the Python engine, which is more powerful but slower, it seems to behave. For example, with the default C engine:
>>> df = pd.read_csv("out.rsv",sep="þ")
>>> df.iloc[:,:5]
TimeþUser-IDþAdvertiser-IDþOrder-IDþAd-IDþCreative-IDþCreative-VersionþCreative-Size-IDþSite-IDþPage-IDþCountry-IDþState/ProvinceþBrowser-IDþBrowser-VersionþOS-IDþDMA-IDþCity-IDþZip-CodeþSite-DataþTime-UTC-Sec
0 03-28-2016-00:50:03þ0þ3893600þ7786669þ29866277...
1 03-28-2016-00:24:29þ0þ3893600þ7352234þ29074376...
2 03-28-2016-00:13:42þ0þ3893600þ7352234þ29074376...
3 03-28-2016-00:21:09þ0þ3893600þ7352234þ29074376...
But with Python:
>>> df = pd.read_csv("out.rsv",sep="þ", engine="python")
>>> df.iloc[:,:5]
Time User-ID Advertiser-ID Order-ID Ad-ID
0 03-28-2016-00:50:03 0 3893600 7786669 298662779
1 03-28-2016-00:24:29 0 3893600 7352234 290743769
2 03-28-2016-00:13:42 0 3893600 7352234 290743769
3 03-28-2016-00:21:09 0 3893600 7352234 290743769
.. but seriously, þ? You're using þ as a delimiter? The only search hits google gives me for "rho delimited file" are all related to this question!
Note that you say lowercase rho, but it looks like thorn to me.. Maybe it's a lowercase rho on your end and got confused in posting?

Python - splitting string into individual bytes and putting them back together

Here is a part of a python script I have:
textString = raw_input('')
text = list(textString)
print textString
try:
for i in range (0, len(text)):
chat_client.sock.send(text[i])
i = i + 1
chat_client.sock.send(0)
except:
Exception
try:
for i in range (0, len(text)):
chat_server.conn.send(text[i])
i = i + 1
chat_server.conn.send(0)
except:
Exception
I then am hoping to put it back together when it is received, using the int delimiter 0. Just for testing purposes, I have got:
byte = self.conn.recv(1024)
if byte:
print byte
else:
break
just to show each byte that has been received individually.
However, when I insert a string, some of it is split into more than one character:
e.g. The quick brown fox jumps over the lazy dog -->
T
h
e
q
u
i
ck
b
r
o
wn
f
ox j
umps ov
er the
lazy dog
I wondered if anyone could figure out why this might be going on.
Thank you in advance.
Also, in case you are wondering why I am trying to split text like this, it is due to a suggestion from this post:
Python P2P socket chat script - only fully working on home network; connects at school but does not work

It is by design on stream socket. From the wikipedia page : a stream socket is a type of internet socket which provides a connection-oriented, sequenced, and unique flow of data without record boundaries. If multiple messages are already present when you read, they may be concatenated.
All what is guaranteed by the specification, if that you get all, and in order.

Convert binary data to web-safe text and back - Python

I want to convert a binary file (such as a jpg, mp3, etc) to web-safe text and then back into binary data. I've researched a few modules and I think I'm really close but I keep getting data corruption.
After looking at the documentation for binascii I came up with this:
from binascii import *
raw_bytes = open('test.jpg','rb').read()
text = b2a_qp(raw_bytes,quotetabs=True,header=False)
bytesback = a2b_qp(text,header=False)
f = open('converted.jpg','wb')
f.write(bytesback)
f.close()
When I try to open the converted.jpg I get data corruption :-/
I also tried using b2a_base64 with 57-long blocks of binary data. I took each block, converted to a string, concatenated them all together, and then converted back in a2b_base64 and got corruption again.
Can anyone help? I'm not super knowledgeable on all the intricacies of bytes and file formats. I'm using Python on Windows if that makes a difference with the \r\n stuff

Your code looks quite complicated. Try this:
#!/usr/bin/env python
from binascii import *
raw_bytes = open('28.jpg','rb').read()
i = 0
str_one = b2a_base64(raw_bytes) # 1
str_list = b2a_base64(raw_bytes).split("\n") #2
bytesBackAll = a2b_base64(''.join(str_list)) #2
print bytesBackAll == raw_bytes #True #2
bytesBackAll = a2b_base64(str_one) #1
print bytesBackAll == raw_bytes #True #1
Lines tagged with #1 and #2 represent alternatives to each other. #1 seems most straightforward to me - just make it one string, process it and convert it back.

You should use base64 encoding instead of quoted printable. Use b2a_base64() and a2b_base64().
Quoted printable is much bigger for binary data like pictures. In this encoding each binary (non alphanumeric character) code is changed into =HEX. It can be used for texts that consist mainly of alphanumeric like email subjects.
Base64 is much better for mainly binary data. It takes 6 bites of first byte, then last 2 bits of 1st byte and 4 bites from 2nd byte. etc. It can be recognized by = padding at the end of the encoded text (sometimes other character is used).
As an example I took .jpeg of 271 700 bytes. In qp it is 627 857 b while in base64 it is 362 269 bytes. Size of qp is dependent of data type: text which is letters only do not change. Size of base64 is orig_size * 8 / 6.

Your documentation reference is for Python 3.0.1. There is no good reason using Python 3.0. You should be using 3.2 or 2.7. What exactly are you using?
Suggestion: (1) change bytes to raw_bytes to avoid confusion with the bytes built-in (2) check for raw_bytes == bytes_back in your test script (3) while your test should work with quoted-printable, it is very inefficient for binary data; use base64 instead.
Update: Base64 encoding produces 4 output bytes for every 3 input bytes. Your base64 code doesn't work with 56-byte chunks because 56 is not an integral multiple of 3; each chunk is padded out to a multiple of 3. Then you join the chunks and attempt to decode, which is guaranteed not to work.
Your chunking loop would be much better written as:
output_string = ''.join(
b2a_base64(raw_bytes[i:i+57]) for i in xrange(0, xrange(len(raw_bytes), 57)
)
In any case, chunking is rather slow and pointless; just do b2a_base64(raw_bytes)

#PMC's answer copied from the question:
Here's what works:
from binascii import *
raw_bytes = open('28.jpg','rb').read()
str_list = []
i = 0
while i < len(raw_bytes):
byteSegment = raw_bytes[i:i+57]
str_list.append(b2a_base64(byteSegment))
i += 57
bytesBackAll = a2b_base64(''.join(str_list))
print bytesBackAll == raw_bytes #True
Thanks for the help guys. I'm not sure why this would fail with [0:56] instead of [0:57] but I'll leave that as an exercise for the reader :P

Downsides to reading strings from Excel in python using encode('utf-8')

I am reading a large amount of data from an excel spreadsheet in which I read (and reformat and rewrite) from the spreadsheet using the following general structure:
book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
z = i + 1
toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
out.write(toprint)
out.write("\n")
where x and y are arbitrary cells in this case, with x being less arbitrary and containing utf-8 characters
So far I have only been using the .encode('utf-8') in cells where I know there will be errors otherwise or foresee an error without using utf-8.
My question is basically this: is there a disadvantage to using .encode('utf-8') on all of the cells even if it is unnecessary? Efficiency is not an issue. the main issue is that it works even if there is a utf-8 character in a place there shouldn't be. If no errors would occur if I just lump the ".encode('utf-8')" onto every cell read, I will probably end up doing that.

The XLRD Documentation states it clearly: "From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode.". Since you are likely reading in files newer than 97, they are containing Unicode codepoints anyway. It is therefore necessary that keep the content of these cells as Unicode within Python and do not convert them to ASCII (which you do in with the str() function). Use this code below:
book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
z = i + 1
toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
out.write(toprint.encode('UTF-8'))

This answer is really a few mild comments on the accepted answer, but they need better formatting than the SO comment facility provides.
(1) Avoiding the SO horizontal scrollbar enhances the chance that people will read your code. Try wrapping your lines, for example:
toprint = u"".join([
u"formatting of the data im writing. "
u"important stuff is to the right -> ",
unicode(sheettwo.cell(z,y).value),
u" more formatting! ",
unicode(sheettwo.cell(z,x).value),
u" and done\n"
])
out.write(toprint.encode('UTF-8'))
(2) Presumably you are using unicode() to convert floats and ints to unicode; it does nothing for values that are already unicode. Be aware that unicode(), like str(), gives you only 12 digits of precision for floats:
>>> unicode(123456.78901234567)
u'123456.789012'
If that is a bother, you might like to try something like this:
>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'
(3) xlrd builds Cell objects on the fly when demanded.
sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I decode garbled text from the Library of Congress? - python

r.data.decode('windows-1255').encode('utf-8') you'll have to figure the correct encoding they used, and put that instead of 'windows-1255' (which might work, if you're right about the hebrew guess).

Related

Data reading - csv

reading in rho delimited file

Python - splitting string into individual bytes and putting them back together

Convert binary data to web-safe text and back - Python

Downsides to reading strings from Excel in python using encode('utf-8')

Categories

Resources