Accessing Universal-Sentence-encoder training vocabulary - python

I'm basing this question off of this similar question, but the multilingual universal embeddings have a slightly different structure:
saved_model = loader_impl.parse_saved_model("/path_to/universal_sent_encoder")
graph = saved_model.meta_graphs[0].graph_def
fns = [f for f in graph.library.function if "ptb" in str(f).lower()][0].node_def
print(len(fns))
>>> 1272
nodes = [n for n in fns if 'SentencepieceOp' in n.name]
model_string = nodes[0].attr.get('model').s
I see a byte string with what I assume is a compressed list/dict of tokens:
model_string[100:200]
>>> b"\x19\n\x10extra_token_id_3\x15\x00\x00\x00\x00\x18\x04\n\n\n\x03\xe2\x96\x81\x15_\xbaU\xc0\n\x08\n\x01,\x15~\xdac\xc0\n\x08\n\x01.\x15\x08\xf6d\xc0\n\x08\n\x01s\x15\xe8\xa8\x8b\xc0\n\x0b\n\x04\xe2\x96\x81a\x15\xaf \x9b\xc0\n\x08\n\x01'\x15j\xe9\x9b\xc0\n\r\n\x06\xe2\x96\x81th"
But i've tried multiple ways of uncompressing this:
decoded_model_string = codecs.decode(model_string, 'ISO-8859-1') # decodes just fine
pickle.loads(model_string)
>>>
UnpicklingError Traceback (most recent call last)
<ipython-input-183-857101d05cb4> in <module>
----> 1 pickle.loads(model_string)
UnpicklingError: invalid load key, '\x0a'
pickle.loads(model_string.encode('utf-8'))
>>>
UnpicklingError Traceback (most recent call last)
<ipython-input-183-857101d05cb4> in <module>
----> 1 pickle.loads(model_string)
UnpicklingError: invalid load key, '\x0a'
I've also tried the tensorflow.io.decode_raw but also run into utf decoding errors.

Took a bit but I had to load the by
import sentencepiece as spm
sp_model = spm.SentencePieceProcessor()
sp_model.LoadFromSerializedProto(model_string)
vocab = {sp_model.IdToPiece(i): i for i in range(sp_model.GetPieceSize())}

Related

How to format python string with multiple characters as pading part

This is all good to pad a single character:
>>> '{:{pad}>{num}}'.format('12345',num='10', pad='a')
aaaaa12345
However, how to print out abcab12345, by using 'abc' as padding characters?
this is bad:
>>> '{:{pad}>{num}}'.format('12345',num='10', pad='abc')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-85d5680ad88a> in <module>()
----> 1 '{:{pad}>{num}}'.format('hello',num='10', pad='abc')
ValueError: Invalid format specifier
I like the mini format language in python3 BTW ;-)
You should concatenate the format operations (TIP: you need to ward your {} with doubles {{}} for each format nesting level):
baseFmtStr = "'{{{{:{{pad}}>{num}}}}}'"
resultStr = baseFmtStr.format(num=10).format(pad='-').format(12345)
This let us to result '-----12345'
Here you have a live example

.Thumbdata3 file extraction. TypeError: a bytes-like object is required, not 'str'

I'm aware there are similar threads and I've gone through them, but they didn't help my case:
A while ago I saved two .thumbdata3 files that are about 500mb in size each. This stackexchange thread claimed I could extract small jpegs from the files using a python script:
#!/usr/bin/python
"""extract files from Android thumbdata3 file"""
f=open('thumbdata3.dat','rb')
tdata = f.read()
f.close()
ss = '\xff\xd8'
se = '\xff\xd9'
count = 0
start = 0
while True:
x1 = tdata.find(ss,start)
if x1 < 0:
break
x2 = tdata.find(se,x1)
jpg = tdata[x1:x2+1]
count += 1
fname = 'extracted%d03.jpg' % (count)
fw = open(fname,'wb')
fw.write(jpg)
fw.close()
start = x2+2
However it returned this error:
Traceback (most recent call last):
File "... extract.py", line 15, in <module>
x1 = tdata.find(ss,start)
TypeError: a bytes-like object is required, not 'str'
After searching around I thought the error might be between using 2.7 and 3.5 methodology, and changed the 'rb' in the f.open function to 'r' only to get this error:
Traceback (most recent call last):
File "...\Thumbdata\thumbadata extract.py", line 6, in <module>
tdata = f.read()
File "...\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 277960004: character maps to <undefined>
It's worth mentioning that the script and the file are both in the same folder. I'm using Atom with a Python run package, as well as Anaconda3.
Any help is appreciated.
You must keep using rb mode for read binary in f=open('thumbdata3.dat','rb') to read that binary data.
The problem is that f is a binary stream then find function expect a parameter of byte type, which is new in Python3.
ss and se were assigned as string value, so its type is string (I guess ss and se stand for string start and string end).
You need to encode those strings to binary type using encode() function:
x1 = tdata.find(ss.encode(),start)
x2 = tdata.find(se.encode(),x1)
Please test and comment the output to ensure it would work.
Just keep the same code !
This error :
Traceback (most recent call last):
File "... extract.py", line 15, in <module>
x1 = tdata.find(ss,start)
TypeError: a bytes-like object is required, not 'str'
is due to the using of strings instead of byte-like object here :
ss = '\xff\xd8'
se = '\xff\xd9'
And to fix this problem just add a b before those strings
this is the solution :
ss = b'\xff\xd8'
se = b'\xff\xd9'

unpickle a python 2 object in python 3 raises ValueError

In python 2.7.6:
# the data i'm trying to pickle
>>> x[0:5]
[494.12804680901604, 641.9374923706055, 778.293918918919, 470.2265625, 237.21332017010934]
>>> y[0:5]
[236.99996948242188, 381.6793310733242, 685.0, 409.0909090909091, 658.0]
>>> z[0:5]
[23, 20, 98, 24, 78]
>>> holder = [x,y,z]
How i'm pickling:
with open('holderData.obj','wb') as f:
pickle.dump(holder,f)
f.close()
In python 3.6.2
with open('holderData.obj','rb') as f:
d = pickle.load(f, encoding='bytes')
Yet, this returns:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError: could not convert string to float
The only question/answer I could found related to this issue, tells me to add the encoding='bytes' bit which doesn't work in this instance.
The pickle itself print(repr(pickle.dumps(holder))):
'(lp0\n(lp1\nF494.12804680901604\naF641.9374923706055\naF778.293918918919\naF470.2265625\naF237.21332017010934\naF372.76081123737373\naF396.15337968952133\naF615.2265625\naF470.2265625\naF581.2155330882352\naF488.40675200803213\naF475.47189597315435\naF92.0511279585

Type error in Python: need a single Unicode character as parameter

When I try to convert a unicode variable to float using unicodedata.numeric(variable_name), I get this error "need a single Unicode character as parameter". Does anyone know how to resolve this?
Thanks!
Here is the code snippet I'm using :
f = urllib.urlopen("http://compling.org/cgi-bin/DAL_sentence_xml.cgi?sentence=good")
s = f.read()
f.close()
doc = libxml2dom.parseString(s)
measure = doc.getElementsByTagName("measure")
valence = unicodedata.numeric(measure[0].getAttribute("valence"))
activation = unicodedata.numeric(measure[0].getAttribute("activation"))
This is the error I'm getting when I run the code above
Traceback (most recent call last):
File "sentiment.py", line 61, in <module>
valence = unicodedata.numeric(measure[0].getAttribute("valence"))
TypeError: need a single Unicode character as parameter
Summary: Use float() instead.
The numeric function takes a single character. It does not do general conversions:
>>> import unicodedata
>>> unicodedata.numeric('½')
0.5
>>> unicodedata.numeric('12')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: need a single Unicode character as parameter
If you want to convert a number to a float, use the float() function.
>>> float('12')
12.0
It won't do that Unicode magic, however:
>>> float('½')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '½'

Numpy Correlation Error for Python

I am trying to show correlation between two individual lists. Before installing Numpy, I parsed World Bank data for GDP values and the number of internet users and stored them in two separate lists. Here is the snippet of code. This is just for gdp07. I actually have more lists for more years and other data such as unemployment.
import numpy as np
file = open('final_gdpnum.txt', 'r')
gdp07 = []
for line in file:
fields = line.strip().split()
gdp07.append(fields [0])
file2 = open('internetnum.txt', 'r')
netnum07 = []
for line in file2:
fields2 = line.strip().split()
nnetnum07.append(fields2 [0])
print np.correlate(gdp07,netnum07,"full")
The error I get is this:
Traceback (most recent call last):
File "Project3,py", line 83, in ,module.
print np.correlate(gdp07, netnum07, "full")
File "/usr/lib/python2.6/site-packages/numpy/core/numeric.py", line 645, in correlate
return multiarray.correlate2(a,v,mode))
ValueError: data type must provide an itemsize
Just for the record, I am using Cygwin with Python 2.6 on a Windows computer. I am only using Numpy along with its dependencies and other parts of its build (gcc compiler). Any help would be great. Thx
Perhaps that is the error when you try to input data as string, since according to python docs strip() return a string
http://docs.python.org/library/stdtypes.html
Try parsing the data to whatever type you want
As you can see here
In [14]:np.correlate(["3", "2","1"], [0, 1, 0.5])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/dog/<ipython-input-14-a0b588b9af44> in <module>()
----> 1 np.correlate(["3", "2","1"], [0, 1, 0.5])
/usr/lib64/python2.7/site-packages/numpy/core/numeric.pyc in correlate(a, v, mode, old_behavior)
643 return multiarray.correlate(a,v,mode)
644 else:
--> 645 return multiarray.correlate2(a,v,mode)
646
647 def convolve(a,v,mode='full'):
ValueError: data type must provide an itemsize
try parsing the values
In [15]: np.correlate([int("3"), int("2"),int("1")], [0, 1, 0.5])
Out[15]: array([ 2.5])
import numpy as np
file = open('final_gdpnum.txt', 'r')
gdp07 = []
for line in file:
fields = line.strip().split()
gdp07.append(int(fields [0]))
file2 = open('internetnum.txt', 'r')
netnum07 = []
for line in file2:
fields2 = line.strip().split()
nnetnum07.append(int(fields2 [0]))
print np.correlate(gdp07,netnum07,"full")
your other error is a character ending problem
i hope this works, since I dont think I can reproduce it since I have a linux box that supports utf-8 by default.
I went by ipython help(codecs) documentation
http://code.google.com/edu/languages/google-python-class/dict-files.html
import codecs
f = codecs.open(file, "r", codecs.BOM_UTF8)
for line in f:
fields = line.strip().split()
gdp07.append(int(fields [0]))
Try to cast data to float type. it works for me!

Categories

Resources