must be unicode not string on python 2 - python

I'm trying to run python 3 code on Python 2 but it's giving me this error:
TypeError: must be unicode, not str
I've tried adding str() before chr(i) and "u" before "P" but i'm obviously doing it wrong.
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith("P"))
def remove_punctuation(text):
return text.translate(tbl)
# initialize the stemmer
stemmer = LancasterStemmer()
# variable to hold the Json data read from the file
data = None
# read the json file and load the training data
with open('data.json') as json_data:
data = json.load(json_data)
print(data)

Use unichr not chr to create a Unicode character from an ordinal on Python 2:
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith("P"))
Switch to Python 3 if you can. Python 2 support ends next year.

Related

Encoding a file with ord function

I'm trying to encode a file and output the encode into a new file, but I got this error:
TypeError: ord() expected string of length 1, but int found
My code:
from sys import argv, exit
def encode(data):
encoded = ''
while data:
current = data[0]
count = 1
for i in data[1:]:
if i == current:
count += 1
else:
break
if count == 255:
break
encoded += '{}{}'.format(chr(ord(current) & 255), chr(count & 255)) #error occurs here.
data = data[count:]
return encoded
if __name__ == '__main__':
if len(argv) < 2:
print('Please specify input file!')
exit(0)
with open(argv[1], 'rb') as (f):
data = f.read()
with open(argv[1] + '.out', 'wb') as (f):
f.write(encode(data))
Additional question: How do I decode the encoded file?
You are reading bytes (open(..., 'rb')), so when you take one element of the byte string, you get a byte, ie. a number. This number already is the character code, so just leave out the ord. Alternatively, you could open the file without the b modifier (open(..., 'r')), which will return a string; I would advise to keep it as a byte string though (or you could run into encoding issues if you are parsing something non-ascii).
You will run into a similar problem saving your file: you cannot write a string into a file opened with the b modifier. Since you have characters outside the ascii range (>128), writing as a string is not a good idea, since python will try to encode your characters (eg. in UTF-8), and you will end up with completely different bytes. Therefore, the best solution probably is not to concat your data to a string in your loop (the part where you do '{}{}'.format(...), but to have a list (encoded = [], concat with encoded.append(current)) and convert that to a byte string using bytes(encoded) after your loop. You can then pass that to write without a problem.
As for how to decode your file, you can just open the file like you do for encoding, read two bytes b1 and b2, and append [b1]*b2 to your output (again, as a list), and convert that to a byte string with bytes().

How can I convert the ANSI format csv to UTF-8 using python?

I have a code for writing the Values in CSV using python code.
It is getting stored in ANSI format. but I need in a UTF-8 format for my reading operations.
Python code:
f= csv.writer(open("test1234.csv", "w+",encoding='utf-
8'),lineterminator="\n")
fieldname = ['Param_Name','Param_Value']
f.writerow(fieldname)
instances = conn.instances.filter(Filters=[{'Name': 'instance-state-name',
'Values': ['running', 'stopped']}])
for instance in instances:
instance_count.append(instance)
instanceCount = str(len(instance_count))
f.writerow(('p_instance_count', len(instance_count)))
print('Instance count ->' + str(len(instance_count)))
error:
f = csv.writer(open("test1234.csv", "w+",encoding='utf-8'), lineterminator="\n")
TypeError: 'encoding' is an invalid keyword argument for this function
Please suggest any workaround!
import codecs
f = csv.writer(codecs.open("test1234.csv", "rb+", "utf-8-sig"), lineterminator="\n")

Python 2 vs 3. Same inputs, different results. MD5 hash

Python 3 code:
def md5hex(data):
""" return hex string of md5 of the given string """
h = MD5.new()
h.update(data.encode('utf-8'))
return b2a_hex(h.digest()).decode('utf-8')
Python 2 code:
def md5hex(data):
""" return hex string of md5 of the given string """
h = MD5.new()
h.update(data)
return b2a_hex(h.digest())
Input python 3:
>>> md5hex('bf5¤7¤8¤3')
'61d91bafe643c282bd7d7af7083c14d6'
Input python 2:
>>> md5hex('bf5¤7¤8¤3')
'46440745dd89d0211de4a72c7cea3720'
Whats going on?
EDIT:
def genurlkey(songid, md5origin, mediaver=4, fmt=1):
""" Calculate the deezer download url given the songid, origin and media+format """
data = b'\xa4'.join(_.encode("utf-8") for _ in [md5origin, str(fmt), str(songid), str(mediaver)])
data = b'\xa4'.join([md5hex(data), data])+b'\xa4'
if len(data)%16:
data += b'\x00' * (16-len(data)%16)
return hexaescrypt(data, "jo6aey6haid2Teih").decode('utf-8')
All this problem started with this b'\xa4' in python 2 code in another function. This byte doesn't work in python 3.
And with that one I get the correct MD5 hash...
Use hashlib & a language agnostic implementation instead:
import hashlib
text = u'bf5¤7¤8¤3'
text = text.encode('utf-8')
print(hashlib.md5(text).hexdigest())
works in Python 2/3 with the same result:
Python2:
'61d91bafe643c282bd7d7af7083c14d6'
Python3 (via repl.it):
'61d91bafe643c282bd7d7af7083c14d6'
The reason your code is failing is the encoded string is not the same string as the un-encoded one: You are only encoding for Python 3.
If you need it to match the unencoded Python 2:
import hashlib
text = u'bf5¤7¤8¤3'
print(hashlib.md5(text.encode("latin1")).hexdigest())
works:
46440745dd89d0211de4a72c7cea3720
the default encoding for Python 2 is latin1 not utf-8
Default encoding in python3 is Unicode. In python 2 it's ASCII. So even if string matches when read they are presented differently.

Unable to convert UTF-8 characters - Python

I receive a bunch of data into a variable using a mechanize and urllib in Python 2.7. However, certain characters are not decoded despite using .decode(UTF-8). The full code is as follows:
#!/usr/bin/python
import urllib
import mechanize
from time import time
total_time = 0
count = 0
def send_this(url):
global count
count = count + 1
this_browser=mechanize.Browser()
this_browser.set_handle_robots(False)
this_browser.addheaders=[('User-agent','Chrome')]
translated=this_browser.open(url).read().decode("UTF-8")
return translated
def collect_this(my_ltarget,my_lhome,data):
global total_time
data = data.replace(" ","%20")
get_url="http://mymemory.translated.net/api/ajaxfetch?q="+data+"&langpair="+my_lhome+"|"+my_ltarget+"&mtonly=1"
return send_this(get_url)
ctr = 0
print collect_this("hi-IN","en-GB","This is my first proper computer program.")
The output of the print statement is:
{"responseData":{"translatedText":"\u092f\u0939 \u092e\u0947\u0930\u093e \u092a\u0939
u0932\u093e \u0938\u092e\u0941\u091a\u093f\u0924 \u0915\u0902\u092a\u094d\u092f\u0942\u091f
\u0930 \u092a\u094d\u0930\u094b\u0917\u094d\u0930\u093e\u092e \u0939\u0948
\u0964"},"responseDetails":"","responseStatus":200,"matches":[{"id":0,"segment":"This is my
first proper computer program.","translation":"\u092f\u0939 \u092e\u0947\u0930\u093e \u092a
\u0939\u0932\u093e \u0938\u092e\u0941\u091a\u093f\u0924 \u0915\u0902\u092a\u094d\u092f\u0942
\u091f\u0930 \u092a\u094d\u0930\u094b\u0917\u094d\u0930\u093e\u092e \u0939\u0948
\u0964","quality":"70","reference":"Machine Translation provided by Google, Microsoft,
Worldlingo or MyMemory customized engine.","usage-count":0,"subject":"All","created-
by":"MT!","last-updated-by":"MT!","create-date":"2013-12-20","last-update-
date":"2013-12-20","match":0.85}]}
The characters starting with \u... are supposed to be the characters that were supposed to be converted.
Where have I gone wrong?
You don't have a UTF-8-encoded string. You have JSON with JSON unicode escapes in it. Decode it with a JSON decoder:
import json
json.loads(your_json_string)

"list index out of range" in python

I have a code in python to index a text file that contain arabic words. I tested the code on an english text and it works well ,but it gives me an error when i tested an arabic one.
Note: the text file is saved in unicode encoding not in ANSI encoding.
This is my code:
from whoosh import fields, index
import os.path
import csv
import codecs
from whoosh.qparser import QueryParser
# This list associates a name with each position in a row
columns = ["juza","chapter","verse","voc"]
schema = fields.Schema(juza=fields.NUMERIC,
chapter=fields.NUMERIC,
verse=fields.NUMERIC,
voc=fields.TEXT)
# Create the Whoosh index
indexname = "indexdir"
if not os.path.exists(indexname):
os.mkdir(indexname)
ix = index.create_in(indexname, schema)
# Open a writer for the index
with ix.writer() as writer:
with open("h.txt", 'r') as txtfile:
lines=txtfile.readlines()
# Read each row in the file
for i in lines:
# Create a dictionary to hold the document values for this row
doc = {}
thisline=i.split()
u=0
# Read the values for the row enumerated like
# (0, "juza"), (1, "chapter"), etc.
for w in thisline:
# Get the field name from the "columns" list
fieldname = columns[u]
u+=1
#if isinstance(w, basestring):
# w = unicode(w)
doc[fieldname] = w
# Pass the dictionary to the add_document method
writer.add_document(**doc)
with ix.searcher() as searcher:
query = QueryParser("voc", ix.schema).parse(u"بسم")
results = searcher.search(query)
print(len(results))
print(results[1])
Then the error is :
Traceback (most recent call last):
File "C:\Python27\yarab.py", line 38, in <module>
fieldname = columns[u]
IndexError: list index out of range
this is a sample of the file:
1 1 1 كتاب
1 1 2 قرأ
1 1 3 لعب
1 1 4 كتاب
While I cannot see anything obviously wrong with that, I would make sure you're designing for error. Make sure you catch any situation where split() returns more than expected amount of elements and handle it promptly (e.g. print and terminate). It looks like you might be dealing with ill-formatted data.
You missed the header of Unicode in your script. the first line should be:
encoding: utf-8
Also to open a file with the unicode encoding use:
import codecs
with codecs.open("s.txt",encoding='utf-8') as txtfile:

Categories

Resources