"Prefix" before "unicode(entry.title.text, "utf-8")"

"Prefix" before "unicode(entry.title.text, "utf-8")" - python

I'm editing a script which gets the name of a video url and the line of code that does this is:
title = unicode(entry.title.text, "utf-8")
It can be found here. Is there a simple way to add a predefined prefix before this?
For example if there is a Youtube video named "test", the script should show "Testing Videos: test".

Just prepend a unicode string:
title = u'Testing Videos: ' + unicode(entry.title.text, "utf-8")
or use string formatting for more complex options; like adding both a prefix and a postfix:
title = u'Testing Videos: {} (YouTube)'.format(unicode(entry.title.text, "utf-8"))
All that unicode(inputvalue, codec) does is decode a byte string to a unicode value; you are free to concatenate that with other unicode values, including unicode literals.
An alternative spelling would be to use the str.decode() method on the entry.title.text object:
title = u'Testing Videos: ' + entry.title.text.decode("utf-8")
but the outcome would be the same.

Related

Format String of Dictionary

I've a string of dictionary as following:
CREDENTIALS = "{\"aaaUser\": {\"attributes\": {\"pwd\": \"cisco123\", \"name\": \"admin\"}}}"
Now I want to format this string to replace the pwd and name dynamically. What I've tried is:
CREDENTIALS = "{\"aaaUser\": {\"attributes\": {\"pwd\": \"{0}\", \"name\": \"{1}\"}}}".format('password', 'username')
But this gives following error:
traceback (most recent call last):
File ".\ll.py", line 4, in <module>
CREDENTIALS = "{\"aaaUser\": {\"attributes\": {\"pwd\": \"{0}\", \"name\": \"{1}\"}}}".format('password', 'username')
KeyError: '"aaaUser"
It is possible by just loading the string as dict using json.loads()and then setting the attributes as required, but this is not what I want. I want to format the string, so that I can use this string in other files/modules.
'
What I'm missing here? Any help would be appreciated.

Don't try to work with the JSON string directly; decode it, update the data structure, and re-encode it:
# Use single quotes instead of escaping all the double quotes
CREDENTIALS = '{"aaaUser": {"attributes": {"pwd": "cisco123", "name": "admin"}}}'
d = json.loads(CREDENTIALS)
attributes = d["aaaUser"]["attributes"]
attributes["name"] = username
attributes["pwd"] = password
CREDENTIALS = json.dumps(d)
With string formatting, you would need to change your string to look like
CREDENTIALS = '{{"aaaUser": {{"attributes": {{"pwd": "{0}", "name": "{1}"}}}}}}'
doubling all the literal braces so that the format method doesn't mistake them for placeholders.
However, formatting also means that the password needs to be pre-escaped if it contains anything that could be mistaken for JSON syntax, such as a double quote.
# This produces invalid JSON
NEW_CREDENTIALS = CREDENTIALS.format('new"password', 'bob')
# This produces valid JSON
NEW_CREDENTIALS = CREDENTIALS.format('new\\"password', 'bob')
It's far easier and safer to just decode and re-encode.

str.format deals with the text enclosed with braces {}. Here variable CREDENTIALS has the starting letter as braces { which follows the str.format rule to replace it's text and find the immediately closing braces since it don't find it and instead gets another opening braces '{' that's why it throws the error.
The string on which this method is called can contain literal text or replacement fields delimited by braces {}
Now to escape braces and replace only which indented can be done if enclosed twice like
'{{ Hey Escape }} {0}'.format(12) # O/P '{ Hey Escape } 12'
If you escape the parent and grandparent {} then it will work.
Example:
'{{Escape Me {n} }}'.format(n='Yes') # {Escape Me Yes}
So following the rule of the str.format, I'm escaping the parents text enclosed with braces by adding one extra brace to escape it.
"{{\"aaaUser\": {{\"attributes\": {{\"pwd\": \"{0}\", \"name\": \"{1}\"}}}}}}".format('password', 'username')
#O/P '{"aaaUser": {"attributes": {"pwd": "password", "name": "username"}}}'
Now Coming to the string formatting to make it work. There is other way of doing it. However this is not recommended in your case as you need to make sure the problem always has the format as you mentioned and never mess with other otherwise the result could change drastically.
So here the solution that I follow is using string replace to convert the format from {0} to %(0)s so that string formatting works without any issue and never cares about braces .
'Hello %(0)s' % {'0': 'World'} # Hello World
SO here I'm using re.sub to replace all occurrence
def myReplace(obj):
found = obj.group(0)
if found:
found = found.replace('{', '%(')
found = found.replace('}', ')s')
return found
CREDENTIALS = re.sub('\{\d{1}\}', myReplace, "{\"aaaUser\": {\"attributes\": {\"pwd\": \"{0}\", \"name\": \"{1}\"}}}"% {'0': 'password', '1': 'username'}
print CREDENTIALS # It should print desirable result

How to check the Emoji property of a character in Python?

In unicode a character can have an Emoji property.
Is there a standard way in Python to determine if a character is an Emoji?
I know of unicodedata, but it doesn't appear to expose all these extra character details.
Note: I'm asking about the specific attribute called "Emoji" in the unicdoe standard, as provided in the link. I don't want to have an arbitrary list of pattern ranges, and preferably use a standard library.

This is the code I ended up creating to load the Emoji information. The get_emoji function gets the data file, parses it, and calls the enumeraton callback. The rest of the code uses this to produce a JSON file of the information I needed.
#!/usr/bin/env python3
# Generates a list of emoji characters and names in JS format
import urllib.request
import unicodedata
import re, json
'''
Enumerates the Emoji characters that match an attributes from the Unicode standard (the Emoji list).
#param on_emoji A callback that is called with each found character. Signature `on_emoji( code_point_value )`
#param attribute The attribute that is desired, such as `Emoji` or `Emoji_Presentation`
'''
def get_emoji(on_emoji, attribute):
with urllib.request.urlopen('http://www.unicode.org/Public/emoji/5.0/emoji-data.txt') as f:
content = f.read().decode(f.headers.get_content_charset())
cldr = re.compile('^([0-9A-F]+)(..([0-9A-F]+))?([^;]*);([^#]*)#(.*)$')
for line in content.splitlines():
m = cldr.match(line)
if m == None:
continue
line_attribute = m.group(5).strip()
if line_attribute != attribute:
continue
code_point = int(m.group(1),16)
if m.group(3) == None:
on_emoji(code_point)
else:
to_code_point = int(m.group(3),16)
for i in range(code_point,to_code_point+1):
on_emoji(i)
# Dumps the values into a JSON format
def print_emoji(value):
c = chr(value)
try:
obj = {
'code': value,
'name': unicodedata.name(c).lower(),
}
print(json.dumps(obj),',')
except:
# Unicode DB is likely outdated in installed Python
pass
print( "module.exports = [" )
get_emoji(print_emoji, "Emoji_Presentation")
print( "]" )
That solved my original problem. To answer the question itself it'd just be a matter of sticking the results into a dictionary and doing a lookup.

I have used the following regex pattern successfully before
import re
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
Also check out this question: removing emojis from a string in Python

Python convert file content to unicode form

For example, I have a file a.js whose content is:
Hello, 你好, bye.
Which contains two Chinese characters whose unicode form is \u4f60\u597d
I want to write a python program which convert the Chinese characters in a.js to its unicode form to output b.js, whose content should be: Hello, \u4f60\u597d, bye.
My code:
fp = open("a.js")
content = fp.read()
fp.close()
fp2 = open("b.js", "w")
result = content.decode("utf-8")
fp2.write(result)
fp2.close()
but it seems that the Chinese characters are still one character , not an ASCII string like I want.

>>> print u'Hello, 你好, bye.'.encode('unicode-escape')
Hello, \u4f60\u597d, bye.
But you should consider using JSON, via json.

You can try codecs module
codecs.open(filename, mode[, encoding[, errors[, buffering]]])
a = codecs.open("a.js", "r", "cp936").read() # a is a unicode object
codecs.open("b.js", "w", "utf16").write(a)

There two ways you can use.
first one, use 'encode' method
str1 = "Hello, 你好, bye. "
print(str1.encode("raw_unicode_escape"))
print(str1.encode("unicode_escape"))
Also you can use 'codecs' module：
import codecs
print(codecs.raw_unicode_escape_encode(str1))

I found that repr(content.decode("utf-8")) will return "u'Hello, \u4f60\u597d, bye'"
so repr(content.decode("utf-8"))[2:-1] will do the job

you can use repr:
a = u"Hello, 你好, bye. "
print repr(a)[2:-1]
or you can use encode method:
print a.encode("raw_unicode_escape")
print a.encode("unicode_escape")

character encoding in python

I have a byte stream that looks like this '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
str_data is wrote into text file using the following code
file = open("test_doc","w")
file.write(str_data)
file.close()
If test_doc is opened in a web browser and character encoding is set to Japanese it just works fine.
I am using reportlab for generating pdf . using the following code
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfgen.canvas import Canvas
from reportlab.pdfbase.cidfonts import CIDFont
pdfmetrics.registerFont(CIDFont('HeiseiMin-W3','90ms-RKSJ-H'))
pdfmetrics.registerFont(CIDFont('HeiseiKakuGo-W5','90ms-RKSJ-H'))
c = Canvas('test1.pdf')
c.setFont('HeiseiMin-W3-90ms-RKSJ-H', 6)
message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88';
c.drawString(100, 675,message1)
c.save()
Here I use message1 variable which gives output in Japanese I need to use message3 instead of message1 to generate the pdf. message3 generated garabage probably because of improper encoding.

Here is an answer:
message1 is encoded in shift_jis; message3 and str_data are encoded in UTF-8. All appear to represent Japanese text. See the following IDLE session:
>>> message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
>>> print message1.decode('shift_jis')
これは平成明朝です。
>>> message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
>>> print message3.decode('UTF-8')
テスト
>>>str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
>>> print str_data.decode('UTF-8')
日本語
>>>
Google Translate detects the language as Japanese and translates them to the English "This is the Heisei Mincho.", "Test", and "Japanese" respectively.
What is the question?

I guess you have to learn more about encoding of strings in general. A string in python has no encoding information attached, so it's up to you to use it in the right way or convert it appropriately. Have a look at unicode strings, the encode / decode methods and the codecs module. And check whether c.drawString might also allow to pass a unicode string, which might make your live much easier.

If you need to detect these encodings on the fly, you can take a look at Mark Pilgrim's excellent open source Universal Encoding Detector.
#!/usr/bin/env python
import chardet
message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
print chardet.detect(message1)
message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
print chardet.detect(message3)
str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
print chardet.detect(str_data)
Output:
{'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
{'confidence': 0.87625, 'encoding': 'utf-8'}
{'confidence': 0.87625, 'encoding': 'utf-8'}

How can I understand this python error message?

Hi can you help me decode this message and what to do:
main.py", line 1278, in post
message.body = "%s %s/%s/%s" % (msg, host, ad.key().id(), slugify(ad.title.encode('utf-8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Thanks
UPDATE having tried removing the encode call it appears to work:
class Recommend(webapp.RequestHandler):
def post(self, key):
ad= db.get(db.Key(key))
email = self.request.POST['tip_email']
host = os.environ.get("HTTP_HOST", os.environ["SERVER_NAME"])
senderemail = users.get_current_user().email() if users.get_current_user() else 'info#monton.cl' if host.endswith('.cl') else 'info#monton.com.mx' if host.endswith('.mx') else 'info#montao.com.br' if host.endswith('.br') else 'admin#koolbusiness.com'
message = mail.EmailMessage(sender=senderemail, subject="%s recommends %s" % (self.request.POST['tip_name'], ad.title) )
message.to = email
message.body = "%s %s/%s/%s" % (self.request.POST['tip_msg'],host,ad.key().id(),slugify(ad.title))
message.send()
matched_images=ad.matched_images
count = matched_images.count()
if ad.text:
p = re.compile(r'(www[^ ]*|http://[^ ]*)')
text = p.sub(r'\1',ad.text.replace('http://',''))
else:
text = None
self.response.out.write("Message sent<br>")
path = os.path.join(os.path.dirname(__file__), 'market', 'market_ad_detail.html')
self.response.out.write(template.render(path, {'user_url':users.create_logout_url(self.request.uri) if users.get_current_user() else users.create_login_url(self.request.uri),
'user':users.get_current_user(), 'ad.user':ad.user,'count':count, 'ad':ad, 'matched_images': matched_images,}))

The problem here is your underlying model (message.body) only wants ASCII text but you're trying to give it a string encoded in unicode.
But since you've got a normal ascii string here, you can just make python print out the '?' character when you've got a non-ascii-printing string.
"UNICODE STRING".encode('ascii','replace').decode('ascii')
So like from your example above:
message.body = "%s %s/%s/%s" % \
(msgencode('ascii','replace').decode('ascii'),
hostencode('ascii','replace').decode('ascii'),
ad.key().id()encode('ascii','replace').decode('ascii'),
slugify(ad.title)encode('ascii','replace').decode('ascii'))
Or just encode/decode on the variable that has the unicode character.
But this isn't an optimal solution. The best idea is to make message.body a unicode string. Being that doesn't seem feasible (I'm not familiar with GAE), you can use this to at least not have errors.

You've got a Unicode character in a place that you're not supposed to. Most often I find this error is having MS Word-style slanted quotes.

One of these fields has some characters that cannot be encoded. If you switch to python 3 (it has better unicode support), or you change the encoding of the entire script the problem should stop, about the best way to change the encoding in 2.x is using the encoding comment line. If you see http://evanjones.ca/python-utf8.html you will see more of an explanation of using python with utf-8 support the best suggestion is add # -*- coding: utf-8 -*- to the top of your script. And handle scripts like this
s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )

I had a similar problem when using Django norel and Google App Engine.
The problem was at the folder containing the application. Probably isn't this the problem described in this question, but, maybe helps someone don't waste time like me.
Try first change you application folder maybe to /home/ and try to run again, if doesn't works, try something more.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

"Prefix" before "unicode(entry.title.text, "utf-8")" - python

Related

Format String of Dictionary

How to check the Emoji property of a character in Python?

Python convert file content to unicode form

character encoding in python

How can I understand this python error message?

Categories

Resources