This question is based on a side-effect of that one.
My .py files are all have # -*- coding: utf-8 -*- encoding definer on the first line, like my api.py
As I mention on the related question, I use HttpResponse to return the api documentation. Since I defined encoding by:
HttpResponse(cy_content, content_type='text/plain; charset=utf-8')
Everything is ok, and when I call my API service, there are no encoding problems except the string formed from a dictionary by pprint
Since I am using Turkish characters in some values in my dict, pprint converts them to unichr equivalents, like:
API_STATUS = {
1: 'müşteri',
2: 'some other status message'
}
my_str = 'Here is the documentation part that contains Turkish chars like işüğçö'
my_str += pprint.pformat(API_STATUS, indent=4, width=1)
return HttpRespopnse(my_str, content_type='text/plain; charset=utf-8')
And my plain text output is like:
Here is the documentation part that contains Turkish chars like işüğçö
{
1: 'm\xc3\xbc\xc5\x9fteri',
2: 'some other status message'
}
I try to decode or encode pprint output to different encodings, with no success... What is the best practice to overcome this problem
pprint appears to use repr by default, you can work around this by overriding PrettyPrinter.format:
# coding=utf8
import pprint
class MyPrettyPrinter(pprint.PrettyPrinter):
def format(self, object, context, maxlevels, level):
if isinstance(object, unicode):
return (object.encode('utf8'), True, False)
return pprint.PrettyPrinter.format(self, object, context, maxlevels, level)
d = {'foo': u'işüğçö'}
pprint.pprint(d) # {'foo': u'i\u015f\xfc\u011f\xe7\xf6'}
MyPrettyPrinter().pprint(d) # {'foo': işüğçö}
You should use unicode strings instead of 8-bit ones:
API_STATUS = {
1: u'müşteri',
2: u'some other status message'
}
my_str = u'Here is the documentation part that contains Turkish chars like işüğçö'
my_str += pprint.pformat(API_STATUS, indent=4, width=1)
The pprint module is designed to print out all possible kind of nested structure in a readable way. To do that it will print the objects representation rather then convert it to a string, so you'll end up with the escape syntax wheather you use unicode strings or not. But if you're using unicode in your document, then you really should be using unicode literals!
Anyway, thg435 has given you a solution how to change this behaviour of pformat.
Related
I am trying to perform a rethinkdb match query with an escaped unicode user provided search param:
import re
from rethinkdb import RethinkDB
r = RethinkDB()
search_value = u"\u05e5" # provided by user via flask
search_value_escaped = re.escape(search_value) # results in u'\\\u05e5' ->
# when encoded with "utf-8" gives "\ץ" as expected.
conn = rethinkdb.connect(...)
results_cursor_a = r.db(...).table(...).order_by(index="id").filter(
lambda doc: doc.coerce_to("string").match(search_value)
).run(conn) # search_value works fine
results_cursor_b = r.db(...).table(...).order_by(index="id").filter(
lambda doc: doc.coerce_to("string").match(search_value_escaped)
).run(conn) # search_value_escaped spits an error
The error for search_value_escaped is the following:
ReqlQueryLogicError: Error in regexp `\ץ` (portion `\ץ`): invalid escape sequence: \ץ in:
r.db(...).table(...).order_by(index="id").filter(lambda var_1: var_1.coerce_to('string').match(u'\\\u05e5m'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I tried encoding with "utf-8" before/after re.escape() but same results with different errors. What am I messing? Is it something in my code or some kind of a bug?
EDIT: .coerce_to('string') converts the document to "utf-8" encoded string. RethinkDB also converts the query to "utf-8" and then it matches them hence the first query works even though it looks like a unicde match inside a string.
From what it looks like RethinkDB rejects escaped unicode characters so I wrote a simple workaround with a custom escape without implementing my own logic of replacing characters (in fear that I must miss one and create a security issue).
import re
def no_unicode_escape(u):
escaped_list = []
for i in u:
if ord(i) < 128:
escaped_list.append(re.escape(i))
else:
escaped_list.append(i)
rv = "".join(escaped_list)
return rv
or a one-liner:
import re
def no_unicode_escape(u):
return "".join(re.escape(i) if ord(i) < 128 else i for i in u)
Which yields the required result of escaping "dangerous" characters and works with RethinkDB as I wanted.
I've been trying to send some integer values over socket to a LabVIEW client using json.dumps but as the numbers change the size of each field may change, I would like to know if there is a way to pad the number with '0' without turning it into a string when I do the json dump, as it adds " " to the packet send around each number.
Example:
data = json.dumps({"Data": str(52).zfill(4)]})
self.sock.send(data.encode())
This sends
'"Data":"0052"'
I want
'"Data": 0052'
As #jsonharper mentioned, technically what you're asking for is no longer JSON, more on that here
However, that doesn't mean you can't use the json library to do the bulk of the work for you!
You can achieve this by passing a custom encoder class to json.dumps like this:
>>> import json
>>> class MyInt(int):
... def __str__(self):
... return '{:0>4}'.format(self)
>>> class MyEncoder(json.encoder.JSONEncoder):
... def default(self, o):
... if isinstance(o, MyInt):
... return str(o)
... return super(MyEncoder, self).default(o)
>>> obj = {'Data': MyInt(52)}
>>> json.dumps(obj, cls=MyEncoder)
'{"Data": 0052}'
You can do this with any class, but this can result in something that can't be decoded again with a strict JSON decoder.
See if you can get LabVIEW to read standard JSON, but if not, the above should work.
In unicode a character can have an Emoji property.
Is there a standard way in Python to determine if a character is an Emoji?
I know of unicodedata, but it doesn't appear to expose all these extra character details.
Note: I'm asking about the specific attribute called "Emoji" in the unicdoe standard, as provided in the link. I don't want to have an arbitrary list of pattern ranges, and preferably use a standard library.
This is the code I ended up creating to load the Emoji information. The get_emoji function gets the data file, parses it, and calls the enumeraton callback. The rest of the code uses this to produce a JSON file of the information I needed.
#!/usr/bin/env python3
# Generates a list of emoji characters and names in JS format
import urllib.request
import unicodedata
import re, json
'''
Enumerates the Emoji characters that match an attributes from the Unicode standard (the Emoji list).
#param on_emoji A callback that is called with each found character. Signature `on_emoji( code_point_value )`
#param attribute The attribute that is desired, such as `Emoji` or `Emoji_Presentation`
'''
def get_emoji(on_emoji, attribute):
with urllib.request.urlopen('http://www.unicode.org/Public/emoji/5.0/emoji-data.txt') as f:
content = f.read().decode(f.headers.get_content_charset())
cldr = re.compile('^([0-9A-F]+)(..([0-9A-F]+))?([^;]*);([^#]*)#(.*)$')
for line in content.splitlines():
m = cldr.match(line)
if m == None:
continue
line_attribute = m.group(5).strip()
if line_attribute != attribute:
continue
code_point = int(m.group(1),16)
if m.group(3) == None:
on_emoji(code_point)
else:
to_code_point = int(m.group(3),16)
for i in range(code_point,to_code_point+1):
on_emoji(i)
# Dumps the values into a JSON format
def print_emoji(value):
c = chr(value)
try:
obj = {
'code': value,
'name': unicodedata.name(c).lower(),
}
print(json.dumps(obj),',')
except:
# Unicode DB is likely outdated in installed Python
pass
print( "module.exports = [" )
get_emoji(print_emoji, "Emoji_Presentation")
print( "]" )
That solved my original problem. To answer the question itself it'd just be a matter of sticking the results into a dictionary and doing a lookup.
I have used the following regex pattern successfully before
import re
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
Also check out this question: removing emojis from a string in Python
I'm trying to scrape from a non-English website using Scrapy. The scraped results as JSON look something like this:
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"},
This is the code I'm using:
def parse(self, response):
for sel in response.xpath('//section[#class="items-box"]'):
item = ShopItem()
item['name'] = sel.xpath('a/div/h3/text()').extract()
item['price'] = sel.xpath('a/div/div/div[1]/text()').extract().replace("$", "")
yield item
How would I output unescaped Unicode characters onto the JSON?
Edit (2016-10-19):
With Scrapy 1.2+, you can use the FEED_EXPORT_ENCODING set to the character encoding you need for the output JSON file, e.g FEED_EXPORT_ENCODING = 'utf-8' (the default value being None, which means \uXXXX escaping)
Note: I'm adapting what I wrote on GitHub for a similar issue I linked to in the question's comments.
Note that there's an open issue on Scrapy to make the output encoding a parameter: https://github.com/scrapy/scrapy/issues/1965
Scrapy's default JSON exporter uses (the default) ensure_ascii=True argument, so it outputs Unicode characters as \uXXXX sequences before writing to file. (This is what is used when doing -o somefile.json)
Setting ensure_ascii=False in the exporter will output Unicode strings, which will end up as UTF-8 encoded on file. See custom exporter code at the bottom here.
To illustrate, let's read your input JSON string back into some data to work on:
>>> import json
>>> test = r'''{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'''
>>> json.loads(test)
{u'price': u'13,000', u'name': u'\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc'}
The input with \uXXXX sequences is valid JSON for Python (as it should), and loads() produces a valid Python dict.
Now let's serialize to JSON again:
>>> # dumping the dict back to JSON, with default ensure_ascii=True
>>> json.dumps(json.loads(test))
'{"price": "13,000", "name": "\\u58c1\\u6bb4\\u308a\\u4ee3\\u884c\\u69d8\\u5c02\\u7528\\u2605 \\u30c6\\u30ec\\u30d3\\u672c\\u4f53 20v\\u578b \\u767d \\u9001\\u6599\\u8fbc"}'
>>>
And now with ensure_ascii=False
>>> # now dumping with ensure_ascii=False, you get a Unicode string
>>> json.dumps(json.loads(test), ensure_ascii=False)
u'{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'
>>>
Let's print to see the difference:
>>> print json.dumps(json.loads(test))
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}
>>> print json.dumps(json.loads(test), ensure_ascii=False)
{"price": "13,000", "name": "壁殴り代行様専用★ テレビ本体 20v型 白 送料込"}
If you want to write JSON items as UTF-8, you can do it like this:
1.. define a custom item exporter, e.g. in an exporters.py file in your project
$ cat myproject/exporters.py
from scrapy.exporters import JsonItemExporter
class Utf8JsonItemExporter(JsonItemExporter):
def __init__(self, file, **kwargs):
super(Utf8JsonItemExporter, self).__init__(
file, ensure_ascii=False, **kwargs)
2.. replace the default JSON item exporter in your settings.py
FEED_EXPORTERS = {
'json': 'myproject.exporters.Utf8JsonItemExporter',
}
Use the codecs module for text -> text decoding (In Python 2 it's not strictly necessary, but in Python 3 str doesn't have a decode method, because the methods are for str -> bytes and back, not str -> str). Using the unicode_escape codec for decoding will get you the correct data back:
import codecs
somestr = codecs.decode(strwithescapes, 'unicode-escape')
So to fix the names you're getting, you'd do:
item['name'] = codecs.decode(sel.xpath('a/div/h3/text()').extract(), 'unicode-escape')
If the problem is in JSON you're producing, you'd want to just make sure the json module isn't forcing strings to be ASCII with character encodings; it does so by default because not all JSON parsers can handle true Unicode characters (they often assume data is sent as ASCII bytes with escapes). So wherever you call json.dump/json.dumps (or create a json.JSONEncoder), make sure to explicitly pass ensure_ascii=False.
I'm faced with a situation where I'm reading a string of text and I need to detect the language code (en, de, fr, es, etc).
Is there a simple way to do this in python?
If you need to detect language in response to a user action then you could use google ajax language API:
#!/usr/bin/env python
import json
import urllib, urllib2
def detect_language(text,
userip=None,
referrer="http://stackoverflow.com/q/4545977/4279",
api_key=None):
query = {'q': text.encode('utf-8') if isinstance(text, unicode) else text}
if userip: query.update(userip=userip)
if api_key: query.update(key=api_key)
url = 'https://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s'%(
urllib.urlencode(query))
request = urllib2.Request(url, None, headers=dict(Referer=referrer))
d = json.load(urllib2.urlopen(request))
if d['responseStatus'] != 200 or u'error' in d['responseData']:
raise IOError(d)
return d['responseData']['language']
print detect_language("Python - can I detect unicode string language code?")
Output
en
Google Translate API v2
Default limit 100000 characters/day (no more than 5000 at a time).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2
from operator import itemgetter
def detect_language_v2(chunks, api_key):
"""
chunks: either string or sequence of strings
Return list of corresponding language codes
"""
if isinstance(chunks, basestring):
chunks = [chunks]
url = 'https://www.googleapis.com/language/translate/v2'
data = urllib.urlencode(dict(
q=[t.encode('utf-8') if isinstance(t, unicode) else t
for t in chunks],
key=api_key,
target="en"), doseq=1)
# the request length MUST be < 5000
if len(data) > 5000:
raise ValueError("request is too long, see "
"http://code.google.com/apis/language/translate/terms.html")
#NOTE: use POST to allow more than 2K characters
request = urllib2.Request(url, data,
headers={'X-HTTP-Method-Override': 'GET'})
d = json.load(urllib2.urlopen(request))
if u'error' in d:
raise IOError(d)
return map(itemgetter('detectedSourceLanguage'), d['data']['translations'])
Now you could request detecting a language explicitly:
def detect_language_v2(chunks, api_key):
"""
chunks: either string or sequence of strings
Return list of corresponding language codes
"""
if isinstance(chunks, basestring):
chunks = [chunks]
url = 'https://www.googleapis.com/language/translate/v2/detect'
data = urllib.urlencode(dict(
q=[t.encode('utf-8') if isinstance(t, unicode) else t
for t in chunks],
key=api_key), doseq=True)
# the request length MUST be < 5000
if len(data) > 5000:
raise ValueError("request is too long, see "
"http://code.google.com/apis/language/translate/terms.html")
#NOTE: use POST to allow more than 2K characters
request = urllib2.Request(url, data,
headers={'X-HTTP-Method-Override': 'GET'})
d = json.load(urllib2.urlopen(request))
return [sorted(L, key=itemgetter('confidence'))[-1]['language']
for L in d['data']['detections']]
Example:
print detect_language_v2(
["Python - can I detect unicode string language code?",
u"матрёшка",
u"打水"], api_key=open('api_key.txt').read().strip())
Output
[u'en', u'ru', u'zh-CN']
In my case I only need to determine two languages so I just check the first character:
import unicodedata
def is_greek(term):
return 'GREEK' in unicodedata.name(term.strip()[0])
def is_hebrew(term):
return 'HEBREW' in unicodedata.name(term.strip()[0])
Have a look at guess-language:
Attempts to determine the natural language of a selection of Unicode (utf-8) text.
But as the name says, it guesses the language. You can't expect 100% correct results.
Edit:
guess-language is unmaintained. But there is a fork (that support python3): guess_language-spirit
Look at Natural Language Toolkit and Automatic Language Identification using Python for ideas.
I would like to know if a Bayesian filter can get language right but I can't write a proof of concept right now.
A useful article here suggests that this open source named CLD is the best bet for detecting language in python.
The article shows a comparison of speed and accuracy between 3 solutions :
language-detection or its python port langdetect
Tika
Chromium Language Detection (CLD)
I wasted my time with langdetect now I am switching to CLD which is 16x faster than langdetect and has 98.8% accuracy
Try Universal Encoding Detector its a port of chardet module from Firefox to Python.
If you only have a limited number of possible languages, you could use a set of dictionaries (possibly only including the most common words) of each language and then check the words in your input against the dictionaries.