Python messing up the scandinavian characters (Ö -> Ã) - python

I know everyone's fed up with encoding questions, but I can't figure this out.
I'm getting data from a XML-file (API) in Python. Everything is fine, but when I print the values that contain scandinavian characters, such as Ö or Ä, they get messed up:
Ö -> Ã
Ä -> ä
The XML-document is encoded in UTF-8.
Here's my code. Sorry for the inconvenience.
# Get the data
from urllib2 import urlopen
ur = urlopen("http://www.leffatykki.com/xml/leffat")
data = ur.read()
# Replace ampersands (triggers an error)
data = data.replace('&', '&')
# Loop XML
from lxml import etree
from cStringIO import StringIO
def fast_iter(context, func):
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def process_element(elem):
try:
name = elem.xpath('name/text( )')[0]
year = elem.xpath('year/text( )')[0]
print name
except IndexError:
temp = '...'
context = etree.iterparse(StringIO(data), tag='movie')
fast_iter(context, process_element)

In your call to "etree.iterparse", try filling out the encoding value:
context = etree.iterparse(StringIO(data), tag='movie', encoding="utf-8")
From the itree.iterparse documentation:
"""
| Other keyword arguments:
| - encoding: override the document encoding
| - schema: an XMLSchema to validate against
"""
Better yet - forget that:
I've downloaded your file and played around - it seems to be working, at least for the first movie - maybe you have badly encoded characters in the file itself? It is either taht or everything is just fine, and the mess is only at your print statement -
try using "print name.encode("utf-8")" - or the correct encoding of your terminal, instead of letting python try to guess it.

Related

Python writing to file and json returns None/null instead of value

I'm trying to write data to a file with the following code
#!/usr/bin/python37all
print('Content-type: text/html\n\n')
import cgi
from Alarm import *
import json
htmldata = cgi.FieldStorage()
alarm_time = htmldata.getvalue('alarm_time')
alarm_date = htmldata.getvalue('alarm_date')
print(alarm_time,alarm_date)
data = {'time':alarm_time,'date':alarm_date}
# print(data['time'],data['date'])
with open('alarm_data.txt','w') as f:
json.dump(data,f)
...
but when opening the the file, I get the following output:
{'time':null,'date':null}
The print statement returns what I except it to: 14:26 2020-12-12.
I've tried this same method with f.write() but it returns both values as None. This is being run on a raspberry pi. Why aren't the correct values being written?
--EDIT--
The json string I expect to see is the following:{'time':'14:26','date':'2020-12-12'}
Perhaps you meant:
data = {'time':str(alarm_time), 'date':str(alarm_date)}
I would expect to see your file contents like this:
{"time":"14:26","date":"2020-12-12"}
Note the double quotes: ". json is very strict about these things, so don't fool yourself into having single quotes ' in a file and expecting json to parse it.

How to check the Emoji property of a character in Python?

In unicode a character can have an Emoji property.
Is there a standard way in Python to determine if a character is an Emoji?
I know of unicodedata, but it doesn't appear to expose all these extra character details.
Note: I'm asking about the specific attribute called "Emoji" in the unicdoe standard, as provided in the link. I don't want to have an arbitrary list of pattern ranges, and preferably use a standard library.
This is the code I ended up creating to load the Emoji information. The get_emoji function gets the data file, parses it, and calls the enumeraton callback. The rest of the code uses this to produce a JSON file of the information I needed.
#!/usr/bin/env python3
# Generates a list of emoji characters and names in JS format
import urllib.request
import unicodedata
import re, json
'''
Enumerates the Emoji characters that match an attributes from the Unicode standard (the Emoji list).
#param on_emoji A callback that is called with each found character. Signature `on_emoji( code_point_value )`
#param attribute The attribute that is desired, such as `Emoji` or `Emoji_Presentation`
'''
def get_emoji(on_emoji, attribute):
with urllib.request.urlopen('http://www.unicode.org/Public/emoji/5.0/emoji-data.txt') as f:
content = f.read().decode(f.headers.get_content_charset())
cldr = re.compile('^([0-9A-F]+)(..([0-9A-F]+))?([^;]*);([^#]*)#(.*)$')
for line in content.splitlines():
m = cldr.match(line)
if m == None:
continue
line_attribute = m.group(5).strip()
if line_attribute != attribute:
continue
code_point = int(m.group(1),16)
if m.group(3) == None:
on_emoji(code_point)
else:
to_code_point = int(m.group(3),16)
for i in range(code_point,to_code_point+1):
on_emoji(i)
# Dumps the values into a JSON format
def print_emoji(value):
c = chr(value)
try:
obj = {
'code': value,
'name': unicodedata.name(c).lower(),
}
print(json.dumps(obj),',')
except:
# Unicode DB is likely outdated in installed Python
pass
print( "module.exports = [" )
get_emoji(print_emoji, "Emoji_Presentation")
print( "]" )
That solved my original problem. To answer the question itself it'd just be a matter of sticking the results into a dictionary and doing a lookup.
I have used the following regex pattern successfully before
import re
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
Also check out this question: removing emojis from a string in Python

parsing API with Python - how to handle JSON with BOM

I'm using Python 2.7.11 on windows to get JSON data from API (data on trees in Warsaw, Poland, but nevermind that). I want to generate output csv file with all the data provided by the api, for further analysis. I started with a script I used for another project (also discussed here on Stackoverflow and corrected for me by #Martin Taylor).That script didn't work so I tried to modify it using my very basic understanding, googling around and applying pdb debugger. At the moment, the result looks like this:
import pdb
import json
import urllib2
import csv
pdb.set_trace()
url = "https://api.um.warszawa.pl/api/action/datastore_search/?resource_id=ed6217dd-c8d0-4f7b-8bed-3b7eb81a95ba"
myfile = 'C:/dane/drzewa.csv'
csv_myfile = csv.writer(open(myfile, 'wb'))
cols = ['numer_adres', 'stan_zdrowia', 'y_wgs84', 'dzielnica', 'adres', 'lokalizacja', 'wiek_w_dni', 'srednica_k', 'pnie_obwod', 'miasto', 'jednostka', 'x_pl2000', 'wysokosc', 'y_pl2000', 'numer_inw', 'x_wgs84', '_id', 'gatunek_1', 'gatunek', 'data_wyk_pom']
csv_myfile.writerow(cols)
def api_iterate(myfile):
while True:
global url
print url
json_page = urllib2.urlopen(url)
data = json.load(json_page)
json_page.close()
for data_object in data ['result']['records']:
csv_myfile.writerow([data_object[col] for col in cols])
try:
url = data['_links']['next']
except KeyError as e:
break
with open(myfile, 'wb'):
api_iterate(myfile)
I'm a very fresh Python user so I get confused all the time. Now I got to the point when, while reading the objects in json dictionary, I get a Keyerror message associated with the 'x_wgs84' element. I suppose it has something to do with the fact that in the source url this element is preceded by a U+FEFF unicode character. I tried to get around this but I got stuck and would appreciate assistance.
I suspect the code may be corrupt in several other ways - as I mentioned, I'm a very unskilled programmer (yet).
You need to put the key with the unicode character:
To know how to do it, one easy way is to print the keys:
>>> import requests
>>> res = requests.get('https://api.um.warszawa.pl/api/action/datastore_search/?resource_id=ed6217dd-c8d0-4f7b-8bed-3b7eb81a95ba')
>>> data = res.json()
>>> records = data['result']['records']
>>> records[0]
{u'numer_adres': u'', u'stan_zdrowia': u'dobry', u'y_wgs84': u'52.21865', u'y_pl2000': u'5787241.04475524', u'adres': u'ul. ALPEJSKA', u'x_pl2000': u'7511793.96937063', u'lokalizacja': u'Ulica ALPEJSKA', u'wiek_w_dni': u'60', u'miasto': u'Warszawa', u'jednostka': u'Dzielnica Wawer', u'pnie_obwod': u'73', u'wysokosc': u'14', u'data_wyk_pom': u'20130709', u'dzielnica': u'Wawer', u'\ufeffx_wgs84': u'21.172584', u'numer_inw': u'D386200', u'_id': 125435, u'gatunek_1': u'Quercus robur', u'gatunek': u'd\u0105b szypu\u0142kowy', u'srednica_k': u'7'}
>>> records[0].keys()
[u'numer_adres', u'stan_zdrowia', u'y_wgs84', u'y_pl2000', u'adres', u'x_pl2000', u'lokalizacja', u'wiek_w_dni', u'miasto', u'jednostka', u'pnie_obwod', u'wysokosc', u'data_wyk_pom', u'dzielnica', u'\ufeffx_wgs84', u'numer_inw', u'_id', u'gatunek_1', u'gatunek', u'srednica_k']
>>> records[0][u'\ufeffx_wgs84']
u'21.172584'
As you can see, to get your key, you need to write it as '\ufeffx_wgs84' with the unicode character that is causing trouble.
Note: I don't know if you are using python2 or 3, but you might need to put a u before your string declaration in python2 to declare it as unicode string.

How to detect the language type of a given text via Python? [duplicate]

I'm faced with a situation where I'm reading a string of text and I need to detect the language code (en, de, fr, es, etc).
Is there a simple way to do this in python?
If you need to detect language in response to a user action then you could use google ajax language API:
#!/usr/bin/env python
import json
import urllib, urllib2
def detect_language(text,
userip=None,
referrer="http://stackoverflow.com/q/4545977/4279",
api_key=None):
query = {'q': text.encode('utf-8') if isinstance(text, unicode) else text}
if userip: query.update(userip=userip)
if api_key: query.update(key=api_key)
url = 'https://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s'%(
urllib.urlencode(query))
request = urllib2.Request(url, None, headers=dict(Referer=referrer))
d = json.load(urllib2.urlopen(request))
if d['responseStatus'] != 200 or u'error' in d['responseData']:
raise IOError(d)
return d['responseData']['language']
print detect_language("Python - can I detect unicode string language code?")
Output
en
Google Translate API v2
Default limit 100000 characters/day (no more than 5000 at a time).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2
from operator import itemgetter
def detect_language_v2(chunks, api_key):
"""
chunks: either string or sequence of strings
Return list of corresponding language codes
"""
if isinstance(chunks, basestring):
chunks = [chunks]
url = 'https://www.googleapis.com/language/translate/v2'
data = urllib.urlencode(dict(
q=[t.encode('utf-8') if isinstance(t, unicode) else t
for t in chunks],
key=api_key,
target="en"), doseq=1)
# the request length MUST be < 5000
if len(data) > 5000:
raise ValueError("request is too long, see "
"http://code.google.com/apis/language/translate/terms.html")
#NOTE: use POST to allow more than 2K characters
request = urllib2.Request(url, data,
headers={'X-HTTP-Method-Override': 'GET'})
d = json.load(urllib2.urlopen(request))
if u'error' in d:
raise IOError(d)
return map(itemgetter('detectedSourceLanguage'), d['data']['translations'])
Now you could request detecting a language explicitly:
def detect_language_v2(chunks, api_key):
"""
chunks: either string or sequence of strings
Return list of corresponding language codes
"""
if isinstance(chunks, basestring):
chunks = [chunks]
url = 'https://www.googleapis.com/language/translate/v2/detect'
data = urllib.urlencode(dict(
q=[t.encode('utf-8') if isinstance(t, unicode) else t
for t in chunks],
key=api_key), doseq=True)
# the request length MUST be < 5000
if len(data) > 5000:
raise ValueError("request is too long, see "
"http://code.google.com/apis/language/translate/terms.html")
#NOTE: use POST to allow more than 2K characters
request = urllib2.Request(url, data,
headers={'X-HTTP-Method-Override': 'GET'})
d = json.load(urllib2.urlopen(request))
return [sorted(L, key=itemgetter('confidence'))[-1]['language']
for L in d['data']['detections']]
Example:
print detect_language_v2(
["Python - can I detect unicode string language code?",
u"матрёшка",
u"打水"], api_key=open('api_key.txt').read().strip())
Output
[u'en', u'ru', u'zh-CN']
In my case I only need to determine two languages so I just check the first character:
import unicodedata
def is_greek(term):
return 'GREEK' in unicodedata.name(term.strip()[0])
def is_hebrew(term):
return 'HEBREW' in unicodedata.name(term.strip()[0])
Have a look at guess-language:
Attempts to determine the natural language of a selection of Unicode (utf-8) text.
But as the name says, it guesses the language. You can't expect 100% correct results.
Edit:
guess-language is unmaintained. But there is a fork (that support python3): guess_language-spirit
Look at Natural Language Toolkit and Automatic Language Identification using Python for ideas.
I would like to know if a Bayesian filter can get language right but I can't write a proof of concept right now.
A useful article here suggests that this open source named CLD is the best bet for detecting language in python.
The article shows a comparison of speed and accuracy between 3 solutions :
language-detection or its python port langdetect
Tika
Chromium Language Detection (CLD)
I wasted my time with langdetect now I am switching to CLD which is 16x faster than langdetect and has 98.8% accuracy
Try Universal Encoding Detector its a port of chardet module from Firefox to Python.
If you only have a limited number of possible languages, you could use a set of dictionaries (possibly only including the most common words) of each language and then check the words in your input against the dictionaries.

How can I understand this python error message?

Hi can you help me decode this message and what to do:
main.py", line 1278, in post
message.body = "%s %s/%s/%s" % (msg, host, ad.key().id(), slugify(ad.title.encode('utf-8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Thanks
UPDATE having tried removing the encode call it appears to work:
class Recommend(webapp.RequestHandler):
def post(self, key):
ad= db.get(db.Key(key))
email = self.request.POST['tip_email']
host = os.environ.get("HTTP_HOST", os.environ["SERVER_NAME"])
senderemail = users.get_current_user().email() if users.get_current_user() else 'info#monton.cl' if host.endswith('.cl') else 'info#monton.com.mx' if host.endswith('.mx') else 'info#montao.com.br' if host.endswith('.br') else 'admin#koolbusiness.com'
message = mail.EmailMessage(sender=senderemail, subject="%s recommends %s" % (self.request.POST['tip_name'], ad.title) )
message.to = email
message.body = "%s %s/%s/%s" % (self.request.POST['tip_msg'],host,ad.key().id(),slugify(ad.title))
message.send()
matched_images=ad.matched_images
count = matched_images.count()
if ad.text:
p = re.compile(r'(www[^ ]*|http://[^ ]*)')
text = p.sub(r'\1',ad.text.replace('http://',''))
else:
text = None
self.response.out.write("Message sent<br>")
path = os.path.join(os.path.dirname(__file__), 'market', 'market_ad_detail.html')
self.response.out.write(template.render(path, {'user_url':users.create_logout_url(self.request.uri) if users.get_current_user() else users.create_login_url(self.request.uri),
'user':users.get_current_user(), 'ad.user':ad.user,'count':count, 'ad':ad, 'matched_images': matched_images,}))
The problem here is your underlying model (message.body) only wants ASCII text but you're trying to give it a string encoded in unicode.
But since you've got a normal ascii string here, you can just make python print out the '?' character when you've got a non-ascii-printing string.
"UNICODE STRING".encode('ascii','replace').decode('ascii')
So like from your example above:
message.body = "%s %s/%s/%s" % \
(msgencode('ascii','replace').decode('ascii'),
hostencode('ascii','replace').decode('ascii'),
ad.key().id()encode('ascii','replace').decode('ascii'),
slugify(ad.title)encode('ascii','replace').decode('ascii'))
Or just encode/decode on the variable that has the unicode character.
But this isn't an optimal solution. The best idea is to make message.body a unicode string. Being that doesn't seem feasible (I'm not familiar with GAE), you can use this to at least not have errors.
You've got a Unicode character in a place that you're not supposed to. Most often I find this error is having MS Word-style slanted quotes.
One of these fields has some characters that cannot be encoded. If you switch to python 3 (it has better unicode support), or you change the encoding of the entire script the problem should stop, about the best way to change the encoding in 2.x is using the encoding comment line. If you see http://evanjones.ca/python-utf8.html you will see more of an explanation of using python with utf-8 support the best suggestion is add # -*- coding: utf-8 -*- to the top of your script. And handle scripts like this
s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )
I had a similar problem when using Django norel and Google App Engine.
The problem was at the folder containing the application. Probably isn't this the problem described in this question, but, maybe helps someone don't waste time like me.
Try first change you application folder maybe to /home/ and try to run again, if doesn't works, try something more.

Categories

Resources