How to fix Python Umlaut Dictionary Value

How to fix Python Umlaut Dictionary Value - python

I have a dictionary containing (among others) this key value pair:
'Title': '\xc3\x96lfarben'
In German this translates to Ölfarben.
I have trouble to print this string to stout properly.
It is always printed as Ãlfarben
I already tried to use string.decode("utf-8"), string.encode("utf-8"), and many more combinations such as unicode(string.decode("utf-8")) etc.
The problem is that I still have troubles to understand unicode, utf-8 etc.
Can anyone help?
Update
Here is some more information.
I am receiving a csv file report from the google adwords api (using the official python library to access the api). This data is presumably utf-8 encoded and stored to disk.
Then I use the dictreader method to read the csv from disk and convert it to a dict. Then I iterate of the data and use the print method. This is where the problem above occurs.
this is an entire line from the imported dict:
{'Destination URL': 'http://domain.com/file.html?adword={keyword}', 'Ad': 'Staffeleien', 'Campaign': '\xc3\x96 Farben', 'Ad group state': 'enabled', 'Ad state': 'enabled', 'Ad group': 'Farben', 'Campaign state': 'active'}

If you've added u to this string - dont do it, you should decode it at first. In unicode this string look like this: u'\xd6lfarben':
>>> print u'\xc3\x96lfarben'
Ãlfarben
>>> print '\xc3\x96lfarben'.decode('utf-8')
Ölfarben
>>> '\xc3\x96lfarben'.decode('utf-8')
u'\xd6lfarben'
with unicode function:
>>> unicode('\xc3\x96lfarben', encoding='utf-8')
u'\xd6lfarben'

Related

General Python Unicode / ASCII Casting Issue Causing Trouble in Pyorient

UPDATE: I opened an issue on github based on a Ivan Mainetti's suggestion. If you want to weigh in there, it is :https://github.com/orientechnologies/orientdb/issues/6757
I am working on a database based on OrienDB and using a python interface for it. I've had pretty good luck with it, but I've run into a problem that seems to be the driver's (pyorient) wonkiness when dealing with certain unicode characters.
The data structure I'm uploading to the database looks like this:
new_Node = {'#Nodes':
{
"Abs_Address":Ono.absolute_address,
'Content':Ono.content,
'Heading':Ono.heading,
'Type':Ono.type,
'Value':Ono.value
}
}
I have created literally hundreds of records flawlessly on OrientDB / pyorient. I don't think the problem is necessarily a pyorient specific question, however, as I think the reason it is failing on a particular record is because the Ono.absolute_address element has a unicode character that pyorient is somehow choking on.
The record I want to create has an Abs_address of /u/c/2/a1–2, but the node I get when I pass the value to the my data structure above is this:
{'#Nodes': {'Content': '', 'Abs_Address': u'/u/c/2/a1\u20132', 'Type': 'section', 'Heading': ' Transferred', 'Value': u'1\u20132'}}
I think that somehow my problem is python is mixing unicode and ascii strings / chars? I'm a bit new to python and not declaring types, so I'm hoping this isn't an issue with pyorient perse given that the new_Node object doesn't output the properly formatted string...? Or is this an instance of pyorient not liking unicode? I'm tearing my hair out on this one. Any help is appreciated.
In case the error is coming from pyorient and not some kind of text encoding, here's the pyorient-related info. I am creating the record using this code:
rec_position = self.pyo_client.record_create(14, new_Node)
And this is the error I'm getting:
com.orientechnologies.orient.core.storage.ORecordDuplicatedException - Cannot index record Nodes{Content:,Abs_Address:null,Type:section,Heading: Transferred,Value:null}: found duplicated key 'null' in index 'Nodes.Abs_Address' previously assigned to the record #14:558
The error is odd as it suggests that the backend database is getting a null object for the address. Apparently it did create an entry for this "address," but it's not what I want it to do. I don't know why address strings with unicode are coming up null in the database... I can create it through orientDB studio using the exact string I fed into the new_Node data structure... but I can't use python to do the same thing.
Someone help?
EDIT:
Thanks to Laurent, I've narrowed the problem down to something to do with unicode objects and pyorient. Whenever a the variable I am passing is type unicode, the pyorient adapter sends a null value to the OrientDB database. I determined the value that is causing the problem is an ndash symbol, and Laurent helped me replace it with a minus sign using this code
.replace(u"\u2013",u"-")
When I do that, however, pyorient gets unicode objects which it then passes as null values... This is not good. I can fix this short term by recasting the string using str(...) and this appears to solve my immediate problem:
str(Ono.absolute_address.replace(u"\u2013",u"-"))
. Problem is, I know I will have symbols and other unusual characters in my DB data. I know the database supports the unicode strings because I can add them manually or use SQL syntax to do what I cannot do via pyorient and python... I am assuming this is a dicey casting issue somewhere, but I'm not really sure where. This seems very similar to this problem: http://stackoverflow.duapp.com/questions/34757352/how-do-i-create-a-linked-record-in-orientdb-using-pyorient-library
Any pyorient people out there? Python gods? Lucky s0bs? =)

I have tried your example on Python 3 with the development branch of pyorient with the latest version of OrientDB 2.2.11. If I pass the values without escaping them, your example seems to work for me and I get the right values back.
So this test works:
def test_test1(self):
new_Node = {'#Nodes': {'Content': '',
'Abs_Address': '/u/c/2/a1–2',
'Type': 'section',
'Heading': ' Transferred',
'Value': u'1\u20132'}
}
self.client.record_create(14, new_Node)
result = self.client.query('SELECT * FROM V where Abs_Address="/u/c/2/a1–2"')
assert result[0].Abs_Address == '/u/c/2/a1–2'
I think you may be saving the unicode value as an escaped value and that's where things get tricky.
I don't trust replacing values myself so I usually escape the unicode values I send to orientdb with the following code:
import json
def _escape(string):
return json.dumps(string)[1:-1]
The following test would fail because the escaped value won't match the escaped value in the DB so no record will be returned:
def test_test2(self):
new_Node = {'#Nodes': {'Content': '',
'Abs_Address': _escape('/u/c/2/a1–2'),
'Type': 'section',
'Heading': ' Transferred',
'Value': u'1\u20132'}
}
self.client.record_create(14, new_Node)
result = self.client.query('SELECT * FROM V where Abs_Address="%s"' % _escape('/u/c/2/a1–2'))
assert result[0].Abs_Address.encode('UTF-8').decode('unicode_escape') == '/u/c/2/a1–2'
In order to fix this, you have to escape the value twice:
def test_test3(self):
new_Node = {'#Nodes': {'Content': '',
'Abs_Address': _escape('/u/c/2/a1–2'),
'Type': 'section',
'Heading': ' Transferred',
'Value': u'1\u20132'}
}
self.client.record_create(14, new_Node)
result = self.client.query('SELECT * FROM V where Abs_Address="%s"' % _escape(_escape('/u/c/2/a1–2')))
assert result[0].Abs_Address.encode('UTF-8').decode('unicode_escape') == '/u/c/2/a1–2'
This test will succeed because you will now be asking for the escaped value in the DB.

Representation of python dictionaries with unicode in database queries

I have a problem that I would like to know how to efficiently tackle.
I have data that is JSON-formatted (used with dumps / loads) and contains unicode.
This is part of a protocol implemented with JSON to send messages. So messages will be sent as strings and then loaded into python dictionaries. This means that the representation, as a python dictionary, afterwards will look something like:
{u"mykey": u"myVal"}
It is no problem in itself for the system to handle such structures, but the thing happens when I'm going to make a database query to store this structure.
I'm using pyOrient towards OrientDB. The command ends up something like:
"CREATE VERTEX TestVertex SET data = {u'mykey': u'myVal'}"
Which will end up in the data field getting the following values in OrientDB:
{'_NOT_PARSED_': '_NOT_PARSED_'}
I'm assuming this problem relates to other cases as well when you wish to make a query or somehow represent a data object containing unicode.
How could I efficiently get a representation of this data, of arbitrary depth, to be able to use it in a query?
To clarify even more, this is the string the db expects:
"CREATE VERTEX TestVertex SET data = {'mykey': 'myVal'}"
If I'm simply stating the wrong problem/question and should handle it some other way, I'm very much open to suggestions. But what I want to achieve is to have an efficient way to use python2.7 to build a db-query towards orientdb (using pyorient) that specifies an arbitrary data structure. The data property being set is of the OrientDB type EMBEDDEDMAP.
Any help greatly appreciated.
EDIT1:
More explicitly stating that the first code block shows the object as a dict AFTER being dumped / loaded with json to avoid confusion.

Dargolith:
ok based on your last response it seems you are simply looking for code that will dump python expression in a way that you can control how unicode and other data types print. Here is a very simply function that provides this control. There are ways to make this function more efficient (for example, by using a string buffer rather than doing all of the recursive string concatenation happening here). Still this is a very simple function, and as it stands its execution is probably still dominated by your DB lookup.
As you can see in each of the 'if' statements, you have full control of how each data type prints.
def expr_to_str(thing):
if hasattr(thing, 'keys'):
pairs = ['%s:%s' % (expr_to_str(k),expr_to_str(v)) for k,v in thing.iteritems()]
return '{%s}' % ', '.join(pairs)
if hasattr(thing, '__setslice__'):
parts = [expr_to_str(ele) for ele in thing]
return '[%s]' % (', '.join(parts),)
if isinstance(thing, basestring):
return "'%s'" % (str(thing),)
return str(thing)
print "dumped: %s" % expr_to_str({'one': 33, 'two': [u'unicode', 'just a str', 44.44, {'hash': 'here'}]})
outputs:
dumped: {'two':['unicode', 'just a str', 44.44, {'hash':'here'}], 'one':33}

I went on to use json.dumps() as sobolevn suggested in the comment. I didn't think of that one at first since I wasn't really using json in the driver. It turned out however that json.dumps() provided exactly the formats I needed on all the data types I use. Some examples:
>>> json.dumps('test')
'"test"'
>>> json.dumps(['test1', 'test2'])
'["test1", "test2"]'
>>> json.dumps([u'test1', u'test2'])
'["test1", "test2"]'
>>> json.dumps({u'key1': u'val1', u'key2': [u'val21', 'val22', 1]})
'{"key2": ["val21", "val22", 1], "key1": "val1"}'
If you need to take more control of the format, quotes or other things regarding this conversion, see the reply by Dan Oblinger.

python dictionary from url

I am trying to gather weather data from the national weather service and read it into a python script. They offer a JSON return, but they also offer another return which isn't formatted JSON but has more variables (which I want). This set of data looks like it is formatted as a python dictionary. It looks like this:
stations={
KAPC:
{
'id':'KAPC',
'stnid':'92',
'name':'Napa, Napa County Airport',
'elev':'33',
'latitude':'38.20750',
'longitude':'-122.27944',
'distance':'',
'provider':'NWS/FAA',
'link':'http://www.wrh.noaa.gov/mesowest/getobext.php?sid=KAPC',
'Date':'24 Feb 8:54 am',
'Temp':'39',
'TempC':'4',
'Dewp':'29',
'Relh':'67',
'Wind':'NE#6',
'Direction':'50&#176',
'Winds':'6',
'WindChill':'35',
'Windd':'50',
'SLP':'1027.1',
'Altimeter':'30.36',
'Weather':'',
'Visibility':'10.00',
'Wx':'',
'Clouds':'CLR',
[...]
So, to me, it looks like its got a defined variable stations equal to a dictionary of dictionaries containing the stations and their variables. My question is how do I access this data. Right now I am trying:
import urllib
response = urrllib.urlopen(url)
r = response.read()
If I try to use the JSON module, it clearly fails because this isn't json. And if I just try to read the file, it comes back with a long string of characters. Any suggestions on how to extract this data? If possible, I would just like to get the dictionary as it exists in the url return, ie stations={...} Thanks!

See, As far I infer from the question, I assume that you have data in the form of text which in not a valid JSON data, So given we have a text like: line = "stations={'KAPC':{'id':'KAPC', 'stnid':'92', 'name':'Napa, Napa County Airport'}}" (say), then we can extract the dictionary by splitting it at the = symbol and then use the eval() method which initializes the dictionary variable with the required data.
dictionary_text = line.split("=")[1]
python_dictionary = eval(dictionary_text)
print python_dictionary
>>> {'KAPC': {'id': 'KAPC', 'name': 'Napa, Napa County Airport', 'stnid': '92'}}
The python_dictionary now behaves like a Python Dictionary with key, value pairs , and you can access any attribute using python_dictionary["KAPC"]["id"]

urllib's urlencode returning weird encoded results

I'm trying to use Facebook's REST api, and am encoding a JSON string/dictionary using urllib.urlencode. The result I get however, is different from the correct encoded result (as displayed by pasting the dictionary in the attachment field here http://developers.facebook.com/docs/reference/rest/stream.publish/). I was wondering if anyone could offer any help.
Thanks.
EDIT:
I'm trying to encode the following dictionary:
{"media": [{"type":"flash", "swfsrc":"http://shopperspoll.webfactional.com/media/flashFile.swf", "height": '100', "width": '100', "expanded_width":"160", "expanded_height":"120", "imgsrc":"http://shopperspoll.webfactional.com/media/laptop1.jpg"}]}
This is the encoded string using urllib.urlencode:
"media=%5B%7B%27swfsrc%27%3A+%27http%3A%2F%2Fshopperspoll.webfactional.com%2Fmedia%2FflashFile.swf%27%2C+%27height%27%3A+%27100%27%2C+%27width%27%3A+%27100%27%2C+%27expanded_width%27%3A+%27160%27%2C+%27imgsrc%27%3A+%27http%3A%2F%2Fshopperspoll.webfactional.com%2Fmedia%2Flaptop1.jpg%27%2C+%27expanded_height%27%3A+%27120%27%2C+%27type%27%3A+%27flash%27%7D%5D"
It's not letting me copy the result being thrown out from the facebook rest documentation link, but on copying the above dictionary in the attachment field, the result is different.

urllib.encode isn't meant for urlencoding a single value (as functions of the same name are in many languages), but for encoding a dict of separate values. For example, if I had the dict {"a": 1, "b": 2} it would produce the string "a=1&b=2".
First, you want to encode your dict as JSON.
data = {"media": [{"type":"flash", "swfsrc":"http://shopperspoll.webfactional.com/media/flashFile.swf", "height": '100', "width": '100', "expanded_width":"160", "expanded_height":"120", "imgsrc":"http://shopperspoll.webfactional.com/media/laptop1.jpg"}]}
import json
json_encoded = json.dumps(data)
You can then either use urllib.encode to create a complete query string
import urllib
urllib.encode({"access_token": example, "attachment": json_encoded})
# produces a long string in the form "access_token=...&attachment=..."
or use urllib.quote to just encode your attachment parameter
urllib.quote(json_encoded)
# produces just the part following "&attachment="

Python currency codes into a list

Does anyone have a nifty way to get all the three letter alphabetic currency codes (an example of the ones I mean is at http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/currency_codes/currency_codes_list-1.htm) into a list in Python 2.5? Note I don't want to do a screen scraping version as the code has to work offline - the website is just an example of the codes.
It looks like there should be a way using the locale library but it's not clear to me from reading the documentation and there must be a better way than copy pasting those into a file!
To clear the question up more, in C# for the same problem, the following code did it very neatly using the internal locale libraries:
CultureInfo.GetCultures(CultureTypes.SpecificCultures)
.Select(c => new RegionInfo(c.LCID).CurrencySymbol)
.Distinct()
I was hoping there might be an equivalent in python. And thanks to everyone who has provided an answer so far.

Not very elegant or nifty, but you can generate the list once and save to use it later:
import urllib, re
url = "http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/currency_codes/currency_codes_list-1.htm"
print re.findall(r'\<td valign\="top"\>\s+([A-WYZ][A-Z]{2})\s+\</td\>', urllib.urlopen(url).read())
output:
['AFN', 'EUR', 'ALL', 'DZD', 'USD', 'EUR', 'AOA', 'ARS', 'AMD', 'AWG', 'AUD',
...
'UZS', 'VUV', 'EUR', 'VEF', 'VND', 'USD', 'USD', 'MAD', 'YER', 'ZMK', 'ZWL', 'SDR']
Note that you'll need to prune everything after X.. as they are apparently reserved names, which means that you'll get one rogue entry (SDR, the last element) which you can just delete by yourself.

You can get currency codes (and other) data from geonames. Here's some code that downloads the data (save the file locally to achieve the same result offline) and populates a list:
import urllib2
data = urllib2.urlopen('http://download.geonames.org/export/dump/countryInfo.txt')
ccodes = []
for line in data.read().split('\n'):
if not line.startswith('#'):
line = line.split('\t')
try:
if line[10]:
ccodes.append(line[10])
except IndexError:
pass
ccodes = list(set(ccodes))
ccodes.sort()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.