General Python Unicode / ASCII Casting Issue Causing Trouble in Pyorient

General Python Unicode / ASCII Casting Issue Causing Trouble in Pyorient - python

UPDATE: I opened an issue on github based on a Ivan Mainetti's suggestion. If you want to weigh in there, it is :https://github.com/orientechnologies/orientdb/issues/6757
I am working on a database based on OrienDB and using a python interface for it. I've had pretty good luck with it, but I've run into a problem that seems to be the driver's (pyorient) wonkiness when dealing with certain unicode characters.
The data structure I'm uploading to the database looks like this:
new_Node = {'#Nodes':
{
"Abs_Address":Ono.absolute_address,
'Content':Ono.content,
'Heading':Ono.heading,
'Type':Ono.type,
'Value':Ono.value
}
}
I have created literally hundreds of records flawlessly on OrientDB / pyorient. I don't think the problem is necessarily a pyorient specific question, however, as I think the reason it is failing on a particular record is because the Ono.absolute_address element has a unicode character that pyorient is somehow choking on.
The record I want to create has an Abs_address of /u/c/2/a1–2, but the node I get when I pass the value to the my data structure above is this:
{'#Nodes': {'Content': '', 'Abs_Address': u'/u/c/2/a1\u20132', 'Type': 'section', 'Heading': ' Transferred', 'Value': u'1\u20132'}}
I think that somehow my problem is python is mixing unicode and ascii strings / chars? I'm a bit new to python and not declaring types, so I'm hoping this isn't an issue with pyorient perse given that the new_Node object doesn't output the properly formatted string...? Or is this an instance of pyorient not liking unicode? I'm tearing my hair out on this one. Any help is appreciated.
In case the error is coming from pyorient and not some kind of text encoding, here's the pyorient-related info. I am creating the record using this code:
rec_position = self.pyo_client.record_create(14, new_Node)
And this is the error I'm getting:
com.orientechnologies.orient.core.storage.ORecordDuplicatedException - Cannot index record Nodes{Content:,Abs_Address:null,Type:section,Heading: Transferred,Value:null}: found duplicated key 'null' in index 'Nodes.Abs_Address' previously assigned to the record #14:558
The error is odd as it suggests that the backend database is getting a null object for the address. Apparently it did create an entry for this "address," but it's not what I want it to do. I don't know why address strings with unicode are coming up null in the database... I can create it through orientDB studio using the exact string I fed into the new_Node data structure... but I can't use python to do the same thing.
Someone help?
EDIT:
Thanks to Laurent, I've narrowed the problem down to something to do with unicode objects and pyorient. Whenever a the variable I am passing is type unicode, the pyorient adapter sends a null value to the OrientDB database. I determined the value that is causing the problem is an ndash symbol, and Laurent helped me replace it with a minus sign using this code
.replace(u"\u2013",u"-")
When I do that, however, pyorient gets unicode objects which it then passes as null values... This is not good. I can fix this short term by recasting the string using str(...) and this appears to solve my immediate problem:
str(Ono.absolute_address.replace(u"\u2013",u"-"))
. Problem is, I know I will have symbols and other unusual characters in my DB data. I know the database supports the unicode strings because I can add them manually or use SQL syntax to do what I cannot do via pyorient and python... I am assuming this is a dicey casting issue somewhere, but I'm not really sure where. This seems very similar to this problem: http://stackoverflow.duapp.com/questions/34757352/how-do-i-create-a-linked-record-in-orientdb-using-pyorient-library
Any pyorient people out there? Python gods? Lucky s0bs? =)

I have tried your example on Python 3 with the development branch of pyorient with the latest version of OrientDB 2.2.11. If I pass the values without escaping them, your example seems to work for me and I get the right values back.
So this test works:
def test_test1(self):
new_Node = {'#Nodes': {'Content': '',
'Abs_Address': '/u/c/2/a1–2',
'Type': 'section',
'Heading': ' Transferred',
'Value': u'1\u20132'}
}
self.client.record_create(14, new_Node)
result = self.client.query('SELECT * FROM V where Abs_Address="/u/c/2/a1–2"')
assert result[0].Abs_Address == '/u/c/2/a1–2'
I think you may be saving the unicode value as an escaped value and that's where things get tricky.
I don't trust replacing values myself so I usually escape the unicode values I send to orientdb with the following code:
import json
def _escape(string):
return json.dumps(string)[1:-1]
The following test would fail because the escaped value won't match the escaped value in the DB so no record will be returned:
def test_test2(self):
new_Node = {'#Nodes': {'Content': '',
'Abs_Address': _escape('/u/c/2/a1–2'),
'Type': 'section',
'Heading': ' Transferred',
'Value': u'1\u20132'}
}
self.client.record_create(14, new_Node)
result = self.client.query('SELECT * FROM V where Abs_Address="%s"' % _escape('/u/c/2/a1–2'))
assert result[0].Abs_Address.encode('UTF-8').decode('unicode_escape') == '/u/c/2/a1–2'
In order to fix this, you have to escape the value twice:
def test_test3(self):
new_Node = {'#Nodes': {'Content': '',
'Abs_Address': _escape('/u/c/2/a1–2'),
'Type': 'section',
'Heading': ' Transferred',
'Value': u'1\u20132'}
}
self.client.record_create(14, new_Node)
result = self.client.query('SELECT * FROM V where Abs_Address="%s"' % _escape(_escape('/u/c/2/a1–2')))
assert result[0].Abs_Address.encode('UTF-8').decode('unicode_escape') == '/u/c/2/a1–2'
This test will succeed because you will now be asking for the escaped value in the DB.

Related

JSON Parsing with python from Rethink database [Python]

Im trying to retrieve data from a database named RethinkDB, they output JSON when called with r.db("Databasename").table("tablename").insert([{ "id or primary key": line}]).run(), when doing so it outputs [{'id': 'ValueInRowOfid\n'}] and I want to parse that to just the value eg. "ValueInRowOfid". Ive tried with JSON in Python, but I always end up with the typeerror: list indices must be integers or slices, not str, and Ive been told that it is because the Database outputs invalid JSON format. My question is how can a JSON format be invalid (I cant see what is invalid with the output) and also what would be the best way to parse it so that the value "ValueInRowOfid" is left in a Operator eg. Value = ("ValueInRowOfid").
This part imports the modules used and connects to RethinkDB:
import json
from rethinkdb import RethinkDB
r = RethinkDB()
r.connect( "localhost", 28015).repl()
This part is getting the output/value and my trial at parsing it:
getvalue = r.db("Databasename").table("tablename").sample(1).run() # gets a single row/value from the table
print(getvalue) # If I print that, it will show as [{'id': 'ValueInRowOfid\n'}]
dumper = json.dumps(getvalue) # I cant use `json.loads(dumper)` as JSON object must be str. Which the output of the database isnt (The output is a list)
parsevalue = json.loads(dumper) # After `json.dumps(getvalue)` I can now load it, but I cant use the loaded JSON.
print(parsevalue["id"]) # When doing this it now says that the list is a str and it needs to be an integers or slices. Quite frustrating for me as it is opposing it self eg. It first wants str and now it cant use str
print(parsevalue{'id'}) # I also tried to shuffle it around as seen here, but still the same result
I know this is janky and is very hard to comprehend this level of stupidity that I might be on. As I dont know if it is the most simple problem or something that just isnt possible (Which it should or else I cant use my data in the database.)
Thank you for reading this through and not jumping straight into the comments and say that I have to read the JSON documentation, because I have and I havent found a single piece that could help me.
I tried reading the documentation and watching tutorials about JSON and JSON parsing. I also looked for others whom have had the same problems as me and couldnt find.

It looks like it's returning a dictionary ({}) inside a list ([]) of one element.
Try:
getvalue = r.db("Databasename").table("tablename").sample(1).run()
print(getvalue[0]['id'])

How to Write a Mongo Query Where One of the Key Has a # Character as a Name in a Python / Flask Environment?

I am a rookie to programming. I use Flask/Python as backend and MongoDB as my database. My mongo's documents are uploaded by CSV files and I have no control of the header name. Thus I cannot change the header's name to remove the # character.
**Mongo Collection**
Part #: "ABC123"
Description : "DC Motor 12V"
**Flask/Python/Backend**
query = { "Part #" : "ABC123" }
Since # character denotes comment in Python, I have tried to use "Part \#" to escape # but I think when it was sent to MongoDB as a query, it sees the backslash in the key name and no results appear.
I have googled for long time but unable to find a solution to it. Can someone provide a hint on what I can do? Thank you.

After #difurious commented that there were no issue with query, I looked deeper.
Apparently the collection's field name "Part # " had a _ space after the hash. In Mongo's Compass DB, it doesn't shows up unless you click to edit it.
So it was not returning any result because of wrong field name.
To conclude this question, it was my mistake. # in a python string is acceptable for mongo query.
Thanks to #difurious

Decoding XML object from mssql in Python

I get back an XML object from a mssql server when I call a SP from Python (2.7). I get it in the following form:
{u'XML_F52E2B61-18A1-11d1-B105-00805F49916B': 'D\x02i\x00d\x00D\x05d\x00e\x00s\x00c\x00r\x00D\x0bd\x00a\x00t\x00a\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\x13e\x00n\x00u\x00m\x00e\x00r\x00a\x00t\x00i\x00o\x00n\x00_\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\rs\x00y\x00s\x00t\x00e\x00m\x00f\x00e\x00a\x00t\x00u\x00r\x00e\x00D\x04l\x00i\x00n\x00k\x00D\x07F\x00e\x00a\x00t\x00u\x00r\x00e\x00\x01\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00A\x01\x07A\x01\x01A\x03B\x01\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x1a\x00r\x00e\x00s\x00p\x00o\x00n\x00d\x00e\x00n\x00t\x00_\x00i\x00d\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x02\x00\x00\x00\x81\x01\x01\x02A\x03\x11 \x00W\x00o\x00r\x00k\x00s\x00 \x00a\x00t\x00 \x00c\x00o\x00m\x00p\x00a\x00n\x00y\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x01\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x03\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x0c\x00G\x00e\x00n\x00d\x00e\x00r\x00\x81\x02\x01\x03A\x03B\x08\x00\x00\x00\x81\x03\x01\x04A\x03B\x01\x00\x00\x00\x81\x04\x01\x05A\x03F\x00\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x81\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00'}
I have two questions:
1: What encoding is this?
2: What library should I use to decode this?
Addition:
The XML as it shows in the SQL Management Studio:
The SP:
ALTER PROCEDURE [dbo].[rdb_sql2python]
AS
BEGIN
SET NOCOUNT ON
SELECT * FROM [_rdb].[dbo].[features] FOR XML RAW ('Feature'), ROOT ('FeatureS'), ELEMENTS
SET NOCOUNT OFF
END

I try something like an answer, at least to the question: What is this:
At this JSON-viewer your string as you presented it did not work. But when I removed the "u", replaced the single quotes with double quotes and removed the "D" it worked somehow:
This string
{"XML_F52E2B61-18A1-11d1-B105-00805F49916B":
"\x02i\x00d\x00D\x05d\x00e\x00s\x00c\x00r\x00D\x0bd\x00a\x00t\x00a\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\x13e\x00n\x00u\x00m\x00e\x00r\x00a\x00t\x00i\x00o\x00n\x00_\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\rs\x00y\x00s\x00t\x00e\x00m\x00f\x00e\x00a\x00t\x00u\x00r\x00e\x00D\x04l\x00i\x00n\x00k\x00D\x07F\x00e\x00a\x00t\x00u\x00r\x00e\x00\x01\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00A\x01\x07A\x01\x01A\x03B\x01\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x1a\x00r\x00e\x00s\x00p\x00o\x00n\x00d\x00e\x00n\x00t\x00_\x00i\x00d\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x02\x00\x00\x00\x81\x01\x01\x02A\x03\x11
\x00W\x00o\x00r\x00k\x00s\x00 \x00a\x00t\x00
\x00c\x00o\x00m\x00p\x00a\x00n\x00y\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x01\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x03\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x0c\x00G\x00e\x00n\x00d\x00e\x00r\x00\x81\x02\x01\x03A\x03B\x08\x00\x00\x00\x81\x03\x01\x04A\x03B\x01\x00\x00\x00\x81\x04\x01\x05A\x03F\x00\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x81\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00"}
converts to
Name: XML_F52E2B61-18A1-11d1-B105-00805F49916B
Value: "idDdescrDdatatype_idDenumeration_type_idD systemfeatureDlinkDFeatureFeatureSAAABArespondent_idABAFAFAABA Works at companyABAFAFAABAGenderABABAFAFFeatureS"
This is - for sure - not the final solution, but it's clear, that this is BSON encoded JSON.
It might be a good idea to show (the relevant parts of) you(r) SP and the way you are calling this. Might be, that there is a completely different / better approach...

Representation of python dictionaries with unicode in database queries

I have a problem that I would like to know how to efficiently tackle.
I have data that is JSON-formatted (used with dumps / loads) and contains unicode.
This is part of a protocol implemented with JSON to send messages. So messages will be sent as strings and then loaded into python dictionaries. This means that the representation, as a python dictionary, afterwards will look something like:
{u"mykey": u"myVal"}
It is no problem in itself for the system to handle such structures, but the thing happens when I'm going to make a database query to store this structure.
I'm using pyOrient towards OrientDB. The command ends up something like:
"CREATE VERTEX TestVertex SET data = {u'mykey': u'myVal'}"
Which will end up in the data field getting the following values in OrientDB:
{'_NOT_PARSED_': '_NOT_PARSED_'}
I'm assuming this problem relates to other cases as well when you wish to make a query or somehow represent a data object containing unicode.
How could I efficiently get a representation of this data, of arbitrary depth, to be able to use it in a query?
To clarify even more, this is the string the db expects:
"CREATE VERTEX TestVertex SET data = {'mykey': 'myVal'}"
If I'm simply stating the wrong problem/question and should handle it some other way, I'm very much open to suggestions. But what I want to achieve is to have an efficient way to use python2.7 to build a db-query towards orientdb (using pyorient) that specifies an arbitrary data structure. The data property being set is of the OrientDB type EMBEDDEDMAP.
Any help greatly appreciated.
EDIT1:
More explicitly stating that the first code block shows the object as a dict AFTER being dumped / loaded with json to avoid confusion.

Dargolith:
ok based on your last response it seems you are simply looking for code that will dump python expression in a way that you can control how unicode and other data types print. Here is a very simply function that provides this control. There are ways to make this function more efficient (for example, by using a string buffer rather than doing all of the recursive string concatenation happening here). Still this is a very simple function, and as it stands its execution is probably still dominated by your DB lookup.
As you can see in each of the 'if' statements, you have full control of how each data type prints.
def expr_to_str(thing):
if hasattr(thing, 'keys'):
pairs = ['%s:%s' % (expr_to_str(k),expr_to_str(v)) for k,v in thing.iteritems()]
return '{%s}' % ', '.join(pairs)
if hasattr(thing, '__setslice__'):
parts = [expr_to_str(ele) for ele in thing]
return '[%s]' % (', '.join(parts),)
if isinstance(thing, basestring):
return "'%s'" % (str(thing),)
return str(thing)
print "dumped: %s" % expr_to_str({'one': 33, 'two': [u'unicode', 'just a str', 44.44, {'hash': 'here'}]})
outputs:
dumped: {'two':['unicode', 'just a str', 44.44, {'hash':'here'}], 'one':33}

I went on to use json.dumps() as sobolevn suggested in the comment. I didn't think of that one at first since I wasn't really using json in the driver. It turned out however that json.dumps() provided exactly the formats I needed on all the data types I use. Some examples:
>>> json.dumps('test')
'"test"'
>>> json.dumps(['test1', 'test2'])
'["test1", "test2"]'
>>> json.dumps([u'test1', u'test2'])
'["test1", "test2"]'
>>> json.dumps({u'key1': u'val1', u'key2': [u'val21', 'val22', 1]})
'{"key2": ["val21", "val22", 1], "key1": "val1"}'
If you need to take more control of the format, quotes or other things regarding this conversion, see the reply by Dan Oblinger.

Unicode Hell in Pyramid: MySQL -> SQLAlchemy -> Pyramid -> JSON

Background
I'm in a real mess with unicode and Python. It seems to be a common angst and I've tried using other solutions out there but I just can't get my head around it.
Setup
MySQL Database Setup
collation_database: utf8_general_ci
character_set_database: utf8
SQLAlchemy Model
class Product(Base):
id = Column('product_id', Integer, primary_key=True)
name = Column('product_name', String(64)) #Tried using Unicode() but didn't help
Pyramid View
#view_config(renderer='json', route_name='products_search')
def products_search(request):
json_products = []
term = "%%%s%%" % request.params['term']
products = dbsession.query(Product).filter(Product.name.like(term)).all()
for prod in products:
json_prod = {'id': prod.id, 'label': prod.name, 'value': prod.name, 'sku': prod.sku, 'price': str(prod.price[0].price)}
json_products.append(json_prod)
return json_products
Problem
I get encoding errors reported from the json module (which is called as its the renderer for this route) like so:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 37: invalid start byte
The culprit is a "-" (dash symbol) in the prod.name value. Full stack trace here. If the returned products don't have a "-" in then it all works fine!
Tried
I've tried encoding, decoding with various types before returning the json_products variable.

The above comment is right, but more specifially, you can replace 'label': prod.name with 'label': prod.name.decode("cp1252"). You should probably also do this for all of the strings in your json_prod dictionary since it's likely you'll see cp1252 encoded characters elsewhere in the real-world use of your application.
On that note, depending on the source of these strings and how widely that source is used in your app, you may run into such problems elsewhere in your app and generally when you least expect it. To investigate further you may want to figure What the source for these strings is and if you can do the decoding/re-encoding at a lower level to correct most future problems with this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.