Decoding XML object from mssql in Python

Decoding XML object from mssql in Python - python

I get back an XML object from a mssql server when I call a SP from Python (2.7). I get it in the following form:
{u'XML_F52E2B61-18A1-11d1-B105-00805F49916B': 'D\x02i\x00d\x00D\x05d\x00e\x00s\x00c\x00r\x00D\x0bd\x00a\x00t\x00a\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\x13e\x00n\x00u\x00m\x00e\x00r\x00a\x00t\x00i\x00o\x00n\x00_\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\rs\x00y\x00s\x00t\x00e\x00m\x00f\x00e\x00a\x00t\x00u\x00r\x00e\x00D\x04l\x00i\x00n\x00k\x00D\x07F\x00e\x00a\x00t\x00u\x00r\x00e\x00\x01\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00A\x01\x07A\x01\x01A\x03B\x01\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x1a\x00r\x00e\x00s\x00p\x00o\x00n\x00d\x00e\x00n\x00t\x00_\x00i\x00d\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x02\x00\x00\x00\x81\x01\x01\x02A\x03\x11 \x00W\x00o\x00r\x00k\x00s\x00 \x00a\x00t\x00 \x00c\x00o\x00m\x00p\x00a\x00n\x00y\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x01\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x03\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x0c\x00G\x00e\x00n\x00d\x00e\x00r\x00\x81\x02\x01\x03A\x03B\x08\x00\x00\x00\x81\x03\x01\x04A\x03B\x01\x00\x00\x00\x81\x04\x01\x05A\x03F\x00\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x81\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00'}
I have two questions:
1: What encoding is this?
2: What library should I use to decode this?
Addition:
The XML as it shows in the SQL Management Studio:
The SP:
ALTER PROCEDURE [dbo].[rdb_sql2python]
AS
BEGIN
SET NOCOUNT ON
SELECT * FROM [_rdb].[dbo].[features] FOR XML RAW ('Feature'), ROOT ('FeatureS'), ELEMENTS
SET NOCOUNT OFF
END

I try something like an answer, at least to the question: What is this:
At this JSON-viewer your string as you presented it did not work. But when I removed the "u", replaced the single quotes with double quotes and removed the "D" it worked somehow:
This string
{"XML_F52E2B61-18A1-11d1-B105-00805F49916B":
"\x02i\x00d\x00D\x05d\x00e\x00s\x00c\x00r\x00D\x0bd\x00a\x00t\x00a\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\x13e\x00n\x00u\x00m\x00e\x00r\x00a\x00t\x00i\x00o\x00n\x00_\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\rs\x00y\x00s\x00t\x00e\x00m\x00f\x00e\x00a\x00t\x00u\x00r\x00e\x00D\x04l\x00i\x00n\x00k\x00D\x07F\x00e\x00a\x00t\x00u\x00r\x00e\x00\x01\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00A\x01\x07A\x01\x01A\x03B\x01\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x1a\x00r\x00e\x00s\x00p\x00o\x00n\x00d\x00e\x00n\x00t\x00_\x00i\x00d\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x02\x00\x00\x00\x81\x01\x01\x02A\x03\x11
\x00W\x00o\x00r\x00k\x00s\x00 \x00a\x00t\x00
\x00c\x00o\x00m\x00p\x00a\x00n\x00y\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x01\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x03\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x0c\x00G\x00e\x00n\x00d\x00e\x00r\x00\x81\x02\x01\x03A\x03B\x08\x00\x00\x00\x81\x03\x01\x04A\x03B\x01\x00\x00\x00\x81\x04\x01\x05A\x03F\x00\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x81\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00"}
converts to
Name: XML_F52E2B61-18A1-11d1-B105-00805F49916B
Value: "idDdescrDdatatype_idDenumeration_type_idD systemfeatureDlinkDFeatureFeatureSAAABArespondent_idABAFAFAABA Works at companyABAFAFAABAGenderABABAFAFFeatureS"
This is - for sure - not the final solution, but it's clear, that this is BSON encoded JSON.
It might be a good idea to show (the relevant parts of) you(r) SP and the way you are calling this. Might be, that there is a completely different / better approach...

Related

JSON Parsing with python from Rethink database [Python]

Im trying to retrieve data from a database named RethinkDB, they output JSON when called with r.db("Databasename").table("tablename").insert([{ "id or primary key": line}]).run(), when doing so it outputs [{'id': 'ValueInRowOfid\n'}] and I want to parse that to just the value eg. "ValueInRowOfid". Ive tried with JSON in Python, but I always end up with the typeerror: list indices must be integers or slices, not str, and Ive been told that it is because the Database outputs invalid JSON format. My question is how can a JSON format be invalid (I cant see what is invalid with the output) and also what would be the best way to parse it so that the value "ValueInRowOfid" is left in a Operator eg. Value = ("ValueInRowOfid").
This part imports the modules used and connects to RethinkDB:
import json
from rethinkdb import RethinkDB
r = RethinkDB()
r.connect( "localhost", 28015).repl()
This part is getting the output/value and my trial at parsing it:
getvalue = r.db("Databasename").table("tablename").sample(1).run() # gets a single row/value from the table
print(getvalue) # If I print that, it will show as [{'id': 'ValueInRowOfid\n'}]
dumper = json.dumps(getvalue) # I cant use `json.loads(dumper)` as JSON object must be str. Which the output of the database isnt (The output is a list)
parsevalue = json.loads(dumper) # After `json.dumps(getvalue)` I can now load it, but I cant use the loaded JSON.
print(parsevalue["id"]) # When doing this it now says that the list is a str and it needs to be an integers or slices. Quite frustrating for me as it is opposing it self eg. It first wants str and now it cant use str
print(parsevalue{'id'}) # I also tried to shuffle it around as seen here, but still the same result
I know this is janky and is very hard to comprehend this level of stupidity that I might be on. As I dont know if it is the most simple problem or something that just isnt possible (Which it should or else I cant use my data in the database.)
Thank you for reading this through and not jumping straight into the comments and say that I have to read the JSON documentation, because I have and I havent found a single piece that could help me.
I tried reading the documentation and watching tutorials about JSON and JSON parsing. I also looked for others whom have had the same problems as me and couldnt find.

It looks like it's returning a dictionary ({}) inside a list ([]) of one element.
Try:
getvalue = r.db("Databasename").table("tablename").sample(1).run()
print(getvalue[0]['id'])

pymssql to pandas encoding

I'm aware there are a zillions posts about encoding / decoding problems on the forum but after going through half of them I wasn't able to find one that did the trick for me. So be nice if it is somewhere in the other half...
My issue :
I have a dbase (MS SQL) containing multilingual data (Latin1_General_CI_AS COLLATE), and I am using pymssql and pandas to convert it to a dataframe for use outside of python. All works fine except for the non latin characters and I'm completely stuck at this moment.
This is my (simplified) python 3 code:
import pandas as pd
import pymssql
def rm_main():
conn = pymssql.connect(server='***',port=4133, user='***', charset='UTF-8', password='***', database='**')
q="""
SELECT goodmorning FROM myTable
"""
df = pd.read_sql(q,conn)
df['encoded_goodmorning'] = df.goodmorning.str.encode('utf-8')
return df
what is in my database is a field called goodmorning, and it contains the following string : Dzień dobry
When calling the data as above, using just pymssql the data is retrieved correctly.
When I want to use the read_sql method form pandas I get the dreadfull question mark as follows : Dzie? dobry
Using the encoding options I get a bit further in the right direction as I get the following : b'Dziexc5x84 dobry', where c5 84 is the utf hex code for my small latin n with acute. So my content is complete but it is not very reader friendly.
Now where I fail miserably is to get this into the 'friendly format' again (so that it just says 'Dzień dobry' again).
What do I overlook here? Are there better approaches to do this? it seems like something very obvious but whatever I tried (encoding / decoding) either doesn't make a difference or it simply brakes the code.

Representation of python dictionaries with unicode in database queries

I have a problem that I would like to know how to efficiently tackle.
I have data that is JSON-formatted (used with dumps / loads) and contains unicode.
This is part of a protocol implemented with JSON to send messages. So messages will be sent as strings and then loaded into python dictionaries. This means that the representation, as a python dictionary, afterwards will look something like:
{u"mykey": u"myVal"}
It is no problem in itself for the system to handle such structures, but the thing happens when I'm going to make a database query to store this structure.
I'm using pyOrient towards OrientDB. The command ends up something like:
"CREATE VERTEX TestVertex SET data = {u'mykey': u'myVal'}"
Which will end up in the data field getting the following values in OrientDB:
{'_NOT_PARSED_': '_NOT_PARSED_'}
I'm assuming this problem relates to other cases as well when you wish to make a query or somehow represent a data object containing unicode.
How could I efficiently get a representation of this data, of arbitrary depth, to be able to use it in a query?
To clarify even more, this is the string the db expects:
"CREATE VERTEX TestVertex SET data = {'mykey': 'myVal'}"
If I'm simply stating the wrong problem/question and should handle it some other way, I'm very much open to suggestions. But what I want to achieve is to have an efficient way to use python2.7 to build a db-query towards orientdb (using pyorient) that specifies an arbitrary data structure. The data property being set is of the OrientDB type EMBEDDEDMAP.
Any help greatly appreciated.
EDIT1:
More explicitly stating that the first code block shows the object as a dict AFTER being dumped / loaded with json to avoid confusion.

Dargolith:
ok based on your last response it seems you are simply looking for code that will dump python expression in a way that you can control how unicode and other data types print. Here is a very simply function that provides this control. There are ways to make this function more efficient (for example, by using a string buffer rather than doing all of the recursive string concatenation happening here). Still this is a very simple function, and as it stands its execution is probably still dominated by your DB lookup.
As you can see in each of the 'if' statements, you have full control of how each data type prints.
def expr_to_str(thing):
if hasattr(thing, 'keys'):
pairs = ['%s:%s' % (expr_to_str(k),expr_to_str(v)) for k,v in thing.iteritems()]
return '{%s}' % ', '.join(pairs)
if hasattr(thing, '__setslice__'):
parts = [expr_to_str(ele) for ele in thing]
return '[%s]' % (', '.join(parts),)
if isinstance(thing, basestring):
return "'%s'" % (str(thing),)
return str(thing)
print "dumped: %s" % expr_to_str({'one': 33, 'two': [u'unicode', 'just a str', 44.44, {'hash': 'here'}]})
outputs:
dumped: {'two':['unicode', 'just a str', 44.44, {'hash':'here'}], 'one':33}

I went on to use json.dumps() as sobolevn suggested in the comment. I didn't think of that one at first since I wasn't really using json in the driver. It turned out however that json.dumps() provided exactly the formats I needed on all the data types I use. Some examples:
>>> json.dumps('test')
'"test"'
>>> json.dumps(['test1', 'test2'])
'["test1", "test2"]'
>>> json.dumps([u'test1', u'test2'])
'["test1", "test2"]'
>>> json.dumps({u'key1': u'val1', u'key2': [u'val21', 'val22', 1]})
'{"key2": ["val21", "val22", 1], "key1": "val1"}'
If you need to take more control of the format, quotes or other things regarding this conversion, see the reply by Dan Oblinger.

using list instead of number or string in the query

I would like to use a list of int to be used in a query as below:
db.define_table('customer',Field('name'),Field('cusnumber','integer'))
def custmr():
listOfNumbers=[22,12,76,98]
qry=db(db.customer.cusnumber==listOfNumbers).select(db.customer.name)
print qry
this arise an issue that the only accepted data type in the query is int or str.
Is there any way to avoid this issue (preferably by not using for loop)
Regards

It is really difficult to know what you're trying to ask, but from the syntax of db.define_table(...), I take a wild guess you're on web2py and trying to do a query which fetch any int in your listOfNumbers.
You may use contains attribute like this:
# if all=True, cusnumber will need to contains all listOfNumbers, False means any
qry=db(db.customer.cusnumber.contains(listOfNumbers, all=False)).select(db.customer.name)
You can read more in details in HERE
As OP replied that contains only works for string, I'm going to suggest using for/loop will be a better answer:
listOfNumbers=[22,12,76,98]
for each in listOfNumbers:
qry=db(db.customer.cusnumber==each).select(db.customer.name)
# ... do your stuff or whatever ...

Assuming you want the set of records for which the cusnumber is in listOfNumbers, you should use the .belongs method:
qry = db(db.customer.cusnumber.belongs(listOfNumbers)).select(db.customer.name)

Python: Matching & Stripping port number from socket data

I have data coming in to a python server via a socket. Within this data is the string '<port>80</port>' or which ever port is being used.
I wish to extract the port number into a variable. The data coming in is not XML, I just used the tag approach to identifying data for future XML use if needed. I do not wish to use an XML python library, but simply use something like regexp and strings.
What would you recommend is the best way to match and strip this data?
I am currently using this code with no luck:
p = re.compile('<port>\w</port>')
m = p.search(data)
print m
Thank you :)

Regex can't parse XML and shouldn't be used to parse fake XML. You should do one of
Use a serialization method that is nicer to work with to start with, such as JSON or an ini file with the ConfigParser module.
Really use XML and not something that just sort of looks like XML and really parse it with something like lxml.etree.
Just store the number in a file if this is the entirety of your configuration. This solution isn't really easier than just using JSON or something, but it's better than the current one.
Implementing a bad solution now for future needs that you have no way of defining or accurately predicting is always a bad approach. You will be kept busy enough trying to write and maintain software now that there is no good reason to try to satisfy unknown future needs. I have never seen a case where "I'll put this in for later" has led to less headache later on, especially when I put it in by doing something completely wrong. YAGNI!
As to what's wrong with your snippet other than using an entirely wrong approach, angled brackets have a meaning in regex.

Though Mike Graham is correct, using regex for xml is not 'recommended', the following will work:
(I have defined searchType as 'd' for numerals)
searchStr = 'port'
if searchType == 'd':
retPattern = '(<%s>)(\d+)(</%s>)'
else:
retPattern = '(<%s>)(.+?)(</%s>)'
searchPattern = re.compile(retPattern % (searchStr, searchStr))
found = searchPattern.search(searchStr)
retVal = found.group(2)
(note the complete lack of error checking, that is left as an exercise for the user)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.