pymssql to pandas encoding - python

I'm aware there are a zillions posts about encoding / decoding problems on the forum but after going through half of them I wasn't able to find one that did the trick for me. So be nice if it is somewhere in the other half...
My issue :
I have a dbase (MS SQL) containing multilingual data (Latin1_General_CI_AS COLLATE), and I am using pymssql and pandas to convert it to a dataframe for use outside of python. All works fine except for the non latin characters and I'm completely stuck at this moment.
This is my (simplified) python 3 code:
import pandas as pd
import pymssql
def rm_main():
conn = pymssql.connect(server='***',port=4133, user='***', charset='UTF-8', password='***', database='**')
q="""
SELECT goodmorning FROM myTable
"""
df = pd.read_sql(q,conn)
df['encoded_goodmorning'] = df.goodmorning.str.encode('utf-8')
return df
what is in my database is a field called goodmorning, and it contains the following string : Dzień dobry
When calling the data as above, using just pymssql the data is retrieved correctly.
When I want to use the read_sql method form pandas I get the dreadfull question mark as follows : Dzie? dobry
Using the encoding options I get a bit further in the right direction as I get the following : b'Dziexc5x84 dobry', where c5 84 is the utf hex code for my small latin n with acute. So my content is complete but it is not very reader friendly.
Now where I fail miserably is to get this into the 'friendly format' again (so that it just says 'Dzień dobry' again).
What do I overlook here? Are there better approaches to do this? it seems like something very obvious but whatever I tried (encoding / decoding) either doesn't make a difference or it simply brakes the code.

Related

JSON Parsing with python from Rethink database [Python]

Im trying to retrieve data from a database named RethinkDB, they output JSON when called with r.db("Databasename").table("tablename").insert([{ "id or primary key": line}]).run(), when doing so it outputs [{'id': 'ValueInRowOfid\n'}] and I want to parse that to just the value eg. "ValueInRowOfid". Ive tried with JSON in Python, but I always end up with the typeerror: list indices must be integers or slices, not str, and Ive been told that it is because the Database outputs invalid JSON format. My question is how can a JSON format be invalid (I cant see what is invalid with the output) and also what would be the best way to parse it so that the value "ValueInRowOfid" is left in a Operator eg. Value = ("ValueInRowOfid").
This part imports the modules used and connects to RethinkDB:
import json
from rethinkdb import RethinkDB
r = RethinkDB()
r.connect( "localhost", 28015).repl()
This part is getting the output/value and my trial at parsing it:
getvalue = r.db("Databasename").table("tablename").sample(1).run() # gets a single row/value from the table
print(getvalue) # If I print that, it will show as [{'id': 'ValueInRowOfid\n'}]
dumper = json.dumps(getvalue) # I cant use `json.loads(dumper)` as JSON object must be str. Which the output of the database isnt (The output is a list)
parsevalue = json.loads(dumper) # After `json.dumps(getvalue)` I can now load it, but I cant use the loaded JSON.
print(parsevalue["id"]) # When doing this it now says that the list is a str and it needs to be an integers or slices. Quite frustrating for me as it is opposing it self eg. It first wants str and now it cant use str
print(parsevalue{'id'}) # I also tried to shuffle it around as seen here, but still the same result
I know this is janky and is very hard to comprehend this level of stupidity that I might be on. As I dont know if it is the most simple problem or something that just isnt possible (Which it should or else I cant use my data in the database.)
Thank you for reading this through and not jumping straight into the comments and say that I have to read the JSON documentation, because I have and I havent found a single piece that could help me.
I tried reading the documentation and watching tutorials about JSON and JSON parsing. I also looked for others whom have had the same problems as me and couldnt find.
It looks like it's returning a dictionary ({}) inside a list ([]) of one element.
Try:
getvalue = r.db("Databasename").table("tablename").sample(1).run()
print(getvalue[0]['id'])

Trouble Inserting DataFrame Into InfluxDB Using Python

I'm trying to insert a very large CSV file into InfluxDB and am inserting it as such in Python:
influx_pd = influxdb.DataFrameClient(host, port, user, password, db, verify_ssl=False)
for frame in pd.read_csv(infile, chunksize=batch_count):
frame.set_index(pd.DatetimeIndex(frame[date_pk]), inplace=True)
frame.dropna(axis=1, how='all')
influx_pd.write_points(frame, 'patients')
However, on the first call to write_points, I'm receiving this error (truncated):
raise InfluxDBClientError(response.content, response.status_code)
influxdb.exceptions.InfluxDBClientError: 400: {"error":"unable to parse 'enroll_pd Pt Id=\"21.0\",Admit Date=\"2010-12-05\", ... MRSA Screening=\"Negative\" 1291507200000000000': invalid field format\nunable to parse ... (ellipses used to truncate)
I had read about issues with InfluxDB and NaN values (which my CSV file does contain), so I tried inserting placeholder values for NaN values but receive the same result. Could someone please help me locate the issue in my code? It would be much appreciated.
I'm using an InfluxDB 1.3 Docker image just FYI.
So I realized that I had to explicitly specify the protocol to be json, as such:
influx_pd.write_points(frame, measurement='enroll_pd', protocol='json')
in addition to filling in NaN values (JSON has no support for those) with an imputation method. I thought the docs I was under the impression that json was the default, I guess that was not the case.
This, of course, might only be one solution. I welcome other, alternative solutions that work.

Decoding XML object from mssql in Python

I get back an XML object from a mssql server when I call a SP from Python (2.7). I get it in the following form:
{u'XML_F52E2B61-18A1-11d1-B105-00805F49916B': 'D\x02i\x00d\x00D\x05d\x00e\x00s\x00c\x00r\x00D\x0bd\x00a\x00t\x00a\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\x13e\x00n\x00u\x00m\x00e\x00r\x00a\x00t\x00i\x00o\x00n\x00_\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\rs\x00y\x00s\x00t\x00e\x00m\x00f\x00e\x00a\x00t\x00u\x00r\x00e\x00D\x04l\x00i\x00n\x00k\x00D\x07F\x00e\x00a\x00t\x00u\x00r\x00e\x00\x01\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00A\x01\x07A\x01\x01A\x03B\x01\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x1a\x00r\x00e\x00s\x00p\x00o\x00n\x00d\x00e\x00n\x00t\x00_\x00i\x00d\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x02\x00\x00\x00\x81\x01\x01\x02A\x03\x11 \x00W\x00o\x00r\x00k\x00s\x00 \x00a\x00t\x00 \x00c\x00o\x00m\x00p\x00a\x00n\x00y\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x01\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x03\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x0c\x00G\x00e\x00n\x00d\x00e\x00r\x00\x81\x02\x01\x03A\x03B\x08\x00\x00\x00\x81\x03\x01\x04A\x03B\x01\x00\x00\x00\x81\x04\x01\x05A\x03F\x00\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x81\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00'}
I have two questions:
1: What encoding is this?
2: What library should I use to decode this?
Addition:
The XML as it shows in the SQL Management Studio:
The SP:
ALTER PROCEDURE [dbo].[rdb_sql2python]
AS
BEGIN
SET NOCOUNT ON
SELECT * FROM [_rdb].[dbo].[features] FOR XML RAW ('Feature'), ROOT ('FeatureS'), ELEMENTS
SET NOCOUNT OFF
END
I try something like an answer, at least to the question: What is this:
At this JSON-viewer your string as you presented it did not work. But when I removed the "u", replaced the single quotes with double quotes and removed the "D" it worked somehow:
This string
{"XML_F52E2B61-18A1-11d1-B105-00805F49916B":
"\x02i\x00d\x00D\x05d\x00e\x00s\x00c\x00r\x00D\x0bd\x00a\x00t\x00a\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\x13e\x00n\x00u\x00m\x00e\x00r\x00a\x00t\x00i\x00o\x00n\x00_\x00t\x00y\x00p\x00e\x00_\x00i\x00d\x00D\rs\x00y\x00s\x00t\x00e\x00m\x00f\x00e\x00a\x00t\x00u\x00r\x00e\x00D\x04l\x00i\x00n\x00k\x00D\x07F\x00e\x00a\x00t\x00u\x00r\x00e\x00\x01\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00A\x01\x07A\x01\x01A\x03B\x01\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x1a\x00r\x00e\x00s\x00p\x00o\x00n\x00d\x00e\x00n\x00t\x00_\x00i\x00d\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x02\x00\x00\x00\x81\x01\x01\x02A\x03\x11
\x00W\x00o\x00r\x00k\x00s\x00 \x00a\x00t\x00
\x00c\x00o\x00m\x00p\x00a\x00n\x00y\x00\x81\x02\x01\x03A\x03B\x01\x00\x00\x00\x81\x03\x01\x05A\x03F\x01\x81\x05\x01\x06A\x03F\x01\x81\x06\x81\x07\x01\x07A\x01\x01A\x03B\x03\x00\x00\x00\x81\x01\x01\x02A\x03\x11\x0c\x00G\x00e\x00n\x00d\x00e\x00r\x00\x81\x02\x01\x03A\x03B\x08\x00\x00\x00\x81\x03\x01\x04A\x03B\x01\x00\x00\x00\x81\x04\x01\x05A\x03F\x00\x81\x05\x01\x06A\x03F\x00\x81\x06\x81\x07\x81\x00\x08F\x00e\x00a\x00t\x00u\x00r\x00e\x00S\x00"}
converts to
Name: XML_F52E2B61-18A1-11d1-B105-00805F49916B
Value: "idDdescrDdatatype_idDenumeration_type_idD systemfeatureDlinkDFeatureFeatureSAAABArespondent_idABAFAFAABA Works at companyABAFAFAABAGenderABABAFAFFeatureS"
This is - for sure - not the final solution, but it's clear, that this is BSON encoded JSON.
It might be a good idea to show (the relevant parts of) you(r) SP and the way you are calling this. Might be, that there is a completely different / better approach...

Arcpy, select features based on part of a string

So for my example, I have a large shapefile of state parks where some of them are actual parks and others are just trails. However there is no column defining which are trails vs actual parks, and I would like to select those that are trails and remove them. I DO have a column for the name of each feature, that usually contains the word "trail" somewhere in the string. It's not always at the beginning or end however.
I'm only familiar with Python at a basic level and while I could go through manually selecting the ones I want, I was curious to see if it could be automated. I've been using arcpy.Select_analysis and tried using "LIKE" in my where_clause and have seen examples using slicing, but have not been able to get a working solution. I've also tried using the 'is in' function but I'm not sure I'm using it right with the where_clause. I might just not have a good enough grasp of the proper terms to use when asking and searching. Any help is appreciated. I've been using the Python Window in ArcMap 10.3.
Currently I'm at:
arcpy.Select_analysis ("stateparks", "notrails", ''trail' is in \"SITE_NAME\"')
Although using the Select tool is a good choice, the syntax for the SQL expression can be a challenge. Consider using an Update Cursor to tackle this problem.
import arcpy
stateparks = r"C:\path\to\your\shapefile.shp"
notrails = r"C:\path\to\your\shapefile_without_trails.shp"
# Make a copy of your shapefile
arcpy.CopyFeatures_management(stateparks, notrails)
# Check if "trail" exists in the string--delete row if so
with arcpy.da.UpdateCursor(notrails, "SITE_NAME") as cursor:
for row in cursor:
if "trails" in row[0]: # row[0] refers to the current row in the "SITE_NAME" field
cursor.deleteRow() # Delete the row if condition is true

Save FITS table: The keyword description with its value is too long

I get an error when trying to save astropy Tables retrieved using astroquery to FITS files. In some case it complains that the description of some keywords is too long. The writeto() function seems to have a output_verify argument to avoid this kind of problem, but I cannot find how to pass it to the write() function? Does it exist?
Here is my code:
import astropy.units as u
from astroquery.vizier import Vizier
import astropy.coordinates as coord
from astropy.table import Table
akari_query=Vizier(columns=["S09","S18","e_S09","e_S18","q_S09","q_S18"],catalog=["II/297/irc"])
result=akari_query.query_region(coord.SkyCoord(ra=200.0, dec=10.0,unit=(u.deg, u.deg),frame='icrs'), width=[2.0*u.deg,2.0*u.deg],return_type='votable')
table=Table(result[0], masked=True)
table.write('test.fits')
It returns a long error message ending with:
ValueError: The keyword description with its value is too long
The problem is that table.meta['description'] is longer than allowed for the header of the fits file you're trying to save. You can simply shorten it to anything below 80 characters and try to write test.fits again:
table.meta['description'] = u'AKARI/IRC All-Sky Survey Point Source Catalogue v. 1.0'
table.write('test.fits')

Categories

Resources