SQLAlchemy Unicode conundrum - python

I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)
My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.
Below I've included a minimal example that illustrates the problem.
utf8test.sql: Set up a database and create one row with a Unicode
character in it
utf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.
It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).
Can anybody shed any light on this?
This creates the test DB:
dh#jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh#jenna:~/python$ mysql < utf8test.sql
Here's the Pyhton script:
dh#jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-
from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))
engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)
db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()
rs = db.query(X.text).all()
for r in rs:
print(r.text)
db.close()
This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):
dh#jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte
Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.
dh#jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|

c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.
Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python
But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.
Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.
Bottom line: You have multiple errors.

Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.

Related

Can't upload a file to S3 with pre-signed URL no matter what I do. AWS command line works. CURL and anything else = 403

My AWS credentials in ~/.aws/credentials are correct and working. Proof?
$ aws s3api put-object --bucket <my bucket name> --key videos/uploads/yoda.jpeg --body /Users/r<my_name>/Desktop/Archive/yoda.jpeg
getting back:
{
"ETag": "\"66bee0b7caf3d127900e0a70f2da4b5f\""
}
The upload worked from command line. And I can see my file when I see my S3 bucket in AWS's management console.
NOW- I delete the successfully uploaded file from S3 and I'm trying to upload it again, this time via a presigned URL
$ aws s3 presign s3://<my-bucket>/videos/uploads/yoda.jpeg
for which I get:
https://<my-bucket>.s3.amazonaws.com/videos/uploads/yoda.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=3600&X-Amz-Credential=<MY-AWS-KEY-ID>%2F20210207%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-SignedHeaders=host&X-Amz-Date=20210207T222859Z&X-Amz-Signature=3a3624b9e264c119ebdf93c989efb73337f7ab8793e89554c7b000e1fc93c85c
From this moment on, any PUT attempt with CURL, POSTMAN or any other tool, with this URL fails to upload the file always ends up with 403 (yes, it's not expiring, it fails immediately) and The request signature we calculated does not match the signature you provided is the excuse provided by AWS.
The S3 bucket has a policy allowing the user whose credentials are in /.aws/credentials to Put* on that very bucket.
What is going on? Why doesn't pre-signed URL work?
CURL ATTEMPT
$ curl --location --request PUT 'https://<my-bucket-name>.s3.amazonaws.com/videos/uploads/yoda.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=3600&X-Amz-Credential=<MY-AWS-KEY-ID>%2F20210207%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-SignedHeaders=host&X-Amz-Date=20210207T224403Z&X-Amz-Signature=8a8625591e6c4e0871f97bf5e15c2f93b3e373cfc1c2daddb2cf34edb10a5670%0A' \
--header 'Content-Type: image/jpeg' \
--data-binary '#/Users/<MY-NAME>/Desktop/Archive/yoda.jpeg'
to which I get:
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>SignatureDoesNotMatch</Code>
<Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message>
<AWSAccessKeyId><---MY--ACCESS--KEY--ID--->/AWSAccessKeyId>
<StringToSign>AWS4-HMAC-SHA256
20210207T224403Z
20210207/us-east-2/s3/aws4_request
da93cc1a0ec196fe0726ec6d5cace8c1b2b4865b20663bf0240454e276dbef6f</StringToSign>
<SignatureProvided>8a8625591e6c4e0871f97bf5e15c2f93b3e373cfc1c2daddb2cf34edb10a5670
</SignatureProvided>
<StringToSignBytes>41 57 53 34 2d 48 4d 41 43 2d 53 48 41 32 35 36 0a 32 30 32 31 30 32 30 37 54 32 32 34 34 30 33 5a 0a 32 30 32 31 30 32 30 37 2f 75 73 2d 65 61 73 74 2d 32 2f 73 33 2f 61 77 73 34 5f 72 65 71 75 65 73 74 0a 64 61 39 33 63 63 31 61 30 65 63 31 39 36 66 65 30 37 32 36 65 63 36 64 35 63 61 63 65 38 63 31 62 32 62 34 38 36 35 62 32 30 36 36 33 62 66 30 32 34 30 34 35 34 65 32 37 36 64 62 65 66 36 66</StringToSignBytes>
<CanonicalRequest>PUT
/videos/uploads/yoda.jpeg
X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=<---MY--ACCESS--KEY--ID--->%2F20210207%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20210207T224403Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host
host:<my-bucket-name>.s3.amazonaws.com
host
UNSIGNED-PAYLOAD</CanonicalRequest>
<CanonicalRequestBytes>50 55 54 0a 2f 76 69 64 65 6f 73 2f 75 70 6c 6f 61 64 73 2f 79 6f 64 61 2e 6a 70 65 67 0a 58 2d 41 6d 7a 2d 41 6c 67 6f 72 69 74 68 6d 3d 41 57 53 34 2d 48 4d 41 43 2d 53 48 41 32 35 36 26 58 2d 41 6d 7a 2d 43 72 65 64 65 6e 74 69 61 6c 3d 41 4b 49 41 51 33 44 34 36 52 4e 50 48 51 4e 4b 47 42 46 4b 25 32 46 32 30 32 31 30 32 30 37 25 32 46 75 73 2d 65 61 73 74 2d 32 25 32 46 73 33 25 32 46 61 77 73 34 5f 72 65 71 75 65 73 74 26 58 2d 41 6d 7a 2d 44 61 74 65 3d 32 30 32 31 30 32 30 37 54 32 32 34 34 30 33 5a 26 58 2d 41 6d 7a 2d 45 78 70 69 72 65 73 3d 33 36 30 30 26 58 2d 41 6d 7a 2d 53 69 67 6e 65 64 48 65 61 64 65 72 73 3d 68 6f 73 74 0a 68 6f 73 74 3a 6c 73 74 76 32 2d 70 75 62 6c 69 63 2e 73 33 2e 61 6d 61 7a 6f 6e 61 77 73 2e 63 6f 6d 0a 0a 68 6f 73 74 0a 55 4e 53 49 47 4e 45 44 2d 50 41 59 4c 4f 41 44</CanonicalRequestBytes>
<RequestId>CBJT0Y4SX9A7RB26</RequestId>
<HostId>h+5b/u8cdi34yuSDBX0Z/mZGQMtRZIMS4rvIwiKzOZSOZhRoQfak8cOdVBq2BgtU1qbqlHrO2TY=</HostId>
</Error>
TRYING TO GENERATE THE PRESIGN URL FROM PYTHON. STILL DOES NOT WORK. THE URL IS FAULTY- AWS REJECTS WITH THE SAME 403
def get_upload_pre_signed_url(bucket_name, object_name, expiration=3600):
s3_client = boto3.client('s3')
try:
response = s3_client.generate_presigned_url('put_object',
Params={'Bucket': bucket_name,
'Key': object_name},
ExpiresIn=expiration)
except ClientError as e:
return None
# The response contains the presigned URL
return response
URL generated from this:
https://<my-bucket>.s3.amazonaws.com//videos/uploads/yoda.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=<my-AWS-KEY-ID>%2F20210207%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20210207T231306Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=968a3e2cab9b7e907df69e24aae24d79ea40f52a52d407591d7cbd69c86fe67b
Curling it gets the same 403. Doesn't change.
The aws s3 presign command creates URLs that can be used for downloading files. It does not create URLs that can be used for uploading. To quote the docs-
Generate a pre-signed URL for an Amazon S3 object. This allows anyone who receives the pre-signed URL to retrieve the S3 object with an HTTP GET request. All presigned URL’s now use sigv4 so the region needs to be configured explicitly.
To create upload URLs you need to jump out of the command line and into your language of choice and use the full AWS SDK.
Python example:
s3.generate_presigned_post(
Bucket=BUCKET_NAME,
Key=FILE_KEY,
ExpiresIn=(5*60)
)
Note that it uses the generate_presigned_post function to do this.
SOLUTION: Turns out the Boto3 I was using wasn't up to date and I was using it wrong. After fixing those, the code that worked for me was:
# THE CREDENTIALS ARE PART OF MY TESTING CODE. NO WORRIES THEY'RE IN AN ENV VARIABLE NOW
def get_upload_pre_signed_url(bucket_name, key, expiration=3600):
s3 = boto3.client('s3',
aws_access_key_id="<my access_key_id",
aws_secret_access_key="<my_secreet_access_key>",
config=Config(region_name='us-east-2', s3.{"use_accelerate_endpoint": True}))
try:
url = s3.generate_presigned_url('put_object', Params={'Bucket': bucket_name, 'Key': key},
ExpiresIn=expiration,
HttpMethod='PUT')
except ClientError as e:
return None
return url
The presign cli command is only for GET requests. If you need anything else, you have to use AWS API directly - as suggested below: you can use a short python script for that. We used for one of our applications a lambda which you can call and you will get the right url. Also, the presigned URL uses the role which called the API, so it has the same permissions. Including the fact that if you are using and STS assumed role and the grant expires sooner than the expiration time of the presigned url, the url will still fail. But if you use regular roles (like your aws cli profile), it should be ok.
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/presign.html
Generate a pre-signed URL for an Amazon S3 object. This allows anyone who receives the pre-signed URL to retrieve the S3 object with an HTTP GET request. All presigned URL’s now use sigv4 so the region needs to be configured explicitly.
Possible to send a PUT request to aws s3 presign url?
AWS CLI doesn't support presigned PUT URL yet. You can easily generate one using Python Boto3 though. The documentation is here. If you want a presigned PUT, you just need to let ClientMethod param be put_object.
s3 presign is only for generating url for download
Generate a pre-signed URL for an Amazon S3 object. This allows anyone
who receives the pre-signed URL to retrieve the S3 object with an HTTP
GET request. All presigned URL’s now use sigv4 so the region needs to
be configured explicitly
we will still need to specify Bucket and Key to which we want to upload.
NodeJs:
const bucketParms = {
Bucket: "sample-temp-bucket",
Key: "HeadShot.jpg",
ContentType:'image/jpeg'
};
s3.getSignedUrl("putObject", bucketParms, (error, url) => {
if (error) console.log("error", error);
if (url) console.log("url", url);
});
Python:
response = s3_client.generate_presigned_url('put_object',
Params={'Bucket': bucket_name,
'Key': object_name,
'ContentType':'image/jpeg'},
ExpiresIn=expiration)
we can do a curl or postman
curl --location --request PUT 'https://test-events.s3.amazonaws.com/98..?X....' \
--header 'Content-Type: image/jpeg' \
--data-binary '/Users/user/image/path'
Your edit is correct for generating the URL using the SDK.
That said, to use the URL, the HTTP headers curl uses need to be exactly correct. Notably here, the signature requires there be no Content-Type header sent to the server. Since --data-binary forces one, the easiest way I know of the get curl to do the right thing is use the --upload-file flag:
$ curl $URL --upload-file yoda.jpg
It was not working for me too, it was returning 403 forbidden. Things that helped were:
Setting the signature to v4 new AWS.S3({ signatureVersion: "v4" })
If you use Metadata, do encode the content encodeURI("meta text with symbols##");
here is more coupled ts/js code:
const s3 = new AWS.S3({ signatureVersion: "v4" });
const s3Params = {
Bucket: <BUCKET_NAME>,
Key: <KEY>,
Expires: 300,
ContentType: <CONTENT_TYPE>,
Metadata: {
title: encodeURI('SomeTitle with utf-8 characters'),
},
};
const signedPutUrl = await s3.getSignedUrlPromise(
"putObject",
s3Params
);
console.log('the signed url',signedPutUrl);
return {'url':signedPutUrl};

How to keep the header and trailer while zlib decompress and compress

I have raw data extracted from PDF and I decompressed the raw data and compressed it again.
I expected the same header and trailer, but the header was changed.
Original Hex Header
48 89 EC 57 ....
Converted Hex Header
78 9C EC BD ...
I dug into zlib compression and got header 48 also is one of zlib.header.
But mostly 78 is used for zlib compression.
It's my code which decompress and compress:
decompress_wbit = 12
compress_variable = 6
output_data = zlib.decompress(open(raw_data, "rb").read(), decompress_wbit)
output_data = zlib.compress(output_data, 6)
output_file = open(raw_data + '_', "wb")
output_file.write(output_data)
output_file.close()
I changed the decompress_wbit and compress_variable but still keeps 78.
So not sure how to get 48 as header.
Here is the short description about zlib.header.
CINFO (bits 12-15)
Indicates the window size as a power of two, from 0 (256 bytes) to 7 (32768 bytes). This will usually be 7. Higher values are not allowed.
CM (bits 8-11)
The compression method. Only Deflate (8) is allowed.
FLEVEL (bits 6-7)
Roughly indicates the compression level, from 0 (fast/low) to 3 (slow/high)
FDICT (bit 5)
Indicates whether a preset dictionary is used. This is usually 0. 1 is technically allowed, but I don't know of any Deflate formats that define preset dictionaries.
FCHECK (bits 0-4)
A checksum (5 bits, 0..31), whose value is calculated such that the entire value divides 31 with no remainder.
Typically, only the CINFO and FLEVEL fields can be freely changed, and FCHECK must be calculated based on the final value.* Assuming no preset dictionary, there is no choice in what the other fields contain, so a total of 32 possible headers are valid. Here they are:
FLEVEL: 0 1 2 3
CINFO:
0 08 1D 08 5B 08 99 08 D7
1 18 19 18 57 18 95 18 D3
2 28 15 28 53 28 91 28 CF
3 38 11 38 4F 38 8D 38 CB
4 48 0D 48 4B 48 89 48 C7
5 58 09 58 47 58 85 58 C3
6 68 05 68 43 68 81 68 DE
7 78 01 78 5E 78 9C 78 DA
Please let me know how to keep the zlib.header while decompression & compression
Thanks for your time.
I will first note that it doesn't matter. The data will be decompressed fine with that zlib header. Why do you care?
You are giving zlib.compress a small amount of data that permits a smaller window. Since it is permitted, the Python library is electing to compress with a smaller window.
A way to avoid that would be to use zlib.compressobj instead. Upon initiation, it doesn't know how much data you will be feeding it and will default to the largest window size.

Python flake8 py reporting W391 (no newline at end of file) incorrectly

W391 says that there should be one (and only one) blank line at the end of file. However, flake8 reports the error when there is at least one newline at the end of the file:
$ cat /tmp/test.py
def hello():
print('hello')
hello()
$ hexdump -C /tmp/test.py
00000000 64 65 66 20 68 65 6c 6c 6f 28 29 3a 0a 20 20 20 |def hello():. |
00000010 20 70 72 69 6e 74 28 27 68 65 6c 6c 6f 27 29 0a | print('hello').|
00000020 0a 0a 68 65 6c 6c 6f 28 29 0a 0a |..hello()..|
0000002b
You can see above there is in fact one and only one blank line at the end of the file (0a is \n). However, when I run flake8, I get the W391 error:
$ flake8 /tmp/test.py
/tmp/test.py:6:1: W391 blank line at end of file
Why is that?
Apparently vim automatically adds a newline to every file, which fools me into thinking that last blank line isn't there. Over time this implicit newline confused me into thinking two newline characters at the end created one blank line.
So, the warning is correct. There should be one and only one \n at the end of the file.

Difference in result while reading same file with node and python

I have been trying to read the contents of the genesis.block given in this file of the Node SDK in Hyperledger Fabric using Python. However, whenever I try to read the file with Python by using
data = open("twoorgs.genesis.block").read()
The value of the data variable is as follows:
>>> data
'\n'
With nodejs using fs.readFileSync() I obtain an instance of Buffer() for the same file.
var data = fs.readFileSync('./twoorgs.genesis.block');
The result is
> data
<Buffer 0a 22 1a 20 49 63 63 ac 9c 9f 3e 48 2c 2c 6b 48 2b 1f 8b 18 6f a9 db ac 45 07 29 ee c0 bf ac 34 99 9e c2 56 12 e1 84 01 0a dd 84 01 0a d9 84 01 0a 79 ... >
How can I read this file successfully using Python?
You file has a 1a in it. This is Ctrl-Z, which is an end of file on Windows.
So try binary mode like:
data = open("twoorgs.genesis.block", 'rb').read()

Clustering Algorithm (in Python) for Data

I have thousands of data entries that looks similar to the following:
08 00 00 00 c3 85 20 65 6e 61 62 6c 65 64 2e 0d 0a 45 78 70
5c 72 88 74 80 83 82 79 68 8d 7b 73 90 7c 60 84 80 74 00 00
5d 77 84 76 7d 85 7f 7d 6c 94 7e 73 82 74 61 7f 7b 76 00 00
63 70 84 8c 95 87 80 72 65 73 70 67 85 8a 64 93 89 74 00 00
65 7c 73 6c 6c 9a a2 86 7e 4f 7e 71 7c 79 5c 7f 72 7b 00 00
...
Each entry has 20 numbers, of which each number can be any value between 0 and 255 (shown as a hex number). I have references that I can use to help pin the clusters. The references have the same template as the data.
I have already determined that I can use a Manhattan distance equation to give each one a numerical value with regard to a reference array. But I'm looking for a way to cluster the data. Based on what I know about the data, there should be approximately 50-60 clusters. I expect some of the data to be outside of a threshold and consequently not apart of any cluster.
With the way that the data is setup, I can process the data as it comes in (about once 20 seconds). I haven't found a convenient library to use and the entire thing must be written in python (preferably with just the standard library).
I was hoping that I did not need to develop the algorithm on my own. I believe I might want a MinHash, but I am open to other possibilities.
So, it really depends on what kind of clustering you want. Clustering is an incredibly large and generally quite quantitatively-expensive operation and there are a large number of different approaches to it.
I would go as far to say that there is no better solution to your problem than using scikit's clustering modules. They have a fantastic breakdown of their different clustering algorithms shown here: http://scikit-learn.org/dev/modules/clustering.html
Personally I use DBSCAN for most applications, but depending on exactly how you want to cluster this data that might not be the best choice for you. Also worth mentioning that Manhattan Distance is usually not a great choice for clustering algorithms and cosine distance and euclidean distance can both be more performant and give a more accurate representation of your data.
A quick Google search reveals the python-cluster package, located at https://pypi.python.org/pypi/cluster/1.1.0b1. One of the examples shows something I believe to be somewhat similar to the data setup that you want. The package does advise for large datasets to perform the clustering in a separate thread, though I believe in your specific situation that wouldn't be necessary.
>>> from cluster import *
>>> data = [12,34,23,32,46,96,13]
>>> cl = HierarchicalClustering(data, lambda x,y: abs(x-y))
>>> cl.getlevel(10) # get clusters of items closer than 10
[96, 46, [12, 13, 23, 34, 32]]
>>> cl.getlevel(5) # get clusters of items closer than 5
[96, 46, [12, 13], 23, [34, 32]]
Because you know all your data is between 0 and 255, the getlevel(5) call would separate your data into approximately 50-52 clusters. Also, you would have to convert your dataset into a list of integers.
Edit: Turns out, that won't do what you want. I assume you have enough data that you'll have at least one value for every five. This clustering algorithm will just group everything into a big nested list, as below.
>>> data = [1,2,3,4,5,6,7,8,9]
>>> x = HierarchicalClustering(data, lambda x,y: abs(x-y))
>>> x.getlevel(1)
[[1, 2, 3, 4, 5, 6, 9, 7, 8]]

Categories

Resources