Unicode printing wrong characters - python

Unfortunately still using Python 2.7 and moving to Python 3 currently is not in the roadmap.
I am making a REST call to an external application to retrieve data which I need to then process, push into a CSV file and post that across another different REST interface.
The characters causing me problems I have receive from the first rest call are
u'\xe7\x83\xad\xe7\x82\xb9\xe8\xae\xa8\xe8\xae\xba'
which when printed are
>>> print u'\xe7\x83\xad\xe7\x82\xb9\xe8\xae\xa8\xe8\xae\xba'
热点讨论
What this should actually be is 热点讨论
Now this data is taken and then processed without change but is written into the csv file as '热点讨论' and then posted to the second REST interface
The actual characters I should be receiving from the CSV file should be
热点讨论
which in unicode should look like u'\u70ed\u70b9\u8ba8\u8bba'
>>> print u'\u70ed\u70b9\u8ba8\u8bba'
热点讨论
How do I either:
transform the initially received unicode u'\xe7\x83\xad\xe7\x82\xb9\xe8\xae\xa8\xe8\xae\xba' into u'\u70ed\u70b9\u8ba8\u8bba' before pushing into the CSV file and posting to the second rest api
Cause u'\xe7\x83\xad\xe7\x82\xb9\xe8\xae\xa8\xe8\xae\xba' to print 热点讨论 correctly once on the receive side of the second REST api?
Edit: for the first REST call.
The call is being made using requests.get()
This is the response data:
[
{"flags":1074200578,"results":{"id1":22212362104,"id2":22212362125,"fields":[{"id1":0,"id2":0,"count":0,"format":128,"type":"ip.src","flags":0,"group":0,"value":"172.18.16.74"},{"id1":0,"id2":0,"count":0,"format":65,"type":"filename","flags":0,"group":0,"value":"WinVNC6"},{"id1":0,"id2":0,"count":0,"format":65,"type":"checksum","flags":0,"group":0,"value":"5b52bc196bfc207d43eedfe585df96fcfabbdead087ff79fcdcdd4d08c7806db"},{"id1":22212362104,"id2":22212362104,"count":0,"format":32,"type":"time","flags":0,"group":0,"value":1618530308},{"id1":22212362106,"id2":22212362106,"count":0,"format":128,"type":"device.ip","flags":0,"group":0,"value":"172.18.21.197"},{"id1":22212362108,"id2":22212362108,"count":0,"format":65,"type":"device.type","flags":0,"group":0,"value":"symantecav"},{"id1":22212362111,"id2":22212362111,"count":0,"format":65,"type":"host.src","flags":0,"group":0,"value":"\u00E7\u0083\u00AD\u00E7\u0082\u00B9\u00E8\u00AE\u00A8\u00E8\u00AE\u00BA"},{"id1":22212362114,"id2":22212362114,"count":0,"format":65,"type":"checksum","flags":0,"group":0,"value":"5b52bc196bfc207d43eedfe585df96fcfabbdead087ff79fcdcdd4d08c7806db"},{"id1":22212362117,"id2":22212362117,"count":0,"format":65,"type":"filename","flags":0,"group":0,"value":"WinVNC6"},{"id1":22212362119,"id2":22212362119,"count":0,"format":65,"type":"action","flags":0,"group":0,"value":"Left alone"},{"id1":22212362124,"id2":22212362124,"count":0,"format":65,"type":"user.dst","flags":0,"group":0,"value":"Adam_Joe"},{"id1":22212362125,"id2":22212362125,"count":0,"format":128,"type":"ip.src","flags":0,"group":0,"value":"172.18.16.74"},{"id1":22212362104,"id2":22212362104,"count":1,"format":8,"type":"time","flags":2,"group":0,"value":"1"}]}},
{"flags":1074200577,"results":{"id1":22212362126,"id2":22212362136,"fields":[]}}
]
The piece in this causing the issues is "\u00E7\u0083\u00AD\u00E7\u0082\u00B9\u00E8\u00AE\u00A8\u00E8\u00AE\u00BA"

Related

Python read from file being written to in a different language

I know this is kind of a dumb situation, but I'm writing to a file in C# and need to be able to read from it in python. I was looking it up, and people were mostly saying to flush the text file, but that only works for python-python, and I need it to work for C#-python. If I read from the file in python and then write to it in C#, everything works fine, but the information is wrong. If I write then read, the correct information is written to the txt file, but then python generates a FileNotFound exception. I am flushing and closing the file in C#, but it still isn't working.
My C# code is as follows:
File.WriteAllText("..\\Ref.txt", "");
System.IO.FileStream s = new System.IO.FileStream("..\\Ref.txt", System.IO.FileMode.OpenOrCreate);
string text = intPtr.ToString();
Byte[] bytes = Encoding.ASCII.GetBytes(text);
s.Write(bytes, 0, bytes.Length);
s.Flush();
Console.WriteLine("Flushed");
s.Close();
And my python code looks like
os.chdir(aDir)
os.system("dotnet run")
r = open("Ref.txt", "r")
ref = r.read()
r.close()
This order generates and exception, but if it were somehow able to not generate an exception, the correct information would be attained from the text file. If I open the text file then run dotnet run, no exception is generated, but I have outdated information (i.e. the info from the last time I ran the program since that was the last time the txt was written to)

Error (Hung process?) when using COPY INTO with ANSI file

I'm trying to load a set of public flat files (using COPY INTO from Python) - that apparently are saved in ANSI format. Some of the files load with no issue, but there is at least one case where the COPY INTO statement hangs (no error is returned, & nothing is logged, as far as I can tell). I isolated the error to a particular row with a non-standard character e.g., the ¢ character in the 2nd row -
O}10}49771}2020}02}202002}4977110}141077}71052900}R }}N}0}0}0}0}0}0}0}0}0}0}0}0}0}0}0}0}08}CWI STATE A}CENTENNIAL RESOURCE PROD, LLC}PHANTOM (WOLFCAMP)
O}10}50367}2020}01}202001}5036710}027348}73933500}R }}N}0}0}0}0}0}0}0}0}0}0}0}0}0}0}0}0}08}A¢C 34-197}APC WATER HOLDINGS 1, LLC}QUITO, WEST (DELAWARE)
Re-saving these rows into a file with UTF-8 encoding solves the issue, but I thought I'd pass this along in case someone wants to take a look at the back-end to handle these types of characters and/or return some kind of error.
Why do you save into a file?
If it is possible, just play with strings internally from Python with:
resultstr= bytestr.encode("utf-8")

UnicodeDecodeError when reading data from DBF database

I need to write a script that connects an ERP program to a manufacturing program. With the production program the matter is clear - I send it data via HTTP requests. It is worse with the ERP program, because in its case, the data must be read from a DBF file.
I use the dbf library because (if I'm not mistaken) it's the only one that provides the ability to filter data in a fairly simple and fast way. I open the database this way
table = dbf.Table(path).open()
dbf_index = dbf.pql(table, "select * where ident == 'M'")
I then loop through each successive record that the query returned. I need to "package" the selected data from the DBF database into json and send it to the production program api.
data = {
"warehouse_id" : parseDbfData(record['SYMBOL']),
"code" : parseDbfData(record['SYMBOL']),
"name" : parseDbfData(record['NAZWA']),
"main_warehouse" : False,
"blocked" : False
}
The parseDbfData function looks like this, but it's not the one causing the problem because it didn't work the same way without it. I added it trying to fix the problem.
def parseDbfData(data):
return str(data.strip())
When run, if the function encounters any "mismatching" character from DBF database (e.g. any Polish characters i.e. ą, ę, ś, ć) the script terminates with an error
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 15: ordinal not in range(128)
The error points to a line containing this (in building json)
"name" : parseDbfData(record['NAZWA']),
The value the script is trying to read at this point is probably "Magazyn materiałów Podgórna". As you can see, this value contains the characters "ł" and "ó". I think this makes the whole script break but I don't know how to fix it.
I'll mention that I'm using Python version 3.9. I know that there were character encoding issues in versions 2., but I thought that the Python 3. era had remedied that problem. I found out it didn't :(
I came to the conclusion that I have to use encoding directly when reading the DBF database. However, I could not read from the documentation, how exactly to do this.
After a thorough analysis of the dbf module itself, I came to the conclusion that I need to use the codepage parameter when opening the database. A moment of combining and I was able to determine that of all the encoding standards available in the module, cp852 suits me best.
After the correction, the code to open a DBF database looks like this:
table = dbf.Table(path, codepage='cp852').open()
Python 3 did fix the unicode/bytes issue, but only for Python itself. The dbf format stores the code page that should be used inside the .dbf files themselves (which is frequently not done, resulting in an ascii codec being used).
To fix the dbf files (which may mess up the other programs using them, so test carefully):
table.open()
table.codepage = dbf.CodePage('cp852')
table.close()

Why is Django FieldFIle readline() returning the hex version of a text file?

Having an odd problem.
I have a Django app that opens a file (represented as a Django FieldFile) and reads each row using readline() as below:
with file.open(mode='r') as f:
row = f.readline()
# do something with row...
The file is text, utf-8 encoded and lines are terminated with \r\n.
The problem is each row is being read as the hex representation of the string, so instead of "Hello" I get "48656c6c6f".
A few stranger things:
It previously worked properly, but at some point an update has broken it (I've tried rolling back to previous commits and it is still wonky, so possibly a dependency has updated and not something from my requirements.txt). Missed it in my testing because it is in a very rarely used part of the app.
If I read the same file using readlines() instead of readline() I see the correct string representation of the file wrapped in [b'...']
The file reads normally if I do it using straight Python open() and readline() from an interpreter
Forcing text mode with mode='rt' doesn't change the behaviour, neither does mode='rb'
The file is stored in a Minio bucket, so the defaut storage is storages.backends.s3boto3.S3Boto3Storage from django-storages and not the default Django storage class. This means that boto3, botocore and s3fs are also in the mix, making it more confusing for me to debug.
Scratching my head at why this worked before and what I'm doing wrong.
Environment is Python 3.8, Django 2.2.8 and 3.0 (same result) running in Docker containers.
EDIT
Let me point out that the fix for this is simply using
row = f.readline().decode()
but I would still like to figure out what's happening.
EDIT 2
Further to this, FieldFile.open() is reading the file as a binary file, whereas plain Python open() is reading the file as a text file.
This seems very weird.
I think you will see the solution immediately after trying following (I will then update my answer or delete it if it really doesn't help to find it, but I'm quite confident)
A assume, that there is some code, that is monkeypatching file.open or the django view function.
What I suggest is:
Start your code with manage.py runserver
Ad following code to manage.py (as the very first lines)
import file
print("ID of file.open at manage startup is", id(file.open)
Then add code to your view directly one line above the file.open
print("ID of file.open before opening is", id(file.open)
If both ids are different, then something monkeypatched your open function.
If both are the same, then the problem must be somewhere else.
If you don not see the output of these two prints, something might have monkeypatched your view.
If this doesn't work, then try to use
open() instead of file.open()
Is there any particular reason you use file.open()
Addendum 1:
So what you sai is, that file is an object instance of a class is it a FileField?
In any case can you obtain the name of the file and open it with a normal open() to see whether it is only file.open() that does funny things or whether it is also open() reading it this stange way.
Did you just open the file from command line with cat filename (or if under windows with type filename?
If that doesn't work we could add traces to follow each line of the source code that is being executed.
Addendum 2:
Well if you can't try this in a manage.py runserver, what happens if you try to read the file with a manage.py shell?
Just open the shell and type something like:
from <your_application>.models import <YourModel>
entry = <YourModel>.objects.get(id=<idofentry>)
line1 = entry.<filefieldname>.open("r").read().split("\n")[0]
print("line1 = %r" % line1)
If this is still not conclusive, (but only if you can reproduce the issue with the management shell, then create a small file containing the lines.
from <your_application>.models import <YourModel>
entry = <YourModel>.objects.get(id=<idofentry>)
import pdb; pdb.set_trace()
line1 = entry.<filefieldname>.open("r").read().split("\n")[0]
print("line1 = %r" % line1)
And import it from the management shell.
The code should enter the debugger and now you can single step through the open function and see whether you end up on sime weird function in some monkeypatch.

Python - Data Sent Over Socket Appears Different on Client and Server

I've got a client/server program where the client sends plaintext to the server which then runs AES encryption and returns the ciphertext. I'm using the following algorithm for the encryption/decryption:
http://anh.cs.luc.edu/331/code/aes.py
When I get the results back from the encryption and print them on the server-side I see mostly gibberish in the terminal. I can save to a file immediately and get something along these lines:
tgâY†Äô®Ø8ί6ƒlÑÝ%ŠIç°´>§À¥0Ð
I can see that this is the correct output because if I immediately decrypt it on the server, I get the original plaintext back. If I run this through the socket, send it back to the client, and print(), I get something more like this:
\rtg\xe2Y\x86\x8f\xc4\xf4\xae\xd88\xce\xaf6\x83l\xd1\xdd%\x8aI\xe7\xb0\xb4>\xa7\xc0\x18\xa50\xd0
There's an obvious difference here. I'm aware that the \x represents a hex value. If I save on the client-side, the resulting text file still contains all \x instances (i.e., it looks exactly like what I displayed directly above). What must I do to convert this into the same kind of output that I'm seeing in the first example? From what I have seen so far, it seems that this is unicode and I'm having trouble...
Relevant code from server.py
key = aes.generateRandomKey(keysizes[len(key)%3])
encryptedText = aes.encryptData(key, text)
f = open("serverTest.txt", "w")
f.write(encryptedText)
f.close()
print(encryptedText)
decryptedText = aes.decryptData(key, encryptedText)
print(decryptedText)
conn.sendall(encryptedText)
Relevant code from client.py
cipherText = repr(s.recv(16384))[1:-1]
s.close()
cipherFile = raw_input("Enter the filename to save the ciphertext: ")
print(cipherText)
f = open(cipherFile, "w")
f.write(cipherText)
Edit: To put this simply, I need to be able to send that data to the client and have it display in the same way as it shows up on the server. I feel like there's something I can do with decoding, but everything I've tried so far doesn't work. Ultimately, I'll have to send from the client back to the server, so I'm sure the fix here will also work for that, assuming I can read it from the file correctly.
Edit2: When sending normally (as in my code above) and then decoding on the client-side with "string-escape", I'm getting identical output to the terminal on both ends. The file output also appears to be the same. This issue is close to being resolved, assuming I can read this in and get the correct data sent back to the server for decrypting.
Not sure I fully understood what you're up to, but one difference between client and server is that on the client you're getting the repr for the byte string, while on the server you print the byte string directly.
(if I got the issue right) I'd suggest replacing
repr(s.recv(16384))[1:-1]
with a plain
s.recv(16384)

Categories

Resources