I'm playing a game that I believe uses a base64-encoded save, and I'm trying to figure out how to get a json or something equivalent of the save in python.
First, I'm decoding it from base64 (using base64.b64decode), which gives me bytes that look like
b'Salted__\x98&v\\\x99\xa8\xbb\x10\xaa"\xbbf\xea\x99\xbf}\x95\x96\xd0\xba~\xef#\x00\x16v\x01\xa5\x11=\xfa;\xc0\xec\xcb\x95h\xfcH\...
when printed. I'm having trouble decoding the bytes, however. Doing decoded_bytes.decode() gives an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 8: invalid start byte I've tried passing a few different encodings to decode, but none of them seem to work. It looks like the first 8 bytes are a salt, but removing those 8 and trying to decode doesn't work either. Any help would be greatly appreciated.
Related
I've received a .xlsx file where foreign language characters appear to have to have been originally encoded as utf-8 and then decoded as utf-16.
I don't have access to the original file.
é and ó, for example, should be read as é and ó, respectively, which are encoded as 0xC3 0xA9 and 0xC3 0xB3. Instead each single utf-8 character has been decoded at some point as two utf-16 characters.
I've tried encoding them to bytes and decoding them with UTF-8, but that doesn't translate correctly.
This:
s = "ó".encode("utf-16")
uni = s.decode("utf-8")
print(uni)
returns this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I've tried the above encode/decode sequence with a variety of different parameters including UTF-16be and UTF-16le, and every error parameter of decode, and just slicing off the BOM, none of which worked.
Is there a way to fix this? Some library that can make this a less painful process than doing a literal replacement by reading every string and replacing ó with ó?
I'm trying to receive a CSV file from an Amazon S3 bucket like this:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='BUCKET_NAME', Key='FILE_NAME')
data = obj['Body'].read().decode('utf-8').splitlines()
But this is what I get:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 10: invalid start byte
I thought I got the encoding of the file wrong, so I ran file FILE_NAME.csv and got UTF-8. I have also tried latin-1 and some other encodings, but all of them give me gibberish. Any help is appreciated, and thank you all!
This happens if your string has non ASCII characters encoded in it and it is unable to decode with UTF-8 (may happen if needed to use other encodings). In such a scenario, you need to use the encoding windows-1252.
The below solution works fine for me. By default, it tries to decode with UTF-8 as it is recommended to use this encoding over others. If it gets an error while decoding then it tries with windows-1252.
message = 'some message with special characters'
try:
return message.decode('utf-8')
except:
return message.decode('windows-1252')
Now you can even encode it back with UTF-8.
I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:
fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte
Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)
Code is:
with open(inputne, "rb") as file:
while 1:
readBytes= file.read(dataMaxSize)
fileStrings.append(readBytes)
if not readBytes:
break
readBytes= ''
filesize=0
for i in range(0, len(fileStrings)):
fileStrings[i] = fileStrings[i].decode()
filesize += len(fileStrings[i])
Edit: For anyone having same issue, parameter len() will give you size without b''.
In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).
As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.
I have a strange issue popping up when attempting to scrape the links from a pdf file. The link appears as 'http://www.mbc.ca.gov/Licensees/License_Renewal/Physician_Survey.aspx' in the pdf file. However, it comes out as:
b'http://www.mbc.ca.gov/Licensees/License_Renewal/Physici\xe9C#|\xf2\xefw\x0e\xd3\x8d>X\x0f\xe7\xc6'
when performing a resolve() method on the PDFObjRef. Why the sudden corruption in the link there? Almost looks like a newline or something that got interpreted as a byte. Also, why is this even a byte string if it's clearly human readable? Is this normal behavior for pdfminer?
It gives this error when attempting to decode that byte string with utf-8:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 55: invalid continuation byte
I think this is a goner. The script works with all pdfs I've encountered but this one. So unless someone can come up with a reason that pdfminer would get a weird/corrupt encoding about 40-60 characters into a byte string, this is FUBAR.
So, I'm having issues with Python3 encoding. I have a few bytes I want to work as strings. (long story)
In few words, this works
a = "\x85".encode()
print(a.decode())
But this doesn't
b = (0x85).to_bytes(1,"big")
print(b.decode())
UnicodeDecodeError: utf-8 codec can't decode byte 0x85 in position 0:
invalid start byte
I have read a handful of articles on the subject, but they insist that 'python3 is broken' or that 'you shouldn't be using strings for that'. Plenty articles on Stackoverflow just use "work arounds" (such as "use replace on error" or "user utc-16").
Could anyone tell me where the difference lies and why the function works while the second one doesn't? Shouldn't both of them work identically? Why can't utf-8 decode the byte on the second attempt?
In the first case '\x85'.encode() encodes the Unicode code point U+0085 in the Python 3 default encoding of UTF-8. So the output is the correct two-byte UTF-8 encoding of that code point:
>>> '\x85'.encode()
b'\xc2\x85'
Decode then works because it was correctly encoded in UTF-8 to begin with:
>>> b'\xc2\x85'.decode()
'\x85'
The second case is a complicated way of creating a single byte string:
>>> (0x85).to_bytes(1,'big')
b'\x85'
This byte string is not correctly encoded as UTF-8, so it fails to decode:
>>> b'\x85'.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte
Python 3 is definitely not "broken". It cleanly separates byte data from text.
If you have raw bytes, work with them as bytes. Raw data in Python 3 is intended to be manipulated in byte strings or byte arrays. Unicode strings are for text. Decode bytes to text to manipulate it, then encode back to bytes to serialize to file, socket, database, etc.
If for some reason you feel the need to use Unicode strings for raw data, the first 256 code points of Unicode correspond to the latin1 codec for 1:1 mapping of one to the other.
>>> '\x85'.encode('latin1')
b'\x85'
>>> b'\x85'.decode('latin1')
'\x85'
This is often used to correct programming errors due to encoding/decoding with the wrong encodings.