decode bytes that are in string form with unkown encoding

decode bytes that are in string form with unkown encoding - python

So I just started my own little project to create a bot for a game,
but only did little coding before, so I am definitely no expert, if I get something mixed up or forget to mention some information, I apologize in advance!
so basically my python bot will connect to the server (WebSocket connection 13, the header says "Accept-Encoding: gzip, deflate, br"), I use the WebSocket module and that works well. the game sends messages in JSON format. however, they are filled up with backslashes, I think internally a javascript clears those out / splits each message into multiple ones and removes the outermost layer. so far my solution is to just clear out the backslashes and from there on it's pretty straightforward.
problem is: map data is apparently encoded. so basically the message would look like this:
{"type":"pkg","data":"[\"{\\\"type\\\":\\\"pl\\\",\\\"data\\\":[\\\"{\\\\\\\"type\\\\\\\":\\\\\\\"p\\\\\\\",\\\\\\\"id\\\\\\\":227727,\\\\\\\"tpl\\\\\\\":227727,\\\\\\\"s\\\\\\\":458
.... and then at the end of the message (its a lot longer, i just didnt to post 30 lines of compressed data):
{\\\"type\\\":\\\"zip\\\",\\\"data\\\":\\\"{\\\\\\\"type\\\\\\\":\\\\\\\"map\\\\\\\",\\\\\\\"xî182îyî478îtilesî\\\\\\\"1:î¢î¤î£î¦526_21î¢254
which is obviously the encoded / compressed map data. firefox dev tools however shows it decompressed too, it then looks more like this:
\\\\\\\"map\\\\\\\",\\\\\\\"xî\u0080\u0086182î\u0080\u008dyî\u0080\u0086478î\u0080\u008dtilesî\u0080\u0086\\\\\\\"1:î\u0080¢î\u0080¤î\u0080£î\u0080¦526_21î\u0080¢254:36î\u0080²î\u0080´î\u0080³î\u0080µ:î\u0080¬î\u0080¸î\u0080ºî\u0080·î\u0080·î\u0080ºî\u0080¼î\u0080¶
I tried around with different commands and modules like zlib, but honestly, I m really lost. is that data already decoded and now in byte form or is that still compressed zip data? if so, how can I decode it, as I right now handle it as a raw string? or should I put it into a data file from the get-go? what does the xi, in the beginning, stand for, the encoding scheme?
any help is greatly appreciated, I would really like to know what the heck is going on here :D

Related

Structure of base 64 encoded strings in html

I downloaded the page source (html) of websites with Selenium (Python). And I wish to find all base 64 encoded strings in html files.
Is there a known structure to all base 64 encoded strings in htmls? From my observations, it seems like it would start with ;base64 followed by hex-strings and finally a closing bracket ). Is that accurate?
From Wikipedia, the hex-string must also be composed of the followings: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. Can someone also confirm that?
Thanks a lot in advance!
* Edit 1 *
Thanks a lot Tris! The link you provided is very helpful! However, from that, it seems like there is no specific format for the end of a base 64 strings. If I want to detect its end, what advice would you give other than )?
I mainly want to track the changes of a bunch of websites, and the base64 encodings contain a lot of data that are not relevant for my use. To save storage, I therefore intend to remove them. An example is www.amd.com, which has the following data:image/png;base64,... (after being rendered by browser).
Since there are many different websites, I don't know all of their formats. Here are some other examples of the base64 strings that I found and are not useful to me:
data:font/truetype;base64,AAEAAA...
data:image/png;base64,iVBORw0KG...
For several of the examples that I saw, they all ended with a closing bracket ). May I ask then under what scenario would they end with ) and otherwise?
Thanks again!

Not all base64-encoded strings will include a ;base64 at the beginning of them -- this is typically specific to data URLs. If you are specifically looking for base64-encoded images or other inline elements that would otherwise be referred to with an HTTP URL, this might be fine. The closing bracket is not typically relevant, I haven't seen that required on data URLs or other base64-encoded strings.
Typically, base64-encoded strings use the alphabet you've mentioned -- ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. If the encoded length is not a multiple of 3 bytes, it is padded with an appropriate number of = characters at the end.
There is another commonly used base64 format on the web -- the URL-safe base64 format. In this encoding, + and / are typically replaced with - and _ so they can be included in URLs safely, hence the name.
This information may be irrelevant if you know more about the structure of the websites you are trying to parse, aside from just "they contain base64-encoded string data."

Python decoding, base64, nbt, gzip? what is it?

I am trying to get information from a Minecraft AP. From the API you can read players inventories, but it this is what it says: here is link to pastebin
I tried to run base64 on it on python, but it gave me an output like this (only a few lines):
b'\xad\xa9\xc0d\x85\xe4\xe0\x87`\xcess\x00\x9b]e~c\xea\xaa\xb8\x9a\xa4\xdd\x958"\x8f\x0f\x10\xb9\xea\x9f2v\xdd\xcc#N\xe8x\xb4\xdd\x18\xa9\xee>\xcfM
I read a bit about it on their forums, and a few comments said stuff about "base64, gzip, nbt".
Know, I haven't really worked at decoding stuff, etc, and I am trying to understand what it all means.
Thanks

NBT is a minecraft specific format: Named Binary Tag
So you get an NBT-File, that is zipped (compressed) in the gzip format and then Base64 encoded.
After base64 decoding you need to unzip the gzip format to get the NBT.
There's also a nbt parser in python.

GAE unicode character gets encoded to utf-8 bytes

In my app I'm accepting text from user inputs where users often paste text from microsoft word.
A good example being the apostrophe ’, which for some reason gets converted to =E2=80=99 when posting to my handler in google app engine. I've tried a number of confused ways to prevent this and I'm quite happy to simple remove these characters, some of these methods work in plain python but not in app engine.
here's some of what I've tried:
problem_string = re.sub(r'[^\x00-\x7F]+','', problem_string)# trying to remove it
problem_string = problem_string.encode( "utf-8" )# desperation...
problem_string = "".join((c if ord(c) < 128 else '' for c in problem_string))# trying to just remove the thing
problem_string = unicode(problem_string, "utf8")# probably fails since its already unicode
... where I'm trying to capture the string including ’ and then later save it to the ndb datastore as a StringProperty(). Except for the last option, the apsotrophe example gets converted to =E2=80=99.
If I could save the apostrophe type character and display it again that would be great, but simply removing it would also serve my needs.
*Edit - the following:
experience = re.sub(r'[^\x00-\x7F]+',' ', experience)
seems to work fine on the dev server, and successfully removes the offending apostrophe.
Also what may be an issue is that the POST fields are going through the blobstore, so: blobstore_handlers.BlobstoreUploadHandler, which I think may being causing some problems.
I've really been bumping my head against this and I would really really appreciate an explanation from some clever stack-overflower...

Ok, I think I've vaguely stumbled upon a solution.
It had something to do with the blobstore upload handler, I guess it was encoding/decoding unicode appropriately to account for weird file characters. So I modified the handler so that the image file is uploaded via google cloud storage instead of the blobstore and it seems to work fine, i.e. the ’ gets to the datastore as ’ instead of =E2=80=99
I won't accept my own answer for the next few days, maybe someone can clarify things better for future confused individuals.

jQuery file uploader - Django not working correctly with chunks

I've spent some days by now trying to figure out how to tell Django that my jQuery file uploader is sending chunks and not x seperate files.
I know that I need a custom FileUploadHandler like here in this one.
My client-side code is posted in this question.
The plugin sends chunk by chunk as a separate AJAX call (at least with FireBug it looks like this). The server accepts every one of them and saves them under a different name (in my case "_1", "_2", "_3"... ). And yes, the handler is used. I proved it via print
BTW: The Content-Range in the header is correct.
BTW II: This plugin unfortunately did not utilize chunking... so no solution here for me.
So, has anybody an idea what I might be doing wrong? I found some other FileUploadHandlers but they all seem pretty similar. So I guess the problem is not here?

How to handle unicode of an unknown encoding in Django?

I want to save some text to the database using the Django ORM wrappers. The problem is, this text is generated by scraping external websites and many times it seems they are listed with the wrong encoding. I would like to store the raw bytes so I can improve my encoding detection as time goes on without redoing the scrapes. But Django seems to want everything to be stored as unicode. Can I get around that somehow?

You can store data, encoded into base64, for example. Or try to analize HTTP headers from browser, may be it is simplier to get proper encoding from there.

Create a File with the data. Use a Django models.FileField to hold a reference to the file.
No it does not involve a ton of I/O. If your file is small it adds 2 or 3 I/O's (the directory read, the iNode read and the data read.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.