What does the "\x5b\x4d\x6f etc.." mean in Python? - python

this is my first post on here so please excuse me if I have made any mistakes.
So, I was browsing around on the Metasploit page, and I found these strange types of codes. I tried searching it on google and on here, but couldn't find any other questions and answers like I had. I also noticed that Elliot used the method in "Mr. Robot" while programming in Python. I can see that the code is usually used in viruses, but I need to know why. This is the code that I found using this method:
buf +=
"\x5b\x4d\x6f\x76\x69\x65\x50\x6c\x61\x79\x5d\x0d\x0a\x46\x69\x6c\x65\x4e\x61\x6d\x65\x30\x3d\x43\x3a\x5c"

It's a string, just as any other string like "Hello World!". However, it's written in a different way. In computers, each character corresponds to a number, called a code-point, according to an encoding. One such encoding that you might have heard of is ASCII, another is UTF-8. To give an example, in both encodings, the letter H corresponds to the number 72. In Python, one usually specifies a string using the matching letters, like "Hello World!". However, it is also possible to use the code-points. In python, this can be denoted with \xab, where ab is replaced with the hexadecimal form of the code-point. So H would become '\x48', because 48 is the hexadecimal notation for 72, the code-point for the letter H. In this notation, "Hello World!" becomes "\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21".
The string you specify consists of the hexadecimal code-point 5b (decimal 91, the code-point for the character [), followed by the code-point 4d (M), etc., leading to the full string [MoviePlay]\r\nFileName0=C:\\. Here \r and \n are special characters together representing a line-break, so one could also read it as:
[MoviePlay]
FileName0=C:\\
In principle this notation is not necessarily found in viruses, but that kind of programming often requires very specific manipulation of numbers in memory without a lot of regard for the actual characters represented by those numbers, so that could explain why you'd see it arise there.

The code is a sequence of ASCII character encoded in hex.
It can be printed directly.
print('\x5b\x4d\x6f\x76\x69\x65\x50\x6c\x61\x79\x5d\x0d\x0a\x46\x69\x6c\x65\x4e\x61\x6d\x65\x30\x3d\x43\x3a\x5c')
The result is:
[MoviePlay]
FileName0=C:\

They use Metasploit, msfvenom to be more specific, to create or generate shellcodes specially for crafted or exploited file such as documents (docs, ppt, xls, etc) with different encoding.

Related

Can I access or find Unicode values of control characters?

Is there a way to access or find character controls in Python, like these NUL, DEL, CR, LF, BEL which is its form as a single ASCII Unicode character to use as a parameter in the ord() built-in method to get a numeric value.
You can find their Unicode representation on the ASCII page on Wikipedia.
Key
Unicode
Unicode-Hex
NUL
␀
\u2400
DEL
␡
\u2421
CR
␍
\u240d
LF
␊
\u240a
BEL
␇
\u2407
To "access" or "search" to the control characters, unicode database module provides access to a set of data of characters through different methods by each type or name representation. I was a bit confused as to its representation in the text data type format and how you could use it ord function; lookup returns the name of the corresponding character thus solving the unknown about the subject to "search" control characters by gap unicode characters due to little experience or knowledge of the library and ASCII standards.
It is also important to note that the question is vaguely phrased and also does not provide a reproducible example or algorithm problem, and since these characters have different representations and they make them not only usable in form of text or letters or even names of control characters, this due to the wide variety in which ASCII is ruled, this does not mean that the question could not be answered because it is known that from the beginning it started as a problem for which a clarification or explanation has been provided to the doubts presented on this post.

String change from Latin to ASCII

I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')
It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)

Is json.dumps and json.loads safe to run on a list of any string?

Is there any danger in losing information when JSON serialising/deserialising lists of text in Python?
Given a list of strings lst:
lst = ['str1', 'str2', 'str3', ...]
If I run
lst2 = json.loads(json.dumps(lst))
Will lst always be exactly the same as lst2 (i.e. will lst == lst2 always result to True)? Or are there some special, unusual characters that would break either of these methods?
I'm curious because I'll be dealing with a lot of different and unusual characters from various Unicode ranges and I would like to be absolutely certain that this process is 100% robust.
Depends on what you mean by "exactly the same". We can identify three separate issues:
Semantic identity. What you read in is equivalent in meaning to what you write back out, as long as it's well-defined in the first place. Python (depending on version) might reorder dictionary keys, and will commonly prefer Unicode escapes over literal characters for some code points, and vice versa.
>>> json.loads(json.dumps("\u0050\U0001fea5\U0001f4a9"))
'P\U0001fea5💩'
Lexical identity. Nope. As shown above, the JSON representation of Unicode code points can get normalized in different ways, so that \u0050 gets turned into a literal P, and printable emoji may or may not similarly be turned into Unicode escapes, or vice versa.
(This is distinct from proper Unicode normalization, which makes sure that homoglyphs get turned into the same precise code point.)
Garbage in, same garbage out. Nope. If you have invalid input, Python will often tend to crash rather than pass it through, though you can modify some of this by catching errors and/or passing in flags to request less strict behavior.
>>> json.loads(r'"\u123"')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape
>>> print(json.loads(r'"\udcff"'))
?
>>> #!? should probably crash or moan instead!
You seem to be asking about the first case, but the third can bite your behind badly if you don't make sure you have decided what to do with invalid Unicode data.
The second case would make a difference if you care about the JSON on disk being equivalent between versions; it doesn't seem to matter to you, but future visitors of this question might well care.
To some degree, yes, it should be safe. Note however that JSON is not defined in terms of byte strings, but rather it's defined in terms of Unicode text. That means before you do it json.parse, you need to decode that string first from whatever text encoding you're using. This Unicode encoding/decoding step may introduce inconsistencies.
The other implicit question you might have may be, will this process round trip. The answer to that is, it usually will, but it depends on the encoding/decoding process. Depending on the processing step, you may be normalising different characters that are considered equivalent in Unicode but composed using different code points. For example, accented characters like å may be encoded as composite characters using letter a and combining characters for the circle, or it may be encoded as the canonical code point of that character.
There's also the issue of JSON escape characters, which looks like "\u1234". Once decoded, Python doesn't preserve whether the characters is originally encoded using JSON escape or as Unicode character, so you'll lose that information as well and the text may not round trip fully in that case.
Apart from those issues in the deep corners of Unicode nerdery regarding equivalent characters and normalisation, encoding and decoding from/to JSON itself is pretty safe.

Beautiful Soup and Unicode Problems

I'm using BeautifulSoup to parse some web pages.
Occasionally I run into a "unicode hell" error like the following :
Looking at the source of this article on TheAtlantic.com [ http://www.theatlantic.com/education/archive/2013/10/why-are-hundreds-of-harvard-students-studying-ancient-chinese-philosophy/280356/ ]
We see this in the og:description meta property :
<meta property="og:description" content="The professor who teaches Classical Chinese Ethical and Political Theory claims, "This course will change your life."" />
When BeautifulSoup parses it, I see this:
>>> print repr(description)
u'The professor who teaches\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'
If I try encoding it to UTF-8 , like this SO comment suggests : https://stackoverflow.com/a/10996267/442650
>>> print repr(description.encode('utf8'))
'The professor who teaches\xc2\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'
Just when I thought I had all my unicode issues under control, I still don't quite understand what's going on, so I'm going to lay out a few questions:
1- why would BeautifulSoup convert the to \xa0 [a latin charset space character]? The charset and headers on this page are UTF-8, I thought BeautifulSoup pulls that data for the encoding ? Why wasn't it replaced with a <space> ?
2- Is there a common way to normalize whitespaces for conversion ?
3- When I encoded to UTF8 , where did \xa0 become the sequence of \xc2\xa0 ?
I can pipe everything through unicodedata.normalize('NFKD',string) to help get me where I want to be -- but I'd love to understand what's wrong and avoid problem like this in the future.
You aren't encountering a problem. Everything is behaving as intended.
indicates a non-breaking space character. This isn't replaced with a space because it doesn't represent a space; it represents a non-breaking space. Replacing it with a space would lose information: that where that space occurs, a text rendering engine shouldn't put a line break.
The Unicode code point for non-breaking space is U+00A0, which is written in a Unicode string in Python as \xa0.
The UTF-8 encoding of U+00A0 is, in hexadecimal, the two byte sequence C2 A0, or written in a Python string representation, \xc2\xa0. In UTF-8, anything beyond the 7-bit ASCII set needs two or more bytes to represent it. In this case, the highest bit set is the eighth bit. That means that it can be represented by the two-byte sequence (in binary) 110xxxxx 10xxxxxx where the x's are the bits of the binary representation of the code point. In the case of A0, that is 10000000, or when encoded in UTF-8, 11000010 10000000 or C2 A0.
Many people use in HTML to get spaces which aren't collapsed by the usual HTML whitespace collapsing rules (in HTML, all runs of consecutive spaces, tabs, and newlines get interpreted as a single space unless one of the CSS white-space rules are applied), but that's not really what they are intended for; they are supposed to be used for things like names, like "Mr. Miyagi", where you don't want there to be a line break between the "Mr." and "Miyagi". I'm not sure why it was used in this particular case; it seems out of place here, but that's more of a problem with your source, not the code that interprets it.
Now, if you don't really care about layout so you don't mind whether or not text layout algorithms choose that as a place to wrap, but would like to interpret this merely as a regular space, normalizing using NFKD is a perfectly reasonable answer (or NFKC if you prefer pre-composed accents to decomposed accents). The NFKC and NFKD normalizations map characters such that most characters that represent essentially the same semantic value in most contexts are expanded out. For instance, ligatures are expanded out (ffi -> ffi), archaic long s characters are converted into s (ſ -> s), Roman numeral characters are expanded into their individual letters (Ⅳ -> IV), and non-breaking space converted into a normal space. For some characters, NFKC or NFKD normalization may lose information that is important in some contexts: ℌ and ℍ will both normalize to H, but in mathematical texts can be used to refer to different things.

Python convert mixed ASCII code to String

I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help
It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.
If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]

Categories

Resources