Python convert mixed ASCII code to String - python

I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help

It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.

If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]

Related

String change from Latin to ASCII

I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')
It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)

Is json.dumps and json.loads safe to run on a list of any string?

Is there any danger in losing information when JSON serialising/deserialising lists of text in Python?
Given a list of strings lst:
lst = ['str1', 'str2', 'str3', ...]
If I run
lst2 = json.loads(json.dumps(lst))
Will lst always be exactly the same as lst2 (i.e. will lst == lst2 always result to True)? Or are there some special, unusual characters that would break either of these methods?
I'm curious because I'll be dealing with a lot of different and unusual characters from various Unicode ranges and I would like to be absolutely certain that this process is 100% robust.
Depends on what you mean by "exactly the same". We can identify three separate issues:
Semantic identity. What you read in is equivalent in meaning to what you write back out, as long as it's well-defined in the first place. Python (depending on version) might reorder dictionary keys, and will commonly prefer Unicode escapes over literal characters for some code points, and vice versa.
>>> json.loads(json.dumps("\u0050\U0001fea5\U0001f4a9"))
'P\U0001fea5💩'
Lexical identity. Nope. As shown above, the JSON representation of Unicode code points can get normalized in different ways, so that \u0050 gets turned into a literal P, and printable emoji may or may not similarly be turned into Unicode escapes, or vice versa.
(This is distinct from proper Unicode normalization, which makes sure that homoglyphs get turned into the same precise code point.)
Garbage in, same garbage out. Nope. If you have invalid input, Python will often tend to crash rather than pass it through, though you can modify some of this by catching errors and/or passing in flags to request less strict behavior.
>>> json.loads(r'"\u123"')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape
>>> print(json.loads(r'"\udcff"'))
?
>>> #!? should probably crash or moan instead!
You seem to be asking about the first case, but the third can bite your behind badly if you don't make sure you have decided what to do with invalid Unicode data.
The second case would make a difference if you care about the JSON on disk being equivalent between versions; it doesn't seem to matter to you, but future visitors of this question might well care.
To some degree, yes, it should be safe. Note however that JSON is not defined in terms of byte strings, but rather it's defined in terms of Unicode text. That means before you do it json.parse, you need to decode that string first from whatever text encoding you're using. This Unicode encoding/decoding step may introduce inconsistencies.
The other implicit question you might have may be, will this process round trip. The answer to that is, it usually will, but it depends on the encoding/decoding process. Depending on the processing step, you may be normalising different characters that are considered equivalent in Unicode but composed using different code points. For example, accented characters like å may be encoded as composite characters using letter a and combining characters for the circle, or it may be encoded as the canonical code point of that character.
There's also the issue of JSON escape characters, which looks like "\u1234". Once decoded, Python doesn't preserve whether the characters is originally encoded using JSON escape or as Unicode character, so you'll lose that information as well and the text may not round trip fully in that case.
Apart from those issues in the deep corners of Unicode nerdery regarding equivalent characters and normalisation, encoding and decoding from/to JSON itself is pretty safe.

What does the "\x5b\x4d\x6f etc.." mean in Python?

this is my first post on here so please excuse me if I have made any mistakes.
So, I was browsing around on the Metasploit page, and I found these strange types of codes. I tried searching it on google and on here, but couldn't find any other questions and answers like I had. I also noticed that Elliot used the method in "Mr. Robot" while programming in Python. I can see that the code is usually used in viruses, but I need to know why. This is the code that I found using this method:
buf +=
"\x5b\x4d\x6f\x76\x69\x65\x50\x6c\x61\x79\x5d\x0d\x0a\x46\x69\x6c\x65\x4e\x61\x6d\x65\x30\x3d\x43\x3a\x5c"
It's a string, just as any other string like "Hello World!". However, it's written in a different way. In computers, each character corresponds to a number, called a code-point, according to an encoding. One such encoding that you might have heard of is ASCII, another is UTF-8. To give an example, in both encodings, the letter H corresponds to the number 72. In Python, one usually specifies a string using the matching letters, like "Hello World!". However, it is also possible to use the code-points. In python, this can be denoted with \xab, where ab is replaced with the hexadecimal form of the code-point. So H would become '\x48', because 48 is the hexadecimal notation for 72, the code-point for the letter H. In this notation, "Hello World!" becomes "\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21".
The string you specify consists of the hexadecimal code-point 5b (decimal 91, the code-point for the character [), followed by the code-point 4d (M), etc., leading to the full string [MoviePlay]\r\nFileName0=C:\\. Here \r and \n are special characters together representing a line-break, so one could also read it as:
[MoviePlay]
FileName0=C:\\
In principle this notation is not necessarily found in viruses, but that kind of programming often requires very specific manipulation of numbers in memory without a lot of regard for the actual characters represented by those numbers, so that could explain why you'd see it arise there.
The code is a sequence of ASCII character encoded in hex.
It can be printed directly.
print('\x5b\x4d\x6f\x76\x69\x65\x50\x6c\x61\x79\x5d\x0d\x0a\x46\x69\x6c\x65\x4e\x61\x6d\x65\x30\x3d\x43\x3a\x5c')
The result is:
[MoviePlay]
FileName0=C:\
They use Metasploit, msfvenom to be more specific, to create or generate shellcodes specially for crafted or exploited file such as documents (docs, ppt, xls, etc) with different encoding.

Do UTF-8 characters cover all encodings of ISO8859-xx and windows-12xx?

I am trying to write a generic document indexer from a bunch of documents with different encodings in python. I would like to know if it is possible to read all of my documents (that are encoded with utf-8,ISO8859-xx and windows-12xx) with utf-8 without character loss?
The reading part is as follows:
fin=codecs.open(doc_name, "r","utf-8");
doc_content=fin.read()
I'm going to rephrase your question slightly. I believe you are asking, "can I open a document and read it as if it were UTF-8, provided that it is actually intended to be ISO8869-xx or Windows-12xx, without loss?". This is what the Python code you've posted attempts to do.
The answer to that question is no. The Python code you posted will mangle the documents if they contain any characters above ordinal 127. This is because the "codepages" use the numbers from 128 to 255 to represent one character each, where UTF-8 uses that number range to proxy multibyte characters. So, each character in your document which is not in ASCII will be either interpreted as an invalid string or will be combined with the succeeding byte(s) to form a single UTF-8 codepoint, if you incorrectly parse the file as UTF-8.
As a concrete example, say your document is in Windows-1252. It contains the byte sequence 0xC3 0xAE, or "î" (A-tilde, registered trademark sign). In UTF-8, that same byte sequence represents one character, "ï" (small 'i' with diaresis). In Windows-874, that same sequence would be "รฎ". These are rather different strings - a moral insult could become an invitation to play chess, or vice versa. Meaning is lost.
Now, for a slightly different question - "can I losslessly convert my files from their current encoding to UTF-8?" or, "can I represent all the data from the current files as a UTF-8 bytestream?". The answer to these questions is (modulo a few fuzzy bits) yes. Unicode is designed to have a codepoint for every ideoglyph in any previously existing codepage, and by and large has succeeded in this goal. There are a few rough edges, but you will likely be well-served by using Unicode as your common interchange format (and UTF-8 is a good choice for a representation thereof).
However, to effect the conversion, you must already know and state the format in which the files exist as they are being read. Otherwise Python will incorrectly deal with non-ASCII characters and you will badly damage your text (irreparably, in fact, if you discard either the invalid-in-UTF8 sequences or the origin of a particular wrongly-converted byte range).
In the event that the text is all, 100% ASCII, you can open it as UTF-8 without a problem, as the first 127 codepoints are shared between the two representations.
UTF-8 covers everything in Unicode. I don't know for sure whether ISO-8859-xx and Windows-12xx are entirely covered by Unicode, but I strongly suspect they are.
I believe there are some encodings which include characters which aren't in Unicode, but I would be fairly surprised if you came across those characters. Covering the whole of Unicode is "good enough" for almost everything - that's the purpose of Unicode, after all. It's meant to cover everything we could possibly need (which is why it's grown :)
EDIT: As noted, you have to know the encoding of the file yourself, and state it - you can't just expect files to magically be read correctly. But once you do know the encoding, you could convert everything to UTF-8.
You'll need to have some way of determining which character set the document uses. You can't just open each one as "utf-8" and expect it to get magically converted. Open it with the proper character set, then convert.
The best way to be sure would be to convert a large set of documents, then convert them back and do a comparison.

What is the fool proof way to convert some string (utf-8 or else) to a simple ASCII string in python

Inside my python scrip, I get some string back from a function which I didn't write. The encoding of it varies. I need to convert it to ascii format. Is there some fool-proof way of doing this? I don't mind replacing the non-ascii chars with blanks or something else...
If you want an ASCII string that unambiguously represents what you have got, without losing any information, the answer is simple:
Don't muck about with encode/decode, use the repr() function (Python 2.X) or the ascii() function (Python 3.x).
You say "the encoding of it varies". I guess that by "it" you mean a Python 2.x "string", which is really a sequence of bytes.
Answer part one: if you do not know the encoding of that encoded string, then no, there is no way at all to do anything meaningful with it*. If you do know the encoding, then step one is to convert your str into a unicode:
encoded_string = i_have_no_control()
the_encoding = 'utf-8' # for the sake of example
text = unicode(encoded_string, the_encoding)
Then you can re-encode your unicode object as ASCII, if you like.
ascii_garbage = text.encode('ascii', 'replace')
* There are heuristic methods for guessing encodings, but they are slow and unreliable. Here's one excellent attempt in Python.
I'd try to normalize the string then encode it. What about :
import unicodedata
s = u"éèêàùçÇ"
print unicodedata.normalize('NFKD',s).encode('ascii','ignore')
This works only if you have unicode as input. Therefor, you must know what can of encoding the function ouputs and decode it. If you don't, there are encoding detection heuristics, but on short strings, there are not reliable.
Of course, you could have luck and the function outputs rely on various unknow encodings but using ascii as a code base, therefor they would allocate the same value for the bytes from 0 to 127 (like utf-8).
In that case, you can just get rid of the unwanted chars by filtering them using OrderedSets :
import string.printable # asccii chars
print "".join(OrderedSet(string.printable) & OrderedSet(s))
Or if you want blanks instead :
print("".join(((char if char in string.printable else " ") for char in s )))
"translate" can help you to do the same.
The only way to know if your are this lucky is to try it out... Sometimes, a big fat lucky day is what any dev need :-)
What's meant by "foolproof" is that the function does not fail with even the most obscure, impossible input -- meaning, you could feed the function random binary data and IT WOULD NEVER FAIL, NO MATTER WHAT. That's what "foolproof" means.
The function should then proceed do its best to convert to the destination encoding. If it has to throw away all the trash it does not understand, then that is perfectly fine and is in fact the most desirable result. Why try to salvage all the junk? Just discard the junk. Tell the user he's not merely a moron for using Microsoft anything, but a non-standard moron for using non-standard Microsoft anything...or for attempting to send in binary data!
I have just precisely this same need (though my need is in PHP), and I also have users who are at least as moronic as I am, sometimes moreso; however, they are definitely nicer and no doubt more patient.
The best, bottom-line thing I've found so far is (in PHP 5.3):
$fixed_string = iconv( 'ISO-8859-1', 'UTF-8//IGNORE//TRANSLATE', $in_string );
This attempts to translate whatever it can and simply throws away all the junk, resulting in a legal UTF-8 string output. I've also not been able to break it or cause it to fail or reject any incoming text or data, even by feeding it gobs of binary junk data.
Finding the iconv() and getting it to work is easy; what's so maddening and wasteful is reading through all the total garbage and bend-over-backwards idiocy that so many programmers seem to espouse when dealing with this encoding fiasco. What's become of the enviable (and respectable) "Flail and Burn The Idiots" mentality of old school programming? Let's get back to basics. Use iconv() and throw away their garbage, and don't be bashful when telling them you threw away their garbage -- in short, don't fail to flail the morons who feed you garbage. And you can tell them I told you so.
If all you want to do is preserve ASCII-compatible characters and throw away the rest, then in most encodings that boils down to removing all characters that have the high bit set -- i.e., characters with value over 127. This works because nearly all character sets are extensions of 7-bit ASCII.
If it's a normal string (i.e., not unicode), you need to decode it in an arbitrary character set (such as iso-8859-1 because it accepts any byte values) and then encode in ascii, using the ignore or replace option for errors:
>>> orig = '1ä2äö3öü4ü'
>>> orig.decode('iso-8859-1').encode('ascii', 'ignore')
'1234'
>>> orig.decode('iso-8859-1').encode('ascii', 'replace')
'1??2????3????4??'
The decode step is necessary because you need a unicode string in order to use encode. If you already have a Unicode string, it's simpler:
>>> orig = u'1ä2äö3öü4ü'
>>> orig.encode('ascii', 'ignore')
'1234'
>>> orig.encode('ascii', 'replace')
'1??2????3????4??'

Categories

Resources