Encode binary data so that \n is escaped

Encode binary data so that \n is escaped - python

I'm trying to work out a way to encode/decode binary data in such a way that the new line character is not part of the encoded string.
It seems to be a recursive problem, but I can't seem to work out a solution.
e.g. A naive implementation:
>>> original = 'binary\ndata'
>>> encoded = original.replace('\n', '=n')
'binary=ndata'
>>> decoded = original.replace('=n', '\n')
'binary\ndata'
What happens if there is already a =n in the original string?
>>> original = 'binary\ndata=n'
>>> encoded = original.replace('\n', '=n')
'binary=ndata=n'
>>> decoded = original.replace('=n', '\n')
'binary\ndata\n' # wrong
Try to escape existing =n's, but then what happens if there is already an escaped =n?
>>> original = '++nbinary\ndata=n'
>>> encoded = original.replace('=n', '++n').replace('\n', '=n')
'++nbinary=ndata++n'
How can I get around this recursive problem?

Solution
original = 'binary\ndata \\n'
# encoded = original.encode('string_escape') # escape many chr
encoded = original.replace('\\', '\\\\').replace('\n', '\\n') # escape \n and \\
decoded = encoded.decode('string_escape')
verified
>>> print encoded
binary\ndata \\n
>>> print decoded
binary
data \n
The solution is from How do I un-escape a backslash-escaped string in python?
Edit: I wrote it also with your ad-hoc economic encoding. The original "string_escape" codec escapes backslash, apostrophe and everything below chr(32) and above chr(126). Decoding is the same for both.

The way to encode strings that might contain the "escape" character is to escape the escape character as well. In python, the escape character is a backslash, but you could use anything you want. Your cost is one character for every occurrence of newline or the escape.
To avoid confusing you, I'll use forward slash:
# original
>>> print "slashes / and /newline/\nhere"
slashes / and /newline/
here
# encoding
>>> print "slashes / and /newline/\nhere".replace("/", "//").replace("\n", "/n")
slashes // and //newline///nhere
This encoding is unambiguous, since all real slashes are doubled; but it must be decoded in a single pass, so you can't just use two successive calls to replace():
# decoding
>>> def decode(c):
# Expand this into a real mapping if you have more substitutions
return '\n' if c == '/n' else c[0]
>>> print "".join( decode(c) for c in re.findall(r"(/.|.)",
"slashes // and //newline///nhere"))
slashes / and /newline/
here
Note that there is an actual /n in the input (and another slash before the newline): it all works correctly anyway.

If you encoded the entire string systematically, would you not end up escaping it? Say for every character you do chr(ord(char) + 1) or something trivial like that?

I don't have a great deal of experience with binary data, so this may be completely off/inefficient/both, but would this get around your issue?
In [40]: original = 'binary\ndata\nmorestuff'
In [41]: nlines = [index for index, i in enumerate(original) if i == '\n']
In [42]: encoded = original.replace('\n', '')
In [43]: encoded
Out[43]: 'binarydatamorestuff'
In [44]: decoded = list(encoded)
In [45]: map(lambda x: decoded.insert(x, '\n'), nlines)
Out[45]: [None, None]
In [46]: decoded = ''.join(decoded)
In [47]: decoded
Out[47]: 'binary\ndata\nmorestuff'
Again, I am sure there is a much better/more accurate way - this is just from a novice perspective.

If you are encoding an alphabet of n symbols (e.g. ASCII) into a smaller set of m symbols (e.g. ASCII except newline) you must allow the encoded string to be longer than the original string.
The typical way of doing this is to define one character as an "escape" character; the character following the "escape" represents an encoded character. This technique has been used since the 1940s in teletypewriters; that's where the "Esc" key you see on your keyboard came from.
Python (and other languages) already provide this in strings with the backslash character. Newlines are encoded as '\n' (or '\r\n'). Backslashes escape themselves, so the literal string '\r\n' would be encoded '\\r\\n'.
Note that the encoded length of a string that includes only the escaped character will be double that of the original string. If that is not acceptable you will have to use an encoding that uses a larger alphabet to avoid the escape characters (which may be longer than the original string) or compress it (which may also be longer than the original string).

How about:
In [8]: import urllib
In [9]: original = 'binary\ndata'
In [10]: encoded = urllib.quote(original)
In [11]: encoded
Out[11]: 'binary%0Adata'
In [12]: urllib.unquote(encoded)
Out[12]: 'binary\ndata'

The escapeless encodings are specifically designed to trim off certain characters from binary data. In your case of removing just the \n character, the overhead will be less than 0.4%.

Related

Add a non escaped escape character to python bytearray

I have an API that is demanding that the quotation marks in my XML attributes are escaped, so <cmd_id="1"> will not work, it requires <cmd_id=\"1\">.
I have tried iterating through my string, for example:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">SetChLevel</cmd><name>C</name><value>30</value></tx>'
Each time that I encounter a " (ascii 34) I will replace it with an escape character (ascii 92) and another quote. Infuriatingly this results in:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id=\\"1\\">SetChLevel</cmd><name>C</name><value>30</value></tx>'
where the escapes have been escaped. As a sanity check I replaced 92 with any other character and it works as expected.
temp = b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">\
SetChLevel</cmd><name>C</name><value>30</value></tx>'
i = 0
j = 0
payload = bytearray(len(temp) + 4)
for char in temp:
if char == 34:
payload[i] = 92
i += 1
payload[i] = 34
i += 1
j += 1
else:
payload[i] = temp[j]
i += 1
j += 1
print(bytes(payload))
I would assume that character 92 would appear once but something is escaping the escape!

Your problem is the result of a very common misunderstanding for programmers new to Python.
When printing a string (or bytes) to the console, Python escapes the escape character (\) to show a string that, when used in Python as a literal, would give you the exact same value.
So:
s = 'abc\\abc'
print(s)
Prints abc\abc, but on the interpreter you get:
>>> s = 'abc\\abc'
>>> print(s)
abc\abc
>>> s
'abc\\abc'
Note that this is correct. After all print(s) should show the string on the console as it is, while s on the interpreter is asking Python to show you the representation of s, which includes the quotes and the escape characters.
Compare:
>>> repr(s)
"'abc\\\\abc'"
repr here prints the representation of the representation of s.
For bytes, things are further complicated because the representation is printed when using print, since print prints a string and a bytes needs to be decoded first, i.e.:
>>> print(some_bytes.decode('utf-8')) # or whatever the encoding is
In short: your code was doing what you wanted it to, it does not duplicate escape characters, you only thought it did because you were looking at the representation of the bytes, not the actual bytes content.
By the way, this also means that you don't have to be paranoid and go through the trouble of writing custom code to replace characters based on their ASCII values, you can simply:
>>> example = bytes('<some attr="value">test</some>', encoding='utf-8')
>>> result = example.replace(b'"', b"\\\"")
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>
I won't pretend that b"\\\"" is intuitive, perhaps b'\\"' is better - but both require that you understand the difference between the representation of a string, or its printed value.
So, finally:
>>> example = b'<some attr="value">test</some>'
>>> result = example.replace(b'"', b'\\"')
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>

Removing escape characters from a string

How can i remove the escape chars in Python 2.7 and python 3 ?
Example:
a = "\u00E7a\u00E7a\u00E7a=http\://\u00E1\u00E9\u00ED\u00F3\u00FA\u00E7/()\=)(){[]}"
decoded = a.decode('unicode_escape')
print decoded
Result:
çaçaça=http\://áéíóúç/()\=)(){[]}
Expected result
çaçaça=http://áéíóúç/()=)(){[]}
EDIT: In order to avoid unnecessary downvotes. using .replace isn't our primary focus since this problem was raised by a legacy solution from other teams ( db table with reference data with contains portuguese chars and regular expressions).

You're looking for a simple str.replace
>>> print decoded.replace('\\', '')
çaçaça=http://áéíóúç/()=)(){[]}
The remaining \ is actually a literal backslash, not an escape sequence.

You can simply remove the unnecessary the escape character in your string, i.e.
>>> a = "\u00E7a\u00E7a\u00E7a=http://\u00E1\u00E9\u00ED\u00F3\u00FA\u00E7/()=)(){[]}"
>>> decoded = a.decode('unicode_escape')
>>> print decoded
çaçaça=http://áéíóúç/()=)(){[]}

Dealing with doubly escaped unicode string

I have a database of badly formatted database of strings. The data looks like this:
"street"=>"\"\\u4e2d\\u534e\\u8def\""
when it should be like this:
"street"=>"中华路"
The problem I have is that when that doubly escaped strings comes from the database they are not being decoded to the chinese characters as they should be. So suppose I have this variable; street="\"\\u4e2d\\u534e\\u8def\"" and if I print that print(street) the result is a string of codepoints "\u4e2d\u534e\u8def"
What can I do at this point to convert "\u4e2d\u534e\u8def" to actual unicode characters ?

First encode this string as utf8 and then decode it with unicode-escape which will handle the \\ for you:
>>> line = "\"\\u4e2d\\u534e\\u8def\""
>>> line.encode('utf8').decode('unicode-escape')
'"中华路"'
You can then strip the " if necessary

You could remove the quotation marks with strip and split at every '\\u'. This would give you the characters as strings representing hex numbers. Then for each string you could convert it to int and back to string with chr:
>>> street = "\"\\u4e2d\\u534e\\u8def\""
>>> ''.join(chr(int(x, 16)) for x in street.strip('"').split('\\u') if x)
'中华路'

Based on what you wrote, the database appears to be storing an eval-uable ascii representation of a string with non-unicode chars.
>>> eval("\"\\u4e2d\\u534e\\u8def\"")
'中华路'
Python has a built-in function for this.
>>> ascii('中华路')
"'\\u4e2d\\u534e\\u8def'"
The only difference is the use of \" instead of ' for the needed internal quote.

Hexadecimal file is loading with 2 back slashes before each byte instead of one

I have a hex file in this format: \xda\xd8\xb8\x7d
When I load the file with Python, it loads with two back slashes instead of one.
with open('shellcode.txt', 'r') as file:
shellcode = file.read().replace('\n', '')
Like this: \\xda\\xd8\\xb8\\x7d
I've tried using hex.replace("\\", "\"), but I'm getting an error
EOL while scanning string literal
What is the proper way to replace \\ with \?

Here is an example
>>> h = "\\x123"
>>> h
'\\x123'
>>> print h
\x123
>>>
The two backslashes are needed because \ is an escape character, and so it needs to be escaped. When you print h, it shows what you want

Backshlash (\) is an escape character. It is used for changing the meaning of the character(s) following it.
For example, if you want to create a string which contains a quote, you have to escape it:
s = "abc\"def"
print s # prints: abc"def
If there was no backslash, the first quote would be interpreted as the end of the string.
Now, if you really wanted that backslash in the string, you would have to escape the bacsklash using another backslash:
s = "abc\\def"
print s # prints: abc\def
However, if you look at the representation of the string, it will be shown with the escape characters:
print repr(s) # prints: 'abc\\def'
Therefore, this line should include escapes for each backslash:
hex.replace("\\", "\") # wrong
hex.replace("\\\\", "\\") # correct
But that is not the solution to the problem!
There is no way that file.read().replace('\n', '') introduced additional backslashes. What probably happened is that OP printed the representation of the string with backslashes (\) which ended up printing escaped backslashes (\\).

You can make a bytes object with a utf-8 encoding, and then decode as unicode-escape.
>>> x = "\\x61\\x62\\x63"
>>> y = bytes(x, "utf-8").decode("unicode-escape")
>>> print(x)
\x61\x62\x63
>>> print(y)
abc

efficiently replace bad characters

I often work with utf-8 text containing characters like:
\xc2\x99
\xc2\x95
\xc2\x85
etc
These characters confuse other libraries I work with so need to be replaced.
What is an efficient way to do this, rather than:
text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')

There is always regular expressions; just list all of the offending characters inside square brackets like so:
import re
print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")
This prints: 'Hello There ', with the unwanted characters replaced by spaces.
Alternately, if you have a different replacement character for each:
# remove annoying characters
chars = {
'\xc2\x82' : ',', # High code comma
'\xc2\x84' : ',,', # High code double comma
'\xc2\x85' : '...', # Tripple dot
'\xc2\x88' : '^', # High carat
'\xc2\x91' : '\x27', # Forward single quote
'\xc2\x92' : '\x27', # Reverse single quote
'\xc2\x93' : '\x22', # Forward double quote
'\xc2\x94' : '\x22', # Reverse double quote
'\xc2\x95' : ' ',
'\xc2\x96' : '-', # High hyphen
'\xc2\x97' : '--', # Double hyphen
'\xc2\x99' : ' ',
'\xc2\xa0' : ' ',
'\xc2\xa6' : '|', # Split vertical bar
'\xc2\xab' : '<<', # Double less than
'\xc2\xbb' : '>>', # Double greater than
'\xc2\xbc' : '1/4', # one quarter
'\xc2\xbd' : '1/2', # one half
'\xc2\xbe' : '3/4', # three quarters
'\xca\xbf' : '\x27', # c-single quote
'\xcc\xa8' : '', # modifier - under curve
'\xcc\xb1' : '' # modifier - under line
}
def replace_chars(match):
char = match.group(0)
return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)

I think that there is an underlying problem here, and it might be a good idea to investigate and maybe solve it, rather than just trying to cover up the symptoms.
\xc2\x95 is the UTF-8 encoding of the character U+0095, which is a C1 control character (MESSAGE WAITING). It is not surprising that your library cannot handle it. But the question is, how did it get into your data?
Well, one very likely possibility is that it started out as the character 0x95 (BULLET) in the Windows-1252 encoding, was wrongly decoded as U+0095 instead of the correct U+2022, and then encoded into UTF-8. (The Japanese term mojibake describes this kind of mistake.)
If this is correct, then you can recover the original characters by putting them back into Windows-1252 and then decoding them into Unicode correctly this time. (In these examples I am using Python 3.3; these operations are a bit different in Python 2.)
>>> b'\x95'.decode('windows-1252')
'\u2022'
>>> import unicodedata
>>> unicodedata.name(_)
'BULLET'
If you want to do this correction for all the characters in the range 0x80–0x99 that are valid Windows-1252 characters, you can use this approach:
def restore_windows_1252_characters(s):
"""Replace C1 control characters in the Unicode string s by the
characters at the corresponding code points in Windows-1252,
where possible.
"""
import re
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)
For example:
>>> restore_windows_1252_characters('\x95\x99\x85')
'•™…'

If you want to remove all non-ASCII characters from a string, you can use
text.encode("ascii", "ignore")

import unicodedata
# Convert to unicode
text_to_uncicode = unicode(text, "utf-8")
# Convert back to ascii
text_fixed = unicodedata.normalize('NFKD',text_to_unicode).encode('ascii','ignore')

This is not "Unicode characters" - it feels more like this an UTF-8 encoded string. (Although your prefix should be \xC3, not \xC2 for most chars). You should not just throw them away in 95% of the cases, unless you are comunicating with a COBOL backend. The World is not limited to 26 characters, you know.
There is a concise reading to explain the differences between Unicode strings (what is used as an Unicode object in python 2 and as strings in Python 3 here: http://www.joelonsoftware.com/articles/Unicode.html - please, for your sake do read that. Even if you are never planning to have anything that is not English in all of your applications, you still will stumble on symbols like € or º that won't fit in 7 bit ASCII. That article will help you.
That said, maybe the libraries you are using do accept Unicode python objects, and you can transform your UTF-8 Python 2 strings into unidoce by doing:
var_unicode = var.decode("utf-8")
If you really need 100% pure ASCII, replacing all non ASCII chars, after decoding the string to unicode, re-encode it to ASCII, telling it to ignore characters that don't fit in the charset with:
var_ascii = var_unicode.encode("ascii", "replace")

These characters are not in ASCII Library and that is the reason why you are getting the errors.
To avoid these errors, you can do the following while reading the file.
import codecs
f = codecs.open('file.txt', 'r',encoding='utf-8')
To know more about these kind of errors, go through this link.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encode binary data so that \n is escaped - python

If you encoded the entire string systematically, would you not end up escaping it? Say for every character you do chr(ord(char) + 1) or something trivial like that?

How about: In [8]: import urllib In [9]: original = 'binary\ndata' In [10]: encoded = urllib.quote(original) In [11]: encoded Out[11]: 'binary%0Adata' In [12]: urllib.unquote(encoded) Out[12]: 'binary\ndata'

The escapeless encodings are specifically designed to trim off certain characters from binary data. In your case of removing just the \n character, the overhead will be less than 0.4%.

Related

Add a non escaped escape character to python bytearray

Removing escape characters from a string

Dealing with doubly escaped unicode string

Hexadecimal file is loading with 2 back slashes before each byte instead of one

efficiently replace bad characters

Categories

Resources