Python \0 in a string followed by a number behaves inconsistently

Python \0 in a string followed by a number behaves inconsistently - python

I can enter an octal value of 'up to 3 characters' in a string.
Is there any way to enter an octal value of only 1 character?
For instance.
If I want to print \0 followed by "Hello", I can do:
"\0Hello"
but if I want to print \0 followed by "12345" I can't do
"\012345"
instead I have to do
"\00012345"
This can, in very obscure scenarios, lead to inconsistent behaviour.
def parseAsString(characters):
output = ['H','I''!','\\','0'] + characters
print("".join(output).encode().decode('unicode_escape'));
parseAsString(['Y','O','U'])
#Output:
#>HI! YOU
parseAsString(['1','2','3'])
#Output:
#>HI!
#>3

The answer to this is, when you're dealing with \0, to either.
Always remember to explicitly use \000 or \x00, this may not be possible if your raw text is coming from another source.
When dealing with raw strings AND concatenating them, always decode each constituent part first, then concatenate them last, not the other way around.
For instance the parser will do this for you if you concatenate strings together:
"\0" + "Hello"
and
"\0" + "12345"
Both work consistently as expected., because "\0" is converted to "\x00" before being concatenated with the rest of the string.
Or, in the more obscure scenario:
def safeParseAsString(characters):
output = "".join(['H','I''!','\\','0']).encode().decode('unicode_escape')
output +="".join(characters).encode().decode('unicode_escape')
print(output)
safeParseAsString(['Y','O','U'])
#Output:
#>HI! YOU
safeParseAsString(['1','2','3'])
#Output:
#>HI! 123

Related

How i can replace in a list of strings "\\" with "\" in Python

In my code i have a list of locations and the output of a list is like this
['D:\\Todo\\VLC\\Daft_Punk\\One_more_time.mp4"', ...
i want a replace "\\" with "\"
(listcancion is a list with all strings)
i try to remplace with this code remplacement = [listcancion.replace('\\', '\') for listcancion in listcancion] or this remplacement = [listcancion.replace('\\\\', '\\') for listcancion in listcancion] or also this remplacement = [listcancion.replace('\\', 'X') for listcancion in listcancion]
listrandom = [remplacement.replace('X', '\') for remplacement in remplacement]
I need to change only the character \ i can't do it things like this ("\\Todo", "\Todo") because i have more characters to remplace.
If i can solved without imports thats great.

It is just a matter of string representations.
First, you have to differentiate between a string's "real" content and its representation.
A string's "real" content might be letters, digits, punctuation and so on, which makes displaying it quite easy. But imagine a strring which contains a, a line break and a b. If you print that string, you get the output
a
b
which is what you expect.
But in order to make it more compact, this string's representation is a\nb: the line break is represented as \n, the \ serving as an escape character. Compare the output of print(a) (which is the same as print(str(a))) and of print(repr(a)).
Now, in order not to confuse this with a string which contains a, \, n and b, a "real" backslash in a string has a representation of \\ and the same string, which prints as a\nb, has a representation of a\\nb in order to distinguish that from the first example.
If you print a list of anything, it is displayed as a comma-separated list of the representations of their components, even if they are strings.
If you do
for element in listcancion:
print(element)
you'll see that the string actually contains only one \ where its representation shows \\.
(Oh, and BTW, I am not sure that things like [listcancion.<something> for listcancion in listcancion] work as intended; better use another variable as the loop variablen, such as [element.<something> for element in listcancion].)

Add a non escaped escape character to python bytearray

I have an API that is demanding that the quotation marks in my XML attributes are escaped, so <cmd_id="1"> will not work, it requires <cmd_id=\"1\">.
I have tried iterating through my string, for example:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">SetChLevel</cmd><name>C</name><value>30</value></tx>'
Each time that I encounter a " (ascii 34) I will replace it with an escape character (ascii 92) and another quote. Infuriatingly this results in:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id=\\"1\\">SetChLevel</cmd><name>C</name><value>30</value></tx>'
where the escapes have been escaped. As a sanity check I replaced 92 with any other character and it works as expected.
temp = b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">\
SetChLevel</cmd><name>C</name><value>30</value></tx>'
i = 0
j = 0
payload = bytearray(len(temp) + 4)
for char in temp:
if char == 34:
payload[i] = 92
i += 1
payload[i] = 34
i += 1
j += 1
else:
payload[i] = temp[j]
i += 1
j += 1
print(bytes(payload))
I would assume that character 92 would appear once but something is escaping the escape!

Your problem is the result of a very common misunderstanding for programmers new to Python.
When printing a string (or bytes) to the console, Python escapes the escape character (\) to show a string that, when used in Python as a literal, would give you the exact same value.
So:
s = 'abc\\abc'
print(s)
Prints abc\abc, but on the interpreter you get:
>>> s = 'abc\\abc'
>>> print(s)
abc\abc
>>> s
'abc\\abc'
Note that this is correct. After all print(s) should show the string on the console as it is, while s on the interpreter is asking Python to show you the representation of s, which includes the quotes and the escape characters.
Compare:
>>> repr(s)
"'abc\\\\abc'"
repr here prints the representation of the representation of s.
For bytes, things are further complicated because the representation is printed when using print, since print prints a string and a bytes needs to be decoded first, i.e.:
>>> print(some_bytes.decode('utf-8')) # or whatever the encoding is
In short: your code was doing what you wanted it to, it does not duplicate escape characters, you only thought it did because you were looking at the representation of the bytes, not the actual bytes content.
By the way, this also means that you don't have to be paranoid and go through the trouble of writing custom code to replace characters based on their ASCII values, you can simply:
>>> example = bytes('<some attr="value">test</some>', encoding='utf-8')
>>> result = example.replace(b'"', b"\\\"")
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>
I won't pretend that b"\\\"" is intuitive, perhaps b'\\"' is better - but both require that you understand the difference between the representation of a string, or its printed value.
So, finally:
>>> example = b'<some attr="value">test</some>'
>>> result = example.replace(b'"', b'\\"')
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>

How to remove set of characters when a string comprise of "\" and Special characters in python

a = "\Virtual Disks\DG2_ASM04\ACTIVE"
From the above string I would like to get the part "DG2_ASM04" alone. I cannot split or strip as it has the special characters "\", "\D" and "\A" in it.
Have tried the below and can't get the desired output.
a.lstrip("\Virtual Disks\\").rstrip("\ACTIVE")
the output I have got is: 'G2_ASM04' instead of "DG2_ASM04"

Simply use slicing and escape backslash(\)
>>> a.split("\\")[-2]
'DG2_ASM04'
In your case D is also removing because it is occurring more than one time in given string (thus striping D as well). If you tweak your string then you will realize what is happening
>>> a = "\Virtual Disks\XG2_ASM04\ACTIVE"
>>> a.lstrip('\\Virtual Disks\\').rstrip("\\ACTIVE")
'XG2_ASM04'

Encode binary data so that \n is escaped

I'm trying to work out a way to encode/decode binary data in such a way that the new line character is not part of the encoded string.
It seems to be a recursive problem, but I can't seem to work out a solution.
e.g. A naive implementation:
>>> original = 'binary\ndata'
>>> encoded = original.replace('\n', '=n')
'binary=ndata'
>>> decoded = original.replace('=n', '\n')
'binary\ndata'
What happens if there is already a =n in the original string?
>>> original = 'binary\ndata=n'
>>> encoded = original.replace('\n', '=n')
'binary=ndata=n'
>>> decoded = original.replace('=n', '\n')
'binary\ndata\n' # wrong
Try to escape existing =n's, but then what happens if there is already an escaped =n?
>>> original = '++nbinary\ndata=n'
>>> encoded = original.replace('=n', '++n').replace('\n', '=n')
'++nbinary=ndata++n'
How can I get around this recursive problem?

Solution
original = 'binary\ndata \\n'
# encoded = original.encode('string_escape') # escape many chr
encoded = original.replace('\\', '\\\\').replace('\n', '\\n') # escape \n and \\
decoded = encoded.decode('string_escape')
verified
>>> print encoded
binary\ndata \\n
>>> print decoded
binary
data \n
The solution is from How do I un-escape a backslash-escaped string in python?
Edit: I wrote it also with your ad-hoc economic encoding. The original "string_escape" codec escapes backslash, apostrophe and everything below chr(32) and above chr(126). Decoding is the same for both.

The way to encode strings that might contain the "escape" character is to escape the escape character as well. In python, the escape character is a backslash, but you could use anything you want. Your cost is one character for every occurrence of newline or the escape.
To avoid confusing you, I'll use forward slash:
# original
>>> print "slashes / and /newline/\nhere"
slashes / and /newline/
here
# encoding
>>> print "slashes / and /newline/\nhere".replace("/", "//").replace("\n", "/n")
slashes // and //newline///nhere
This encoding is unambiguous, since all real slashes are doubled; but it must be decoded in a single pass, so you can't just use two successive calls to replace():
# decoding
>>> def decode(c):
# Expand this into a real mapping if you have more substitutions
return '\n' if c == '/n' else c[0]
>>> print "".join( decode(c) for c in re.findall(r"(/.|.)",
"slashes // and //newline///nhere"))
slashes / and /newline/
here
Note that there is an actual /n in the input (and another slash before the newline): it all works correctly anyway.

If you encoded the entire string systematically, would you not end up escaping it? Say for every character you do chr(ord(char) + 1) or something trivial like that?

I don't have a great deal of experience with binary data, so this may be completely off/inefficient/both, but would this get around your issue?
In [40]: original = 'binary\ndata\nmorestuff'
In [41]: nlines = [index for index, i in enumerate(original) if i == '\n']
In [42]: encoded = original.replace('\n', '')
In [43]: encoded
Out[43]: 'binarydatamorestuff'
In [44]: decoded = list(encoded)
In [45]: map(lambda x: decoded.insert(x, '\n'), nlines)
Out[45]: [None, None]
In [46]: decoded = ''.join(decoded)
In [47]: decoded
Out[47]: 'binary\ndata\nmorestuff'
Again, I am sure there is a much better/more accurate way - this is just from a novice perspective.

If you are encoding an alphabet of n symbols (e.g. ASCII) into a smaller set of m symbols (e.g. ASCII except newline) you must allow the encoded string to be longer than the original string.
The typical way of doing this is to define one character as an "escape" character; the character following the "escape" represents an encoded character. This technique has been used since the 1940s in teletypewriters; that's where the "Esc" key you see on your keyboard came from.
Python (and other languages) already provide this in strings with the backslash character. Newlines are encoded as '\n' (or '\r\n'). Backslashes escape themselves, so the literal string '\r\n' would be encoded '\\r\\n'.
Note that the encoded length of a string that includes only the escaped character will be double that of the original string. If that is not acceptable you will have to use an encoding that uses a larger alphabet to avoid the escape characters (which may be longer than the original string) or compress it (which may also be longer than the original string).

How about:
In [8]: import urllib
In [9]: original = 'binary\ndata'
In [10]: encoded = urllib.quote(original)
In [11]: encoded
Out[11]: 'binary%0Adata'
In [12]: urllib.unquote(encoded)
Out[12]: 'binary\ndata'

The escapeless encodings are specifically designed to trim off certain characters from binary data. In your case of removing just the \n character, the overhead will be less than 0.4%.

Output first 100 characters in a string

Can seem to find a substring function in python.
Say I want to output the first 100 characters in a string, how can I do this?
I want to do it safely also, meaning if the string is 50 characters it shouldn't fail.

print my_string[0:100]

From python tutorial:
Degenerate slice indices are handled
gracefully: an index that is too large
is replaced by the string size, an
upper bound smaller than the lower
bound returns an empty string.
So it is safe to use x[:100].

Easy:
print mystring[:100]

To answer Philipp's concern ( in the comments ), slicing works ok for unicode strings too
>>> greek=u"αβγδεζηθικλμνξοπρςστυφχψω"
>>> print len(greek)
25
>>> print greek[:10]
αβγδεζηθικ
If you want to run the above code as a script, put this line at the top
# -*- coding: utf-8 -*-
If your editor doesn't save in utf-8, substitute the correct encoding

Slicing of arrays is done with [first:last+1].
One trick I tend to use a lot of is to indicate extra information with ellipses. So, if your field is one hundred characters, I would use:
if len(s) <= 100:
print s
else:
print "%s..."%(s[:97])
And yes, I know () is superfluous in this case for the % formatting operator, it's just my style.

String formatting using % is a great way to handle this. Here are some examples.
The formatting code '%s' converts '12345' to a string, but it's already a string.
>>> '%s' % '12345'
'12345'
'%.3s' specifies to use only the first three characters.
>>> '%.3s' % '12345'
'123'
'%.7s' says to use the first seven characters, but there are only five. No problem.
>>> '%.7s' % '12345'
'12345'
'%7s' uses up to seven characters, filling missing characters with spaces on the left.
>>> '%7s' % '12345'
' 12345'
'%-7s' is the same thing, except filling missing characters on the right.
>>> '%-7s' % '12345'
'12345 '
'%5.3' says use the first three characters, but fill it with spaces on the left to total five characters.
>>> '%5.3s' % '12345'
' 123'
Same thing except filling on the right.
>>> '%-5.3s' % '12345'
'123 '
Can handle multiple arguments too!
>>> 'do u no %-4.3sda%3.2s wae' % ('12345', 6789)
'do u no 123 da 67 wae'
If you require even more flexibility, str.format() is available too. Here is documentation for both.

Most of previous examples will raise an exception in case your string is not long enough.
Another approach is to use
'yourstring'.ljust(100)[:100].strip().
This will give you first 100 chars.
You might get a shorter string in case your string last chars are spaces.

[start:stop:step]
So If you want to take only 100 first character, use your_string[0:100] or your_string[:100]
If you want to take only the character at even position, use your_string[::2]
The "default values" for start is 0, for stop - len of string, and for step - 1. So when you don't provide one of its and put ':', it'll use it default value.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python \0 in a string followed by a number behaves inconsistently - python

Related

How i can replace in a list of strings "\\" with "\" in Python

Add a non escaped escape character to python bytearray

How to remove set of characters when a string comprise of "\" and Special characters in python

Encode binary data so that \n is escaped

Output first 100 characters in a string

Categories

Resources