How do I use unicode_escape correctly? - python

I need to add an additional \ to escape some characters. For instance, BC \ BS needs to become BC \\ BS. The following line solves the issue:
txt.encode('unicode_escape').replace("'", "\\'")
However, it messes up other characters. For example, ^#? becomes \x00?. in such a situation, I will need to remove \x00 as a subsequent step but other characters might show up.
What is the most Pythonic way to add the escape character to set of characters such as \,\t,\n etc. without causing other characters to break? I have tried using translate but ran into issues as the character size of \\t is unequal to \t.

Are you attempting to escape certain characters in text strings, raw text strings, or byte strings? What is the full set of characters you need to escape? If it's text, then there is no need to encode it first...Some things you can try:
test = "Please escape this: ' and this \n"
for char in ["'","\n"]:
test = test.replace(char, f"\{char}")
complicated_test = "This \tstring \n is \ full of ' things to \t escape"
re.escape(complicated_test)

Related

Why does the following string work without an additional escape?

In the following:
>>> r'\d+','\d+', '\\d+'
('\\d+', '\\d+', '\\d+')
Why does the backslash in '\d+' not need to be escaped? Why does this give the same result as the other two literals?
Similarly:
>>> r'[a-z]+\1', '[a-z]+\1'
('[a-z]+\\1', '[a-z]+\x01')
Why does the \1 get converted into a hex escape?
String and Bytes literals has tables showing which backslash combinations are actually escape sequences that have a special meaning. Combinations outside of these tables are not escapes, are not part of the raw string rules and are treated as regular characters. "\d" is two characters as is r"\d". You'll find, for instance, that "\n" (a single newline character) will work differently than \d.
\1 is an \ooo octal escape. When printed, python shows the same character value as a hex escape. Interestingly, \8 isn't octal but instead of raising an error, python just treats it as two characters (because its not an escape).
Because \d is not an escape code. So, however you type it, it is interpreted as a literal \ then a d.
If you type \\d, then the \\ is interpreted as an escaped \, followed by a d.
The situation is different if you choose a letter part of an escape code.
r'\n+','\n+', '\\n+'
⇒ ('\\n+', '\n+', '\\n+')
The first one (because raw) and the last one (because \ is escaped) is a 3-letter string containing a \ a n and a +.
The second one is a 2 letter string, containing a '\n' (a newline) and a +
The second one is even more straightforward. Nothing strange here. r'\1' is a backslash then a one. '\1' is the character whose ASCII code is 1, whose canonical representation is '\x01'
'\1', '\x01' or '\001' are the same thing. Python cannot remember what specific syntax you used to type it. All it knows is it that is the character of code 1. So, it displays it in the "canonical way".
Exactly like 'A' '\x41' or '\101' are the same thing. And would all be printed with the canonical representation, which is 'A'

Python assign "\" to a variable [duplicate]

When I write print('\') or print("\") or print("'\'"), Python doesn't print the backslash \ symbol. Instead it errors for the first two and prints '' for the third. What should I do to print a backslash?
This question is about producing a string that has a single backslash in it. This is particularly tricky because it cannot be done with raw strings. For the related question about why such a string is represented with two backslashes, see Why do backslashes appear twice?. For including literal backslashes in other strings, see using backslash in python (not to escape).
You need to escape your backslash by preceding it with, yes, another backslash:
print("\\")
And for versions prior to Python 3:
print "\\"
The \ character is called an escape character, which interprets the character following it differently. For example, n by itself is simply a letter, but when you precede it with a backslash, it becomes \n, which is the newline character.
As you can probably guess, \ also needs to be escaped so it doesn't function like an escape character. You have to... escape the escape, essentially.
See the Python 3 documentation for string literals.
A hacky way of printing a backslash that doesn't involve escaping is to pass its character code to chr:
>>> print(chr(92))
\
print(fr"\{''}")
or how about this
print(r"\ "[0])
For completeness: A backslash can also be escaped as a hex sequence: "\x5c"; or a short Unicode sequence: "\u005c"; or a long Unicode sequence: "\U0000005c". All of these will produce a string with a single backslash, which Python will happily report back to you in its canonical representation - '\\'.

I need to remove all invisible characters in python

i have a long text which i need to be as clean as possible.
I have collapsed multiple spaces in one space only. I have removed \n and \t. I stripped the resulting string.
I then found characters like \u2003 and \u2019
What are these? How do I make sure that in my text I will have removed all special characters?
Besides the \n \t and the \u2003, should I check for more characters to remove?
I am using python 3.6
Try this:
import re
# string contains the \u2003 character
string = u'This is a test string ’'
# this regex will replace all special characters with a space
re.sub('\W+',' ',string).strip()
Result
'This is a test string'
If you want to preserve ascii special characters:
re.sub('[^!-~]+',' ',string).strip()
This regex reads: select [not characters 34-126] one or more times, where characters 34-126 are the visible range of ascii.
In regex , the ^ says not and the - indicates a range. Looking at an ascii table, 32 is space and all characters below are either a button interrupt or another form of white space like tab and newline. Character 33 is the ! mark and the last displayable character in ascii is 126 or ~.
Thank you Mike Peder, this solution worked for me. However I had to do it for both sides of the comparison
if((re.sub('[^!-~]+',' ',date).strip())==(re.sub('[^!-~]+',' ',calendarData[i]).strip())):

Wrong symbol when using escape sequences learn python the hard way ex10

When i try to print \v or \f i get gender symbols instead:
Note also that I'm a complete beginner at programming.
edit: Seems like i didnt write clear enough, i dont want to write \v or \f but the escape sequence created by them, i dont know what they exactly do but i dont think this is their meant function-
You are trying to print special characters, e.g., "\n" == new line. You can learn more here: Python String Literals.
Excerpt:
In plain English: String literals can be enclosed in matching single quotes (') or double quotes ("). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings). The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences.
The r tells it to print a "raw string."
Python 2.7ish:
print r"\v"
Or, you can escape the escape character:
print "\\v"
Or, for dynamic prints:
print "%r" % ("\v",)
You need to cancel out \ by using \\ the \ character is used for special cases.
try
print '\\t'
print '\\v'
Try print '\\v' or print r"\v"
Try this;
print (r"\n")
r is good for escaping special characters.

How to remove escape sequence like '\xe2' or '\x0c' in python

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file.
But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.
The problem is that \xXX is just a representation of a control character, not the character itself. Therefore, you can't literally match \x unless you're working with the repr of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)
Example:
>>> re.sub(r'[\x00-\x1f\x7f-\xff]', '', ''.join(map(chr, range(256))))
' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.
The solution is just to double the backslash, which makes a pattern that matches a single backslash.
I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.
Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.
import re
s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)
Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!
Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:
escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')
And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).
escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.
import re
s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("#", s)
print("Result: '%s'" % s)
This prints: Result: '+#1+'
NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by #nneonneo instead of this one.
If you're working with 8-bit char values, it's possible to forgo regex's by building some simple tables beforehand and then use them inconjunction with str.translate() method to remove unwanted characters in strings very quickly and easily:
import random
import string
allords = [i for i in xrange(256)]
allchars = ''.join(chr(i) for i in allords)
printableords = [ord(ch) for ch in string.printable]
deletechars = ''.join(chr(i) for i in xrange(256) if i not in printableords)
test = ''.join(chr(random.choice(allords)) for _ in xrange(10, 40)) # random string
print test.translate(allchars, deletechars)
not enough reputation to comment, but the accepted answer removes printable characters as well.
s = "pörféct änßwer"
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', s)
'prfct nwer'
For non-English strings, please use answer https://stackoverflow.com/a/62530464/3021668
import unicodedata
''.join(c for c in s if not unicodedata.category(c).startswith('C'))
'pörféct änßwer'

Categories

Resources