how to match particular string by using python - python

I want to print the string. In my code i am not getting the right string.
line="\\python\001tag\file.txt"
str=re.search(r"\[(0-9)+]",line) (don't use raw_string here)
print str.group()
This gives nothing. I want to extract 001 from there.
Note: I don't want to use rawstring.because here user is getting the path from other resource. Is it possible to replace single slash by double slash to solve this problem

You need to use a raw-string so that escape sequences are not processed:
sat = r"\\Python\001tag\file.txt"
Demo:
>>> sat = r"\\Python\001tag\file.txt"
>>> sat
'\\\\Python\\001tag\\file.txt'
>>> print(sat)
\\Python\001tag\file.txt
>>>

Three errors: '\001' gives the codepoint in octal, actually the character at codepoint 1. Use double \\ or raw-strings.
Second: r'\[' escapes the '[', use double \\ instead: r'\\[+0-9()]' (I have rearranged the characters in the set, so that it doesn't look like a expression group.
Third: You want to look at str.group(0) to get the whole matched string.

Related

How to assign '\'(-inf-24.5]\'' to a python string?

s='\'(-inf-24.5]\'' #this in not working
what should be put before \ to include it?
we have to assign s '\'(-inf-24.5]\''
the last two characters are two single quotes and not a single double quote.
the string should literally contain the given single backslashes as the string is to be inserted as it is in a column.
You can try this:
>>> s="\\'(-inf-24.5]\\'"
>>> print s
\'(-inf-24.5]\'
or
>>> s="'\\'(-inf-24.5]\\''"
>>> print s
'\'(-inf-24.5]\''
Basically, you will need to escape the backslash, when you write \' normally, python treats it as the ' being escaped. Also, python strings can be either "", or '', so you can mix them togather to get the desired result.
>>> s = r"'\'(-inf-24.5]\''"
>>> s
"'\\'(-inf-24.5]\\''"
>>> print(s)
'\'(-inf-24.5]\''
Prepending r before a string denotes a raw string, basically indicating to the interpreter that that string's characters should be taken literally. The only thing it can't do is end a string with a backslash (such a backslash would have to be concatenated from a separate string).

Python putting r before unicode string variable

For static strings, putting an r in front of the string would give the raw string (e.g. r'some \' string'). Since it is not possible to put r in front of a unicode string variable, what is the minimal approach to dynamically convert a string variable to its raw form? Should I manually substitute all backslashes with double backslashes?
str_var = u"some text with escapes e.g. \( \' \)"
raw_str_var = ???
If you really need to escape a string, let's say you want to print a newline as \n, you can use the encode method with the Python specific string_escape encoding:
>>> s = "hello\nworld"
>>> e = s.encode("string_escape")
>>> e
"hello\\nworld"
>>> print s
hello
world
>>> print e
hello\nworld
You didn't mention anything about unicode, or which Python version you are using, but if you are dealing with unicode strings you should use unicode_escape instead.
>>> u = u"föö\nbär"
>>> print u
föö
bär
>>> print u.encode('unicode_escape')
f\xf6\xf6\nb\xe4r
Your post originally had the regex tag, maybe re.escape is what you're actually looking for?
>>> re.escape(u"foo\nbar\'baz")
u"foo\\\nbar\\'baz"
Not the "double escapes", ie printing the above string yields:
foo\
bar\'baz
There is nothing to convert - the r prefix is only significant in source code notation, not for program logic.
As a rule, if you use a single backslash in a normal string, it will automatically be converted to a double backslash if it doesn't start a valid escape sequence:
>>> "\n \("
'\n \\('
Since it may be difficult to remember all the valid/invalid escape sequences, raw string notation was introduced. But there is no way and no need to convert a string after it has been defined.
In your case, the correct approach would be to use
str_var = ur"some text with escapes e.g. \( \' \)"
which happens to result in the same string here, but is more explicit.

Slash replacement inside a raw string

Just a simple question concerning raw string, regex pattern and replacement:
I have a string variable defined as follow:
> print repr(foo)
'\n\t\t\n\t\tIf (GUTIAttach>=1) //In case of GUTI attach Enodeb should not ask RRCUecapa again\n\t\tUECapInfo;//Mps("( \\"rat_Type\\":0 \\"ueCapabilitiesRAT_Container\\":hex:011c0000000080 )");
My problem are characters "(" and ")", I want to replace them by "\(" and "\)" inside the raw string because it will be used after as a regular expression pattern.
I tried to use this method:
foo_tmp= [inc.replace(')', '\)') for inc in foo]
foo_tmp= [inc.replace('(', '\)') for inc in foo_tmp]
foo = "".join(foo_tmp)
the result gives:
> print repr(foo)
'\n\t\t\n\t\tIf \\(GUTIAttach>=1\\) //In case of GUTI attach Enodeb should not ask RRCUecapa again\n\t\t{\n\t\t\tUECapInfo;//Mps\\("\\( \\"rat_Type\\":0 \\"ueCapabilitiesRAT_Container\\":hex:011c0000000080 \\)"\\);
Characters "(" and ")" have been replaced by "\\(" and "//)" instead of "\(" and "\)".
That's a bit unexpected for me, so do you know how I can proceed to get just a single slash without changing the other part of the string?
Note: The method .decode('string_escape') is also not working due to the rest of string. Double slashes already present in the original raw string must not change.
Thanks a lot for your help
Use the re.escape() function to escape regular expression meta characters for you.
What you are seeing is otherwise perfectly normal Python behaviour; you are looking at a python literal representation; the output can be pasted back into a Python interpreter and recreate the value. As such, anything that could be interpreted as an escape code is escaped for you; a single \ would normally be doubled to prevent it being interpreted as the start of an escape sequence:
>>> '\('
'\\('
>>> print '\\('
\(
You can see this at work in other places in your foo string; the \n character combination represents a newline character, not two separate characters \ and n. If you wanted to include a literal \ and n in the text, you'd have to double the backslash to \\n. Further on into the value of foo you'll find \\", which is a single backslash followed by a " quote.

python replace single backslash with double backslash [duplicate]

This question already has answers here:
How can I put an actual backslash in a string literal (not use it for an escape sequence)?
(4 answers)
Closed 7 months ago.
In python, I am trying to replace a single backslash ("\") with a double backslash("\"). I have the following code:
directory = string.replace("C:\Users\Josh\Desktop\20130216", "\", "\\")
However, this gives an error message saying it doesn't like the double backslash. Can anyone help?
No need to use str.replace or string.replace here, just convert that string to a raw string:
>>> strs = r"C:\Users\Josh\Desktop\20130216"
^
|
notice the 'r'
Below is the repr version of the above string, that's why you're seeing \\ here.
But, in fact the actual string contains just '\' not \\.
>>> strs
'C:\\Users\\Josh\\Desktop\\20130216'
>>> s = r"f\o"
>>> s #repr representation
'f\\o'
>>> len(s) #length is 3, as there's only one `'\'`
3
But when you're going to print this string you'll not get '\\' in the output.
>>> print strs
C:\Users\Josh\Desktop\20130216
If you want the string to show '\\' during print then use str.replace:
>>> new_strs = strs.replace('\\','\\\\')
>>> print new_strs
C:\\Users\\Josh\\Desktop\\20130216
repr version will now show \\\\:
>>> new_strs
'C:\\\\Users\\\\Josh\\\\Desktop\\\\20130216'
Let me make it simple and clear. Lets use the re module in python to escape the special characters.
Python script :
import re
s = "C:\Users\Josh\Desktop"
print s
print re.escape(s)
Output :
C:\Users\Josh\Desktop
C:\\Users\\Josh\\Desktop
Explanation :
Now observe that re.escape function on escaping the special chars in the given string we able to add an other backslash before each backslash, and finally the output results in a double backslash, the desired output.
Hope this helps you.
Use escape characters: "full\\path\\here", "\\" and "\\\\"
In python \ (backslash) is used as an escape character. What this means that in places where you wish to insert a special character (such as newline), you would use the backslash and another character (\n for newline)
With your example string you would notice that when you put "C:\Users\Josh\Desktop\20130216" in the repl you will get "C:\\Users\\Josh\\Desktop\x8130216". This is because \2 has a special meaning in a python string. If you wish to specify \ then you need to put two \\ in your string.
"C:\\Users\\Josh\\Desktop\\28130216"
The other option is to notify python that your entire string must NOT use \ as an escape character by pre-pending the string with r
r"C:\Users\Josh\Desktop\20130216"
This is a "raw" string, and very useful in situations where you need to use lots of backslashes such as with regular expression strings.
In case you still wish to replace that single \ with \\ you would then use:
directory = string.replace(r"C:\Users\Josh\Desktop\20130216", "\\", "\\\\")
Notice that I am not using r' in the last two strings above. This is because, when you use the r' form of strings you cannot end that string with a single \
Why can't Python's raw string literals end with a single backslash?
https://pythonconquerstheuniverse.wordpress.com/2008/06/04/gotcha-%E2%80%94-backslashes-are-escape-characters/
Maybe a syntax error in your case,
you may change the line to:
directory = str(r"C:\Users\Josh\Desktop\20130216").replace('\\','\\\\')
which give you the right following output:
C:\\Users\\Josh\\Desktop\\20130216
The backslash indicates a special escape character. Therefore, directory = path_to_directory.replace("\", "\\") would cause Python to think that the first argument to replace didn't end until the starting quotation of the second argument since it understood the ending quotation as an escape character.
directory=path_to_directory.replace("\\","\\\\")
Given the source string, manipulation with os.path might make more sense, but here's a string solution;
>>> s=r"C:\Users\Josh\Desktop\\20130216"
>>> '\\\\'.join(filter(bool, s.split('\\')))
'C:\\\\Users\\\\Josh\\\\Desktop\\\\20130216'
Note that split treats the \\ in the source string as a delimited empty string. Using filter gets rid of those empty strings so join won't double the already doubled backslashes. Unfortunately, if you have 3 or more, they get reduced to doubled backslashes, but I don't think that hurts you in a windows path expression.
You could use
os.path.abspath(path_with_backlash)
it returns the path with \
Use:
string.replace(r"C:\Users\Josh\Desktop\20130216", "\\", "\\")
Escape the \ character.

How to remove escape sequence like '\xe2' or '\x0c' in python

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file.
But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.
The problem is that \xXX is just a representation of a control character, not the character itself. Therefore, you can't literally match \x unless you're working with the repr of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)
Example:
>>> re.sub(r'[\x00-\x1f\x7f-\xff]', '', ''.join(map(chr, range(256))))
' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.
The solution is just to double the backslash, which makes a pattern that matches a single backslash.
I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.
Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.
import re
s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)
Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!
Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:
escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')
And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).
escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.
import re
s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("#", s)
print("Result: '%s'" % s)
This prints: Result: '+#1+'
NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by #nneonneo instead of this one.
If you're working with 8-bit char values, it's possible to forgo regex's by building some simple tables beforehand and then use them inconjunction with str.translate() method to remove unwanted characters in strings very quickly and easily:
import random
import string
allords = [i for i in xrange(256)]
allchars = ''.join(chr(i) for i in allords)
printableords = [ord(ch) for ch in string.printable]
deletechars = ''.join(chr(i) for i in xrange(256) if i not in printableords)
test = ''.join(chr(random.choice(allords)) for _ in xrange(10, 40)) # random string
print test.translate(allchars, deletechars)
not enough reputation to comment, but the accepted answer removes printable characters as well.
s = "pörféct änßwer"
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', s)
'prfct nwer'
For non-English strings, please use answer https://stackoverflow.com/a/62530464/3021668
import unicodedata
''.join(c for c in s if not unicodedata.category(c).startswith('C'))
'pörféct änßwer'

Categories

Resources