re.split not working on ^A [duplicate] - python

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 5 years ago.
I'm trying to parse lines of input that look like
8=FIX.4.2^A9=0126^A35=0^A34=000742599^A49=L3Q206N^A50=2J6L^A52=20130620-11:16:27.344^A369=000733325^A56=CME^A57=G^A142=US,IL^A1603=OMS2^A1604=0.1^A
where you have different fields of data separated by ^A. I'm trying to get at the individual data fields (like 8=FIX.4.2, 9=0126, 35=0, etc). The problem is that python sometimes interprets ^A as a single character (in vim this is ctrl-v, ctrl-a) and sometimes as the string '^A' with two characters. So I have tried doing
entries = re.split('^A|^A', str(line))
but later when i do
for entry in entries:
print entries
I just end up with the original string, with nothing split. Is this a problem with re.split?

Depends on what that line contains.
If you want to split on the 2-character string '^A', escape the special-to-regexps character ^, in this case probably meaning '\^A'.
It's more likely that this is instead the caret notation way of printing the single character with byte value 0x01, in which case you probably want to split on '\x01' instead.
(You might as well use string's own split() function, I'm guessing it's faster than using regexps for something this simple)

^ has a special meaning in regular expressions, so you should escape it first.
>>> strs = "8=FIX.4.2^A9=0126^A35=0^A34=000742599^A49=L3Q206N^A50=2J6L^A52=20130620-11:16:27.344^A369=000733325^A56=CME^A57=G^A142=US,IL^A1603=OMS2^A1604=0.1^A"
>>> re.split('\^A',strs)
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']
From docs:
'^' : (Caret.) Matches the start of the string, and in MULTILINE mode also
matches immediately after each newline.

^ is a metacharacter, it matches only at the start of a string. Escape it:
>>> re.split('\^A', line)
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']
There is no need to use a | in your expression, especially not when both 'alternate' strings are the same.
It appears however that you have the \x07 or \a control character, not the two-character ^A string. Just use .split() to split on that value, no need for a regular expression:
>>> line = line.replace('^A', '\a')
>>> line
'8=FIX.4.2\x079=0126\x0735=0\x0734=000742599\x0749=L3Q206N\x0750=2J6L\x0752=20130620-11:16:27.344\x07369=000733325\x0756=CME\x0757=G\x07142=US,IL\x071603=OMS2\x071604=0.1\x07'
>>> line.split('\a')
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']

Related

Searching for a raw string (r'\n') within a raw string full of ('\n') resulting in an empty result set [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

regular expression problems in re.findall python [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

Why do I need to add DOTALL to python regular expression to match new line in raw string [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
Why does one need to add the DOTALL flag for the python regular expression to match characters including the new line character in a raw string. I ask because a raw string is supposed to ignore the escape of special characters such as the new line character. From the docs:
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline.
This is my situation:
string = '\nSubject sentence is: Appropriate support for families of children diagnosed with hearing impairment\nCausal Verb is : may have\npredicate sentence is: a direct impact on the success of early hearing detection and intervention programs in reducing the negative effects of permanent hearing loss'
re.search(r"Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string ,re.DOTALL)
results in a match , However , when I remove the DOTALL flag, I get no match.
In regex . means any character except \n
So if you have newlines in your string, then .* will not pass that newline(\n).
But in Python, if you use the re.DOTALL flag(also known as re.S) then it includes the \n(newline) with that dot .
Your source string is not raw, only your pattern string.
maybe try
string = r'\n...\n'
re.search("Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string)

python replace single backslash with double backslash [duplicate]

This question already has answers here:
How can I put an actual backslash in a string literal (not use it for an escape sequence)?
(4 answers)
Closed 7 months ago.
In python, I am trying to replace a single backslash ("\") with a double backslash("\"). I have the following code:
directory = string.replace("C:\Users\Josh\Desktop\20130216", "\", "\\")
However, this gives an error message saying it doesn't like the double backslash. Can anyone help?
No need to use str.replace or string.replace here, just convert that string to a raw string:
>>> strs = r"C:\Users\Josh\Desktop\20130216"
^
|
notice the 'r'
Below is the repr version of the above string, that's why you're seeing \\ here.
But, in fact the actual string contains just '\' not \\.
>>> strs
'C:\\Users\\Josh\\Desktop\\20130216'
>>> s = r"f\o"
>>> s #repr representation
'f\\o'
>>> len(s) #length is 3, as there's only one `'\'`
3
But when you're going to print this string you'll not get '\\' in the output.
>>> print strs
C:\Users\Josh\Desktop\20130216
If you want the string to show '\\' during print then use str.replace:
>>> new_strs = strs.replace('\\','\\\\')
>>> print new_strs
C:\\Users\\Josh\\Desktop\\20130216
repr version will now show \\\\:
>>> new_strs
'C:\\\\Users\\\\Josh\\\\Desktop\\\\20130216'
Let me make it simple and clear. Lets use the re module in python to escape the special characters.
Python script :
import re
s = "C:\Users\Josh\Desktop"
print s
print re.escape(s)
Output :
C:\Users\Josh\Desktop
C:\\Users\\Josh\\Desktop
Explanation :
Now observe that re.escape function on escaping the special chars in the given string we able to add an other backslash before each backslash, and finally the output results in a double backslash, the desired output.
Hope this helps you.
Use escape characters: "full\\path\\here", "\\" and "\\\\"
In python \ (backslash) is used as an escape character. What this means that in places where you wish to insert a special character (such as newline), you would use the backslash and another character (\n for newline)
With your example string you would notice that when you put "C:\Users\Josh\Desktop\20130216" in the repl you will get "C:\\Users\\Josh\\Desktop\x8130216". This is because \2 has a special meaning in a python string. If you wish to specify \ then you need to put two \\ in your string.
"C:\\Users\\Josh\\Desktop\\28130216"
The other option is to notify python that your entire string must NOT use \ as an escape character by pre-pending the string with r
r"C:\Users\Josh\Desktop\20130216"
This is a "raw" string, and very useful in situations where you need to use lots of backslashes such as with regular expression strings.
In case you still wish to replace that single \ with \\ you would then use:
directory = string.replace(r"C:\Users\Josh\Desktop\20130216", "\\", "\\\\")
Notice that I am not using r' in the last two strings above. This is because, when you use the r' form of strings you cannot end that string with a single \
Why can't Python's raw string literals end with a single backslash?
https://pythonconquerstheuniverse.wordpress.com/2008/06/04/gotcha-%E2%80%94-backslashes-are-escape-characters/
Maybe a syntax error in your case,
you may change the line to:
directory = str(r"C:\Users\Josh\Desktop\20130216").replace('\\','\\\\')
which give you the right following output:
C:\\Users\\Josh\\Desktop\\20130216
The backslash indicates a special escape character. Therefore, directory = path_to_directory.replace("\", "\\") would cause Python to think that the first argument to replace didn't end until the starting quotation of the second argument since it understood the ending quotation as an escape character.
directory=path_to_directory.replace("\\","\\\\")
Given the source string, manipulation with os.path might make more sense, but here's a string solution;
>>> s=r"C:\Users\Josh\Desktop\\20130216"
>>> '\\\\'.join(filter(bool, s.split('\\')))
'C:\\\\Users\\\\Josh\\\\Desktop\\\\20130216'
Note that split treats the \\ in the source string as a delimited empty string. Using filter gets rid of those empty strings so join won't double the already doubled backslashes. Unfortunately, if you have 3 or more, they get reduced to doubled backslashes, but I don't think that hurts you in a windows path expression.
You could use
os.path.abspath(path_with_backlash)
it returns the path with \
Use:
string.replace(r"C:\Users\Josh\Desktop\20130216", "\\", "\\")
Escape the \ character.

How to remove escape sequence like '\xe2' or '\x0c' in python

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file.
But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.
The problem is that \xXX is just a representation of a control character, not the character itself. Therefore, you can't literally match \x unless you're working with the repr of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)
Example:
>>> re.sub(r'[\x00-\x1f\x7f-\xff]', '', ''.join(map(chr, range(256))))
' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.
The solution is just to double the backslash, which makes a pattern that matches a single backslash.
I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.
Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.
import re
s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)
Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!
Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:
escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')
And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).
escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.
import re
s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("#", s)
print("Result: '%s'" % s)
This prints: Result: '+#1+'
NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by #nneonneo instead of this one.
If you're working with 8-bit char values, it's possible to forgo regex's by building some simple tables beforehand and then use them inconjunction with str.translate() method to remove unwanted characters in strings very quickly and easily:
import random
import string
allords = [i for i in xrange(256)]
allchars = ''.join(chr(i) for i in allords)
printableords = [ord(ch) for ch in string.printable]
deletechars = ''.join(chr(i) for i in xrange(256) if i not in printableords)
test = ''.join(chr(random.choice(allords)) for _ in xrange(10, 40)) # random string
print test.translate(allchars, deletechars)
not enough reputation to comment, but the accepted answer removes printable characters as well.
s = "pörféct änßwer"
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', s)
'prfct nwer'
For non-English strings, please use answer https://stackoverflow.com/a/62530464/3021668
import unicodedata
''.join(c for c in s if not unicodedata.category(c).startswith('C'))
'pörféct änßwer'

Categories

Resources