Finding "^M" using Python re - python

I have a file that has occurences of the characters "^M". I want to remove all such occurences.
For this I tried using the following bit of code -
re.sub('^M', '', line)
Even on multiple attempts i am unable to remove it.
Can anyone please tell me whats wrong with what I am doing and suggest something

^ is a meta-character in regular expressions, and it should be escaped, i.e.
re.sub('\\^M', '', line) # 2 backslashes because the backslash needs to be escaped
or
re.sub(r'\^M') # no escape needed because of a raw string.
But then, this is a fixed string, so you should use str.replace, i.e.
line.replace('^M', '')
But then, not always does the ^ and M mean that there is ^ and M, as the ^M can also be used to mean the ASCII carriage return (U+0013, or Control-M); which in a Python string is \r:
line.replace('\r', '')
... but then there is already a utility that will strip these Windows line endings from a file: dos2unix.

Related

How to get rid of trailing \ while reading a file in python3

I am reading a file in python and getting the lines from it.
However, after printing out the values I get, I realize that after each line there is a trailing \ at the end.
I have looked at Python strip with \n and tried everything in it but nothing has removed the trailing .
For example
0048\
0051\
0052\
0054\
0056\
0057\
0058\
0059\
How can I get rid of these slashes?
Here is the code I have so far
for line in f:
line = line.replace('\\n', "")
line = line.replace('\\n', "")
print(line)
I've even tried using regex
strings = re.findall(r"\S+", f.read())
But nothing has worked so far.
You're probably confused about what is in the lines, and as a result you're confusing me too. '\n' is a single newline character, as shown using repr() (which is your friend when you want to know what a value is exactly). A line typically ends with that (the exception being the end of file which might not). That does not contain a backslash; that backslash is part of a string literal escape sequence. Your replace argument of '\\n' contains two characters, a backslash followed by the letter n. This wouldn't match a '\n'; the easiest way to remove the newline specifically is to use str.rstrip('\n'). The line reading itself will guarantee that there's only up to one newline, and it is at the end of the string. Frequently we use strip() with no argument instead as we don't want whitespace either.
If your string really does contain backslash, you can process that as well, whether using replace, strip, re or some other string processing. Just keep in mind that it might be used for escape sequences not only at string literal level but at regular expression level too. For instance, re.sub(r'\\$', '', str) will remove a backslash from the end of a string; the backslash itself is doubled to not mean a special sequence in the regular expression, and the string literal is raw to not need another doubling of the backslashes.

Escape ":" in Python?

I'm using Python (and Pytumblr) and trying to extract a certain string from some returned data, but the string I am searching for includes ":" in it. Whenever I run my script I get the error:
File "myfile.py", line 22
if re.search('^ion': u'..', u'b', line) :
^
SyntaxError: invalid syntax
Here is my code:
import pytumblr
import re
returned = client.submission('blog') # get the submissions for a given blog
sch = open('returned')
for line in sch:
line = line.rstrip()
if re.search('^ion': u'..', u'b', line) :
print line
Is there another error in this code or is there a way to escape ":" that I don't know about? I'm pretty new to Python but I didn't think : needed to be escaped.
That's a syntax error because your colon is not part of the string. The single quote ' mark is closing off the string. Your first argument is being parsed as:
'^ion' - String 1: ^ion
: - Syntactical colon
u - The syntactical character u,
indicating you intend for the
following string literal to be
in unicode
'..' - String 2: ..
If you want your single quote at the end of ^ion to be a part of the string, you either need to escape that with a backslash '^ion\': or, alternatively, use double quotes around the string itself. Since Python accepts both single and double quotes for string literal markers, 'hello' and "hello" mean the same thing. Making '"hello world"' and "'hello world'" both legal strings.
If the regex is the pain point here, there's lots of literature and tooling out there to help. I recommend regex101
Try to use double quotes:
re.search("^ion': u'..', u'b", line):
Or escape ':
re.search('^ion\': u\'..\', u\'b', line):

Understanding file locations in python - unexpected errors

I am learning python 3.3 in windows 7. I have a two text files - lines.txt and raven.txt in a folder. Both contain the same text for the first example.
When I try to access ravens, using the code below, I get the error -
OSError: [Errno 22] Invalid argument: 'C:\\Python\raven.txt'
I know that the above error can be fixed by using an escape character like this -
C:\\Python\\raven.txt
C:\Python\\raven.txt
Why do both methods work ? Strangely, when I access lines.txt in the same folder, I get no error ! Why ?
import re
def main():
print('')
fh = open('C:\Python\lines.txt')
for line in fh:
if re.search('(Len|Neverm)ore', line):
print(line, end = '')
if __name__ == '__main__':main()
Also, when I use the line below, I get a completely different error - TypeError: embedded NUL character. Why ?
fh = open('C:\Python\Exercise Files\09 Regexes\raven.txt')
I can rectify this by using \ before every \ in the file path.
\r is an escape character, but \l is not. So, lines is interpreted as lines while raven is interpreted as aven, since \r is escaped.
In [1]: len('\l')
Out[1]: 2
In [2]: len('\r')
Out[2]: 1
You should always escape backslashes with \\. In cases your string doesn't have quotes, you can also use raw strings:
In [9]: len(r'\r')
Out[9]: 2
In [10]: r'\r'
Out[10]: '\\r'
See: https://docs.python.org/3/reference/lexical_analysis.html
maybe you can use raw string.
just like this open(r'C:\Python\Exercise Files\09 Regexes\raven.txt').
When an r' orR' prefix is present, backslashes are still used to
quote the following character, but all backslashes are left in the
string. For example, the string literal r"\n" consists of two
characters: a backslash and a lowercase `n'. String quotes can be
escaped with a backslash, but the backslash remains in the string; for
example, r"\"" is a valid string literal consisting of two characters:
a backslash and a double quote; r"\" is not a value string literal
(even a raw string cannot end in an odd number of backslashes).
Specifically, a raw string cannot end in a single backslash (since the
backslash would escape the following quote character). Note also that
a single backslash followed by a newline is interpreted as those two
characters as part of the string, not as a line continuation.
You can actually use forward slashes instead of backward ones, that way you don't have to escape them at all, which would save you a lot of headaches. Like this: 'C:/Python/raven.txt', I can guarantee that it works on Windows.

How to remove escape sequence like '\xe2' or '\x0c' in python

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file.
But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.
The problem is that \xXX is just a representation of a control character, not the character itself. Therefore, you can't literally match \x unless you're working with the repr of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)
Example:
>>> re.sub(r'[\x00-\x1f\x7f-\xff]', '', ''.join(map(chr, range(256))))
' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.
The solution is just to double the backslash, which makes a pattern that matches a single backslash.
I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.
Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.
import re
s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)
Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!
Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:
escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')
And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).
escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.
import re
s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("#", s)
print("Result: '%s'" % s)
This prints: Result: '+#1+'
NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by #nneonneo instead of this one.
If you're working with 8-bit char values, it's possible to forgo regex's by building some simple tables beforehand and then use them inconjunction with str.translate() method to remove unwanted characters in strings very quickly and easily:
import random
import string
allords = [i for i in xrange(256)]
allchars = ''.join(chr(i) for i in allords)
printableords = [ord(ch) for ch in string.printable]
deletechars = ''.join(chr(i) for i in xrange(256) if i not in printableords)
test = ''.join(chr(random.choice(allords)) for _ in xrange(10, 40)) # random string
print test.translate(allchars, deletechars)
not enough reputation to comment, but the accepted answer removes printable characters as well.
s = "pörféct änßwer"
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', s)
'prfct nwer'
For non-English strings, please use answer https://stackoverflow.com/a/62530464/3021668
import unicodedata
''.join(c for c in s if not unicodedata.category(c).startswith('C'))
'pörféct änßwer'

Weird Python Regex Issues

whitespace_pattern = u"\s" # bug: tried to use unicode \u0020, broke regex
time_sig_pattern = \
"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
time_sig = compile(time_sig_pattern, U|M)
For some reason, adding the Verbose flag, X, to compile breaks the pattern.
Also, I wanted to use unicode for whitespace_pattern recognition (supposedly, we'll get patterns that use non-unicode spaces and we need to explicitly check for that one unicode character as a valid space), but the pattern keeps breaking.
VERBOSE gives you the ability to write comments in your regex to document it.
In order to do so, it ignores spaces, since you need to use line breaks to write comments.
Replace all spaces in your regex by \s to specify they are spaces you want to match in your pattern, and not just some spaces to format your comments.
What's more, you may want to use the r prefix for the string you use as a pattern. It tells Python not to interpret special notations such as \n and let you use backslashes without escaping them.
Always define regexes with the r prefix to indicate they are raw strings.
r"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
When creating a regex to match unicode characters you do not want to use a Python unicode string. In your example regular expression needs to see the literal characters \u0020, so you should use whitespace_pattern = r"\u0020" instead of u"\u0020".
As other answers have mentioned, you should also use the r prefix for time_sig_pattern, after those two changes your code should work fine.
For VERBOSE to work correctly you need to escape all whitespace in the pattern, so towards the beginning of the pattern replace the space in time signature with "\ " (quotes for clarity), \s, or [ ] as documented here.

Categories

Resources