regex not matching - python

I am write a small python script to gather some data from a database, the only problem is when I export data as XML from mysql it includes a \b character in the XML file. I wrote code to remove it, but then realized I didn't need to do that processing everytime, so I put it in a method and am calling it I find a \b in the XML file, only now the regex isnt matching, even though I know the \b is there.
here is what I am doing:
Main program:
'''Program should start here'''
#test the file to see if processing is needed before parsing
for line in xml_file:
p = re.compile("\b")
if(p.match(line)):
print p.match(line)
processing = True
break #only one match needed
if(processing):
print "preprocess"
preprocess(xml_file)
Preprocessing method:
def preprocess(file):
#exporting from MySQL query browser adds a weird
#character to the result set, remove it
#so the XML parser can read the data
print "in preprocess"
lines = []
for line in xml_file:
lines.append(re.sub("\b", "", line))
#go to the beginning of the file
xml_file.seek(0);
#overwrite with correct data
for line in lines:
xml_file.write(line);
xml_file.truncate()
Any help would be great,
Thanks

\b is a flag for the regular expression engine:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.
So you will need to escape it to find it with a regex.

Escape it with backslash in regex. Since backslash in Python needs to be escaped as well (unless you use raw strings which you don't want to), you need a total of 3 backslashes:
p = re.compile("\\\b")
This will produce a pattern matching the \b character.

Correct me if i wrong but there is no need to use regEx in order to replace '\b', you can simply use replace method for this purpose:
def preprocess(file):
#exporting from MySQL query browser adds a weird
#character to the result set, remove it
#so the XML parser can read the data
print "in preprocess"
lines = map(lambda line: line.replace("\b", ""), xml_file)
#go to the beginning of the file
xml_file.seek(0)
#overwrite with correct data
for line in lines:
xml_file.write(line)
# OR: xml_file.writelines(lines)
xml_file.truncate()
Note that there is no need in python to use ';' at the end of string

Related

python regular_expression_ multiple expression within expression

In my python script it needed a expression like
"\[.*[ERROR].*\n.*\n.*\n.*/\n.*is for multiple time/[\]]{2}"
please let me know how to take "\n." for multiple time... I'm getting stuck in this place
There is the multiline flag available, that let's you match across multiple lines.
https://docs.python.org/2/library/re.html#re.MULTILINE
re.MULTILINE
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
You also have access to DOTALL that will have . match even newlines
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Depending on your match, those two flags let you choose how newlines are handled. In your case, you probably want to adjust your pattern like this:
text = '\n[ [ERROR]\n\nsome text\nis for multiple time]'
re.findall("\[.*\[ERROR\].*is for multiple time\]", text, re.DOTALL)
# result: ['[ [ERROR]\n\nsome text\nis for multiple time]']

How to get rid of trailing \ while reading a file in python3

I am reading a file in python and getting the lines from it.
However, after printing out the values I get, I realize that after each line there is a trailing \ at the end.
I have looked at Python strip with \n and tried everything in it but nothing has removed the trailing .
For example
0048\
0051\
0052\
0054\
0056\
0057\
0058\
0059\
How can I get rid of these slashes?
Here is the code I have so far
for line in f:
line = line.replace('\\n', "")
line = line.replace('\\n', "")
print(line)
I've even tried using regex
strings = re.findall(r"\S+", f.read())
But nothing has worked so far.
You're probably confused about what is in the lines, and as a result you're confusing me too. '\n' is a single newline character, as shown using repr() (which is your friend when you want to know what a value is exactly). A line typically ends with that (the exception being the end of file which might not). That does not contain a backslash; that backslash is part of a string literal escape sequence. Your replace argument of '\\n' contains two characters, a backslash followed by the letter n. This wouldn't match a '\n'; the easiest way to remove the newline specifically is to use str.rstrip('\n'). The line reading itself will guarantee that there's only up to one newline, and it is at the end of the string. Frequently we use strip() with no argument instead as we don't want whitespace either.
If your string really does contain backslash, you can process that as well, whether using replace, strip, re or some other string processing. Just keep in mind that it might be used for escape sequences not only at string literal level but at regular expression level too. For instance, re.sub(r'\\$', '', str) will remove a backslash from the end of a string; the backslash itself is doubled to not mean a special sequence in the regular expression, and the string literal is raw to not need another doubling of the backslashes.

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Escaping quotes when isolating strings from input

I'm trying to parse a file in which quotation files are used to encapsulate strings. For instance, the file might contain a line like this:
"\"Hello there, my friends,\" the tour guide says." me # swap notify
But it might also contain lines like this:
"I'm a dingus who wants to put a backslash at the end of my statements. \\" me # swap notify
In that example, the quotes shouldn't be escaped, but a single backslash should remain.
Is there any function I can use to extract that full quoted statement? \n for newline and \r for carriage return also show up on occasion, so I'd like to get those two, but only after I have the full string isolated.
Parse out the string part. You could use a regular expression or string partition
ast.literal_eval the string and assign it to a variable.
Test:
>>> import re
>>> import ast
>>> with open('test.txt.') as f:
... for line in f:
... m = re.match('(.*) \w+ # \w+ \w+', line)
... print ast.literal_eval(m.group(1))
...
"Hello there, my friends," the tour guide says.
I'm a dingus who wants to put a backslash at the end of my statements. \
The regex says "Match anything and store it as group 1, up to a space, a word, a space, #-sign, space and a word". You then retreive the group with the .group(1) syntax. The parenthesis define a group, see regex documentation.
Here's a version that tries to parse the string as greedily as possible, by failing and retrying until a match is found, or no match can be made:
import re
import ast
def match_line(line):
while line:
print "Trying to match:", line
try:
return ast.literal_eval(line)
except SyntaxError, e:
line = line[:e.offset - 1]
except ValueError: # No way it would ever match
break
return None
with open('test.txt.') as f:
for line in f:
match = match_line(line.strip())
print "Matched:", match
print
You could use regex. It's usually not recommended for parsing though, because unless you have fairly simple inputs or inputs that follow strict rules, it's easy to make mistakes.
There is probably some sort of parsing module that handles this better (for example the csv module is fantastic for quote marks in fields & escaping, if you have a csv).
txt1 = r'"\"Hello there, my friends,\" the tour guide says." me # swap notify.'
txt2 = '"I' + "'" + r'm a dingus who wants to put a backslash at the end of my statements. \\" me # swap notify'
import re
print re.findall(r'"(?:[^"\\]|\\.)+"',txt1)[0]
# "\"Hello there, my friends,\" the tour guide says."
print re.findall(r'"(?:[^"\\]|\\.)+"',txt2)[0]
# "I'm a dingus who wants to put a backslash at the end of my statements. \\"
Note I used the r'xxxxx' syntax to avoid having to further escape my backslashes for python (they're already escaped for the regex).
The regex "([^"\\]|\\.)+" says "match anything that's not a " or a backslash, OR match a backslash and whatever is immediately following it."

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Categories

Resources