I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)
Related
I'm want search newline character in string using regex in python.I don't want to include \r or \n in Message.
I have tried regex which is able to detect \r\n correctly. But when i'm removing \r\n from Line variable. still it prints the error.
Line="got less no of bytes than requested\r\n"
if(re.search('\\r|\\n',Line)):
print("Do not use \\r\\n in MSG");
It Should detect \r\n in Line variable which is as a text not the invisible \n.
It should not print when the Line is Like below:
Line="got less no of bytes than requested"
You are looking for the re.sub function.
Try to do this:
Import re
Line="got less no of bytes than requested\r\n"
replaced = re.sub('\n','',Line)
replaced = re.sub('\r','',Line)
print replaced
Instead of checking for newlines, it would probably be better to just remove them. No need to use regex for it, just use strip, it will remove all whitespace and newlines from the ends of the string:
line = 'got less no of bytes than requested\r\n'
line = line.strip()
# line = 'got less no of bytes than requested'
If you want to do it with regex you can use:
import re
line = 'got less no of bytes than requested\r\n'
line = re.sub(r'\n|\r', '', line)
# line = 'got less no of bytes than requested'
If you insist on checking for the newlines, you can do it like this:
if '\n' in line or '\r' in line:
print(r'Do not use \r\n in MSG');
Or the same with regex:
import re
if re.search(r'\n|\r', line):
print(r'Do not use \r\n in MSG');
Also: it's advisable to have your Python variables named with snake_case.
First of all consider to use strip as many guys here mentioned.
Second, if you want to match newline at ANY position in string use search not match
What is the difference between re.search and re.match?Here is more about search vs match
newline_regexp = re.compile("\n|\r")
newline_regexp.search(Line) # will give u search object or None if not found
If you just want to check for line breaks in the message, you can use the string function find(). Note the use of raw text as indicated by the r in front of strings. This removes the need to escape the backslash.
line = r"got less no of bytes than requested\r\n"
print(line)
if line.find(r'\r\n') > 0:
print("Do not use line breaks in MSG");
As others have noted, you are probably looking for line.strip(). But, in case you still want to practice regex, you would use the following code:
Line="got less no of bytes than requested\r\n"
# \r\n located anywhere in the string
prog = re.compile(r'\r\n')
# \r or \n located anywhere in the string
prog = re.compile(r'(\r|\n)')
if prog.search(Line):
print('Do not use \\r\\n in MSG');
I have this CSV with the next lines written on it (please note the newline /n):
"<a>https://google.com</a>",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Dirección
I am trying to delete all that commas and putting the address one row up. Thus, on Python I am using this:
with open('Reutput.csv') as e, open('Put.csv', 'w') as ee:
text = e.read()
text = str(text)
re.compile('<a/>*D', re.MULTILINE|re.DOTALL)
replace = re.sub('<a/>*D','<a/>",D',text) #arreglar comas entre campos
replace = str(replace)
ee.write(replace)
f.close()
As far as I know, re.multiline and re.dotall are necessary to fulfill /n needs. I am using re.compile because it is the only way I know to add them, but obviously compiling it is not needed here.
How could I finish with this text?
"<a>https://google.com</a>",Dirección
You don't need the compile statement at all, because you aren't using it. You can put either the compiled pattern or the raw pattern in the re.sub function. You also don't need the MULTILINE flag, which has to do with the interpretation of the ^ and $ metacharacters, which you don't use.
The heart of the problem is that you are compiling the flag into a regular expression pattern, but since you aren't using the compiled pattern in your substitute command, it isn't getting recognized.
One more thing. re.sub returns a string, so replace = str(replace) is unnecessary.
Here's what worked for me:
import re
with open('Reutput.csv') as e:
text = e.read()
text = str(text)
s = re.compile('</a>".*D',re.DOTALL)
replace = re.sub(s, '</a>"D',text) #arreglar comas entre campos
print(replace)
If you just call re.sub without compiling, you need to call it like
re.sub('</a>".*D', '</a>"D', text, flags=re.DOTALL)
I don't know exactly what your application is, of course, but if all you want to do is to delete all the commas and newlines, it might be clearer to write
replace = ''.join((c for c in text if c not in ',\n'))
When you use re.compile you need to save the returned Regular Expression object and then call sub on that. You also need to have a .* to match any character instead of matching close html tags. The re.MULTILINE flag is only for the begin and end string symbols (^ and $) so you do not need it in this case.
regex = re.compile('</a>.*D',re.DOTALL)
replace = regex.sub('</a>",D',text)
That should work. You don't need to convert replace to a string since it is already a string.
Alternative you can write a regular expression that doesn't use .
replace = re.sub('"(,|\n)*D','",D',text)
This worked for me using re.sub with multiline texte
#!/usr/bin/env python3
import re
output = open("newFile.txt","w")
input = open("myfile.txt")
file = input.read()
input.close()
text = input.read()
replace = re.sub("value1\n\s +nickname", "value\n\s +name", text, flags=re.DOTALL)
output.write(replace)
output.close()
I'm reading in a file and storing its contents as a multiline string. Then I loop through some values I get from a django query to run regexes based on the query results values. My regex seems like it should be working, and works if I copy the values returned by the query, but for some reason isn't matching when all the parts are working together that ends like this
My code is:
with open("/path_to_my_file") as myfile:
data=myfile.read()
#read saved settings then write/overwrite them into the config
items = MyModel.objects.filter(some_id="s100009")
for item in items:
regexString = "^\s*"+item.feature_key+":"
print regexString #to verify its what I want it to be, ie debug
pq = re.compile(regexString, re.M)
if pq.match(data):
#do stuff
So basically my problem is that the regex isn't matching. When I copy the file contents into a big old string, and copy the value(s) printed by the print regexString line, it does match, so I'm thinking theres some esoteric python/django thing going on (or maybe not so esoteric as python isnt my first language).
And for examples sake, the output of print regexString is :
^\s*productDetailOn:
File contents:
productDetailOn:true,
allOff:false,
trendingWidgetOn:true,
trendingWallOn:true,
searchResultOn:false,
bannersOn:true,
homeWidgetOn:true,
}
Running Python 2.7. Also, dumped the types of both item.feature and data, and both were unicode. Not sure if that matters? Anyway, I'm starting to hit my head off the desk after working this for a couple hours, so any help is appreciated. Cheers!
According to documentation, re.match never allows searching at the beginning of a line:
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
You need to use a re.search:
regexString = r"^\s*"+item.feature_key+":"
pq = re.compile(regexString, re.M)
if pq.search(data):
A small note on the raw string (r"^\s+"): in this case, it is equivalent to "\s+" because there is no \s escape sequence (like \r or \n), thus, Python treats it as a raw string literal. Still, it is safer to always declare regex patterns with raw string literals in Python (and with corresponding notations in other languages, too).
I have the following regexp:
pattern = re.compile(r"HESAID:|SHESAID:")
It's working correctly. I use it to split by multiple delimiters like this:
result = pattern.split(content)
What I want to add is verification so that the split does NOT happend unless HESAID: or SHESAID: are placed on new lines. This is not working:
pattern = re.compile(r"\nHESAID:\n|\nSHESAID:\n")
Please help.
It would be helpful if you elaborated on how exactly it is not working, but I am guessing that the issue is that it does not match consecutive lines of HESAID/SHESAID. You can fix this by using beginning and end of line anchors instead of actually putting \n in your regex:
pattern = re.compile(r'^HESAID:$|^SHESAID:$', re.MULTILINE)
The re.MULTILINE flag is necessary so that ^ and $ match at beginning and end of lines, instead of just the beginning and end of the string.
I would probably rewrite the regex as follows, the ? after the S makes it optional:
pattern = re.compile(r'^S?HESAID:$', re.MULTILINE)
I am write a small python script to gather some data from a database, the only problem is when I export data as XML from mysql it includes a \b character in the XML file. I wrote code to remove it, but then realized I didn't need to do that processing everytime, so I put it in a method and am calling it I find a \b in the XML file, only now the regex isnt matching, even though I know the \b is there.
here is what I am doing:
Main program:
'''Program should start here'''
#test the file to see if processing is needed before parsing
for line in xml_file:
p = re.compile("\b")
if(p.match(line)):
print p.match(line)
processing = True
break #only one match needed
if(processing):
print "preprocess"
preprocess(xml_file)
Preprocessing method:
def preprocess(file):
#exporting from MySQL query browser adds a weird
#character to the result set, remove it
#so the XML parser can read the data
print "in preprocess"
lines = []
for line in xml_file:
lines.append(re.sub("\b", "", line))
#go to the beginning of the file
xml_file.seek(0);
#overwrite with correct data
for line in lines:
xml_file.write(line);
xml_file.truncate()
Any help would be great,
Thanks
\b is a flag for the regular expression engine:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.
So you will need to escape it to find it with a regex.
Escape it with backslash in regex. Since backslash in Python needs to be escaped as well (unless you use raw strings which you don't want to), you need a total of 3 backslashes:
p = re.compile("\\\b")
This will produce a pattern matching the \b character.
Correct me if i wrong but there is no need to use regEx in order to replace '\b', you can simply use replace method for this purpose:
def preprocess(file):
#exporting from MySQL query browser adds a weird
#character to the result set, remove it
#so the XML parser can read the data
print "in preprocess"
lines = map(lambda line: line.replace("\b", ""), xml_file)
#go to the beginning of the file
xml_file.seek(0)
#overwrite with correct data
for line in lines:
xml_file.write(line)
# OR: xml_file.writelines(lines)
xml_file.truncate()
Note that there is no need in python to use ';' at the end of string