python re.sub newline multiline dotall - python

I have this CSV with the next lines written on it (please note the newline /n):
"<a>https://google.com</a>",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Dirección
I am trying to delete all that commas and putting the address one row up. Thus, on Python I am using this:
with open('Reutput.csv') as e, open('Put.csv', 'w') as ee:
text = e.read()
text = str(text)
re.compile('<a/>*D', re.MULTILINE|re.DOTALL)
replace = re.sub('<a/>*D','<a/>",D',text) #arreglar comas entre campos
replace = str(replace)
ee.write(replace)
f.close()
As far as I know, re.multiline and re.dotall are necessary to fulfill /n needs. I am using re.compile because it is the only way I know to add them, but obviously compiling it is not needed here.
How could I finish with this text?
"<a>https://google.com</a>",Dirección

You don't need the compile statement at all, because you aren't using it. You can put either the compiled pattern or the raw pattern in the re.sub function. You also don't need the MULTILINE flag, which has to do with the interpretation of the ^ and $ metacharacters, which you don't use.
The heart of the problem is that you are compiling the flag into a regular expression pattern, but since you aren't using the compiled pattern in your substitute command, it isn't getting recognized.
One more thing. re.sub returns a string, so replace = str(replace) is unnecessary.
Here's what worked for me:
import re
with open('Reutput.csv') as e:
text = e.read()
text = str(text)
s = re.compile('</a>".*D',re.DOTALL)
replace = re.sub(s, '</a>"D',text) #arreglar comas entre campos
print(replace)
If you just call re.sub without compiling, you need to call it like
re.sub('</a>".*D', '</a>"D', text, flags=re.DOTALL)
I don't know exactly what your application is, of course, but if all you want to do is to delete all the commas and newlines, it might be clearer to write
replace = ''.join((c for c in text if c not in ',\n'))

When you use re.compile you need to save the returned Regular Expression object and then call sub on that. You also need to have a .* to match any character instead of matching close html tags. The re.MULTILINE flag is only for the begin and end string symbols (^ and $) so you do not need it in this case.
regex = re.compile('</a>.*D',re.DOTALL)
replace = regex.sub('</a>",D',text)
That should work. You don't need to convert replace to a string since it is already a string.
Alternative you can write a regular expression that doesn't use .
replace = re.sub('"(,|\n)*D','",D',text)

This worked for me using re.sub with multiline texte
#!/usr/bin/env python3
import re
output = open("newFile.txt","w")
input = open("myfile.txt")
file = input.read()
input.close()
text = input.read()
replace = re.sub("value1\n\s +nickname", "value\n\s +name", text, flags=re.DOTALL)
output.write(replace)
output.close()

Related

using a regex wildcard within a specific pattern match

my code:
f = open("file.bin", 'rb')
s = f.read()
str1 = ''.join(re.findall( b'\x00\x00\x00\x12\x00\x00\x00(.*?)\x00\x01\x00\x00', s )[0])
I have some binary files from which I want to extract information (strings). The information/strings in this file looks like "[DELIMITER]String1[DELIMITER]STRING2"... The delimiters used in these files are always different but the 00's are always the same so a good workaround would be to tell regex that \x12 and \x01 can be anything.
So what I would need is
str1 = ''.join(re.findall( b'\x00\x00\x00\x[ANYTHING]\x00\x00\x00(.*?)\x00\x[ANYTHING]\x00\x00', s )[0])
How can I do this in regex?
You could try
str1 = ''.join(re.findall(b'\x00\x00\x00.\x00\x00\x00(.*?)\x00.\x00\x00', s)[0], re.S)
The re.S is needed for . to match absolutely any character (or byte in this case), including \n (aka \x0a).
(Notice that to the regular expression engine, each \xnn is just 1 character, so you cannot use any operators within such escape).

How to perform this regex replacement more effectively in python without repeating the search?

In python, I want to search for a pattern in a given line and surround it with the html tags. I am doing it as follows:
pattern = "(boy|girl)"
line = "I am a boy"
m = re.search(pattern, line)
line = re.sub(pattern, "<strong><u>"+m.group(0)+"</u></strong>", line)
But I feel like I am repeating the search twice. In other words, I feel like I should be able to accomplish in one line, but I just don't know the right command yet in python.
Is there something like "&" from perl? that you can use to do something like:
s/pattern/<tag>&</tag>/;
Use:
line = re.sub(pattern, r'<strong><u>\1</u></strong>', line)
The \1 is the key part -- it's replaced by the text that matched the pattern. (the r prefix is recommended in all RE patterns to keep backslash escapes as literals).

Using Python regex to sanitize input string

I have the string:
text = 'href = "www.google.com" onmouseover = blahblah >'
I want 'href = "www.google.com">'
Currently, my function looks like this:
text = re.sub(r'href = \".*\".*>', 'href = \".*\">', text)
which ends up removing the website link and replacing it with the string '.*' . I think I'm supposed to use ?Pname somehow?, but do not know ho to write it properly so that I get the correct output.
You don't want to substitute in .*, you want to substitute in whatever the first .* matched.
To do that, you need a backreference, like \1.
And this means you need something for the backreference to refer back to—a capture group, like (.*) instead of .*.
More generally, the replacement string is not a regular expression, it's a different kind of thing—basically, it's a template that's all literal characters except for backreferences.* So, you don't want to try to escape the quotes, unless you want literal backslashes in the results.
So:
>>> re.sub(r'href = \"(.*)\".*>', r'href = "\1">', text)
'href = "www.google.com">'
This is explained in more detail in Search and Replace in the Regular Expression HOWTO.
* Or it can be a function which takes each match object and returns a string.
An alternative way to accomplish your goal is to take a substring. No regular expression is needed. The idea is to find the second double-quote character using the string method index().
For a string called input, this expression gives you the position of the second double-quote character:
input.index('"', input.index('"')+1)
If that value is k, write input[:k+1] to extract everything up to and including the second double-quote character.
Try out the following in your Python interpreter.
input = 'href = "www.google.com" onmouseover=hax0rFunction()>'
k = input.index('"', input.index('"')+1)
input[0:k+1]

Escaping quotes when isolating strings from input

I'm trying to parse a file in which quotation files are used to encapsulate strings. For instance, the file might contain a line like this:
"\"Hello there, my friends,\" the tour guide says." me # swap notify
But it might also contain lines like this:
"I'm a dingus who wants to put a backslash at the end of my statements. \\" me # swap notify
In that example, the quotes shouldn't be escaped, but a single backslash should remain.
Is there any function I can use to extract that full quoted statement? \n for newline and \r for carriage return also show up on occasion, so I'd like to get those two, but only after I have the full string isolated.
Parse out the string part. You could use a regular expression or string partition
ast.literal_eval the string and assign it to a variable.
Test:
>>> import re
>>> import ast
>>> with open('test.txt.') as f:
... for line in f:
... m = re.match('(.*) \w+ # \w+ \w+', line)
... print ast.literal_eval(m.group(1))
...
"Hello there, my friends," the tour guide says.
I'm a dingus who wants to put a backslash at the end of my statements. \
The regex says "Match anything and store it as group 1, up to a space, a word, a space, #-sign, space and a word". You then retreive the group with the .group(1) syntax. The parenthesis define a group, see regex documentation.
Here's a version that tries to parse the string as greedily as possible, by failing and retrying until a match is found, or no match can be made:
import re
import ast
def match_line(line):
while line:
print "Trying to match:", line
try:
return ast.literal_eval(line)
except SyntaxError, e:
line = line[:e.offset - 1]
except ValueError: # No way it would ever match
break
return None
with open('test.txt.') as f:
for line in f:
match = match_line(line.strip())
print "Matched:", match
print
You could use regex. It's usually not recommended for parsing though, because unless you have fairly simple inputs or inputs that follow strict rules, it's easy to make mistakes.
There is probably some sort of parsing module that handles this better (for example the csv module is fantastic for quote marks in fields & escaping, if you have a csv).
txt1 = r'"\"Hello there, my friends,\" the tour guide says." me # swap notify.'
txt2 = '"I' + "'" + r'm a dingus who wants to put a backslash at the end of my statements. \\" me # swap notify'
import re
print re.findall(r'"(?:[^"\\]|\\.)+"',txt1)[0]
# "\"Hello there, my friends,\" the tour guide says."
print re.findall(r'"(?:[^"\\]|\\.)+"',txt2)[0]
# "I'm a dingus who wants to put a backslash at the end of my statements. \\"
Note I used the r'xxxxx' syntax to avoid having to further escape my backslashes for python (they're already escaped for the regex).
The regex "([^"\\]|\\.)+" says "match anything that's not a " or a backslash, OR match a backslash and whatever is immediately following it."

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Categories

Resources