regular expression in Python to update string in a file

regular expression in Python to update string in a file - python

Anything that starts with <a class=“rms-req-link” href=“https://rms. AND ends with </a> should be replaced by TBD.
Example:
<a class=“req-link” href=“https://doc.test.com/req_view/ABC-3456">ABC-3456</a>
or:
<a class=“req-link” href=“https://doc.test.com/req_view/ABC-1234">ABC-1234</a>
Such strings should be replaced by TBD in the file.
Code I tried:
import re
output = open("regex1.txt","w")
input = open("regex.txt")
for line in input:
output.write(re.sub(r"^<a class=“req-link” .*=“https://([a-zA-Z]+(\.[a-zA-Z]+)+).*</a>$", 'TBD', line))
input.close()
output.close()

As mentioned in the comments, the pattern you mention does not match the one you use in your code, nor does it correspond to the example strings you want replaced. So you may or may not want to adjust the following pattern depending on what you actually need.
import re
from pathlib import Path
PATTERN = re.compile(r'<a\s+class=“req-link”\s+href=“https://.*?</a>')
def replace_a_tags(input_file: str, output_file: str) -> None:
contents = Path(input_file).read_text()
with Path(output_file).open("w") as f:
f.write(re.sub(PATTERN, "TBD", contents))
if __name__ == "__main__":
replace_a_tags("input.txt", "output.txt")
The .*? is important to match lazily (as opposed to greedily) so that it matches any character (.) between zero and unlimited times, as few times as possible until it hits the closing anchor tag.
The pattern matches both your example strings.
The Path.read_text method obviously reads the entire file into memory, so that may be a problem, if it happens to be gigantic, but I doubt it. The benefit is that the global regex replacement is much more efficient than iterating over each line in the file individually.

Related

Problem with finding the correct match with regex

I have some data, which I'm trying to process. Basically I want to change all the commas , to semicolon ;, but some fields contain text, usernames or passwords that also contain commas. How do I change all the commas except the ones inclosed in "?
Test data:
Secret Name,URL,Username,Password,Notes,Folder,TOTP Key,TOTP Backup Codes
test1,,username,"pass,word",These are the notes,\Some\Folder,,
test2,,"user1, user2, user3","pass,word","Hello, I'm mr Notes",\Some\Folder,,
test3,http://1.2.3.4/ucsm/ucsm.jnlp,"xxxx\n(use Drop down, select Hello)",password,Use the following\nServer1\nServer2,\Some\Folder,,
What have I tried?
secrets = """Secret Name,URL,Username,Password,Notes,Folder,TOTP Key,TOTP Backup Codes
test1,,username,"pass,word",These are the notes,\Some\Folder,,
test2,,"user1, user2, user3","pass,word","Hello, I'm mr Notes",\Some\Folder,,
test3,http://1.2.3.4/ucsm/ucsm.jnlp,"xxxx\n(use Drop down, select Hello)",password,Use the following\nServer1\nServer2,\Some\Folder,,
"""
test = re.findall(r'(.+?\")(.+)(\".+)', secrets)
for line in test:
part1, part2, part3 = line
processed = "".join([part1.replace(",", ";"), part2, part3.replace(",", ";")])
print(processed)
Result:
test1;;username;"pass,word";These are the notes;\Some\Folder;;
test2;;"user1, user2, user3","pass,word","Hello, I'm mr Notes";\Some\Folder;;
It works fine, when there's only one occurence of "" in the line and no line breaks, but when there are more or there's a line break within the quotations, it's broken. How can I fix this?
FYI: Notes can contain multiple line breaks.

You don't need a regex here, take advantage of a CSV parser:
import csv, io
inp = csv.reader(io.StringIO(secrets), # or use file as input
quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL)
with open('out.csv', 'w') as out:
csv.writer(out, delimiter=';').writerows(inp)
output file:
Secret Name;URL;Username;Password;Notes;Folder;TOTP Key;TOTP Backup Codes
test1;;username;pass,word;These are the notes;\Some\Folder;;
test2;;user1, user2, user3;pass,word;Hello, I'm mr Notes;\Some\Folder;;
test3;http://1.2.3.4/ucsm/ucsm.jnlp;"xxxx
(use Drop down, select Hello)";password;Use the following
Server1
Server2;\Some\Folder;;
Optionally, use the quoting=csv.QUOTE_ALL parameter in csv.writer.

This should do I believe:
import re
print( re.sub(r'("[^"]*")|,', lambda x: x.group(1) if x.group(1) else x.group().replace(",", ";"), secrets))

mozway's solution looks like the best way to resolve this, but interestingly, SM1312's regex works almost perfectly with a much more simple replacement argument for the sub function (i.e. r'\1;'):
import re
print (re.sub(r'("[^"]*")|,', r'\1;', secrets))
The only issue is this introduces an extra semicolon after a quoted entry. This happens because the first alternation member (i.e. ("[^"]*")) does not consume a comma, but the replacement argument adds a semicolon regardless of which alternation member matches. Simply adding a comma to the first alternation member resolves this and works perfectly for the sample data:
import re
print (re.sub(r'("[^"]*"),|,', r'\1;', secrets))
However, it fails if the data includes a quoted entry as the last (i.e. the TOTP Backup Codes) column of the data; any commas in the last quoted entry will be changed to semicolons. This is likely not an acceptable failure mode since it is changing the data set. The following resolves that issue, but introduces a different error that may be tolerable; it adds an extra semicolon at the end of the line:
import re
print (re.sub(r'("[^"]*")(,|(?=\s+))|,', r'\1;', secrets))
This is accomplished by changing the first part of the original alternation member to use alternation itself. That is, the part that was matching the comma after the quoted entry is changed to additionally check for nothing but whitespace (i.e. (,|(?=\s+))), which includes an end of line, after the quoted entry using the following positive lookahead assertion: (?=\s+). The positive lookahead assertion for whitespace is used instead of simply matching whitespace to avoid consuming the whitespace and eliminating it from the resulting output.

Python 2.7 find IP address and replace with text

I have used Python to extract a route table from a router and am trying to
strip out superfluous text, and
replace the destination of each route with a text string to match a different customer grouping.
At the moment I have::
infile = "routes.txt"
outfile = "output.txt"
delete_text = ["ROUTER1# sh run | i route", "ip route"]
client_list = ["CUST_A","CUST_B"]
subnet_list = ["1.2.3.4","5.6.7.8"]
fin = open(infile)
fout = open(outfile, "w+")
for line in fin:
for word in delete_text:
line = line.replace(word, "")
for word in subnet_list:
line = line.replace("1.2.3.4", "CUST_A")
for word in subnet_list:
line = line.replace("5.6.7.8", "CUST_B")
fout.write(line)
fin.close()
fout.close()
f = open('output.txt', 'r')
file_contents = f.read()
print (file_contents)
f.close()
This works to an extent but when it searches and replaces for e.g. 5.6.7.8 it also picks up that string within other IP addresses e.g. 5.6.7.88, and replaces them also which I don't want to happen.
What I am after is an exact match only to be found and replaced.

You could use re.sub() with explicit word boundaries (\b):
>>> re.sub(r'\b5.6.7.8\b', 'CUST_B', 'test 5.6.7.8 test 5.6.7.88 test')
'test CUST_B test 5.6.7.88 test'

As you found out your approach is bad because it results in false positives (i.e., undesirable matches). You should parse the lines into tokens then match the individual tokens. That might be as simple as first doing tokens = line.split() to split on whitespace. That, however, may not work if the line contains quoted strings. Consider what the result of this statement: "ab 'cd ef' gh".split(). So you might need a more sophisticated parser.
You could use the re module to perform substitutions using the \b meta sequence to ensure the matches begin and end on a "word" boundary. But that has its own, unique, failure modes. For example, consider that the . (period) character matches any character. So doing re.sub('\b5.6.7.8\b', ...) as #NPE suggested will actually match not just the literal word 5.6.7.8 but also 5x6.7y8. That may not be a concern given the inputs you expect but is something most people don't consider and is therefore another source of bugs. Regular expressions are seldom the correct tool for a problem like this one.

thanks guys, I've been testing with this and the re.sub function just seems to print out the below string in a loop : CUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTBCUSTB .
I have amended the code snippet above to :
for word in subnet_list:
line = re.sub(r'\b5.6.7.8\b', 'CUST_B', '5.6.7.88')
Ideally I would like the string element to be replaced in all of the list occurrences along with preserving the list structure ?

Regular Expression String Mangling Efficiency in Python - Explanation for Slowness?

I'm hoping someone can help explain why Python's re module seems to be so slow at chopping up a very large string for me.
I have string ("content") that is very nearly 600k bytes in size. I'm trying to hack off just the beginning part of it, a variable number of lines, delimited by the text ">>>FOOBAR<<<".
The literal completion time is provided for comparison purposes - the script that this snippet is in takes a bit to run naturally.
The first and worst method:
import re
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = re.sub(".*>>>FOOBAR<<<", ">>>FOOBAR<<<", content, flags=re.S)
Has a completion time of:
real 6m7.213s
While a wordy method:
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
newstir = ""
flag = False
for l in content.split('\n'):
if re.search(">>>FOOBAR<<<", l):
flag = True
#End if we encountered our flag line
if flag:
newstir += l
#End loop through content
content = newstir
Has an expected completion time of:
real 1m5.898s
And using a string's .split method:
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = content.split(">>>FOOBAR<<<")[1]
Also has an expected completion time of:
real 1m6.427s
What's going on here? Why is my re.sub call so ungodly slow for the same string?

There is no good way to do it with a pattern starting either with .* or .*? in particular with large data, since the first will cause a lot of backtracking and the second must test for each taken character if the following subpattern fails (until it succeeds). Using a non-greedy quantifier isn't faster than using a greedy quantifier.
I suspect that your ~600k content data are in a file at the beginning. Instead of loading the whole file and storing its content to a variable, work line by line. In this way you will preserve memory and avoid to split and to create a list of lines. Second thing, if you are looking for a literal string, don't use a regex method, use a simple string method like find that is faster:
with open('yourfile') as fh:
for line in fh:
result += line
if line.find('>>>FOOBAR<<<') > -1:
break
If >>>FOOBAR<<< isn't a simple literal string but a regex pattern, in this case compile the pattern before:
pat = re.compile(r'>>>[A-Z]+<<<')
with open('yourfile') as fh:
for line in fh:
result += line
if pat.search(line):
break

Editing a text file using python

I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009?

It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers.
(Edit, to incorporate suggestions)
import re
inf = 'temp.txt'
outf = 'out.txt'
with open(inf) as f,open(outf,'w') as o:
all = f.read()
all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here
o.write(all)
o.close()

You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):
import re
s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""
new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s
Explanation: "match a colon : followed by four numbers [0-9]{4} followed by any two "word" characters \w{2}. The parentheses catch just the part you want to keep, and r'\1' means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r before the string is there because it is necessary to interpret \1 as a raw string, and not as an escape sequence.
Hope this helps!

Parsing text file in python

I have html-file. I have to replace all text between this: [%anytext%]. As I understand, it's very easy to do with BeautifulSoup for parsing hmtl. But what is regular expression and how to remove&write back text data?
Okay, here is the sample file:
<html>
[t1] [t2] ... [tood] ... [sadsada]
Sample text [i8]
[d9]
</html>
Python script must work with all strings and replace [%] -> some another string, for example:
<html>
* * ... * ... *
Sample text *
*
</html>
What I did:
import re
import codecs
fullData = ''
for line in codecs.open(u'test.txt', encoding='utf-8'):
line = re.sub("\[.*?\]", '*', line)
fullData += line
print fullData
This code does exactly I described in sample. Thanks all.

Regex does the trick if you are needing to replace any text between "[%" and "%]".
The code would look something like this:
import re
newstring = re.sub("\[%.*?%\]",newtext,oldstring)
The regex used here is lazy so it would match everything between an occurrence of "[%" and the next occurrence of "%]". You could make it greedy by removing the question mark. This would match everything between the first occurrence of of "[%" and the last occurrence of "%]"

Looks like you need to parse a generic textfile, looking for that marker to replace it -- the fact that other text outside the marker is HTML, at least from the way you phrased your task, does not seem to matter.
If so, and what you want is to replace every occurrence of [%anytext%] with loremipsum, then a simple:
thenew = theold.replace('[%anytext%]', 'loremipsum')
will serve, if theold is the original string containing the file's text -- now thenew is a new string with all occurrences of that marker replaced - no need for regex, BS or anything else.
If your task is very different from this, pls edit your Question to explain it in more detail!-)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regular expression in Python to update string in a file - python

Related

Problem with finding the correct match with regex

Python 2.7 find IP address and replace with text

Regular Expression String Mangling Efficiency in Python - Explanation for Slowness?

Editing a text file using python

Parsing text file in python

Categories

Resources