Anything that starts with <a class=“rms-req-link” href=“https://rms. AND ends with </a> should be replaced by TBD.
Example:
<a class=“req-link” href=“https://doc.test.com/req_view/ABC-3456">ABC-3456</a>
or:
<a class=“req-link” href=“https://doc.test.com/req_view/ABC-1234">ABC-1234</a>
Such strings should be replaced by TBD in the file.
Code I tried:
import re
output = open("regex1.txt","w")
input = open("regex.txt")
for line in input:
output.write(re.sub(r"^<a class=“req-link” .*=“https://([a-zA-Z]+(\.[a-zA-Z]+)+).*</a>$", 'TBD', line))
input.close()
output.close()
As mentioned in the comments, the pattern you mention does not match the one you use in your code, nor does it correspond to the example strings you want replaced. So you may or may not want to adjust the following pattern depending on what you actually need.
import re
from pathlib import Path
PATTERN = re.compile(r'<a\s+class=“req-link”\s+href=“https://.*?</a>')
def replace_a_tags(input_file: str, output_file: str) -> None:
contents = Path(input_file).read_text()
with Path(output_file).open("w") as f:
f.write(re.sub(PATTERN, "TBD", contents))
if __name__ == "__main__":
replace_a_tags("input.txt", "output.txt")
The .*? is important to match lazily (as opposed to greedily) so that it matches any character (.) between zero and unlimited times, as few times as possible until it hits the closing anchor tag.
The pattern matches both your example strings.
The Path.read_text method obviously reads the entire file into memory, so that may be a problem, if it happens to be gigantic, but I doubt it. The benefit is that the global regex replacement is much more efficient than iterating over each line in the file individually.
(1) I am using Python and would like to create a function that rewrites a portion of a text file. Referencing the sample example below, I would like to be able to delete everything from [Variables] onwards and write new content from that position. I can't figure out how to achieve this using any of seek(), truncate() and/or tell().
I'm thinking I may have to read and store the file's contents up to [Variables] and write that back in before appending the new content. Is there a better way to go about this?
(2) Bonus question: How would I do this if there was content beyond the variables section that I wanted to remain unchanged? This is currently not required, but it would be helpful to know for the future.
Sample Text File:
"[Log]
This happened
That happened
etc
[Variables]
Animals: [Dog, Cat]
Number: 4"
You can try to use regex:
import re
string = text
word = '[Variables]'
# The Regex pattern to match al characters on and after '[Variables]'
pattern = word + ".*"
# Remove all characters after '[Variables]' from string
string = re.sub(pattern, '', string)
print(string)
Here if the text is the text that you show on your question, the output of the code will be:
"[Log]
This happened
That happened
etc"
In order to add new text at the end you just need to concatenate a new string to the existing one like:
string += "Some Text"
I need to get the text inside the parenthesis where the text ends with .md using a regex (if you know another way you can say it) in python.
Original string:
[Romanian (Romania)](books/free-programming-books-ro.md)
Expected result:
books/free-programming-books-ro.md
This should work:
import re
s = '[Romanian (Romania)](books/free-programming-books-ro.md)'
result = re.findall(r'[^\(]+\.md(?=\))',s)
['books/free-programming-books-ro.md']
suppose i have the following string
GPH_EPL_GK_FIN
i want a regex that ill be using in python that looks for such string from a csv file (not relevant to this question) for records that start with GPH but DONT contain EPL
i know carrot ^ is used for searching at beginning
so i have something like this
^GPH_.*
i want to include the NOT contain part as well, how do i chain the regex?
i.e.
(^GPH_.*)(?!EPL)
i would like to take this a step further eventually and any records that are returned without EPL, i.e.
GPH_ABC_JKL_OPQ
to include AFTER GPH_ the EPL part
i.e. desired result
GPH_EPL_ABC_JKL_OPQ
To cover both requirements:
compose a pattern to match lines that start with GPH but DONT contain EPL
insert EPL_ part into matched line to a particular position
import re
# sample string containing lines
s = '''GPH_EPL_GK_FIN
GPH_ABC_JKL_OPQ'''
pat = re.compile(r'^(GPH_)(?!.*EPL.*)')
for line in s.splitlines():
print(pat.sub('\\1EPL_', line))
The output:
GPH_EPL_GK_FIN
GPH_EPL_ABC_JKL_OPQ
This here would do, I think:
^GPH_(?!EPL).*
This will return any string that start with GPH and does not have EPL after GPH_.
I'm just guessing that one option would be,
(?<=^GPH_(?!EPL))
and re.sub with,
EPL_
Test
import re
print(re.sub(r"(?<=^GPH_(?!EPL))", "EPL_", "GPH_ABC_JKL_OPQ"))
Output
GPH_EPL_ABC_JKL_OPQ
Simply use this:
https://regex101.com/r/GwBsg2/2
pattern: ^(?!^(?:[^_\n]+_)*EPL_?(?:[^_\n]+_?)*)(.*)GPH
substitute: \1GPH_EPL
flags: gm
I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.au\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317, 60d9, 45c2, b6d6, 0f24b59e1991
Im not to sure how to get the data as its displayed.
How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.
You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317-60d9-45c2-b6d6-0f24b59e1991
Try adding \n to the string that you are entering in to the file (\n means new line)
Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.