Parsing text file in python

Parsing text file in python - python

I have html-file. I have to replace all text between this: [%anytext%]. As I understand, it's very easy to do with BeautifulSoup for parsing hmtl. But what is regular expression and how to remove&write back text data?
Okay, here is the sample file:
<html>
[t1] [t2] ... [tood] ... [sadsada]
Sample text [i8]
[d9]
</html>
Python script must work with all strings and replace [%] -> some another string, for example:
<html>
* * ... * ... *
Sample text *
*
</html>
What I did:
import re
import codecs
fullData = ''
for line in codecs.open(u'test.txt', encoding='utf-8'):
line = re.sub("\[.*?\]", '*', line)
fullData += line
print fullData
This code does exactly I described in sample. Thanks all.

Regex does the trick if you are needing to replace any text between "[%" and "%]".
The code would look something like this:
import re
newstring = re.sub("\[%.*?%\]",newtext,oldstring)
The regex used here is lazy so it would match everything between an occurrence of "[%" and the next occurrence of "%]". You could make it greedy by removing the question mark. This would match everything between the first occurrence of of "[%" and the last occurrence of "%]"

Looks like you need to parse a generic textfile, looking for that marker to replace it -- the fact that other text outside the marker is HTML, at least from the way you phrased your task, does not seem to matter.
If so, and what you want is to replace every occurrence of [%anytext%] with loremipsum, then a simple:
thenew = theold.replace('[%anytext%]', 'loremipsum')
will serve, if theold is the original string containing the file's text -- now thenew is a new string with all occurrences of that marker replaced - no need for regex, BS or anything else.
If your task is very different from this, pls edit your Question to explain it in more detail!-)

Related

regular expression in Python to update string in a file

Anything that starts with <a class=“rms-req-link” href=“https://rms. AND ends with </a> should be replaced by TBD.
Example:
<a class=“req-link” href=“https://doc.test.com/req_view/ABC-3456">ABC-3456</a>
or:
<a class=“req-link” href=“https://doc.test.com/req_view/ABC-1234">ABC-1234</a>
Such strings should be replaced by TBD in the file.
Code I tried:
import re
output = open("regex1.txt","w")
input = open("regex.txt")
for line in input:
output.write(re.sub(r"^<a class=“req-link” .*=“https://([a-zA-Z]+(\.[a-zA-Z]+)+).*</a>$", 'TBD', line))
input.close()
output.close()

As mentioned in the comments, the pattern you mention does not match the one you use in your code, nor does it correspond to the example strings you want replaced. So you may or may not want to adjust the following pattern depending on what you actually need.
import re
from pathlib import Path
PATTERN = re.compile(r'<a\s+class=“req-link”\s+href=“https://.*?</a>')
def replace_a_tags(input_file: str, output_file: str) -> None:
contents = Path(input_file).read_text()
with Path(output_file).open("w") as f:
f.write(re.sub(PATTERN, "TBD", contents))
if __name__ == "__main__":
replace_a_tags("input.txt", "output.txt")
The .*? is important to match lazily (as opposed to greedily) so that it matches any character (.) between zero and unlimited times, as few times as possible until it hits the closing anchor tag.
The pattern matches both your example strings.
The Path.read_text method obviously reads the entire file into memory, so that may be a problem, if it happens to be gigantic, but I doubt it. The benefit is that the global regex replacement is much more efficient than iterating over each line in the file individually.

Rewrite a specific portion of a text file in python

(1) I am using Python and would like to create a function that rewrites a portion of a text file. Referencing the sample example below, I would like to be able to delete everything from [Variables] onwards and write new content from that position. I can't figure out how to achieve this using any of seek(), truncate() and/or tell().
I'm thinking I may have to read and store the file's contents up to [Variables] and write that back in before appending the new content. Is there a better way to go about this?
(2) Bonus question: How would I do this if there was content beyond the variables section that I wanted to remain unchanged? This is currently not required, but it would be helpful to know for the future.
Sample Text File:
"[Log]
This happened
That happened
etc
[Variables]
Animals: [Dog, Cat]
Number: 4"

You can try to use regex:
import re
string = text
word = '[Variables]'
# The Regex pattern to match al characters on and after '[Variables]'
pattern = word + ".*"
# Remove all characters after '[Variables]' from string
string = re.sub(pattern, '', string)
print(string)
Here if the text is the text that you show on your question, the output of the code will be:
"[Log]
This happened
That happened
etc"
In order to add new text at the end you just need to concatenate a new string to the existing one like:
string += "Some Text"

Match text between parenthesis that end with .md

I need to get the text inside the parenthesis where the text ends with .md using a regex (if you know another way you can say it) in python.
Original string:
[Romanian (Romania)](books/free-programming-books-ro.md)
Expected result:
books/free-programming-books-ro.md

This should work:
import re
s = '[Romanian (Romania)](books/free-programming-books-ro.md)'
result = re.findall(r'[^\(]+\.md(?=\))',s)
['books/free-programming-books-ro.md']

Regex filter containing word at beginning but not containing another word

suppose i have the following string
GPH_EPL_GK_FIN
i want a regex that ill be using in python that looks for such string from a csv file (not relevant to this question) for records that start with GPH but DONT contain EPL
i know carrot ^ is used for searching at beginning
so i have something like this
^GPH_.*
i want to include the NOT contain part as well, how do i chain the regex?
i.e.
(^GPH_.*)(?!EPL)
i would like to take this a step further eventually and any records that are returned without EPL, i.e.
GPH_ABC_JKL_OPQ
to include AFTER GPH_ the EPL part
i.e. desired result
GPH_EPL_ABC_JKL_OPQ

To cover both requirements:
compose a pattern to match lines that start with GPH but DONT contain EPL
insert EPL_ part into matched line to a particular position
import re
# sample string containing lines
s = '''GPH_EPL_GK_FIN
GPH_ABC_JKL_OPQ'''
pat = re.compile(r'^(GPH_)(?!.*EPL.*)')
for line in s.splitlines():
print(pat.sub('\\1EPL_', line))
The output:
GPH_EPL_GK_FIN
GPH_EPL_ABC_JKL_OPQ

This here would do, I think:
^GPH_(?!EPL).*
This will return any string that start with GPH and does not have EPL after GPH_.

I'm just guessing that one option would be,
(?<=^GPH_(?!EPL))
and re.sub with,
EPL_
Test
import re
print(re.sub(r"(?<=^GPH_(?!EPL))", "EPL_", "GPH_ABC_JKL_OPQ"))
Output
GPH_EPL_ABC_JKL_OPQ

Simply use this:
https://regex101.com/r/GwBsg2/2
pattern: ^(?!^(?:[^_\n]+_)*EPL_?(?:[^_\n]+_?)*)(.*)GPH
substitute: \1GPH_EPL
flags: gm

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","author":null,"d‌escription":null,"fi‌leAssetId":"034b9317‌-60d9-45c2-b6d6-0f24‌b59e1991","filename"‌:"Reports.pdf"},"cre‌atedBy":1531,"create‌dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌bat.png","id":3041,"‌inheritedPermissions‌":false,"name":"map"‌,"permissions":[23,8‌7,35,49,65],"type":3‌,"viewLevel":2},{"__‌type":"WikiNode:http‌:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","children":[],"c‌ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌, 60d9, 45c2, b6d6, 0f24‌b59e1991
Im not to sure how to get the data as its displayed.

How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.

You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌-60d9-45c2-b6d6-0f24‌b59e1991

Try adding \n to the string that you are entering in to the file (\n means new line)

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing text file in python - python

Related

regular expression in Python to update string in a file

Rewrite a specific portion of a text file in python

Match text between parenthesis that end with .md

Regex filter containing word at beginning but not containing another word

Python - Parsing JSON formatted text file with regex

Categories

Resources