Rewrite a specific portion of a text file in python - python

(1) I am using Python and would like to create a function that rewrites a portion of a text file. Referencing the sample example below, I would like to be able to delete everything from [Variables] onwards and write new content from that position. I can't figure out how to achieve this using any of seek(), truncate() and/or tell().
I'm thinking I may have to read and store the file's contents up to [Variables] and write that back in before appending the new content. Is there a better way to go about this?
(2) Bonus question: How would I do this if there was content beyond the variables section that I wanted to remain unchanged? This is currently not required, but it would be helpful to know for the future.
Sample Text File:
"[Log]
This happened
That happened
etc
[Variables]
Animals: [Dog, Cat]
Number: 4"

You can try to use regex:
import re
string = text
word = '[Variables]'
# The Regex pattern to match al characters on and after '[Variables]'
pattern = word + ".*"
# Remove all characters after '[Variables]' from string
string = re.sub(pattern, '', string)
print(string)
Here if the text is the text that you show on your question, the output of the code will be:
"[Log]
This happened
That happened
etc"
In order to add new text at the end you just need to concatenate a new string to the existing one like:
string += "Some Text"

Related

python/regex copy paragraphs in order to another txt document

I'm working on what I initially thought would be a pretty simple program. Essentially, it should find key words then copy that paragraph to another document. What I want to do is take content from document 1 (both are .txt files) and re-order the paragraphs into a desired order.
I think I've written my python part correctly, as it works with other snippets (or seems to just fine), but the regex part (admittedly I'm very new to this) for some reason does not work.
I've tried a number of things and searched all through stack overflow. What I have currently "catches" almost the entire txt file instead of just the paragraph. This may be obvious but in addition to it catching most of the document, it's catching paragraphs without the target term (in this case, discussing) in it.
I appreciate all help in advance.
def write_function():
with open('minnar.txt','r') as rf, open('regexoutput.txt', 'a') as wf:
content = rf.read()
matches = target.findall(content)
print(matches)
for match in matches:
wf.write(match + '\n \n')
target = re.compile('([^\']*(?=discussing)[^\']*)')
write_function()```
If your paragraph means the text between quote, then the regex should be follow:
\'([^']+)\'
https://pythex.org/?regex=%5C%27(%5B%5E%27%5D%2B)%5C%27&test_string=This%20is%20%27the%20thing%27%20that%20I%20talked%20about.%20And%20I%20think%20this%20%27should%20be%20the%20one%20that%20they%20expected%27&ignorecase=0&multiline=0&dotall=1&verbose=0

Python, eliminating lines within angle brackets with regex

I'm writing a python script to assign grammatical categories to words in several text files. In each text file, I have file headers within angle brackets <>. Throughout the texts there are also additional lines with information such as time stamps, page numbers, and questions from the transcriber. I want to remove these lines. This is basically what the text files look like:
<title Titipuru Supay>
<speaker name>
<sex female>
<dialect Pastaza>
<register narrative>
<contributor name>
chan; payguna serenkya man chiga;
<ima?>
payguna kirina man, chiga, mana
shayachira; ninagunan shi tujsirani nira:
illaparani nira shi illapay
<173>
pasasha, ima shi kasna nin, nisha,
Even though there are the same number of headers in each file the other <> material varies, so I can't just eliminate specific lines. So I thought I'd try something simple like a re.sub statement that removes everything inbetween <> and including the brackets.
with open(file, encoding='utf-8') as file_in:
text = file_in.read()
re.sub(r"<.*>", " ", text)
I tried <.*> on pythex.org and regex101 it worked in both places with a test string, but not in my script (yes I have import re). I also tried other solutions like: \<.*\>
Am I just not getting the regex right or there something deeper here?
From what I understand, you may have several <...> on the same line. In this case, you are much safer with a negated character class solution:
text = re.sub(r"<[^>]*>", " ", text)
The text variable, of course, should be updated as Python strings are immutable, and the regex is now matching <, then zero or more characters other than >, and then >.
See the regex demo
Strings are immutable, meaning they cannot be modified, only reassigned. The re.sub(...) is working, but it's returning a new string. Try this:
text = re.sub(r"<.*>", " ", text)
If this still doesn't work, please give us more information about your problem

Need help finding the correct regex pattern for my string pattern

I'm terrible with RegEx patterns, and I'm writing a simple python program that requires splitting lines of a file into a 'content' part and a 'tags' part, and then further splitting the tags parts into individual tags. Here's a simple example of what one line of my file might look like:
The Beatles <music,rock,60s,70s>
I've opened my file with begun reading lines like this:
def Load(self, filename):
file = open(filename, r)
for line in file:
#Ignore comments and empty lines..
if not line.startswith('#') and not line.strip():
#...
Forgive my likely terrible Python, it's my first few days with the language. Anyway, next I was thinking it would be useful to use a regex to break my string into sections - with a variable to store the 'content' (for example, "The Beatles"), and a list/set to store each of the tags. As such, I need a regex (or two?) that can:
Split the raw part from the <> part.
And split the tags part into a list based on the commas.
Finally, I want to make sure that the content part retains its capitalization and inner spacing. But I want to make sure the tags are all lower-case and without white space.
I'm wondering if any of the regex experts out there can help me find the correct pattern(s) to achieve my goals here?
This is a solution that gets around the problem without using by relying on multiple splits.
# This separates the string into the content and the remainder
content, tagStr = line.split('<')
# This splits the tagStr into individual tags. [:-1] is used to remove trailing '>'
tags = tagStr[:-1].split(',')
print content
print tags
The problem with this is that it leaves a trailing whitespace after the content.
You can remove this with:
content = content[:-1]

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".
You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each timeā€¦
If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

Python - RegEx match does not write to file if it contains full-stop

I'm trying to use a RegEx expression in a Python script in order to find specific variables within a webpage. I then export this using a csv file. However, if the found group contains a full-stop, it does not export at all. How do I remedy this?
In this webpage, the item displayed changes depending on a code inputted. My script automates the inputting of codes, and then records the item produced. Here are the relevant parts of my code:
import re
regName = r'The item name is (.*?)\.'
response = opener.open(
'http://website.com/webpage.php' + itemValues)
html = response.read()
responseDecode = html.decode('utf8')
name = re.findall(regName, responseDecode)
#Convert stuff to Unicode
uniName = name[0].encode('utf8', 'replace')
with open("readable.txt", "a") as file:
file.write("\n"*2)
file.write(uniName + '\n')
Of note, I convert to unicode because some of the item names contain accented characters.
EDIT: an example of something that would not work would be, for instance, R.O.B.O.T . All that would be written would be R
Try using regName = r'The item name is (.*?)\.$' The $ marks the end of the string, which would allow the other full stops to not be consumed early. Right now the regular expression is being greedy and matching on the first one.
Or if the string doesn't end right there, try adding a space or some other following character. You need to specify the kind of character that marks the end of the item string.

Categories

Resources