Python: Multiple text processings from stdin - python

I want to write a python script which will be integrated in a shell pipeline. So it has to take some input text from std-input and print the result to the std-output. I have multiple text processings to chain. Some processings apply on each line, others, first need to detect some blocks before change.
I made a loop for each text processing but my problem is that I don't see how to chain them to have the next loop taking as input the ouput of the previous.
Here under is my first draft.
As I'm used to write shell scripts, I have the feeling that I will have to works with tempfiles but not sure it's the way to go in Python.
And I assume that it would be nicer if I put each loop's processing in a function, too.
#!/usr/bin/python3
""" Sample of pre-processing formating script """
import fileinput
import re
import sys
""" Read StdIn """
lines_in = fileinput.input()
lines_out = ""
preform_txt_regex = re.compile(r"^ ")
code_block = ""
"""
Walk through the input and replace the 'preformatted text' (starting with 2 spaces)
into 'Fixed width text' (<code>…</code>).
"""
for line in lines_in:
if line.startswith(" "):
code_block = code_block + preform_txt_regex.sub('', line)
else:
if code_block != "":
lines_out = lines_out + "<syntaxhighlight lang='shell'>\n{}</syntaxhighlight>\n".format(code_block)
code_block = ""
sys.stdout.write(line)
lines_out = lines_out + line
# Reset lines_in and lines_out
lines_in = lines_out.split("\n")
lines_out = ""
"""
Remove the all 'Category' tags
"""
for line in lines_in:
lines_out = lines_out + re.sub(r'\[\[Cat[ée]gor.*:[^\]]*]]', r'', line)
"""
Few other string substitution
"""
for line in lines_in:
[...]
""" Print the processed texts """
sys.stdout.write(lines_out)

Propositions from #Steve and #ibra were to use a list of lines as variable buffer. And indeed, I made something working like that.
So here is my code reviewed. I moved my processing loops in functions which take lines_buffer as parameter:
#!/usr/bin/python3
# -*- encoding: utf-8 -*-
""" Sample of pre-processing formating script """
import fileinput
import re
import sys
"""
Walk through the input and replace the 'preformatted text' (starting with 2 spaces)
into 'Fixed width text' (<code>…</code>).
"""
def render_code_block(lines):
preform_txt_regex = re.compile(r"^ ")
code_block = []
output = []
for line in lines:
if line.startswith(" "):
code_block.append(preform_txt_regex.sub('', line))
else:
if code_block != []:
output.append("<syntaxhighlight lang='shell'>\n{}</syntaxhighlight>\n".format(code_block))
code_block = []
output.append(line)
return output
"""
Remove the all 'Category' tags
"""
def remove_category_tags(lines):
output = []
for line in lines:
output.append(re.sub(r'\[\[Cat[ée]gor.*:[^\]]*]]', r'', line))
return output
""" Main """
lines_buffer = []
lines_buffer = fileinput.input()
lines_buffer = render_code_block(lines_buffer)
lines_buffer = remove_category_tags(lines_buffer)
for line in lines_buffer:
sys.stdout.write(line)
And of course, I had to replace the string initialization (= "") by list initialization (= []) and append in place of concatenate (+).

Related

How to replace a string in a file?

I have 2 numbers in two similar files. There is a new.txt and original.txt. They both have the same string in them except for a number. The new.txt has a string that says boothNumber="3". The original.txt has a string that says boothNumber="1".
I want to be able to read the new.txt, pick the number 3 out of it and replace the number 1 in original.txt.
Any suggestions? Here is what I am trying.
import re # used to replace string
import sys # some of these are use for other code in my program
def readconfig():
with open("new.text") as f:
with open("original.txt", "w") as f1:
for line in f:
match = re.search(r'(?<=boothNumber=")\d+', line)
for line in f1:
pattern = re.search(r'(?<=boothNumber=")\d+', line)
if re.search(pattern, line):
sys.stdout.write(re.sub(pattern, match, line))
When I run this, my original.txt gets completely cleared of any text.
I did a traceback and I get this:
in readconfig
for line in f1:
io.UnsupportedOperationo: not readable
UPDATE
I tried:
def readconfig(original_txt_path="original.txt",
new_txt_path="new.txt"):
with open(new_txt_path) as f:
for line in f:
if not ('boothNumber=') in line:
continue
booth_number = int(line.replace('boothNumber=', ''))
# do we need check if there are more than one 'boothNumber=...' line?
break
with open(original_txt_path) as f1:
modified_lines = [line.startswith('boothNumber=') if not line
else 'boothNumber={}'.format(booth_number)
for line in f1]
with open(original_txt_path, mode='w') as f1:
f1.writelines(modified_lines)
And I get error:
booth_number = int(line.replace('boothNumber=', ''))
ValueError: invalid literal for int() with base 10: '
(workstationID="1" "1" window=1= area="" extra parts of the line here)\n
the "1" after workstationID="1" is where the boothNumber=" " would normally go. When I open up original.txt, I see that it actually did not change anything.
UPDATE 3
Here is my code in full. Note, the file names are changed but I'm still trying to do the same thing. This is another idea or revision I had that is still not working:
import os
import shutil
import fileinput
import re # used to replace string
import sys # prevents extra lines being inputed in config
# example: sys.stdout.write
def convertconfig(pattern):
source = "template.config"
with fileinput.FileInput(source, inplace=True, backup='.bak') as file:
for line in file:
match = r'(?<=boothNumber=")\d+'
sys.stdout.write(re.sub(match, pattern, line))
def readconfig():
source = "bingo.config"
pattern = r'(?<=boothNumber=")\d+' # !!!!!!!!!! This probably needs fixed
with fileinput.FileInput(source, inplace=True, backup='.bak') as file:
for line in file:
if re.search(pattern, line):
fileinput.close()
convertconfig(pattern)
def copyfrom(servername):
source = r'//' + servername + '/c$/remotedirectory'
dest = r"C:/myprogram"
file = "bingo.config"
try:
shutil.copyfile(os.path.join(source, file), os.path.join(dest, file))
except:
print ("Error")
readconfig()
# begin here
os.system('cls' if os.name == 'nt' else 'clear')
array = []
with open("serverlist.txt", "r") as f:
for servername in f:
copyfrom(servername.strip())
bingo.config is my new file
template.config is my original
It's replacing the number in template.config with the literal string "r'(?<=boothNumber=")\d+'"
So template.config ends up looking like
boothNumber="r'(?<=boothNumber=")\d+'"
instead of
boothNumber="2"
To find boothNumber value we can use next regular expression (checked with regex101)
(?<=\sboothNumber=\")(\d+)(?=\")
Something like this should work
import re
import sys # some of these are use for other code in my program
BOOTH_NUMBER_RE = re.compile('(?<=\sboothNumber=\")(\d+)(?=\")')
search_booth_number = BOOTH_NUMBER_RE.search
replace_booth_number = BOOTH_NUMBER_RE.sub
def readconfig(original_txt_path="original.txt",
new_txt_path="new.txt"):
with open(new_txt_path) as f:
for line in f:
search_res = search_booth_number(line)
if search_res is None:
continue
booth_number = int(search_res.group(0))
# do we need check if there are more than one 'boothNumber=...' line?
break
else:
# no 'boothNumber=...' line was found, so next lines will fail,
# maybe we should raise exception like
# raise Exception('no line starting with "boothNumber" was found')
# or assign some default value
# booth_number = -1
# or just return?
return
with open(original_txt_path) as f:
modified_lines = []
for line in f:
search_res = search_booth_number(line)
if search_res is not None:
line = replace_booth_number(str(booth_number), line)
modified_lines.append(line)
with open(original_txt_path, mode='w') as f:
f.writelines(modified_lines)
Test
# Preparation
with open('new.txt', mode='w') as f:
f.write('some\n')
f.write('<jack Fill workstationID="1" boothNumber="56565" window="17" Code="" area="" section="" location="" touchScreen="False" secureWorkstation="false">')
with open('original.txt', mode='w') as f:
f.write('other\n')
f.write('<jack Fill workstationID="1" boothNumber="23" window="17" Code="" area="" section="" location="" touchScreen="False" secureWorkstation="false">')
# Invocation
readconfig()
# Checking output
with open('original.txt') as f:
for line in f:
# stripping newline character
print(line.rstrip('\n'))
gives
other
<jack Fill workstationID="1" boothNumber="56565" window="17" Code="" area="" section="" location="" touchScreen="False" secureWorkstation="false">

Python should return more than 1 argument

Hello I'm beginner in python and I'm trying to execute this program to create an inverted index for a collection file:
import sys
import re
from porterStemmer import PorterStemmer
from collections import defaultdict
from array import array
import gc
porter=PorterStemmer()
class CreateIndex:
def __init__(self):
self.index=defaultdict(list) #the inverted index
def getStopwords(self):
'''get stopwords from the stopwords file'''
f=open(self.stopwordsFile, 'r')
stopwords=[line.rstrip() for line in f]
self.sw=dict.fromkeys(stopwords)
f.close()
def getTerms(self, line):
'''given a stream of text, get the terms from the text'''
line=line.lower()
line=re.sub(r'[^a-z0-9 ]',' ',line) #put spaces instead of non-alphanumeric characters
line=line.split()
line=[x for x in line if x not in self.sw] #eliminate the stopwords
line=[ porter.stem(word, 0, len(word)-1) for word in line]
return line
def parseCollection(self):
''' returns the id, title and text of the next page in the collection '''
doc=[]
for line in self.collFile:
if line=='</page>\n':
break
doc.append(line)
curPage=''.join(doc)
pageid=re.search('<id>(.*?)</id>', curPage, re.DOTALL)
pagetitle=re.search('<title>(.*?)</title>', curPage, re.DOTALL)
pagetext=re.search('<text>(.*?)</text>', curPage, re.DOTALL)
if pageid==None or pagetitle==None or pagetext==None:
return {}
d={}
d['id']=pageid.group(1)
d['title']=pagetitle.group(1)
d['text']=pagetext.group(1)
return d
def writeIndexToFile(self):
'''write the inverted index to the file'''
f=open(self.indexFile, 'w')
for term in self.index.iterkeys():
postinglist=[]
for p in self.index[term]:
docID=p[0]
positions=p[1]
postinglist.append(':'.join([str(docID) ,','.join(map(str,positions))]))
print >> f, ''.join((term,'|',';'.join(postinglist)))
f.close()
def getParams(self):
'''get the parameters stopwords file, collection file, and the output index file'''
param=sys.argv
self.stopwordsFile=param[0]
self.collectionFile=param[1]
self.indexFile=param[2]
def createIndex(self):
'''main of the program, creates the index'''
self.getParams()
self.collFile=open(self.collectionFile,'r')
self.getStopwords()
#bug in python garbage collector!
#appending to list becomes O(N) instead of O(1) as the size grows if gc is enabled.
gc.disable()
pagedict={}
pagedict=self.parseCollection()
#main loop creating the index
while pagedict != {}:
lines='\n'.join((pagedict['title'],pagedict['text']))
pageid=int(pagedict['id'])
terms=self.getTerms(lines)
#build the index for the current page
termdictPage={}
for position, term in enumerate(terms):
try:
termdictPage[term][1].append(position)
except:
termdictPage[term]=[pageid, array('I',[position])]
#merge the current page index with the main index
for termpage, postingpage in termdictPage.iteritems():
self.index[termpage].append(postingpage)
pagedict=self.parseCollection()
gc.enable()
self.writeIndexToFile()
if __name__=="__main__":
c=CreateIndex()
c.createIndex()
and it says that there is only 1 argument in the sys.argv...
how should the other arguments appear???
In the getParams function, you can see that your code request 3 parameters.
When you call your program:
python your_program.py
# sys.argv[0] = 'your_program.py'
It has 1 argument. So you need two more:
python your_program.py arg_1 arg_2
# sys.argv[0] = 'your_program.py'
# sys.argv[1] = 'arg_1'
# sys.argv[2] = 'arg_2

Copy complete sentences from text file and add to list

I am trying to extract complete sentences from a long text file and adding them as strings to a list in Python 2.7. I want to automate this and not just cut and paste in the list.
Here is what I have:
from sys import argv
script, filename = argv # script = alien.py; filename = roswell.txt
listed = []
text = open(filename, 'rw')
for i in text:
lines = readline(i)
listed.append(lines)
print listed
text.close()
Nothing loads to the list.
You can do it with a while loop:
listed = []
with open(filename,"r") as text:
Line = text.readline()
while Line!='':
listed.append(Line)
Line = text.readline()
print listed
In the previous example, I assumed that each sentence is written on a different line, if that's not the case, use this code instead:
listed = []
with open(filename,"r") as text:
Line = text.readline()
while Line!='':
Line1 = Line.split(".")
for Sentence in Line1:
listed.append(Sentence)
Line = text.readline()
print listed
And on a side note, try using with open(...) as text: instead of text = open(...)
Normally sentences are separated by '. ', not '\n'. Under this condition, use split with period+space(without return-enter):
listed = []
fd = open(filename,"r")
try:
data = fd.read()
sentences = data.split(". ")
for sentence in sentences:
listed.append(sentence)
print listed
finally:
fd.close()

How do I search a file for a string and replace it with multiple lines in Python?

I am running Python 3.5.1
I have a text file that I'm trying to search through and replace or overwrite text if it matches a predefined variable. Below is a simple example:
test2.txt
A Bunch of Nonsense Stuff
############################
# More Stuff Goes HERE #
############################
More stuff here
Outdated line of information that has no comment above - message_label
The last line in this example needs to be overwritten so the new file looks like below:
test2.txt after script
A Bunch of Nonsense Stuff
############################
# More Stuff Goes HERE #
############################
More stuff here
# This is an important line that needs to be copied
Very Important Line of information that the above line is a comment for - message_label
The function I have written idealAppend does not work as intended and subsequent executions create a bit of a mess. My workaround has been to separate the two lines into single line variables but this doesn't scale well. I want to use this function throughout my script with the ability to handle any number of lines. (if that makes sense)
Script
#!/usr/bin/env python3
import sys, fileinput, os
def main():
file = 'test2.txt'
fullData = r'''
# This is an important line that needs to be copied
Very Important Line of information that the above line is a comment for - message_label
'''
idealAppend(file, fullData)
def idealAppend(filename, data):
label = data.split()[-1] # Grab last word of the Append String
for line in fileinput.input(filename, inplace=1, backup='.bak'):
if line.strip().endswith(label) and line != data: # If a line 2 exists that matches the last word (label)
line = data # Overwrite with new line, comment, new line, and append data.
sys.stdout.write(line) # Write changes to current line
with open(filename, 'r+') as file: # Open File with rw permissions
line_found = any(data in line for line in file) # Search if Append exists in file
if not line_found: # If data does NOT exist
file.seek(0, os.SEEK_END) # Goes to last line of the file
file.write(data) # Write data to the end of the file
if __name__ == "__main__": main()
Workaround Script
This seems to work perfectly as long as I only need to write exactly two lines. I'd love this to be more dynamic when it comes to number of lines so I can reuse the function easily.
#!/usr/bin/env python3
import sys, fileinput, os
def main():
file = 'test2.txt'
comment = r'# This is an important line that needs to be copied'
append = r'Very Important Line of information that the above line is a comment for - message_label'
appendFile(file, comment, append)
def appendFile(filename, comment, append):
label = append.split()[-1] # Grab last word of the Append String
for line in fileinput.input(filename, inplace=1, backup='.bak'):
if line.strip().endswith(label) and line != append: # If a line 2 exists that matches the last word (label)
line = '\n' + comment + '\n' + append # Overwrite with new line, comment, new line, and append data.
sys.stdout.write(line) # Write changes to current line
with open(filename, 'r+') as file: # Open File with rw permissions
line_found = any(append in line for line in file) # Search if Append exists in file
if not line_found: # If data does NOT exist
file.seek(0, os.SEEK_END) # Goes to last line of the file
file.write('\n' + comment + '\n' + append) # Write data to the end of the file
if __name__ == "__main__": main()
I am very new to Python so I'm hoping there is a simple solution that I overlooked. I thought it might make sense to try and split the fullData variable at the new line characters into a list or tuple, filter the label from the last item in the list, then output all entries but this is starting to move beyond what I've learned so far.
If I understand your issue correctly, you can just open the input and output files, then check whether the line contains old information and ends with the label and write the appropriate content accordingly.
with open('in.txt') as f, open('out.txt', 'r') as output:
for line in f:
if line.endswith(label) and not line.startswith(new_info):
output.write(replacement_text)
else:
output.write(line)
If you want to update the original file instead of creating a second one, it's easiest to just delete the original and rename the new one instead of trying to modify it in place.
Is this what you are looking for ? It's looking for a label and then replaces the whole line with whatever you want.
test2.txt
A Bunch of Nonsense Stuff
############################
# More Stuff Goes HERE #
############################
More stuff here
Here is to be replaced - to_replace
script.py
#!/usr/bin/env python3
def main():
file = 'test2.txt'
label_to_modify = "to_replace"
replace_with = "# Blabla\nMultiline\nHello"
"""
# Raw string stored in a file
file_replace_with = 'replace_with.txt'
with open(file_replace_with, 'r') as f:
replace_with = f.read()
"""
appendFile(file, label_to_modify, replace_with)
def appendFile(filename, label_to_modify, replace_with):
new_file = []
with open(filename, 'r') as f:
for line in f:
if len(line.split()) > 0 and line.split()[-1] == label_to_modify:
new_file.append(replace_with)
else:
new_file.append(line)
with open(filename + ".bak", 'w') as f:
f.write(''.join(new_file))
if __name__ == "__main__": main()
test2.txt.bak
A Bunch of Nonsense Stuff
############################
# More Stuff Goes HERE #
############################
More stuff here
# Blabla
Multiline
Hello
Reading over both answers I've come up with the following as the best solution i can get to work. It seems to do everything I need. Thanks Everyone.
#!/usr/bin/env python3
def main():
testConfFile = 'test2.txt' # /etc/apache2/apache2.conf
testConfLabel = 'timed_combined'
testConfData = r'''###This is an important line that needs to be copied - ##-#-####
Very Important Line of information that the above line is a \"r\" comment for - message_label'''
testFormatAppend(testConfFile, testConfData, testConfLabel) # Add new test format
def testFormatAppend(filename, data, label):
dataSplit = data.splitlines()
fileDataStr = ''
with open(filename, 'r') as file:
fileData = stringToDictByLine(file)
for key, val in fileData.items():
for row in dataSplit:
if val.strip().endswith(row.strip().split()[-1]):
fileData[key] = ''
fileLen = len(fileData)
if fileData[fileLen] == '':
fileLen += 1
fileData[fileLen] = data
else:
fileLen += 1
fileData[fileLen] = '\n' + data
for key, val in fileData.items():
fileDataStr += val
with open(filename, 'w') as file:
file.writelines(str(fileDataStr))
def stringToDictByLine(data):
fileData = {}
i = 1
for line in data:
fileData[i] = line
i += 1
return fileData
if __name__ == "__main__": main()

python replace backslash

I'm trying to implement a simple helper class to interact with java-properties files. Fiddling with multiline properties I encountered a problem, that I can not get solved, maybe you can?
The unittest in the class first writes a multiline-property spanning over two lines to the property-file, then re-reads it and checks for equality. That just works. Now, if i use the class to add a third line to the property, it re-reads it with additional backslashes that I can't explain.
Here is my code:
#!/usr/bin/env python3
# -*- coding=UTF-8 -*-
import codecs
import os, re
import fileinput
import unittest
class ConfigParser:
reProp = re.compile(r'^(?P<key>[\.\w]+)=(?P<value>.*?)(?P<ext>[\\]?)$')
rePropExt = re.compile(r'(?P<value>.*?)(?P<ext>[\\]?)$')
files = []
def __init__(self, pathes=[]):
for path in pathes:
if os.path.isfile(path):
self.files.append(path)
def getOptions(self):
result = {}
key = ''
val = ''
with fileinput.input(self.files, inplace=False) as fi:
for line in fi:
m = self.reProp.match(line.strip())
if m:
key = m.group('key')
val = m.group('value')
result[key] = val
else:
m = self.rePropExt.match(line.rstrip())
if m:
val = '\n'.join((val, m.group('value')))
result[key] = val
fi.close()
return result
def setOptions(self, updates={}):
options = self.getOptions()
options.update(updates)
with fileinput.input(self.files, inplace=True) as fi:
for line in fi:
m = self.reProp.match(line.strip())
if m:
key = m.group('key')
nval = options[key]
nval = nval.replace('\n', '\\\n')
print('{}={}'.format(key,nval))
fi.close()
class test(unittest.TestCase):
files = ['test.properties']
props = {'test.m.a' : 'Johnson\nTanaka'}
def setUp(self):
for file in self.files:
f = codecs.open(file, encoding='utf-8', mode='w')
for key in self.props.keys():
val = self.props[key]
val = re.sub('\n', '\\\n', val)
f.write(key + '=' + val)
f.close()
def teardown(self):
pass
def test_read(self):
c = configparser(self.files)
for file in self.files:
for key in self.props.keys():
result = c.getOptions()
self.assertEqual(result[key],self.props[key])
def test_write(self):
c = ConfigParser(self.files)
changes = {}
for key in self.props.keys():
changes[key] = self.change_value(self.props[key])
c.setOptions(changes)
result = c.getOptions()
print('changes: ')
print(changes)
print('result: ')
print(result)
for key in changes.keys():
self.assertEqual(result[key],changes[key],msg=key)
def change_value(self, value):
return 'Smith\nJohnson\nTanaka'
if __name__ == '__main__':
unittest.main()
Output of the testrun:
C:\pyt>propertyfileparser.py
changes:
{'test.m.a': 'Smith\nJohnson\nTanaka'}
result:
{'test.m.a': 'Smith\nJohnson\\\nTanaka'}
Any hints welcome...
Since you are adding a backslash in front of new-lines when you are writing, you have to also remove them when you are reading. Uncommenting the line that substitutes '\n' with '\\n' solves the problem, but I expect this also means the file syntax is incorrect.
This happens only with the second line break, because you separate the value into an "oval" and an "nval" where the "oval" is the first line, and the "nval" the rest, and you only do the substitution on the nval.
It's also overkill to use regexp replacing to replace something that isn't a regexp. You can use val.replace('\n', '\\n') instead.
I'd do this parser very differently. Well, first of all, I wouldn't do it at all, I'd use an existing parser, but if I did, I'd read the file, line by line, while handling the line continuation issue, so that I had exactly one value per item in a list. Then I'd parse each item into a key and a value with a regexp, and stick that into a dictionary.
You instead parse each line separately and join continuation lines to the values after parsing, which IMO is completely backwards.

Categories

Resources