Python: Remove everything before certain chars - python

I have several files on which I should work on. The files are xml-files, but before " < ?xml version="1.0"? > ", there are some debugging and status lines coming from the command line. Since I'd like to pare the file, these lines must be removed. My question is: How is this possible? Preferably inplace, i.e. the filename stays the same.
Thanks for any help.

An inefficient solution would be to read the whole contents and find where this occurs:
fileName="yourfile.xml"
with open(fileName,'r+') as f:
contents=f.read()
contents=contents[contents.find("< ?xml version="1.0"? >"):]
f.seek(0)
f.write(contents)
f.truncate()
The file will now contain the original files contents from "< ?xml version="1.0"? >" onwards.

What about trimming the file headers as you read the file?
import xml.etree.ElementTree as et
with open("input.xml", "rb") as inf:
# find starting point
offset = 0
for line in inf:
if line.startswith('<?xml version="1.0"'):
break
else:
offset += len(line)
# read the xml file starting at that point
inf.seek(offset)
data = et.parse(inf)
(This assumes that the xml header starts on its own line, but works on my test file:
<!-- This is a line of junk -->
<!-- This is another -->
<?xml version="1.0" ?>
<abc>
<def>xy</def>
<def>hi</def>
</abc>

Since you say you have several files, using fileinput might be better than open. You can then do something like:
import fileinput
import sys
prolog = '< ?xml version="1.0"? >'
reached_prolog = False
files = ['file1.xml', 'file2.xml'] # The paths of all your XML files
for line in fileinput.input(files, inplace=1):
# Decide how you want to remove the lines. Something like:
if line.startswith(prolog) and not reached_prolog:
continue
else:
reached_prolog = True
sys.stdout.write(line)
Reading the docs for fileinput should make things clearer.
P.S. This is just a quick response; I haven't ran/tested the code.

A solution with regexp:
import re
import shutil
with open('myxml.xml') as ifile, open('tempfile.tmp', 'wb') as ofile:
for line in ifile:
matches = re.findall(r'< \?xml version="1\.0"\? >.+', line)
if matches:
ofile.write(matches[0])
ofile.writelines(ifile)
break
shutil.move('tempfile.tmp', 'myxml.xml')

Related

Python: removing only the first instance of a tag from a text file?

I've got a file that at the moment reads along these lines:
Text
Text
</tag> <tag> Line
Text
Text
</tag> <tag> Line
Text
</tag>
etc.
I'd like to remove only the first instance of the /tag (as obviously this is wrong and shouldn't be there).
So far I've tried something along the lines of:
with open(document.txt, r+) as doc:
for line in doc:
line = line.replace("</tag>", " ")
doc.write(line)
but this doesn't seem to do anything to the file.
I've also tried a different method that involves effectively not inserting the first /tag before I insert the rest (as I'm the one inserting /tag tag into the document), by:
#insert first tag
with open('document.txt', 'r+') as doc:
for line in doc:
with open('document.txt', 'r') as doc:
lines = doc.readlines()
if line.__contains__("Line"):
lines.insert(3, "<tag>")
with open(document.txt', 'w') as doc:
contents = doc.writelines(lines)
#insert other sets of tags
with open('document.txt', 'r+') as doc:
for line in doc:
with open('document.txt', 'r') as doc:
lines = doc.readlines()
for index, line in enumerate(lines):
if line.__contains__("Line") and not line.__contains__("<tag>"):
lines.insert(index, "</tag> <tag>")
break
with open('document.txt', 'w') as doc:
contents = doc.writelines(lines)
This again however seems to just give me the same result as before - with all of the tags, including the first /tag.
Can anyone point me in the right direction to fix this? Apologies if the above is shoddy coding and there's a simple fix.
Thanks in advance
str.replace(old, new [, count]) takes an optional argument count, which replaces only the first count occurrences:
filename = "file.txt"
data = open(filename).read()
data = data.replace("</tag>", " ", 1)
with open(filename, "w") as doc:
doc.write(data)
print(open(filename).read())
Out:
Text
Text
<tag> Line
Text
Text
</tag> <tag> Line
Text
</tag>
etc.
I think you can use the fileinput module for this. From the Python Website https://docs.python.org/3/library/fileinput.html
import fileinput
for line in fileinput.input(encoding="utf-8"):
process(line)
You can use this to process line by line in place. By using the replace function you can modify each of your lines. If there are multiple tags in each line and you always want to only remove the first one, replace also has a count argument.
import fileinput
def main():
filename = "input.txt"
for line in fileinput.input(filename, inplace=True):
print(line.replace("<tag/>", ""), end="")
if __name__ == "__main__":
main()
I hope this helps!

Bulk autoreplacing string in the KML file

I have a set of placemarks, which include quite a wide description included in its balloon within the property. Next each single description (former column header) is bounded in . Because of the shapefile naming restriction to 10 characters only.
https://gis.stackexchange.com/questions/15784/bypassing-10-character-limit-of-field-name-in-shapefiles
I have to retype most of these names manually.
Obviously, I use Notepad++, where I can swiftly press Ctrl+F and toggle Replace mode, as you can see below.
The green bounded strings were already replaced, the red ones still remain.
Basically, if I press "Replace All" then it works fine and quickly. Unfortunately, I have to go one by one. As you can see I have around 20 separate strings to "Replace all". Is there a possibility to do it quicker? Because all the .kml files are similar to each other, this is going to be the same everywhere. I need some tool, which will be able to do auto-replace for these headers cut by 10 characters limit. I think, that maybe Python tools might be helpful.
https://pythonhosted.org/pykml/
But in the tool above there is no information about bulk KML editing.
How can I set something like the "Replace All" tool for all my strings preferably if possible?
UPDATE:
I tried the code below:
files = []
with open("YesNF016.kml") as f:
for line in f.readlines():
if line[-1] == '\n':
files.append(line[:-1])
else:
files.append(line)
old_expression = 'ab'
new_expression = 'it worked'
for file in files:
new_file = ""
with open(file) as f:
for line in f.readlines():
new_file += line.replace(old_expression, new_expression)
with open(file, 'w') as f:
f.write(new_file)
The debugger shows:
[Errno 22] Invalid argument: ''
File "\test.py", line 13, in
with open(file) as f:
whereas line 13 is:
with open(file) as f:
The solutions here:
https://www.reddit.com/r/learnpython/comments/b9cljd/oserror_while_using_elementtree_to_parse_simple/
and
OSError: [Errno 22] Invalid argument Getting invalid argument while parsing xml in python
weren't helpful enough for me.
So you want to replace all occurence of X to Y in bunch of files ?
Pretty easy.
Just create a file_list.txt containing the list of files to edit.
python code:
files = []
with open("file_list.txt") as f:
for line in f.readlines():
if line[-1] == '\n':
files.append(line[:-1])
else:
files.append(line)
old_expression = 'ab'
new_expression = 'it worked'
for file in files:
new_file = ""
with open(file) as f:
for line in f.readlines():
new_file += line.replace(old_expression, new_expression)
with open(file, 'w') as f:
f.write(new_file)

How to edit XML file using Python?

I have a list of XML documents with the following structure. I need to delete this line:
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
using Python code, as manually deleting it would be very time consuming as there lots of files.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.62.0">
<page number="1" position="absolute" top="0" left="0" height="1262" width="892">
</page>
</pdf2xml>
You can read files line by line and then write them back without the line you don't want in the file. Just be sure what you want to delete - is it exactly the line you wrote? Is it always the second line? Is every !DOCTYPE line? Is it first !DOCTYPE line? Etc.
import os
import sys
# Assumes first argument when running the script is a directory containing XML files
directory = sys.argv[1] if len(sys.argv) > 1 else "."
files = os.listdir(directory)
for f in files:
# Ignore not XML files
if not f.endswith(".xml"):
continue
# Read file content
with open(f, 'r') as f_in:
content = f_in.readlines()
# Rewrite the original file
with open(f, 'w') as f_out:
for line in content:
# The condition may differ based on what you really want to delete
if line != "<!DOCTYPE pdf2xml SYSTEM \"pdf2xml.dtd\">\n":
f_out.write(line)
Things to consider:
If the files are big you may not want to load them into the memory
It is inefficient for example in case you want to always delete just the second line in the file.
Do you really need/want to use Python for that? There are better solutions. For example, if you are on Linux or Mac you can use sed:
for f in *.xml; do sed -i '' -n '/<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">/!p' $f; done
First, open the file:
f = open("yourfile.txt","r")
Next, get all your lines from the file:
lines = f.readlines()
Now you can close the file:
f.close()
And reopen it in write mode:
f = open("yourfile.txt","w")
Then, write your lines back, except the line you want to delete. You might want to change the "\n" to whatever line ending your file uses.
for line in lines:
if not line.startswith('<!DOCTYPE'):
f.write(line)
At the end, close the file again.
f.close()

Replacing a line in an already opened file python [duplicate]

I want to loop over the contents of a text file and do a search and replace on some lines and write the result back to the file. I could first load the whole file in memory and then write it back, but that probably is not the best way to do it.
What is the best way to do this, within the following code?
f = open(file)
for line in f:
if line.contains('foo'):
newline = line.replace('foo', 'bar')
# how to write this newline back to the file
The shortest way would probably be to use the fileinput module. For example, the following adds line numbers to a file, in-place:
import fileinput
for line in fileinput.input("test.txt", inplace=True):
print('{} {}'.format(fileinput.filelineno(), line), end='') # for Python 3
# print "%d: %s" % (fileinput.filelineno(), line), # for Python 2
What happens here is:
The original file is moved to a backup file
The standard output is redirected to the original file within the loop
Thus any print statements write back into the original file
fileinput has more bells and whistles. For example, it can be used to automatically operate on all files in sys.args[1:], without your having to iterate over them explicitly. Starting with Python 3.2 it also provides a convenient context manager for use in a with statement.
While fileinput is great for throwaway scripts, I would be wary of using it in real code because admittedly it's not very readable or familiar. In real (production) code it's worthwhile to spend just a few more lines of code to make the process explicit and thus make the code readable.
There are two options:
The file is not overly large, and you can just read it wholly to memory. Then close the file, reopen it in writing mode and write the modified contents back.
The file is too large to be stored in memory; you can move it over to a temporary file and open that, reading it line by line, writing back into the original file. Note that this requires twice the storage.
I guess something like this should do it. It basically writes the content to a new file and replaces the old file with the new file:
from tempfile import mkstemp
from shutil import move, copymode
from os import fdopen, remove
def replace(file_path, pattern, subst):
#Create temp file
fh, abs_path = mkstemp()
with fdopen(fh,'w') as new_file:
with open(file_path) as old_file:
for line in old_file:
new_file.write(line.replace(pattern, subst))
#Copy the file permissions from the old file to the new file
copymode(file_path, abs_path)
#Remove original file
remove(file_path)
#Move new file
move(abs_path, file_path)
Here's another example that was tested, and will match search & replace patterns:
import fileinput
import sys
def replaceAll(file,searchExp,replaceExp):
for line in fileinput.input(file, inplace=1):
if searchExp in line:
line = line.replace(searchExp,replaceExp)
sys.stdout.write(line)
Example use:
replaceAll("/fooBar.txt","Hello\sWorld!$","Goodbye\sWorld.")
This should work: (inplace editing)
import fileinput
# Does a list of files, and
# redirects STDOUT to the file in question
for line in fileinput.input(files, inplace = 1):
print line.replace("foo", "bar"),
Based on the answer by Thomas Watnedal.
However, this does not answer the line-to-line part of the original question exactly. The function can still replace on a line-to-line basis
This implementation replaces the file contents without using temporary files, as a consequence file permissions remain unchanged.
Also re.sub instead of replace, allows regex replacement instead of plain text replacement only.
Reading the file as a single string instead of line by line allows for multiline match and replacement.
import re
def replace(file, pattern, subst):
# Read contents from file as a single string
file_handle = open(file, 'r')
file_string = file_handle.read()
file_handle.close()
# Use RE package to allow for replacement (also allowing for (multiline) REGEX)
file_string = (re.sub(pattern, subst, file_string))
# Write contents to file.
# Using mode 'w' truncates the file.
file_handle = open(file, 'w')
file_handle.write(file_string)
file_handle.close()
As lassevk suggests, write out the new file as you go, here is some example code:
fin = open("a.txt")
fout = open("b.txt", "wt")
for line in fin:
fout.write( line.replace('foo', 'bar') )
fin.close()
fout.close()
If you're wanting a generic function that replaces any text with some other text, this is likely the best way to go, particularly if you're a fan of regex's:
import re
def replace( filePath, text, subs, flags=0 ):
with open( filePath, "r+" ) as file:
fileContents = file.read()
textPattern = re.compile( re.escape( text ), flags )
fileContents = textPattern.sub( subs, fileContents )
file.seek( 0 )
file.truncate()
file.write( fileContents )
A more pythonic way would be to use context managers like the code below:
from tempfile import mkstemp
from shutil import move
from os import remove
def replace(source_file_path, pattern, substring):
fh, target_file_path = mkstemp()
with open(target_file_path, 'w') as target_file:
with open(source_file_path, 'r') as source_file:
for line in source_file:
target_file.write(line.replace(pattern, substring))
remove(source_file_path)
move(target_file_path, source_file_path)
You can find the full snippet here.
fileinput is quite straightforward as mentioned on previous answers:
import fileinput
def replace_in_file(file_path, search_text, new_text):
with fileinput.input(file_path, inplace=True) as file:
for line in file:
new_line = line.replace(search_text, new_text)
print(new_line, end='')
Explanation:
fileinput can accept multiple files, but I prefer to close each single file as soon as it is being processed. So placed single file_path in with statement.
print statement does not print anything when inplace=True, because STDOUT is being forwarded to the original file.
end='' in print statement is to eliminate intermediate blank new lines.
You can used it as follows:
file_path = '/path/to/my/file'
replace_in_file(file_path, 'old-text', 'new-text')
Create a new file, copy lines from the old to the new, and do the replacing before you write the lines to the new file.
Expanding on #Kiran's answer, which I agree is more succinct and Pythonic, this adds codecs to support the reading and writing of UTF-8:
import codecs
from tempfile import mkstemp
from shutil import move
from os import remove
def replace(source_file_path, pattern, substring):
fh, target_file_path = mkstemp()
with codecs.open(target_file_path, 'w', 'utf-8') as target_file:
with codecs.open(source_file_path, 'r', 'utf-8') as source_file:
for line in source_file:
target_file.write(line.replace(pattern, substring))
remove(source_file_path)
move(target_file_path, source_file_path)
Using hamishmcn's answer as a template I was able to search for a line in a file that match my regex and replacing it with empty string.
import re
fin = open("in.txt", 'r') # in file
fout = open("out.txt", 'w') # out file
for line in fin:
p = re.compile('[-][0-9]*[.][0-9]*[,]|[-][0-9]*[,]') # pattern
newline = p.sub('',line) # replace matching strings with empty string
print newline
fout.write(newline)
fin.close()
fout.close()
if you remove the indent at the like below, it will search and replace in multiple line.
See below for example.
def replace(file, pattern, subst):
#Create temp file
fh, abs_path = mkstemp()
print fh, abs_path
new_file = open(abs_path,'w')
old_file = open(file)
for line in old_file:
new_file.write(line.replace(pattern, subst))
#close temp file
new_file.close()
close(fh)
old_file.close()
#Remove original file
remove(file)
#Move new file
move(abs_path, file)

Python- need to append characters to the beginning and end of each line in text file

I should preface that I am a complete Python Newbie.
Im trying to create a script that will loop through a directory and its subdirectories looking for text files. When it encounters a text file it will parse the file and convert it to NITF XML and upload to an FTP directory.
At this point I am still working on reading the text file into variables so that they can be inserted into the XML document in the right places. An example to the text file is as follows.
Headline
Subhead
By A person
Paragraph text.
And here is the code I have so far:
with open("path/to/textFile.txt") as f:
#content = f.readlines()
head,sub,auth = [f.readline().strip() for i in range(3)]
data=f.read()
pth = os.getcwd()
print head,sub,auth,data,pth
My question is: how do I iterate through the body of the text file(data) and wrap each line in HTML P tags? For example;
<P>line of text in file </P> <P>Next line in text file</p>.
Something like
output_format = '<p>{}</p>\n'.format
with open('input') as fin, open('output', 'w') as fout:
fout.writelines( output_format(line.strip()) for line in fin )
This assumes that you want to write the new content back to the original file:
with open('path/to/textFile.txt') as f:
content = f.readlines()
with open('path/to/textFile.txt', 'w') as f:
for line in content:
f.write('<p>' + line.strip() + '</p>\n')
with open('infile') as fin, open('outfile',w) as fout:
for line in fin:
fout.write('<P>{0}</P>\n'.format(line[:-1]) #slice off the newline. Same as `line.rstrip('\n')`.
#Only do this once you're sure the script works :)
shutil.move('outfile','infile') #Need to replace the input file with the output file
in you case, you should probably replace
data=f.read()
with:
data = '\n'.join("<p>%s</p>" % l.strip() for l in f)
use data=f.readlines() here,
and then iterate over data and try something like this:
for line in data:
line="<p>"+line.strip()+"</p>"
#write line+'\n' to a file or do something else
append the and <\p> for each line
ex:
data_new=[]
data=f.readlines()
for lines in data:
data_new.append("<p>%s</p>\n" % data.strip().strip("\n"))
You could use the fileinput module to modify one or more files in-place, with optional backup file creation if desired (see its documentation for details). Here's it being used to process one file.
import fileinput
for line in fileinput.input('testinput.txt', inplace=1):
print '<P>'+line[:-1]+'<\P>'
The 'testinput.txt' argument could also be a sequence of two or more file names instead of just a single one, which could be useful especially if you're using os.walk() to generate the list of files in the directory and its subdirectories to process (as you probably should be doing).

Categories

Resources