I have a list of XML documents with the following structure. I need to delete this line:
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
using Python code, as manually deleting it would be very time consuming as there lots of files.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.62.0">
<page number="1" position="absolute" top="0" left="0" height="1262" width="892">
</page>
</pdf2xml>
You can read files line by line and then write them back without the line you don't want in the file. Just be sure what you want to delete - is it exactly the line you wrote? Is it always the second line? Is every !DOCTYPE line? Is it first !DOCTYPE line? Etc.
import os
import sys
# Assumes first argument when running the script is a directory containing XML files
directory = sys.argv[1] if len(sys.argv) > 1 else "."
files = os.listdir(directory)
for f in files:
# Ignore not XML files
if not f.endswith(".xml"):
continue
# Read file content
with open(f, 'r') as f_in:
content = f_in.readlines()
# Rewrite the original file
with open(f, 'w') as f_out:
for line in content:
# The condition may differ based on what you really want to delete
if line != "<!DOCTYPE pdf2xml SYSTEM \"pdf2xml.dtd\">\n":
f_out.write(line)
Things to consider:
If the files are big you may not want to load them into the memory
It is inefficient for example in case you want to always delete just the second line in the file.
Do you really need/want to use Python for that? There are better solutions. For example, if you are on Linux or Mac you can use sed:
for f in *.xml; do sed -i '' -n '/<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">/!p' $f; done
First, open the file:
f = open("yourfile.txt","r")
Next, get all your lines from the file:
lines = f.readlines()
Now you can close the file:
f.close()
And reopen it in write mode:
f = open("yourfile.txt","w")
Then, write your lines back, except the line you want to delete. You might want to change the "\n" to whatever line ending your file uses.
for line in lines:
if not line.startswith('<!DOCTYPE'):
f.write(line)
At the end, close the file again.
f.close()
Related
I want to print the content of the last saved text file in a folder using Python. I wrote the below code. It is printing out only the path of the file but not the content.
folder_path = r'C:\Users\Siciid\Desktop\restaurant\bill'
file_type = r'\*txt'
files = glob.glob(folder_path + file_type)
max_file = max(files, key=os.path.getctime)
filename=tempfile.mktemp('.txt')
open(filename,'w').write(max_file)
os.startfile(filename,"print")
Is it possible to do this in Python. Any suggestion. I would appreciate your help. Thank you.
You can do that using the following code. Just replace the line where you open and write a file with these two lines:
with open(max_file, "r") as f, open(filename, 'w') as f2:
f2.write(f.read())
The max_file variable contains a file name, not the contents of the file, so writing it to the temp file and printing that will simply print the file name instead of its contents. To put its contents into the temporary file, you need to open the file and then read it. That is what the above two lines of code do.
hi i'am trying to edit the header of my fasta files using seqkit and i have been able to do it but i'm not able to save it!
the command i am using to edit multiple fasta files with respect to their filename and doing it with refseq-
for i in $(find -name \genomid); do seqkit replace -p "^(.+?) (.+?)$" --replacement '{kv}' -k proid_unique *.faa; done
The directory having all my fasta files is like this-
PATH:
~/PANGENOMICS/DATA1/test
FILES in the directory:
GCF_000016305.1_ASM1630v1_protein.faa
GCF_000220485.1_ASM22048v1_protein.faa
GCF_900635735.1_32875_B01_protein.faa
proid_unique
genomid
i am finding filenames using a csv file list- genomid
GCF_900635735.1_32875_B01_protein.faa:WP_151362402.1:WP_151362402.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362403.1:WP_151362403.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362404.1:WP_151362404.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362405.1:WP_151362405.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362406.1:WP_151362406.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362407.1:WP_151362407.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362408.1:WP_151362408.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362409.1:WP_151362409.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362410.1:WP_151362410.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362411.1:WP_151362411.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362412.1:WP_151362412.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362413.1:WP_151362413.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362414.1:WP_151362414.1#0940
the file (proid_unique) i used as key-value file to edit the fasta headers look like this-
WP_151362399.1 WP_151362399.1#0940
WP_151362400.1 WP_151362400.1#0940
WP_151362401.1 WP_151362401.1#0940
WP_151362402.1 WP_151362402.1#0940
WP_151362409.1 WP_151362409.1#0940
WP_151362410.1 WP_151362410.1#0940
WP_151362411.1 WP_151362411.1#0940
WP_151362412.1 WP_151362412.1#0940
WP_151362413.1 WP_151362413.1#0940
WP_151362414.1 WP_151362414.1#0940
WP_094096600.1 WP_094096600.1#0945
WP_016530940.1 WP_016530940.1#0950
WP_000940121.1 WP_000940121.1#0951
WP_012540940.1 WP_012540940.1#0951
example of input-
>WP_151362411.1 YoaH family protein [Klebsiella pneumoniae]
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
IVARFEDEDEDE
>WP_151362412.1 gamma-glutamylcyclotransferase [Klebsiella pneumoniae]
MLEAIGGEWRPGYVTGTFYARGWGAAADFPGIVLDAHGPRVNGYLFLSDRLARTGPCWTT
LRRGYDRVPVEVTTDDGQQISAWIYQLQPRG
>WP_151362413.1 acid resistance repetitive basic protein Asr [Klebsiella pneumoniae]
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK
KHHKTAAKTGSRAESAGCKETS
>WP_151362414.1 ABC transporter permease [Klebsiella pneumoniae]
MKRAPWYLRLATWGGVIFLHFPLLIIAIYAFNTEDAAFSFPPQGLTLRWFSEAAGRSDIL
QAVTLSLKIAALSTAIALVLGTLAAGALWRSAFFGKNAVSLLLLLPIALPGIITGLALLT
AFKAVGLEPGLLTIVVGHATFCVVVVFNNVIARFRRTSWSMVEASMDLGATGWQTFRYVV
LPNLGSALLAGGMLAFALSFDEIIVTTFTAGHERTLPLWLLNQLGRPRDVPVTNVVALLV
MLVTTIPILGAWWLTRDGDSDAGNGK
example of output- expected and correct with above command
>WP_151362411.1#0940
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
IVARFEDEDEDE
>WP_151362412.1#0940
MLEAIGGEWRPGYVTGTFYARGWGAAADFPGIVLDAHGPRVNGYLFLSDRLARTGPCWTT
LRRGYDRVPVEVTTDDGQQISAWIYQLQPRG
>WP_151362413.1#0940
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK
KHHKTAAKTGSRAESAGCKETS
>WP_151362414.1#0940
MKRAPWYLRLATWGGVIFLHFPLLIIAIYAFNTEDAAFSFPPQGLTLRWFSEAAGRSDIL
QAVTLSLKIAALSTAIALVLGTLAAGALWRSAFFGKNAVSLLLLLPIALPGIITGLALLT
AFKAVGLEPGLLTIVVGHATFCVVVVFNNVIARFRRTSWSMVEASMDLGATGWQTFRYVV
LPNLGSALLAGGMLAFALSFDEIIVTTFTAGHERTLPLWLLNQLGRPRDVPVTNVVALLV
MLVTTIPILGAWWLTRDGDSDAGNGK
i am getting the required/expected result but this editing is not saving with this command, can someone help me figure out that how to save those editing in the original files bcz when i open those files again they were same as before with no edited header?
Python alternative of the above command used would also be helpful
Assuming:
The relevant files are proid_unique and *.faa files in current
directory.
We want to replace *.faa files by editing the header lines
according to the key-value pairs described in proid_unique.
We can forget about genomid file so far.
As I'm not familiar with seqkit command, here is an python alternative:
#!/usr/bin/python
import glob
import os
with open('proid_unique') as f: # open the key-value file
m = {k : v for k, v in [line.split() for line in f]}
# create a dictionary of key-value pairs to edit
for fasta in glob.glob('*.faa'):
org = fasta + '.O' # backup filename appending '.O' suffix
os.rename(fasta, org) # rename the file
with open(org) as f, open(fasta, 'w') as w:
# open files to read and write
for line in f: # process line by line
line = line.rstrip() # remove a newline character
if line.startswith('>'): # header line
header = line.split()[0] # extract the substring before a whitespace
if header[1:] in m: # if the header is a key in the dictionary
line = '>' + m[header[1:]]
# then replace the line
w.write(line + '\n') # overwrite to the fasta file
It back-ups the old *.faa files as *.faa.O.
If my assumption is incorrect, please let me know.
I wrote a python script that takes two files as input and then saves the difference between them as output in another file.
I bound it to a batch file .cmd (see below) and added the batch file to context menu of text files, so when I right-click on a text file and select it, a cmd window pops up and I type the address of the file to compare.
Batch file content:
#echo off
cls
python "C:\Users\User\Desktop\Difference of Two Files.py" %1
Python Code:
import sys
import os
f1 = open(sys.argv[1], 'r')
f1_name = str(os.path.basename(f1.name)).rsplit('.')[0]
f2_path = input('Enter the path of file to compare: ')
f2 = open(f2_path, 'r')
f2_name = str(os.path.basename(f2.name)).rsplit('.')[0]
f3 = open(f'{f1_name} - {f2_name} diff.txt', 'w')
file1 = set(f1.read().splitlines())
file2 = set(f2.read().splitlines())
difference = file1.difference(file2)
for i in difference:
f3.write(i + '\n')
f1.close()
f2.close()
f3.close()
Now, my question is how can I replace the typing of 2nd file path with a drag and drop solution that accepts more than one file.
I don't have any problem with python code and can extend it myself to include more files. I just don't know how to edit the batch file so instead of taking just one file by typing the path, it takes several files by drag and drop.
I would appreciate your help.
Finally, I've figured it out myself!
I Post the final code, maybe it helps somebody.
# This script prints those lines in the 1st file that are not in the other added files
# and saves the results into a 3rd file on Desktop.
import sys
import os
f1 = open(sys.argv[1], 'r')
f1_name = str(os.path.basename(f1.name)).rsplit('.')[0]
reference_set = set(f1.read().splitlines())
compare_files = input('Drag and drop files into this window to compare: ')
compare_files = compare_files.strip('"').rstrip('"')
compare_files_list = compare_files.split('\"\"')
compare_set = set()
for file in compare_files_list:
with open(os.path.abspath(file), 'r') as f2:
file_content = set(f2.read().splitlines())
compare_set.update(file_content)
f3 = open(f'C:\\Users\\User\\Desktop\\{f1_name} diff.txt', 'w')
difference = reference_set.difference(compare_set)
for i in difference:
f3.write(i + '\n')
f1.close()
f3.close()
The idea came from this fact that drag and drop to cmd, copies the file path surrounded with double-quotes into it. I used the repeated double-quotes between paths to create a list, and you can see the rest in the code.
However, there's a downside and it's that you can't drag multiple files together and you should do that one by one, but it's better than nothing. ;)
I have several files on which I should work on. The files are xml-files, but before " < ?xml version="1.0"? > ", there are some debugging and status lines coming from the command line. Since I'd like to pare the file, these lines must be removed. My question is: How is this possible? Preferably inplace, i.e. the filename stays the same.
Thanks for any help.
An inefficient solution would be to read the whole contents and find where this occurs:
fileName="yourfile.xml"
with open(fileName,'r+') as f:
contents=f.read()
contents=contents[contents.find("< ?xml version="1.0"? >"):]
f.seek(0)
f.write(contents)
f.truncate()
The file will now contain the original files contents from "< ?xml version="1.0"? >" onwards.
What about trimming the file headers as you read the file?
import xml.etree.ElementTree as et
with open("input.xml", "rb") as inf:
# find starting point
offset = 0
for line in inf:
if line.startswith('<?xml version="1.0"'):
break
else:
offset += len(line)
# read the xml file starting at that point
inf.seek(offset)
data = et.parse(inf)
(This assumes that the xml header starts on its own line, but works on my test file:
<!-- This is a line of junk -->
<!-- This is another -->
<?xml version="1.0" ?>
<abc>
<def>xy</def>
<def>hi</def>
</abc>
Since you say you have several files, using fileinput might be better than open. You can then do something like:
import fileinput
import sys
prolog = '< ?xml version="1.0"? >'
reached_prolog = False
files = ['file1.xml', 'file2.xml'] # The paths of all your XML files
for line in fileinput.input(files, inplace=1):
# Decide how you want to remove the lines. Something like:
if line.startswith(prolog) and not reached_prolog:
continue
else:
reached_prolog = True
sys.stdout.write(line)
Reading the docs for fileinput should make things clearer.
P.S. This is just a quick response; I haven't ran/tested the code.
A solution with regexp:
import re
import shutil
with open('myxml.xml') as ifile, open('tempfile.tmp', 'wb') as ofile:
for line in ifile:
matches = re.findall(r'< \?xml version="1\.0"\? >.+', line)
if matches:
ofile.write(matches[0])
ofile.writelines(ifile)
break
shutil.move('tempfile.tmp', 'myxml.xml')
I should preface that I am a complete Python Newbie.
Im trying to create a script that will loop through a directory and its subdirectories looking for text files. When it encounters a text file it will parse the file and convert it to NITF XML and upload to an FTP directory.
At this point I am still working on reading the text file into variables so that they can be inserted into the XML document in the right places. An example to the text file is as follows.
Headline
Subhead
By A person
Paragraph text.
And here is the code I have so far:
with open("path/to/textFile.txt") as f:
#content = f.readlines()
head,sub,auth = [f.readline().strip() for i in range(3)]
data=f.read()
pth = os.getcwd()
print head,sub,auth,data,pth
My question is: how do I iterate through the body of the text file(data) and wrap each line in HTML P tags? For example;
<P>line of text in file </P> <P>Next line in text file</p>.
Something like
output_format = '<p>{}</p>\n'.format
with open('input') as fin, open('output', 'w') as fout:
fout.writelines( output_format(line.strip()) for line in fin )
This assumes that you want to write the new content back to the original file:
with open('path/to/textFile.txt') as f:
content = f.readlines()
with open('path/to/textFile.txt', 'w') as f:
for line in content:
f.write('<p>' + line.strip() + '</p>\n')
with open('infile') as fin, open('outfile',w) as fout:
for line in fin:
fout.write('<P>{0}</P>\n'.format(line[:-1]) #slice off the newline. Same as `line.rstrip('\n')`.
#Only do this once you're sure the script works :)
shutil.move('outfile','infile') #Need to replace the input file with the output file
in you case, you should probably replace
data=f.read()
with:
data = '\n'.join("<p>%s</p>" % l.strip() for l in f)
use data=f.readlines() here,
and then iterate over data and try something like this:
for line in data:
line="<p>"+line.strip()+"</p>"
#write line+'\n' to a file or do something else
append the and <\p> for each line
ex:
data_new=[]
data=f.readlines()
for lines in data:
data_new.append("<p>%s</p>\n" % data.strip().strip("\n"))
You could use the fileinput module to modify one or more files in-place, with optional backup file creation if desired (see its documentation for details). Here's it being used to process one file.
import fileinput
for line in fileinput.input('testinput.txt', inplace=1):
print '<P>'+line[:-1]+'<\P>'
The 'testinput.txt' argument could also be a sequence of two or more file names instead of just a single one, which could be useful especially if you're using os.walk() to generate the list of files in the directory and its subdirectories to process (as you probably should be doing).