hi i'am trying to edit the header of my fasta files using seqkit and i have been able to do it but i'm not able to save it!
the command i am using to edit multiple fasta files with respect to their filename and doing it with refseq-
for i in $(find -name \genomid); do seqkit replace -p "^(.+?) (.+?)$" --replacement '{kv}' -k proid_unique *.faa; done
The directory having all my fasta files is like this-
PATH:
~/PANGENOMICS/DATA1/test
FILES in the directory:
GCF_000016305.1_ASM1630v1_protein.faa
GCF_000220485.1_ASM22048v1_protein.faa
GCF_900635735.1_32875_B01_protein.faa
proid_unique
genomid
i am finding filenames using a csv file list- genomid
GCF_900635735.1_32875_B01_protein.faa:WP_151362402.1:WP_151362402.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362403.1:WP_151362403.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362404.1:WP_151362404.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362405.1:WP_151362405.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362406.1:WP_151362406.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362407.1:WP_151362407.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362408.1:WP_151362408.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362409.1:WP_151362409.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362410.1:WP_151362410.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362411.1:WP_151362411.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362412.1:WP_151362412.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362413.1:WP_151362413.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362414.1:WP_151362414.1#0940
the file (proid_unique) i used as key-value file to edit the fasta headers look like this-
WP_151362399.1 WP_151362399.1#0940
WP_151362400.1 WP_151362400.1#0940
WP_151362401.1 WP_151362401.1#0940
WP_151362402.1 WP_151362402.1#0940
WP_151362409.1 WP_151362409.1#0940
WP_151362410.1 WP_151362410.1#0940
WP_151362411.1 WP_151362411.1#0940
WP_151362412.1 WP_151362412.1#0940
WP_151362413.1 WP_151362413.1#0940
WP_151362414.1 WP_151362414.1#0940
WP_094096600.1 WP_094096600.1#0945
WP_016530940.1 WP_016530940.1#0950
WP_000940121.1 WP_000940121.1#0951
WP_012540940.1 WP_012540940.1#0951
example of input-
>WP_151362411.1 YoaH family protein [Klebsiella pneumoniae]
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
IVARFEDEDEDE
>WP_151362412.1 gamma-glutamylcyclotransferase [Klebsiella pneumoniae]
MLEAIGGEWRPGYVTGTFYARGWGAAADFPGIVLDAHGPRVNGYLFLSDRLARTGPCWTT
LRRGYDRVPVEVTTDDGQQISAWIYQLQPRG
>WP_151362413.1 acid resistance repetitive basic protein Asr [Klebsiella pneumoniae]
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK
KHHKTAAKTGSRAESAGCKETS
>WP_151362414.1 ABC transporter permease [Klebsiella pneumoniae]
MKRAPWYLRLATWGGVIFLHFPLLIIAIYAFNTEDAAFSFPPQGLTLRWFSEAAGRSDIL
QAVTLSLKIAALSTAIALVLGTLAAGALWRSAFFGKNAVSLLLLLPIALPGIITGLALLT
AFKAVGLEPGLLTIVVGHATFCVVVVFNNVIARFRRTSWSMVEASMDLGATGWQTFRYVV
LPNLGSALLAGGMLAFALSFDEIIVTTFTAGHERTLPLWLLNQLGRPRDVPVTNVVALLV
MLVTTIPILGAWWLTRDGDSDAGNGK
example of output- expected and correct with above command
>WP_151362411.1#0940
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
IVARFEDEDEDE
>WP_151362412.1#0940
MLEAIGGEWRPGYVTGTFYARGWGAAADFPGIVLDAHGPRVNGYLFLSDRLARTGPCWTT
LRRGYDRVPVEVTTDDGQQISAWIYQLQPRG
>WP_151362413.1#0940
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK
KHHKTAAKTGSRAESAGCKETS
>WP_151362414.1#0940
MKRAPWYLRLATWGGVIFLHFPLLIIAIYAFNTEDAAFSFPPQGLTLRWFSEAAGRSDIL
QAVTLSLKIAALSTAIALVLGTLAAGALWRSAFFGKNAVSLLLLLPIALPGIITGLALLT
AFKAVGLEPGLLTIVVGHATFCVVVVFNNVIARFRRTSWSMVEASMDLGATGWQTFRYVV
LPNLGSALLAGGMLAFALSFDEIIVTTFTAGHERTLPLWLLNQLGRPRDVPVTNVVALLV
MLVTTIPILGAWWLTRDGDSDAGNGK
i am getting the required/expected result but this editing is not saving with this command, can someone help me figure out that how to save those editing in the original files bcz when i open those files again they were same as before with no edited header?
Python alternative of the above command used would also be helpful
Assuming:
The relevant files are proid_unique and *.faa files in current
directory.
We want to replace *.faa files by editing the header lines
according to the key-value pairs described in proid_unique.
We can forget about genomid file so far.
As I'm not familiar with seqkit command, here is an python alternative:
#!/usr/bin/python
import glob
import os
with open('proid_unique') as f: # open the key-value file
m = {k : v for k, v in [line.split() for line in f]}
# create a dictionary of key-value pairs to edit
for fasta in glob.glob('*.faa'):
org = fasta + '.O' # backup filename appending '.O' suffix
os.rename(fasta, org) # rename the file
with open(org) as f, open(fasta, 'w') as w:
# open files to read and write
for line in f: # process line by line
line = line.rstrip() # remove a newline character
if line.startswith('>'): # header line
header = line.split()[0] # extract the substring before a whitespace
if header[1:] in m: # if the header is a key in the dictionary
line = '>' + m[header[1:]]
# then replace the line
w.write(line + '\n') # overwrite to the fasta file
It back-ups the old *.faa files as *.faa.O.
If my assumption is incorrect, please let me know.
Related
I want to print the content of the last saved text file in a folder using Python. I wrote the below code. It is printing out only the path of the file but not the content.
folder_path = r'C:\Users\Siciid\Desktop\restaurant\bill'
file_type = r'\*txt'
files = glob.glob(folder_path + file_type)
max_file = max(files, key=os.path.getctime)
filename=tempfile.mktemp('.txt')
open(filename,'w').write(max_file)
os.startfile(filename,"print")
Is it possible to do this in Python. Any suggestion. I would appreciate your help. Thank you.
You can do that using the following code. Just replace the line where you open and write a file with these two lines:
with open(max_file, "r") as f, open(filename, 'w') as f2:
f2.write(f.read())
The max_file variable contains a file name, not the contents of the file, so writing it to the temp file and printing that will simply print the file name instead of its contents. To put its contents into the temporary file, you need to open the file and then read it. That is what the above two lines of code do.
I have a list of XML documents with the following structure. I need to delete this line:
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
using Python code, as manually deleting it would be very time consuming as there lots of files.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.62.0">
<page number="1" position="absolute" top="0" left="0" height="1262" width="892">
</page>
</pdf2xml>
You can read files line by line and then write them back without the line you don't want in the file. Just be sure what you want to delete - is it exactly the line you wrote? Is it always the second line? Is every !DOCTYPE line? Is it first !DOCTYPE line? Etc.
import os
import sys
# Assumes first argument when running the script is a directory containing XML files
directory = sys.argv[1] if len(sys.argv) > 1 else "."
files = os.listdir(directory)
for f in files:
# Ignore not XML files
if not f.endswith(".xml"):
continue
# Read file content
with open(f, 'r') as f_in:
content = f_in.readlines()
# Rewrite the original file
with open(f, 'w') as f_out:
for line in content:
# The condition may differ based on what you really want to delete
if line != "<!DOCTYPE pdf2xml SYSTEM \"pdf2xml.dtd\">\n":
f_out.write(line)
Things to consider:
If the files are big you may not want to load them into the memory
It is inefficient for example in case you want to always delete just the second line in the file.
Do you really need/want to use Python for that? There are better solutions. For example, if you are on Linux or Mac you can use sed:
for f in *.xml; do sed -i '' -n '/<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">/!p' $f; done
First, open the file:
f = open("yourfile.txt","r")
Next, get all your lines from the file:
lines = f.readlines()
Now you can close the file:
f.close()
And reopen it in write mode:
f = open("yourfile.txt","w")
Then, write your lines back, except the line you want to delete. You might want to change the "\n" to whatever line ending your file uses.
for line in lines:
if not line.startswith('<!DOCTYPE'):
f.write(line)
At the end, close the file again.
f.close()
I want to replace word after finding a specific word in a file using python. I tried re.replace or re.sub function but does not get help.
File Contents :
hostname: "abc-myvm-lx01"
I want to keep "hostname: " as static and replace rest of values.
After replacement file will be look like :
hostname: "abc-urvm-lx02"
How about parsing the file yourself:
# READ ENTRIES FROM FILE
entries = {}
with open('your_file.txt', 'r') as file:
for line in file:
entry = line.split(':')
# Allow values like abc:myvm:lx01
entries[entry[0]] = ':'.join(entry[1:]).strip()
# Modify entries['hostname'] in any way you want
# WRITE ENTRIES TO FILE
with open('your_file.txt', 'w') as file:
file.writelines('{}: {}\n'.format(*entry) for entry in entries.items())
I should preface that I am a complete Python Newbie.
Im trying to create a script that will loop through a directory and its subdirectories looking for text files. When it encounters a text file it will parse the file and convert it to NITF XML and upload to an FTP directory.
At this point I am still working on reading the text file into variables so that they can be inserted into the XML document in the right places. An example to the text file is as follows.
Headline
Subhead
By A person
Paragraph text.
And here is the code I have so far:
with open("path/to/textFile.txt") as f:
#content = f.readlines()
head,sub,auth = [f.readline().strip() for i in range(3)]
data=f.read()
pth = os.getcwd()
print head,sub,auth,data,pth
My question is: how do I iterate through the body of the text file(data) and wrap each line in HTML P tags? For example;
<P>line of text in file </P> <P>Next line in text file</p>.
Something like
output_format = '<p>{}</p>\n'.format
with open('input') as fin, open('output', 'w') as fout:
fout.writelines( output_format(line.strip()) for line in fin )
This assumes that you want to write the new content back to the original file:
with open('path/to/textFile.txt') as f:
content = f.readlines()
with open('path/to/textFile.txt', 'w') as f:
for line in content:
f.write('<p>' + line.strip() + '</p>\n')
with open('infile') as fin, open('outfile',w) as fout:
for line in fin:
fout.write('<P>{0}</P>\n'.format(line[:-1]) #slice off the newline. Same as `line.rstrip('\n')`.
#Only do this once you're sure the script works :)
shutil.move('outfile','infile') #Need to replace the input file with the output file
in you case, you should probably replace
data=f.read()
with:
data = '\n'.join("<p>%s</p>" % l.strip() for l in f)
use data=f.readlines() here,
and then iterate over data and try something like this:
for line in data:
line="<p>"+line.strip()+"</p>"
#write line+'\n' to a file or do something else
append the and <\p> for each line
ex:
data_new=[]
data=f.readlines()
for lines in data:
data_new.append("<p>%s</p>\n" % data.strip().strip("\n"))
You could use the fileinput module to modify one or more files in-place, with optional backup file creation if desired (see its documentation for details). Here's it being used to process one file.
import fileinput
for line in fileinput.input('testinput.txt', inplace=1):
print '<P>'+line[:-1]+'<\P>'
The 'testinput.txt' argument could also be a sequence of two or more file names instead of just a single one, which could be useful especially if you're using os.walk() to generate the list of files in the directory and its subdirectories to process (as you probably should be doing).
I am trying to create bulk text files based on list. A text file has number of lines/titles and aim is to create text files. Following is how my titles.txt looks like along with non-working code and expected output.
titles = open("C:\\Dropbox\\Python\\titles.txt",'r')
for lines in titles.readlines():
d_path = 'C:\\titles'
output = open((d_path.lines.strip())+'.txt','a')
output.close()
titles.close()
titles.txt
Title-A
Title-B
Title-C
new blank files to be created under directory c:\\titles\\
Title-A.txt
Title-B.txt
Title-C.txt
It's a little difficult to tell what you're attempting here, but hopefully this will be helpful:
import os.path
with open('titles.txt') as f:
for line in f:
newfile = os.path.join('C:\\titles',line.strip()) + '.txt'
ff = open( newfile, 'a')
ff.close()
If you want to replace existing files with blank files, you can open your files with mode 'w' instead of 'a'.
The following should work.
import os
titles='C:/Dropbox/Python/titles.txt'
d_path='c:/titles'
with open(titles,'r') as f:
for l in f:
with open(os.path.join(d_path,l.strip()),'w') as _:
pass