parsing .xml blast output with re - python

I'm trying to parse BLAST output in XML format using re, have never done it before, below is my code.
However,since some hits have Hsp_num sometimes more than once, I get more results for query_from and query_to, and less for query_len, how to specify that if Hsp_num is more than 1 do print query_len for it again? thank you
import re
output = open('result.txt','w')
n = 0
with open('file.xml','r') as xml:
for line in xml:
if re.search('<Hsp_query-from>', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('<Hsp_query-from>')
line = line.rstrip('</')
query_from = line
if re.search('<Hsp_query-to>', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('<Hsp_query-to>')
line = line.rstrip('</')
query_to = line
if re.search('<Hsp_num>', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('<Hsp_num>')
line = line.rstrip('</')
Hsp_num = line
print >> output, Hsp_num+'\t'+query_from+'\t'+query_to
output.close()
I did query_len in a separate file, since it didnt work..
with open('file.xml','r') as xml:
for line in xml:
if re.search('<Iteration_query-len>', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('<Iteration_query-len>')
line = line.rstrip('</')
query_len = line

Are you familiar with Biopython? Its Bio.Blast.NCBIXML module may be just what you need. Chapter 7 of the Tutorial and Cookbook is all about BLAST, and section 7.3 deals with parsing. You'll get an idea of how it works, and it will be a lot easier than using regex to parse XML, which will only lead to tears and mental breakdowns.

Related

I need to print the specific part of a line in a txt file

I have this text file that reads ,,Janitors, 3,, ,,Programers, 4,, and ,,Secretaries, 1,, and all of these are on different lines. I need to print out Janitor seperate from the number 3, and this has to work for basicaly any word and number combo. This is the code I came up with and, of course, it doesnt work. It says ,,substring not found,,
File = open("Jobs.txt", "r")
Beg_line = 1
for lines in File:
Line = str(File.readline(Beg_line))
Line = Line.strip('\n')
print(Line[0: Line.index(',')])
Beg_line = Beg_line + 1
File.close()
Try running the following code:
file = open("Jobs.txt", "r")
lines = file.read().split('\n')
for line in lines:
print(line.split(' ')[0])
file.close()
This will give the following output:
Janitors
Programers
Secretaries

How to fetch the second line data from text file in Python

How to fetch the second line data from text file in Python.
I have a text file and in file there are some data in line by line_
Dog
Cat
Cow
How to fetch the second line which is “Cat” and store in a variable in python
var = # “Cat”
You should place the text file in the same directory with your Python code, which could be the following:
with open("animals.txt", "r") as f:
animals = [line.strip() for line in f]
second_line = animals[1]
Now, the variable "second_line" contains the data you want.
You can open a file, then read line by line while counting the line number as follows:
if __name__ == '__main__':
input_path = "data/animals.txt"
var = None
with open(input_path, "r") as fin:
n_lines = 0
for line in fin:
n_lines += 1
if 2 == n_lines:
var = line.strip()
break
print(var)
Result:
Cat
If the file is big, you may avoid reading all file and use readline to read one line twice:
with open ('file.txt') as file:
line = file.readline()
line = file.readline()
print(line)
...or check 'seek' method to start reading at specific character index.

How can we write a text file from variable using python?

I am working on NLP project and have extracted the text from pdf using PyPDF2. Further, I removed the blank lines. Now, my output is being shown on the console but I want to populate the text file with the same data which is stored in my variable (file).
Below is the code which is removing the blank lines from a text file.
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file=line
print(file)
Output on Console:
Eclipse,
Visual Studio 2012,
Arduino IDE,
Java
,
HTML,
CSS
2013
Excel
.
Now, I want the same data in my (resume1.txt) text file. I have used three methods but all these methods print a single dot in my resume1.txt file. If I see at the end of the text file then there is a dot which is being printed.
Method 1:
with open("resume1.txt", "w") as out_file:
out_file.write(file)
Method 2:
print(file, file=open("resume1.txt", 'w'))
Method 3:
pathlib.Path('resume1.txt').write_text(file)
Could you please be kind to assist me in populating the text file. Thank you for your cooperation.
First of all, note that you are writing to the same file losing the old data, I don't know if you want to do that. Other than that, every time you write using those methods, you are overwriting the data you previously wrote to the output file. So, if you want to use these methods, you must write just 1 time (write all the data).
SOLUTIONS
Using method 1:
to_file = []
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
to_file.append(file)
to_save = '\n'.join(to_file)
with open("resume1.txt", "w") as out_file:
out_file.write(to_save)
Using method 2:
to_file = []
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
to_file.append(file)
to_save = '\n'.join(to_file)
print(to_save, file=open("resume1.txt", 'w'))
Using method 3:
import pathlib
to_file = []
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
to_file.append(file)
to_save = '\n'.join(to_file)
pathlib.Path('resume1.txt').write_text(to_save)
In these 3 methods, I have used to_save = '\n'.join(to_file) because I'm assuming you want to separate each line of other with an EOL, but if I'm wrong, you can just use ''.join(to_file) if you want not space, or ' '.join(to_file) if you want all the lines in a single one.
Other method
You can do this by using other file, let's say 'output.txt'.
out_file = open('output.txt', 'w')
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
out_file.write(file)
out_file.write('\n') # EOL
out_file.close()
Also, you can do this (I prefer this):
with open('output.txt', 'w') as out_file:
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
out_file.write(file)
out_file.write('\n') # EOL
First post on stack, so excuse the format
new_line = ""
for line in open('resume1.txt', "r"):
for char in line:
if char != " ":
new_line += char
print(new_line)
with open('resume1.txt', "w") as f:
f.write(new_line)

Replace string in line without adding new line?

I want to replace string in a line which contain patternB, something like this:
from:
some lines
line contain patternA
some lines
line contain patternB
more lines
to:
some lines
line contain patternA
some lines
line contain patternB xx oo
more lines
I have code like this:
inputfile = open("d:\myfile.abc", "r")
outputfile = open("d:\myfile_renew.abc", "w")
obj = "yaya"
dummy = ""
item = []
for line in inputfile:
dummy += line
if line.find("patternA") != -1:
for line in inputfile:
dummy += line
if line.find("patternB") != -1:
item = line.split()
dummy += item[0] + " xx " + item[-1] + "\n"
break
outputfile.write(dummy)
It do not replace the line contain "patternB" as expected, but add an new line below it like :
some lines
line contain patternA
some lines
line contain patternB
line contain patternB xx oo
more lines
What can I do with my code?
Of course it is, since you append line to dummy in the beginning of the for loop and then the modified version again in the "if" statement. Also why check for Pattern A if you treat is as you treat everything else?
inputfile = open("d:\myfile.abc", "r")
outputfile = open("d:\myfile_renew.abc", "w")
obj = "yaya"
dummy = ""
item = []
for line in inputfile:
if line.find("patternB") != -1:
item = line.split()
dummy += item[0] + " xx " + item[-1] + "\n"
else:
dummy += line
outputfile.write(dummy)
The simplest will be:
1. Read all File into string
2. Call string.replace
3. Dump string to file
If you want to keep line by line iterator
(for a big file)
for line in inputfile:
if line.find("patternB") != -1:
dummy = line.replace('patternB', 'patternB xx oo')
outputfile.write(dummy)
else:
outputfile.write(line)
This is slower than other responses, but enables big file processing.
This should work
import os
def replace():
f1 = open("d:\myfile.abc","r")
f2 = open("d:\myfile_renew.abc","w")
ow = raw_input("Enter word you wish to replace:")
nw = raw_input("Enter new word:")
for line in f1:
templ = line.split()
for i in templ:
if i==ow:
f2.write(nw)
else:
f2.write(i)
f2.write('\n')
f1.close()
f2.close()
os.remove("d:\myfile.abc")
os.rename("d:\myfile_renew.abc","d:\myfile.abc")
replace()
You can use str.replace:
s = '''some lines
line contain patternA
some lines
line contain patternB
more lines'''
print(s.replace('patternB', 'patternB xx oo'))

Why is this writing part of the text to a new line? (Python)

I'm adding some new bits to one of the lines in a text file and then writing it along with the rest of the lines in the file to a new file. Referring to the 2nd if statement in the while loop, I want that to be all on the same line:
path = raw_input("Enter the name of the destination folder: ")
source_file = open("parameters")
lnum=1
for line in source_file:
nums = line.split()
if (lnum==10):
mTot = float(nums[0])
if (lnum==11):
qinit = float(nums[0])
if (lnum==12):
qfinal = float(nums[0])
if (lnum==13):
qgrowth = float(nums[0])
if (lnum==14):
K = float(nums[0])
lnum = lnum+1
q = qinit
m1 = mTot/(1+qinit)
m2 = (mTot*qinit)/(1+qinit)
taua = (1/3.7)*(mTot**(-4.0/3.0))
taue = taua/K
i = 1
infname = 'parameters'
while (q <= qfinal):
outfname = path+'/'+str(i)
oldfile = open(infname)
lnum=1
for line in oldfile:
if (lnum==17):
line = "{0:.2e}".format(m1)+' '+line
if (lnum==18):
line = "{0:.2e}".format(m2)+' '+line+' '+"{0:.2e}".format(taua)+' '+" {0:.2e}".format(taue)
newfile = open(outfname,'a')
newfile.write(line)
lnum=lnum+1
oldfile.close()
newfile.close()
i=i+1
q = q + q*(qgrowth)
m1 = mTot/(1+q)
m2 = (mTot*q)/(1+q)
but taua and taue are being written on the line below the rest of it. What am I missing here?
That is because line still contains the trailing newline, and when you concatenate it you are also including the newline.
Insert a
line = line.strip()
right after the if (lnum == 19): but before you put the longer line together to get rid of the newline.
Note that write will not add a newline automatically, so you'll want to add a trailing newline of your own.
UPDATE:
This is untested, but I think unless I messed up, you could just use this instead of your longer line:
line = line.strip()
line = "{0:.2e} {} {0:.2e} {0:.2e}\n".format(x, line, y, z)
If you use line = rstrip(line) on line before you change the line then it will trim the new line (as well as any whitespace).

Categories

Resources