Extract specific entries from blastx output file, write to new file

Extract specific entries from blastx output file, write to new file - python

I have created a script that successfully searches for keywords (specified by user) within a Blastx output file in XML format. Now, I need to write those records (query, hit, score, evalue, etc) that contain the keyword in the alignment title to a new file.
I have created separate lists for each of the query titles, hit title, e-value and alignment lengths but cannot seem to write them to a new file.
Problem #1: what if Python errors, and one of the lists is missing a value...? Then all the other lists will be giving wrong information in reference to the query ("line slippage", if you will...).
Problem #2: even if Python doesn't error, and all the lists are the same length, how can I write them to a file so that the first item in each list is associated with each other (and thus, item #10 from each list is also associated?) Should I create a dictionary instead?
Problem#3: dictionaries have only a single value for a key, what if my query has several different hits? Not sure if it will be overwritten or skipped, or if it will just error. Any suggestions? My current script:
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
import re
#obtain full path to blast output file (*.xml)
outfile = input("Full path to Blast output file (XML format only): ")
#obtain string to search for
search_string = input("String to search for: ")
#open the output file
result_handle = open(outfile)
#parse the blast record
blast_records = NCBIXML.parse(result_handle)
#initialize lists
query_list=[]
hit_list=[]
expect_list=[]
length_list=[]
#create 'for loop' that loops through each HIGH SCORING PAIR in each ALIGNMENT from each RECORD
for record in blast_records:
for alignment in record.alignments: #for description in record.descriptions???
for hsp in alignment.hsps: #for title in description.title???
#search for designated string
search = re.search(search_string, alignment.title)
#if search comes up with nothing, end
if search is None:
print ("Search string not found.")
break
#if search comes up with something, add it to a list of entries that match search string
else:
#option to include an 'exception' (if it finds keyword then DOES NOT add that entry to list)
if search is "trichomonas" or "entamoeba" or "arabidopsis":
print ("found exception.")
break
else:
query_list.append(record.query)
hit_list.append(alignment.title)
expect_list.append(expect_val)
length_list.append(length)
#explicitly convert 'variables' ['int' object or 'float'] to strings
length = str(alignment.length)
expect_val = str(hsp.expect)
#print ("\nquery name: " + record.query)
#print ("alignment title: " + alignment.title)
#print ("alignment length: " + length)
#print ("expect value: " + expect_val)
#print ("\n***Alignment***\n")
#print (hsp.query)
#print (hsp.match)
#print (hsp.sbjct + "\n\n")
if query_len is not hit_len is not expect_len is not length_len:
print ("list lengths don't match!")
break
else:
qrylen = len(query_list)
query_len = str(qrylen)
hitlen = len(hit_list)
hit_len = str(hitlen)
expectlen = len(expect_list)
expect_len = str(expectlen)
lengthlen = len(length_list)
length_len = str(lengthlen)
outpath = str(outfile)
#create new file
outfile = open("__Blast_Parse_Search.txt", "w")
outfile.write("File contains entries from [" + outpath + "] that contain [" + search_string + "]")
outfile.close
#write list to file
i = 0
list_len = int(query_len)
for i in range(0, list_len):
#append new file
outfile = open("__Blast_Parse_Search.txt", "a")
outfile.writelines(query_list + hit_list + expect_list + length_list)
i = i + 1
#write to disk, close file
outfile.flush()
outfile.close
print ("query list length " + query_len)
print ("hit list length " + hit_len)
print ("expect list length " + expect_len)
print ("length list length " + length_len + "\n\n")
print ("first record: " + query_list[0] + " " + hit_list[0] + " " + expect_list[0] + " " + length_list[0])
print ("last record: " + query_list[-1] + " " + hit_list[-1] + " " + expect_list[-1] + " " + length_list[-1])
print ("\nFinished.\n")

If I understand your problem correctly you could use a default value for the line slippage thing like:
try:
x(list)
except exception:
append_default_value(list)
http://docs.python.org/tutorial/errors.html#handling-exceptions
or use tuples for dictionary keys like (0,1,1) and use the get method for your default value.
http://docs.python.org/py3k/library/stdtypes.html#mapping-types-dict
If you need to maintain data structures in your output files you might try using shelve:
or you could append some type of reference after each record and give each record a unique id for example '#32{somekey:value}#21#22#44#'
again you can have multiple keys using a tuple.
I don't know if that helps, you might clarify exactly what parts of your code you have trouble with. Like x() gives me output y but I expect z.

Related

Counting instances of words from a file in Python, only works for single letters

The below code is supposed to count instances of a particular word in a text file, though it seems to only work for individual letters. Using a string of two letters or more always returns a count of 0. I have checked, and the input I have been using should definitely not return a count of 0 for the given files.
Any ideas?
def count_of_word(filename, word_to_count):
"""Counts instances of a particular word in a file"""
try:
with open(filename) as file_object:
contents = file_object.read()
except FileNotFoundError:
print("File " + filename + " not found")
else:
word_count = contents.lower().count(word_to_count)
print("The count of the word '" + word_to_count + "' in " + filename + " is " + str(word_count))

You change lower-case to only the file input. Try:
word_count = contents.lower().count(word_to_count.lower())
That works for me - I get 1026 for the count of and in the file you refer to.
EDIT: suspected encoding issue, so suggested specifying encoding, which worked:
open(filename, encoding='utf_8')

Did not change one line in your code, and it works, I'm wondering if this has to do anything with how you are passing 'the' or 'and' into the function should be count_of_word('alice.txt', 'the')
def count_of_word(filename, word_to_count):
"""Counts instances of a particular word in a file"""
try:
with open(filename) as file_object:
contents = file_object.read()
except FileNotFoundError:
print("File " + filename + " not found")
else:
word_count = contents.lower().count(word_to_count)
print("The count of the word '" + word_to_count + "' in " + filename + " is " + str(word_count))
count_of_word('alice.txt', 'the')
count_of_word('alice.txt', 'a')
~/python/stack/sept/twenty_2$ python3.7 alice.py
The count of the word 'and' in alice.txt is 2505
The count of the word 'a' in alice.txt is 9804

How do I assign values in text file to an array inside python function and use it as global?

I am using windows10 and python 2.7.14. Running the python scripts in command prompt. I want to read some lines in text file and compare with some text, if it matches it should be stored in array. And also I want the array should be global. But In my script the I am not able to store the contents in array. How do I achieve this.
#This method is to reading logfile and saving the different datas in different lists
def Readlogs(Filename):
datafile = file(Filename)
for line in datafile:
if "login = " in line:
print(line)
trial=line
s2 = "= "
ArrayLogin = trial[trial.index(s2) + len(s2):]
print(ArrayLogin)
print(ArrayLogin)
if "Overlay = " in line:
print(line)
trial2=line
s2 = "= "
arrayOverlay = trial2[trial2.index(s2) + len(s2):]
print(arrayOverlay)
Readlogs(WriteFileName)

You can declare empty arrays and append items to it.
#This method is to reading logfile and saving the different datas in different lists
def Readlogs(Filename):
#empty array
ArrayLogin, arrayOverlay = [], []
datafile = file(Filename)
for line in datafile:
if "login = " in line:
print(line)
trial=line
s2 = "= "
ArrayLogin.append(trial[trial.index(s2) + len(s2):])
print(ArrayLogin)
print(ArrayLogin)
if "Overlay = " in line:
print(line)
trial2=line
s2 = "= "
arrayOverlay.append(trial2[trial2.index(s2) + len(s2):])
print(arrayOverlay)
return ArrayLogin, arrayOverlay
arr1, arr2, = Readlogs(WriteFileName)

Python export to file via ofile without bracket characters

I successfully simplified a python module that imports data from a spectrometer
(I'm a total beginner, somebody else wrote the model of the code for me...)
I only have one problem: half of the output data (in a .csv file) is surrounded by brackets: []
I would like the file to contain a structure like this:
name, wavelength, measurement
i.e
a,400,0.34
a,410,0.65
...
but what I get is:
a,400,[0.34]
a,410,[0.65]
...
Is there any simple fix for this?
Is it because measurement is a string?
Thank you
import serial # requires pyserial library
ser = serial.Serial(0)
ofile = file( 'spectral_data.csv', 'ab')
while True:
name = raw_input("Pigment name [Q to finish]: ")
if name == "Q":
print "bye bye!"
ofile.close()
break
first = True
while True:
line = ser.readline()
if first:
print " Data incoming..."
first = False
split = line.split()
if 10 <= len(split):
try:
wavelength = int(split[0])
measurement = [float(split[i]) for i in [6]]
ofile.write(str(name) + "," + str(wavelength) + "," + str(measurement) + '\n')
except ValueError:
pass # handles the table heading
if line[:3] == "110":
break
print " Data gathered."
ofile.write('\n')

do this:
measurement = [float(split[i]) for i in [6]]
ofile.write(str(name) + "," + str(wavelength) + "," + ",".join(measurement) + '\n')
OR
ofile.write(str(name) + "," + str(wavelength) + "," + split[6] + '\n')

Cannot write to text file python 2.7 invalid syntax?

My code works perfectly, but I want it to write the values to a text file. When I try to do it, I get 'invalid syntax'. When I use a python shell, it works. So I don't understand why it isn't working in my script.
I bet it's something silly, but why wont it output the data to a text file??
#!/usr/bin/env python
#standard module, needed as we deal with command line args
import sys
from fractions import Fraction
import pyexiv2
#checking whether we got enough args, if not, tell how to use, and exits
#if len(sys.argv) != 2 :
# print "incorrect argument, usage: " + sys.argv[0] + ' <filename>'
# sys.exit(1)
#so the argument seems to be ok, we use it as an imagefile
imagefilename = sys.argv[1]
#trying to catch the exceptions in case of problem with the file reading
try:
metadata = pyexiv2.metadata.ImageMetadata(imagefilename)
metadata.read();
#trying to catch the exceptions in case of problem with the GPS data reading
try:
latitude = metadata.__getitem__("Exif.GPSInfo.GPSLatitude")
latitudeRef = metadata.__getitem__("Exif.GPSInfo.GPSLatitudeRef")
longitude = metadata.__getitem__("Exif.GPSInfo.GPSLongitude")
longitudeRef = metadata.__getitem__("Exif.GPSInfo.GPSLongitudeRef")
# get the value of the tag, and make it float number
alt = float(metadata.__getitem__("Exif.GPSInfo.GPSAltitude").value)
# get human readable values
latitude = str(latitude).split("=")[1][1:-1].split(" ");
latitude = map(lambda f: str(float(Fraction(f))), latitude)
latitude = latitude[0] + u"\u00b0" + latitude[1] + "'" + latitude[2] + '"' + " " + str(latitudeRef).split("=")[1][1:-1]
longitude = str(longitude).split("=")[1][1:-1].split(" ");
longitude = map(lambda f: str(float(Fraction(f))), longitude)
longitude = longitude[0] + u"\u00b0" + longitude[1] + "'" + longitude[2] + '"' + " " + str(longitudeRef).split("=")[1][1:-1]
## Printing out, might need to be modified if other format needed
## i just simple put tabs here to make nice columns
print " \n A text file has been created with the following information \n"
print "GPS EXIF data for " + imagefilename
print "Latitude:\t" + latitude
print "Longitude:\t" + longitude
print "Altitude:\t" + str(alt) + " m"
except Exception, e: # complain if the GPS reading went wrong, and print the exception
print "Missing GPS info for " + imagefilename
print e
# Create a new file or **overwrite an existing file**
text_file = open('textfile.txt', 'w')
text_file.write("Latitude" + latitude)
# Close the output file
text_file.close()
except Exception, e: # complain if the GPS reading went wrong, and print the exception
print "Error processing image " + imagefilename
print e;
The error I see says:
text_file = open('textfile.txt','w')
^
SyntaxError: invalid syntax

File open is within the first try block. It is outside the second try except block. Move it outside the first try except block or increase the indent to include them within the first try block. It should work fine there.
Also move(increase the indent) the two print statements within the try as well.
This will work for you:
#!/usr/bin/env python
#standard module, needed as we deal with command line args
import sys
from fractions import Fraction
import pyexiv2
#checking whether we got enough args, if not, tell how to use, and exits
#if len(sys.argv) != 2 :
# print "incorrect argument, usage: " + sys.argv[0] + ' <filename>'
# sys.exit(1)
#so the argument seems to be ok, we use it as an imagefile
imagefilename = sys.argv[1]
#trying to catch the exceptions in case of problem with the file reading
try:
metadata = pyexiv2.metadata.ImageMetadata(imagefilename)
metadata.read();
#trying to catch the exceptions in case of problem with the GPS data reading
try:
latitude = metadata.__getitem__("Exif.GPSInfo.GPSLatitude")
latitudeRef = metadata.__getitem__("Exif.GPSInfo.GPSLatitudeRef")
longitude = metadata.__getitem__("Exif.GPSInfo.GPSLongitude")
longitudeRef = metadata.__getitem__("Exif.GPSInfo.GPSLongitudeRef")
# get the value of the tag, and make it float number
alt = float(metadata.__getitem__("Exif.GPSInfo.GPSAltitude").value)
# get human readable values
latitude = str(latitude).split("=")[1][1:-1].split(" ");
latitude = map(lambda f: str(float(Fraction(f))), latitude)
latitude = latitude[0] + u"\u00b0" + latitude[1] + "'" + latitude[2] + '"' + " " + str(latitudeRef).split("=")[1][1:-1]
longitude = str(longitude).split("=")[1][1:-1].split(" ");
longitude = map(lambda f: str(float(Fraction(f))), longitude)
longitude = longitude[0] + u"\u00b0" + longitude[1] + "'" + longitude[2] + '"' + " " + str(longitudeRef).split("=")[1][1:-1]
## Printing out, might need to be modified if other format needed
## i just simple put tabs here to make nice columns
print " \n A text file has been created with the following information \n"
print "GPS EXIF data for " + imagefilename
print "Latitude:\t" + latitude
print "Longitude:\t" + longitude
print "Altitude:\t" + str(alt) + " m"
except Exception, e: # complain if the GPS reading went wrong, and print the exception
print "Missing GPS info for " + imagefilename
print e
# Create a new file or **overwrite an existing file**
text_file = open('textfile.txt', 'w')
text_file.write("Latitude" + latitude)
# Close the output file
text_file.close()
except Exception, e: # complain if the GPS reading went wrong, and print the exception
print "Error processing image " + imagefilename
print e;

Can be you are tabulating wrong?... The lines:
print " \n A text file has been created with the following information \n"
print "GPS EXIF data for " + imagefilename
appears to be wrong tabulated
EDIT: The code you posted -the one of the trace- is wrong tabulated, too.

Append to JSON in Python (Optimally due to RAM constraint)

I'm trying to find the optimal way to append some data to a json file using Python. Basically what happens is I have about say 100 threads open storing data to an array. When they are done they send that to a json file using json.dump. However since this can take a few hours for the array to build up I end up running out of RAM eventually. So I'm trying to see what's the best way to use the least amount of RAM in this process. The following is what I have which consumes to much RAM.
i = 0
twitter_data = {}
for null in range(0,1):
while True:
try:
for friends in Cursor(api.followers_ids,screen_name=self.ip).items():
twitter_data[i] = {}
twitter_data[i]['fu'] = self.ip
twitter_data[i]['su'] = friends
i = i + 1
except tweepy.TweepError, e:
print "ERROR on " + str(self.ip) + " Reason: ", e
with open('C:/Twitter/errors.txt', mode='a') as a_file:
new_ii = "ERROR on " + str(self.ip) + " Reason: " + str(e) + "\n"
a_file.write(new_ii)
break
## Save data
with open('C:/Twitter/user_' + str(self.id) + '.json', mode='w') as f:
json.dump(twitter_data, f, indent=2, encoding='utf-8')
Thanks

Output the individual items as an array as they're created, creating the JSON formatting for the array around it manually. JSON is a simple format, so this is trivial to do.
Here's a simple example that prints out a JSON array, without having to hold the entire contents in memory; only a single element in the array needs to be stored at once.
def get_item():
return { "a": 5, "b": 10 }
def get_array():
results = []
yield "["
for x in xrange(5):
if x > 0:
yield ","
yield json.dumps(get_item())
yield "]"
if __name__ == "__main__":
for s in get_array():
sys.stdout.write(s)
sys.stdout.write("\n")

My take, building on the idea from Glenn's answer but serializing a big dict as requested by the OP and using the more pythonic enumerate instead of manually incrementing i (errors can be taken into account by keeping a separate count for them and subtracting it from i before wriring to f):
with open('C:/Twitter/user_' + str(self.id) + '.json', mode='w') as f:
f.write('{')
for i, friends in enumerate(Cursor(api.followers_ids,screen_name=self.ip).items()):
if i>0:
f.write(", ")
f.write("%s:%s" % (json.dumps(i), json.dumps(dict(fu=self.ip, su=friends))))
f.write("}")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract specific entries from blastx output file, write to new file - python

Related

Counting instances of words from a file in Python, only works for single letters

How do I assign values in text file to an array inside python function and use it as global?

Python export to file via ofile without bracket characters

Cannot write to text file python 2.7 invalid syntax?

Append to JSON in Python (Optimally due to RAM constraint)

Categories

Resources