Modify location of a genbank feature - python

Edit : I know feature.type will give gene/CDS and feature.qualifiers will then give "db_xref"/"locus_tag"/"inference" etc. Is there a feature. object which will allow me to access the location (eg: [5240:7267](+) ) directly?
This URL give a bit more info, though I can't figure out how to use it for my purpose... http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html#location_operator
Original Post:
I am trying to modify the location of features within a GenBank file. Essentially, I want to modify the following bit of a GenBank file:
gene 5240..7267
/db_xref="GeneID:887081"
/locus_tag="Rv0005"
/gene="gyrB"
CDS 5240..7267
/locus_tag="Rv0005"
/inference="protein motif:PROSITE:PS00177"
...........................
to
gene 5357..7267
/db_xref="GeneID:887081"
/locus_tag="Rv0005"
/gene="gyrB"
CDS 5357..7267
/locus_tag="Rv0005"
/inference="protein motif:PROSITE:PS00177"
.............................
Note the changes from 5240 to 5357
So far, from scouring the internet and Stackoverflow, I have:
from Bio import SeqIO
gb_file = "mtbtomod.gb"
gb_record = SeqIO.parse(open(gb_file, "r+"), "genbank")
rvnumber = 'Rv0005'
newstart = 5357
final_features = []
for record in gb_record:
for feature in record.features:
if feature.type == "gene":
if feature.qualifiers["locus_tag"][0] == rvnumber:
if feature.location.strand == 1:
feature.qualifiers["amend_position"] = "%s:%s" % (newstart, feature.location.end+1)
else:
# do the reverse for the complementary strand
final_features.append(feature)
record.features = final_features
with open("testest.gb","w") as testest:
SeqIO.write(record, testest, "genbank")
This basically creates a new qualifier called "amend_position".. however, what I would like to do is modify the location directly (with or without creating a new file...)
Rv0005 is just an example of a locus_tag I need to update. I have about 600 more locations to update, which explains the need for a script.. Help!

Ok, I have something which now fully works. I'll post the code in case anyone ever needs something similar
__author__ = 'Kavin'
from Bio import SeqIO
from Bio import SeqFeature
import xlrd
import sys
import re
workbook = xlrd.open_workbook(sys.argv[2])
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
# Create dicts to store TSS data
TSS = {}
row = {}
# For each entry (row), store the startcodon and strand information
for i in range(2, sheet.nrows - 1):
if data[i][5] < -0.7: # Ensures BASS score is within significant range
Gene = data[i][0]
row['Direction'] = str(data[i][3])
row['StartCodon'] = int(data[i][4])
TSS[str(Gene)] = row
row = {}
else:
i += 1
# Create an output filename based on input filename
outfile_init = re.search('(.*)\.(\w*)', sys.argv[1])
outfile = str(outfile_init.group(1)) + '_modified.' + str(outfile_init.group(2))
final_features = []
for record in SeqIO.parse(open(sys.argv[1], "r"), "genbank"):
for feature in record.features:
if feature.type == "gene" or feature.type == "CDS":
if TSS.has_key(feature.qualifiers["locus_tag"][0]):
newstart = TSS[feature.qualifiers["locus_tag"][0]]['StartCodon']
if feature.location.strand == 1:
feature.location = SeqFeature.FeatureLocation(SeqFeature.ExactPosition(newstart - 1),
SeqFeature.ExactPosition(
feature.location.end.position),
feature.location.strand)
else:
feature.location = SeqFeature.FeatureLocation(
SeqFeature.ExactPosition(feature.location.start.position),
SeqFeature.ExactPosition(newstart), feature.location.strand)
final_features.append(feature) # Append final features
record.features = final_features
with open(outfile, "w") as new_gb:
SeqIO.write(record, new_gb, "genbank")
This assumes usage such as python program.py <genbankfile> <excel spreadsheet>
This also assumes a spreadsheet of the following format:
Gene Synonym Tuberculist_annotated_start Orientation Re-annotated_start BASS_score
Rv0005 gyrB 5240 + 5357 -1.782
Rv0012 Rv0012 14089 + 14134 -1.553
Rv0018c pstP 23181 - 23172 -2.077
Rv0032 bioF2 34295 + 34307 -0.842
Rv0037c Rv0037c 41202 - 41163 -0.554

So, you can try something like below. As the number of change will be equal to the number of CDS/genes found in the file. You can read the locations/positions from csv/text file and make a list like I manually made change_values.
import re
f = open("test.txt")
change_values=["1111", "2222"]
flag = True
next = 0
for i in f.readlines():
if i.startswith(' CDS') or i.startswith(' gene'):
out = re.sub(r"\d+", str(change_values[next]), i)
#Instead of print write
print out
flag = not flag
if flag==True:
next += 1
else:
#Instead of print write
print i
Amy sample test.txt file looks like this:
gene 5240..7267
/db_xref="GeneID:887081"
/locus_tag="Rv0005"
/gene="gyrB"
CDS 5240..7267
/locus_tag="Rv0005"
/inference="protein motif:PROSITE:PS00177"
...........................
gene 5240..7267
/db_xref="GeneID:887081"
/locus_tag="Rv0005"
/gene="gyrB"
CDS 5240..7267
/locus_tag="Rv0005"
/inference="protein motif:PROSITE:PS00177"
...........................
Hope, this will solve your issue. Cheers!

I think this can be done with native biopython synthax, no regex
needed, minimal working example here:
from Bio import SeqIO
from Bio import SeqFeature
import copy
gbk = SeqIO.read('./test_gbk', 'gb')
index = 1
feature_to_change = copy.deepcopy(gbk.features[index]) #depends what feature you want to change,
#can also be done with loop if you want to change them all, or by some function...
new_start = 0
new_end = 100
new_feature_location = SeqFeature.FeatureLocation(new_start, new_end, feature.location.strand) #create a new feature location object
feature_to_change.location = new_feature_location #change old feature location
del gbk.features[index] #remove changed feature
gkb.features.append(feature_to_change) #add identical feature with new location
gbk.features = sorted(gbk.features, key = lambda feature: feature.location.start) # if you want them sorted by the start of the location, which is the usual case
SeqIO.write(gbk, './test_gbk_with_new_feature', 'gb')

Related

Cannot save modifications made in xlsx file

I read a .xlsx file, update it but Im not able to save it
from xml.dom import minidom as md
[... some code ....]
sheet = workDir + '/xl/worksheets/sheet'
sheet1 = sheet + '1.xml'
importSheet1 = open(sheet1,'r')
whole_file= importSheet1.read()
data_Sheet = md.parseString(whole_file)
[... some code ....]
self.array_mem_name = []
y = 1
x = 5 #first useful row
day = int(day)
found = 0
while x <= len_array_shared:
readrow = data_Sheet.getElementsByTagName('row')[x]
c_data = readrow.getElementsByTagName('c')[0]
c_attrib = c_data.getAttribute('t')
if c_attrib == 's':
vName = c_data.getElementsByTagName('v')[0].firstChild.nodeValue
#if int(vName) != broken:
mem_name = self.array_shared[int(vName)]
if mem_name != '-----':
if mem_name == old:
c_data = readrow.getElementsByTagName('c')[day]
c_attrib = c_data.getAttribute('t')
if (c_attrib == 's'):
v_Attrib = c_data.getElementsByTagName('v')[0].firstChild.nodeValue
if v_Attrib != '':
#loc = self.array_shared[int(v_Attrib)]
index = self.array_shared.index('--')
c_data.getElementsByTagName('v')[0].firstChild.nodeValue = index
with open(sheet1, 'w') as f:
f.write(whole_file)
As you can see I use f.write(whole_file) but whole_file has not the changes made with index.
Checking the debug I see that the new value has been added to the node, but I can't save sheet1 with the modified value
I switched to using openpyxl instead, as was suggested in a comment by Lei Yang. I found that this tool worked better for my jobs. With openpyxl, reading cell values is much easier than with xml.dom.minidom.
My only concern is that openpyxl seems really slower than the dom to load the workbook. Maybe the memory was overloaded. But, I was more interested in using something simpler than this minor performance issue.

Extract GPS coordinates from .docx file with python

I have some hectic task to do for which I need some help from python. Please see this word document.
I am to extract texts and GPS coordinates from each row. There are currently over 100 coordinates in 10 docx file. My "hefty" python knowledge get me to this.
from docx import Document
import re
main_file = Document("D:/DOCUMENTS/Google_Link/1 Category I/1 Category
I.docx")
table = main_file.tables[1] #this is same for every document
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = tuple(text)
data.append(row_data)
regexReference = re.compile("(C.-)\w+")
colReference = [item[1] for item in data]
listReference = filter(regexReference.match, colReference)
for i in listReference:
print i.encode('UTF-8')
I can print 16 reference ids from column 2. Please guide me to print something like this.
C1-20701-17-1
some site, some region
The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires
some repair/maintenance works including electrical wiring and electrical
lights and appliances like ceiling fans supplies. Detail specification of
the works are attached
x = 91°38'28.2"E
y = 22°40'34.3"N
These XY locations and descritions will be used to create KML files afterwards and attach with each document. I'd prefer a variable for each part of the above section (ref id, location, description, x and y) so that I can automate that as well.
demo docx
I don't know if this works if there are files with different patterns (p.s. I'm using python 2.7.11):
# -*- coding: utf-8 -*-
from docx import Document
import sys
import os
import re
reload(sys)
sys.setdefaultencoding('utf8')
for root, dirs, files in os.walk("."):
for name in files:
doc_file = os.path.join(root, name)
if doc_file.endswith('docx'):
main_file = Document(doc_file)
table = main_file.tables[1] # this is same for every document
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = tuple(text)
data.append(row_data)
regexReference = re.compile("(C.-[0-9-]+)")
regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')
result = []
for item in data:
tmp = dict()
matchReference = regexReference.search(item[1])
matchCoordinate = regexCoordinate.search(unicode(item[2]))
if matchReference:
tmp['reference'] = matchReference.group()
if matchCoordinate:
tmp['x'] = matchCoordinate.group(1)
tmp['y'] = matchCoordinate.group(4)
tmp['description'] = unicode(item[2])
tmp['location'] = unicode(item[3])
result.append(tmp)
for rs in result:
if 'reference' in rs:
for k, v in rs.iteritems():
print('{} = {}'.format(k, v))
print
# Output:
# --------------------------------
# y = 91°38'28.2"E
# x = 22°40'34.3"N
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
# reference = C1-20701-17-1
# location = xxxxx Site, c Region

Python Pandas - Read csv file containing multiple tables

I have a single .csv file containing multiple tables.
Using Pandas, what would be the best strategy to get two DataFrame inventory and HPBladeSystemRack from this one file ?
The input .csv looks like this:
Inventory
System Name IP Address System Status
dg-enc05 Normal
dg-enc05_vc_domain Unknown
dg-enc05-oa1 172.20.0.213 Normal
HP BladeSystem Rack
System Name Rack Name Enclosure Name
dg-enc05 BU40
dg-enc05-oa1 BU40 dg-enc05
dg-enc05-oa2 BU40 dg-enc05
The best I've come up with so far is to convert this .csv file into Excel workbook (xlxs), split the tables into sheets and use:
inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1)
HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2)
However:
This approach requires xlrd module.
Those log files have to be analyzed in real time, so that it would be way better to find a way to analyze them as they come from the logs.
The real logs have far more tables than those two.
If you know the table names beforehand, then something like this:
df = pd.read_csv("jahmyst2.csv", header=None, names=range(3))
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
should work to produce a dictionary with keys as the table names and values as the subtables.
>>> list(tables)
['HP BladeSystem Rack', 'Inventory']
>>> for k,v in tables.items():
... print("table:", k)
... print(v)
... print()
...
table: HP BladeSystem Rack
0 1 2
6 System Name Rack Name Enclosure Name
7 dg-enc05 BU40 NaN
8 dg-enc05-oa1 BU40 dg-enc05
9 dg-enc05-oa2 BU40 dg-enc05
table: Inventory
0 1 2
1 System Name IP Address System Status
2 dg-enc05 NaN Normal
3 dg-enc05_vc_domain NaN Unknown
4 dg-enc05-oa1 172.20.0.213 Normal
Once you've got that, you can set the column names to the first rows, etc.
I assume you know the names of the tables you want to parse out of the csv file. If so, you could retrieve the index positions of each, and select the relevant slices accordingly. As a sketch, this could look like:
df = pd.read_csv('path_to_file')
index_positions = []
for table in table_names:
index_positions.append(df[df['col_with_table_names']==table].index.tolist()[0])
## Include end of table for last slice, omit for iteration below
index_positions.append(df.index.tolist()[-1])
tables = {}
for position in index_positions[:-1]:
table_no = index_position.index(position)
tables[table_names[table_no] = df.loc[position:index_positions[table_no+10]]
There are certainly more elegant solutions but this should give you a dictionary with the table names as keys and the corresponding tables as values.
Pandas doesn't seem to be ready to do this easily, so I ended up doing my own split_csv function. It only requires table names and will output .csv files named after each table.
import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
def split_csv(csv_path, table_names):
tables_infos = detect_tables_from_csv(csv_path, table_names)
for table_info in tables_infos:
split_csv_by_indexes(csv_path, table_info)
def split_csv_by_indexes(csv_path, table_info):
title, start_index, end_index = table_info
print title, start_index, end_index
dir_ = dirname(csv_path)
output_path = join(dir_, title) + ".csv"
with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file:
writer = csv.writer(output_file)
reader = csv.reader(input_file)
for i, line in enumerate(reader):
if i < start_index:
continue
if i > end_index:
break
writer.writerow(line)
def detect_tables_from_csv(csv_path, table_names):
output = []
with open(csv_path, 'rb') as csv_file:
reader = csv.reader(csv_file)
for idx, row in enumerate(reader):
for col in row:
match = [title for title in table_names if title in col]
if match:
match = match[0] # get the first matching element
try:
end_index = idx - 1
start_index
except NameError:
start_index = 0
else:
output.append((previous_match, start_index, end_index))
print "Found new table", col
start_index = idx
previous_match = match
match = False
end_index = idx # last 'end_index' set to EOF
output.append((previous_match, start_index, end_index))
return output
if __name__ == '__main__':
csv_path = 'switch_records.csv'
try:
split_csv(csv_path, table_names)
except IOError as e:
print "This file doesn't exist. Aborting."
print e
exit(1)

Append Function Nested Inside IF Statement Body Not Working

I am fairly new to Python (just started learning in the last two weeks) and am trying to write a script to parse a csv file to extract some of the fields into a List:
from string import Template
import csv
import string
site1 = 'D1'
site2 = 'D2'
site3 = 'D5'
site4 = 'K0'
site5 = 'K1'
site6 = 'K2'
site7 = '0'
site8 = '0'
site9 = '0'
lbl = 1
portField = 'y'
sw = 5
swpt = 6
cd = 0
pt = 0
natList = []
with open(name=r'C:\Users\dtruman\Documents\PROJECTS\SCRIPTING - NATAERO DEPLOYER\NATAERO DEPLOYER V1\nataero_deploy.csv') as rcvr:
for line in rcvr:
fields = line.split(',')
Site = fields[0]
siteList = [site1,site2,site3,site4,site5,site6,site7,site8,site9]
while Site in siteList == True:
Label = fields[lbl]
Switch = fields[sw]
if portField == 'y':
Switchport = fields[swpt]
natList.append([Switch,Switchport,Label])
else:
Card = fields[cd]
Port = fields[pt]
natList.append([Switch,Card,Port,Label])
print natList
Even if I strip the ELSE statement away and break into my code right after the IF clause-- i can verify that "Switchport" (first statement in IF clause) is successfully being populated with a Str from my csv file, as well as "Switch" and "Label". However, "natList" is not being appended with the fields parsed from each line of my csv for some reason. Python returns no errors-- just does not append "natList" at all.
This is actually going to be a function (once I get the code itself to work), but for now, I am simply setting the function parameters as global variables for the sake of being able to run it in an iPython console without having to call the function.
The "lbl", "sw", "swpt", "cd", and "pt" refer to column#'s in my csv (the finished function will allow user to enter values for these variables).
I assume I am running into some issue with "natList" scope-- but I have tried moving the "natList = []" statement to various places in my code to no avail.
I can run the above in a console, and then run "append.natList([Switch,Switchport,Label])" separately and it works for some reason....?
Thanks for any assistance!
It seems to be that the while condition needs an additional parenthesis. Just add some in this way while (Site in siteList) == True: or a much cleaner way suggested by Padraic while Site in siteList:.
It was comparing boolean object against string object.
Change
while Site in siteList == True:
to
if Site in siteList:
You might want to look into the csv module as this module attempts to make reading and writing csv files simpler, e.g.:
import csv
with open('<file>') as fp:
...
reader = csv.reader(fp)
if portfield == 'y':
natlist = [[row[i] for i in [sw, swpt, lbl]] for row in fp if row[0] in sitelist]
else:
natlist = [[row[i] for i in [sw, cd, pt, lbl]] for row in fp if row[0] in sitelist]
print natlist
Or alternatively using a csv.DictReader which takes the first row as the fieldnames and then returns dictionaries:
import csv
with open('<file>') as fp:
...
reader = csv.DictReader(fp)
if portfield == 'y':
fields = ['Switch', 'card/port', 'Label']
else:
fields = ['Switch', '??', '??', 'Label']
natlist = [[row[f] for f in fields] for row in fp if row['Building/Site'] in sitelist]
print natlist

Generate output file name in accordance to input file names

Below is a script to read velocity values from molecular dynamics trajectory data. I have many trajectory files with the name pattern as below:
waters1445-MD001-run0100.traj
waters1445-MD001-run0200.traj
waters1445-MD001-run0300.traj
waters1445-MD001-run0400.traj
waters1445-MD001-run0500.traj
waters1445-MD001-run0600.traj
waters1445-MD001-run0700.traj
waters1445-MD001-run0800.traj
waters1445-MD001-run0900.traj
waters1445-MD001-run1000.traj
waters1445-MD002-run0100.traj
waters1445-MD002-run0200.traj
waters1445-MD002-run0300.traj
waters1445-MD002-run0400.traj
waters1445-MD002-run0500.traj
waters1445-MD002-run0600.traj
waters1445-MD002-run0700.traj
waters1445-MD002-run0800.traj
waters1445-MD002-run0900.traj
waters1445-MD002-run1000.traj
Each file has 200 frames of data to analyse. So I planned in such a way where this code is supposed to read in each traj file (shown above) one after another, and extract the velocity values and write in a specific file (text_file = open("Output.traj.dat", "a") corresponding to the respective input trajectory file.
So I defined a function called 'loops(mmm)', where 'mmm' is a trajectory file name parser to the function 'loops'.
#!/usr/bin/env python
'''
always put #!/usr/bin/env python at the shebang
'''
#from __future__ import print_function
from Scientific.IO.NetCDF import NetCDFFile as Dataset
import itertools as itx
import sys
#####################
def loops(mmm):
inputfile = mmm
for FRAMES in range(0,200):
frame = FRAMES
text_file = open("Output.mmm.dat", "a")
def grouper(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
return itx.izip_longest(fillvalue=fillvalue, *args)
formatxyz = "%12.7f%12.7f%12.7f%12.7f%12.7f%12.7f"
formatxyz_size = 6
formatxyzshort = "%12.7f%12.7f%12.7f"
formatxyzshort_size = 3
#ncfile = Dataset(inputfile, 'r')
ncfile = Dataset(ppp, 'r')
variableNames = ncfile.variables.keys()
#print variableNames
shape = ncfile.variables['coordinates'].shape
'''
do the header
'''
print 'title ' + str(frame)
text_file.write('title ' + str(frame) + '\n')
print "%5i%15.7e" % (shape[1],ncfile.variables['time'][frame])
text_file.write("%5i%15.7e" % (shape[1],ncfile.variables['time']\
[frame]) + '\n')
'''
do the velocities
'''
try:
xyz = ncfile.variables['velocities'][frame]
temp = grouper(2, xyz, "")
for i in temp:
z = tuple(itx.chain(*i))
if (len(z) == formatxyz_size):
print formatxyz % z
text_file.write(formatxyz % z + '\n')
elif (len(z) == formatxyzshort_size):
print formatxyzshort % z
text_file.write(formatxyzshort % z + '\n' )
except(KeyError):
xyz = [0] * shape[2]
xyz = [xyz] * shape[1]
temp = grouper(2, xyz, "")
for i in temp:
z = tuple(itx.chain(*i))
if (len(z) == formatxyz_size):
print formatxyz % z
elif (len(z) == formatxyzshort_size):
print formatxyzshort % z
x = ncfile.variables['cell_angles'][frame]
y = ncfile.variables['cell_lengths'][frame]
#text_file.close()
# program starts - generation of file name
for md in range(1,3):
if md < 10:
for pico in range(100,1100, 100):
if pico >= 1000:
kkk = "waters1445-MD00{0}-run{1}.traj".format(md,pico)
loops(kkk)
elif pico < 1000:
kkk = "waters1445-MD00{0}-run0{1}.traj".format(md,pico)
loops(kkk)
#print kkk
At the (# program starts - generation of file name) line, the code supposed to generate the file name and accordingly call the function and extract the velocity and dump the values in (text_file = open("Output.mmm.dat", "a")
When execute this code, the program is running, but unfortunately could not produce output files according the input trajectory file names.
I want the output file names to be:
velo-waters1445-MD001-run0100.dat
velo-waters1445-MD001-run0200.dat
velo-waters1445-MD001-run0300.dat
velo-waters1445-MD001-run0400.dat
velo-waters1445-MD001-run0500.dat
.
.
.
I could not trace where I need to do changes.
Your code's indentation is broken: The first assignment to formatxyz and the following code is not aligned to either the def grouper, nor the for FRAMES.
The main problem may be (like Johannes commented on already) the time when you open the file(s) for writing and when you actually write data into the file.
Check:
for FRAMES in range(0,200):
frame = FRAMES
text_file = open("Output.mmm.dat", "a")
The output file is named (hardcoded) Output.mmm.dat. Change to "Output.{0}.dat".format(mmm). But then, the variable mmm never changes inside the loop. This may be ok, if all frames are supposed to be written to the same file.
Generally, please work on the names you choose for variables and functions. loops is very generic, and so are kkk and mmm. Be more specific, it helps debugging. If you don't know what's happening and where your programs go wrong, insert print("dbg> do (a)") statements with some descriptive text and/or use the Python debugger to step through your program. Especially interactive debugging is essential in learning a new language and new concepts, imho.

Categories

Resources