Python: replace only one occurrence in a string - python

I have some sample data which looks like:
ATOM 973 CG ARG A 61 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 61 -21.610 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 61 -21.047 7.452 67.937 1.00 12.13 N
I want to replace the 6th column and only the 6th column by the addition of the offset value, in the case above it is 308.
So 61+308 = 369, so 61 in the 6th column should be replaced by 369
I can't str.split() the line as the line spacing is very important.
I have tried tried using str.replace() but the values in column 2 can also overlap with column 6
I did try reversing the line and use str.repalce() but the values in columns 7,8,9,10 and 11 can overlap with the str to be replaced.
The ugly code I have so far is (which partially works apart from if the values overlap in columns 7,8,9,10 and/or 11):
with open('2kqx.pdb', 'r') as inf, open('2kqx_renumbered.pdb', 'w') as outf:
for line in inf:
if line.startswith('ATOM'):
segs = line.split()
if segs[4] == 'A':
offset = 308
number = segs[5][::-1]
replacement = str((int(segs[5])+offset))[::-1]
print number[::-1],replacement
line_rev = line[::-1]
replaced_line = line_rev.replace(number,replacement,1)
print line
print replaced_line[::-1]
outf.write(replaced_line[::-1])
The code above produced this output below. As you can see in the second line the 6th column is not changed, but is changed in column 7. I thought by reversing the string I could bypass the potential overlap with column 2, but I forgot about the other columns and I dont really know how to get around it.
ATOM 973 CG ARG A 369 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 61 -21.3690 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 369 -21.047 7.452 67.937 1.00 12.13 N

data = """\
ATOM 973 CG ARG A 61 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 61 -21.610 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 61 -21.047 7.452 67.937 1.00 12.13 N"""
offset = 308
for line in data.split('\n'):
line = line[:22] + " {:<5d} ".format(int(line[22:31]) + offset) + line[31:]
print line
I haven't done the exact counting of whitespace, that's just a rough estimate.
If you want more flexibility than just having the numbers 22 and 31 scattered in your code, you'll need a way to determine your start and end index (but that contrasts my assumption that the data is in fixed column format).

You better not try to parse PDB-files on your own.
Use a PDB-Parser. There are many freely available inside different bio/computational chemistry packages, for instance
biopython
Here's how to it with biopython, assuming you input is raw.pdb:
from Bio.PDB import PDBParser, PDBIO
parser=PDBParser()
structure = parser.get_structure('some_id', 'raw.pdb')
for r in structure.get_residues():
r.id = (r.id[0], r.id[1] + 308, r.id[2])
io = PDBIO()
io.set_structure(structure)
io.save('shifted.pdb')
I googled a bit and find a quick solution to solve your specific problem here (without third-party dependencies):
http://code.google.com/p/pdb-tools/
There is -- among many other useful pdb-python-script-tools -- this script pdb_offset.py
It is a standalone script and I just copied its pdb_offset method to show it working, your three-line example code is in raw.pdb:
def pdbOffset(pdb_file, offset):
"""
Adds an offset to the residue column of a pdb file without touching anything
else.
"""
# Read in the pdb file
f = open(pdb_file,'r')
pdb = f.readlines()
f.close()
out = []
for line in pdb:
# For and ATOM record, update residue number
if line[0:6] == "ATOM " or line[0:6] == "TER ":
num = offset + int(line[22:26])
out.append("%s%4i%s" % (line[0:22],num,line[26:]))
else:
out.append(line)
return "".join(out)
print pdbOffset('raw.pdb', 308)
which prints
ATOM 973 CG ARG A 369 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 369 -21.610 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 369 -21.047 7.452 67.937 1.00 12.13 N

Related

Writing to a file in python

I have been receiving indexing errors in python. I got my code to work correctly through reading in a file and simply printing the desired output, but now I am trying to write the output to a file. I seem to be having a problem with indexing when trying to write it. I've tried a couple different things, I left an attempt commented out. Either way I keep getting an indexing error.
EDIT Original error may be caused by an error in eclipse, but when running on server, having a new issue*
I can now get it to run and produce output to a .txt file, however it only prints a single output
with open("blast.txt") as blast_output:
for line in blast_output:
subFields = [item.split('|') for item in line.split()]
#transId = str(subFields[0][0])
#iso = str(subFields[0][1])
#sp = str(subFields[1][3])
#identity = str(subFields[2][0])
out = open("parsed_blast.txt", "w")
#out.write(transId + "\t" + iso + "\t" + sp + "\t" + identity)
out.write((str(subFields[0][0]) + "\t" + str(subFields[0][1]) + "\t" + str(subFields[1][3]) + "\t" + str(subFields[2][0])))
out.close()
IndexError: list index out of range
Input file looks like:
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
c1003_g1_i1|m.803 gi|74631197|sp|Q6BDR8.1|NSE4_SCHPO 100.00 246 0 0 1 246 1 246 1e-179 502
c1004_g1_i1|m.804 gi|74676184|sp|O94325.1|PEX5_SCHPO 100.00 598 0 0 1 598 1 598 0.0 1227
c1005_g1_i1|m.805 gi|9910811|sp|O42832.2|SPB1_SCHPO 100.00 802 0 0 1 802 1 802 0.0 1644
c1006_g1_i1|m.806 gi|74627042|sp|O94631.1|MRM1_SCHPO 100.00 255 0 0 1 255 47 301 0.0 525
Expected output
c0_g1_i1 m.1 Q9HGP0.1 100.00
c1002_g1_i1 m.801 Q10302.1 100.00
c1003_g1_i1 m.803 Q6BDR8.1 100.00
c1004_g1_i1 m.804 O94325.1 100.00
c1005_g1_i1 m.805 O42832.2 100.00
c1006_g1_i1 m.806 O94631.1 100.00
My output is instead only one of the lines instead of all of the lines
You are overwriting the same file again and again. Open the file outside the for loop or open it in append mode 'a'
I suggest you write the whole file to a string.
with open("blast.txt", 'r') as fileIn:
data = fileIn.read()
then process the data.
data = func(data)
Then write to file out.
with open('bast_out.txt','w') as fileOut:
fileOut.write()
As #H Doucet said, write the whole thing to a string, then work with it. Leave the open() function out of the loop so it only opens & closes the file once, and make sure to open as "append." I've also cleaned up your out.write() function. No need to specify those list items as strings, they already are. And added a newline ("\n") to the end of each line.
with open("blast.txt") as f:
blast_output = f.read()
out = open("parsed_blast.txt", "a")
for line in blast_output.split("\n"):
subFields = [item.split('|') for item in line.split()]
out.write("{}\t{}\t{}\t{}\n".format(subFields[0][0], subFields[0][1],
subFields[1][3], subFields[2][0]))
out.close()

Counting no. of characters between headers in python

I have the following dataset, my code below will identify each line with the word 'Query_' search for an '*' and print the letters under it until the next line with 'Query_'
Query_10 206 IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK 385
010718494 193 LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT 255
001291831 173 LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT 235
012359817 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
009246541 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
Query_13 31 MEEQKEKGLSNPEVV*KYRQCSEIVNQVLSTVVSSCVPGADVASICTNGDFLIEDGLRNI 210
002947167 7 IQGEQEPNLSVPEVVTKYKAAADICNRALQAVIDGCKDGSKIVDLCRTGDNFITKECGNI 66
004993505 1 MELDRQSKVVDADALSKYRAAAAIANDCVQQLVANCIAGADVYTLAVEADTYIEQKLKEL 60
006961234 1 MSETKEYSLNNPDTLTKYKTAAQISEKVLAAVSDLCVPGAKIVDICQQGDKLIEEELAKV 62
008089018 1 MSEETDYTLNNPDTLTKYKTAAQISEKVLAAVAELVVPGEKIVTICEKGDKLIEEELAKV 60
Query_13 211 EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI 390
I am looking to print only if there are at least 50 or more letters under the '*' between the Query_ lines. Any help would be great!
lines = [line.rstrip() for line in open('infile.txt')]
for line in lines:
data = line.split()
sequence = data[2]
if data[0].startswith("Query_"):
star_indicies = [i for i,c in enumerate(sequence) if c == '*']
else:
print(list(sequence[star_index] for star_index in star_indicies))
Break it down into steps
First find all the lines with headers, and mark whether they contain asterisks:
headers = [[i,"*" in l.split()[2]] for i,l in enumerate(lines)
if l.startswith("Query_")]
So now you have a list of lists, each containing two values
Index into lines of the header
Whether that header contains an asterisk
Now you can iterate over it
for i, header in enumerate(headers[:-1]): # All but last
if not header[1]:
continue // No asterisk
this_header = header[0]
next_header = headers[i+1][0]
if (next_header - this_header -1) < 50:
continue // Not enough rows
...
The ... above is where you put the code to figure out which columns of lines[this_header] contain asterisks and then extract those columns from lines[this_header+1] through lines[next_header-1].
I've left that bit for you as your question is underspecified
Does the file end with a "Query_" header line?
If not, how do you deal with the case where the final header line has asterisks and is followed by 100 more lines?
What do you mean by "print the letters under it"?
But this should get you started

Text edition: from Python to Matlab

I have a .txt file looking like:
rawdata/d-0197.bmp 1 329 210 50 51
rawdata/c-0044.bmp 1 215 287 59 48
rawdata/e-0114.bmp 1 298 244 46 45
rawdata/102.bmp 1 243 126 163 143
I need to transform it in the following way:
-Before "rawdata", add the whole path, which is "/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/".
-Add a comma after ".bmp"
-Remove the first number (so the 1).
-Put the other four numbers into square brackets [].
It would look like:
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/d-0197.bmp, [329 210 50 51]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/c-0044.bmp, [215 287 59 48]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/e-0114.bmp, [298 244 46 45]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/102.bmp, [243 126 163 143]
I have done it, first by replacing "rawdata/" with nothing in a simple text editor, and then with python:
file=open('data.txt')
fout=open('data2.txt','w')
for line in file:
line=line.rstrip()
pieces=line.split('.bmp')
pieces2=pieces[1].split()
fout.write('/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/'+pieces[0]+'.bmp, '+'['+pieces2[1]+' '+pieces2[2]+' '+pieces2[3]+' '+pieces2[4]+']'+'\n')
fout.close()
But this file is going to be used in Matlab, so it would be much better to have an automatic process. How can I do the same in Matlab?
Thank you
Here you go:
infid = fopen('data.txt', 'r');
outfid = fopen('data2.txt', 'w');
dirStr = '/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/';
while ~feof(infid)
inline = fgetl(infid);
outline = [dirStr, regexprep(inline,' 1 (\d* \d* \d* \d*)',', [$1]')];
fprintf(outfid, '%s\n', outline);
end
fclose(infid);
fclose(outfid);
What we've done there is to read in the code from the input file line by line, then use a regular expression to make the changes to the line, then write it out to the output file. There are probably better ways of applying the regular expression, but that was pretty quick.

python print particular lines from file

The background:
Table$Gene=Gene1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.928 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 2208 40 0.755 0.00803 0.739 0.771
5 2256 48 0.769 0.00787 0.754 0.784
6 2208 40 0.755 0.00803 0.739 0.771
Table$Gene=Gene2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.938 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 1000 40 0.744 0.00803 0.739 0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.
What I want seems simple. I want to turn the above file into an output that looks like this:
Gene1 0.755
Gene2 0.744
i.e. each gene, and the last number in the survival column from each section.
I have tried multiple ways, using regular expression, reading the file in as a list and saying ".next()". One example of code that I have tried:
fileopen = open(sys.argv[1]).readlines() # Read in the file as a list.
for index,line in enumerate(fileopen): # Enumerate items in list
if "Table" in line: # Find the items with "Table" (This will have my gene name)
line2 = line.split("=")[1] # Parse line to get my gene name
if "\n" in fileopen[index+1]: # This is the problem section.
print fileopen[index]
else:
fileopen[index+1]
So as you can see in the problem section, I was trying to say in this attempt:
if the next item in the list is a new line, print the item, else, the next line is the current line (and then I can split the line to pull out the particular number I want).
If anyone could correct the code so I can see what I did wrong I'd appreciate it.
Bit of overkill, but instead of manually writing parser for each data item use existing package like pandas to read in the csv file. Just need to write a bit of code to specify the relevant lines in the file. Un-optimized code (reading file twice):
import pandas as pd
def genetable(gene):
l = open('gene.txt').readlines()
l += "\n" # add newline to end of file in case last line is not newline
lines = len(l)
skiprows = -1
for (i, line) in enumerate(l):
if "Table$Gene=Gene"+str(gene) in line:
skiprows = i+1
if skiprows>=0 and line=="\n":
skipfooter = lines - i - 1
df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
# assuming tab separated data given your inputs. change as needed
# assert df.columns.....
return df
return "Not Found"
this will read in a DataFrame with all the relevant data in that file
can then do:
genetable(2).survival # series with all survival rates
genetable(2).survival.iloc[-1] last item in survival
The advantages of this is that you have access to all the items, any mal-formatting of the file will probably be better picked up and prevent incorrect values from being used. If my own code i would add assertions on column names before returning the pandas DataFrame. Want to pick up any errors in parsing early so that it does not propagate.
This worked when I tried it:
gene = 1
for i in range(len(filelines)):
if filelines[i].strip() == "":
print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
gene += 1
You could try something like this (I copied your data into foo.dat);
In [1]: with open('foo.dat') as input:
...: lines = input.readlines()
...:
Using with makes sure the file is closed after reading.
In [3]: lines = [ln.strip() for ln in lines]
This gets rid of extra whitespace.
In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]
In [6]: startgenes
Out[6]: [0, 10]
In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]
In [8]: emptylines
Out[8]: [9, 17]
Using emptylines relies on the fact that the records are separated by lines containing only whitespace.
In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]
In [10]: for first, last in zip(startgenes, lastlines):
....: gene = lines[first].split("=")[1]
....: num = lines[last].split()[-1]
....: print gene, num
....:
Gene1 0.771
Gene2 0.774
here is my solution:
>>> with open('t.txt','r') as f:
... for l in f:
... if "Table" in l:
... gene = l.split("=")[1][:-1]
... elif l not in ['\n', '\r\n']:
... surv = l.split()[3]
... else:
... print gene, surv
...
Gene1 0.755
Gene2 0.744
Instead of checking for new line, simply print when you are done reading the file
lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
if "Table" in line:
if table != "": # print previous survival
print table, finalsurvival
table = line.strip().split('=')[1]
else:
try:
finalsurvival = line.split('\t')[4]
except IndexError:
continue
print table, finalsurvival

Correct use of split()

I'm trying to split lines of text and store key information in a dictionary.
For example I have lines that look like:
Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
For the first line, my key will be "Lasal_00010", and the value I'm storing is "H293".
My current code works fine for this case, but when I encounter a line like:
Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
my code will not store the string "SSCG".
Here is my current code:
dataHash = {}
with open(fasta,'r') as f:
for ln in f:
query = ln.split('\t')[0]
query.strip()
tempValue = ln.split('\t')[1]
value = tempValue.split('|')[0]
value.strip()
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
for x in dataHash:
print x + " " + str(dataHash[x])
I believe I am splitting the line incorrectly in the case with two vertical bars. But I'm confused as to where my problem is. Shouldn't "SSCG" be the value I get when I write value = tempValue.split('|')[0]? Can someone explain to me how split works or what I'm missing?
Split on the first pipe, then on the space:
with open(fasta,'r') as f:
for ln in f:
query, value = ln.partition('|')[0].split()
I used str.partition() here as you only need to split once.
Your code makes assumptions on where tabs are being used; by splitting on the first pipe first we get to ignore the rest of the line altogether, making it a lot simpler to split the first from the second column.
Demo:
>>> lines = '''\
... Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
... Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
... Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
... '''
>>> for ln in lines.splitlines(True):
... query, value = ln.partition('|')[0].split()
... print query, value
...
Lasal_00010 H293
Lasal_00010 SSEG
Lasal_00030 SSCG
However, your code works too, up to a point, albeit less efficiently. Your real problem is with:
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
This really means: First time I see query, store an empty string, otherwise store value. I am not sure why you do this; if there are no other lines starting with Lasal_00030, all you have is an empty value in the dictionary. If that wasn't the intention, just store the value:
dataHash[query] = value
No if statement.
Note that dict.has_key() has been deprecated; it is better to use in to test for a key:
if query not in dataHash:

Categories

Resources