I have a .txt file looking like:
rawdata/d-0197.bmp 1 329 210 50 51
rawdata/c-0044.bmp 1 215 287 59 48
rawdata/e-0114.bmp 1 298 244 46 45
rawdata/102.bmp 1 243 126 163 143
I need to transform it in the following way:
-Before "rawdata", add the whole path, which is "/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/".
-Add a comma after ".bmp"
-Remove the first number (so the 1).
-Put the other four numbers into square brackets [].
It would look like:
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/d-0197.bmp, [329 210 50 51]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/c-0044.bmp, [215 287 59 48]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/e-0114.bmp, [298 244 46 45]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/102.bmp, [243 126 163 143]
I have done it, first by replacing "rawdata/" with nothing in a simple text editor, and then with python:
file=open('data.txt')
fout=open('data2.txt','w')
for line in file:
line=line.rstrip()
pieces=line.split('.bmp')
pieces2=pieces[1].split()
fout.write('/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/'+pieces[0]+'.bmp, '+'['+pieces2[1]+' '+pieces2[2]+' '+pieces2[3]+' '+pieces2[4]+']'+'\n')
fout.close()
But this file is going to be used in Matlab, so it would be much better to have an automatic process. How can I do the same in Matlab?
Thank you
Here you go:
infid = fopen('data.txt', 'r');
outfid = fopen('data2.txt', 'w');
dirStr = '/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/';
while ~feof(infid)
inline = fgetl(infid);
outline = [dirStr, regexprep(inline,' 1 (\d* \d* \d* \d*)',', [$1]')];
fprintf(outfid, '%s\n', outline);
end
fclose(infid);
fclose(outfid);
What we've done there is to read in the code from the input file line by line, then use a regular expression to make the changes to the line, then write it out to the output file. There are probably better ways of applying the regular expression, but that was pretty quick.
Related
This question already has an answer here:
How can I concatenate str and int objects?
(1 answer)
Closed 4 months ago.
I have a file that includes student IDs and their scores. I'm trying to create a file with the name usoscorelist.txt and write the inside of the scorelist.txt in it, changing the score of the student 151133 from 40 to 100. I think the space between the ID and the score is making a problem here. I'm not getting any errors nor seeing any changes in the file.
with open('scorelist.txt','r') as firstfile, open('usoscorelist.txt','r+') as secondfile:
for line in firstfile:
secondfile.write(line)
for line in secondfile:
print(line.replace(151133 + " " + 40, 151133 + " " + 100))
secondfile.close()
The inside of scorelist.txt is:
121787 74
121367 71
121817 88
121619 85
131445 80
131244 96
131872 98
131963 75
131172 78
131965 72
131112 90
131956 87
141105 61
141703 61
141407 78
141569 82
141585 89
141455 82
141370 80
141837 67
141857 86
141497 94
141853 67
141245 80
151452 83
151238 62
151827 58
151409 40
151789 95
151742 71
151133 40
151095 49
151186 75
151586 51
151926 73
151975 96
151079 49
151091 100
151588 49
151630 61
edit the line before writing it, if you replace the line after writing it, you are just changing the line in the program but not the file
for the spaces between the id, you can either just use string as id, or use a f-string (format string)
tips: you dont need to close the file when using with, it will handle it for you
with open('scorelist.txt','r') as firstfile, open('usoscorelist.txt','w+') as secondfile:
student_id = 151133
original_score = 40
new_score = 100
original_str = f"{student_id} {original_score}"
new_str = f"{student_id} {new_score}"
for line in firstfile:
secondfile.write(line.replace(original_str, new_str))
#checking
with open('usoscorelist.txt', 'r') as secondfile:
for line in secondfile:
print(line,end='')
Since I have a file which is huge (several GBs), I would not like to load the whole thing in memory and instead use *generators to load line by line. My file is something like this:
# millions of lines
..................
..................
keyw 28899
2233 121 ee 0o90 jjsl
2321 232 qq 0kj9 jksl
keyw 28900
3433 124 rr 8hu9 jkas
4532 343 ww 3ko9 aslk
1098 115 uy oiw8 rekl
keyw 29891
..................
..................
# millions more
So far I have found a similar answer here. But I am lost as how to implement it. Because the ans has specific identifiers Start and Stop, whereas my files have an incremental number with a identical keyword. I would like some help regarding this.
Edit: Generators not iterators
If you want to adapt that answer this may help:
bucket = []
for line in infile:
if line.split()[0] == 'keyw':
for strings in bucket:
outfile.write( strings + '\n')
bucket = []
continue
bucket.append(line.strip())
I'm trying to split lines of text and store key information in a dictionary.
For example I have lines that look like:
Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
For the first line, my key will be "Lasal_00010", and the value I'm storing is "H293".
My current code works fine for this case, but when I encounter a line like:
Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
my code will not store the string "SSCG".
Here is my current code:
dataHash = {}
with open(fasta,'r') as f:
for ln in f:
query = ln.split('\t')[0]
query.strip()
tempValue = ln.split('\t')[1]
value = tempValue.split('|')[0]
value.strip()
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
for x in dataHash:
print x + " " + str(dataHash[x])
I believe I am splitting the line incorrectly in the case with two vertical bars. But I'm confused as to where my problem is. Shouldn't "SSCG" be the value I get when I write value = tempValue.split('|')[0]? Can someone explain to me how split works or what I'm missing?
Split on the first pipe, then on the space:
with open(fasta,'r') as f:
for ln in f:
query, value = ln.partition('|')[0].split()
I used str.partition() here as you only need to split once.
Your code makes assumptions on where tabs are being used; by splitting on the first pipe first we get to ignore the rest of the line altogether, making it a lot simpler to split the first from the second column.
Demo:
>>> lines = '''\
... Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
... Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
... Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
... '''
>>> for ln in lines.splitlines(True):
... query, value = ln.partition('|')[0].split()
... print query, value
...
Lasal_00010 H293
Lasal_00010 SSEG
Lasal_00030 SSCG
However, your code works too, up to a point, albeit less efficiently. Your real problem is with:
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
This really means: First time I see query, store an empty string, otherwise store value. I am not sure why you do this; if there are no other lines starting with Lasal_00030, all you have is an empty value in the dictionary. If that wasn't the intention, just store the value:
dataHash[query] = value
No if statement.
Note that dict.has_key() has been deprecated; it is better to use in to test for a key:
if query not in dataHash:
I have some sample data which looks like:
ATOM 973 CG ARG A 61 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 61 -21.610 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 61 -21.047 7.452 67.937 1.00 12.13 N
I want to replace the 6th column and only the 6th column by the addition of the offset value, in the case above it is 308.
So 61+308 = 369, so 61 in the 6th column should be replaced by 369
I can't str.split() the line as the line spacing is very important.
I have tried tried using str.replace() but the values in column 2 can also overlap with column 6
I did try reversing the line and use str.repalce() but the values in columns 7,8,9,10 and 11 can overlap with the str to be replaced.
The ugly code I have so far is (which partially works apart from if the values overlap in columns 7,8,9,10 and/or 11):
with open('2kqx.pdb', 'r') as inf, open('2kqx_renumbered.pdb', 'w') as outf:
for line in inf:
if line.startswith('ATOM'):
segs = line.split()
if segs[4] == 'A':
offset = 308
number = segs[5][::-1]
replacement = str((int(segs[5])+offset))[::-1]
print number[::-1],replacement
line_rev = line[::-1]
replaced_line = line_rev.replace(number,replacement,1)
print line
print replaced_line[::-1]
outf.write(replaced_line[::-1])
The code above produced this output below. As you can see in the second line the 6th column is not changed, but is changed in column 7. I thought by reversing the string I could bypass the potential overlap with column 2, but I forgot about the other columns and I dont really know how to get around it.
ATOM 973 CG ARG A 369 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 61 -21.3690 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 369 -21.047 7.452 67.937 1.00 12.13 N
data = """\
ATOM 973 CG ARG A 61 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 61 -21.610 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 61 -21.047 7.452 67.937 1.00 12.13 N"""
offset = 308
for line in data.split('\n'):
line = line[:22] + " {:<5d} ".format(int(line[22:31]) + offset) + line[31:]
print line
I haven't done the exact counting of whitespace, that's just a rough estimate.
If you want more flexibility than just having the numbers 22 and 31 scattered in your code, you'll need a way to determine your start and end index (but that contrasts my assumption that the data is in fixed column format).
You better not try to parse PDB-files on your own.
Use a PDB-Parser. There are many freely available inside different bio/computational chemistry packages, for instance
biopython
Here's how to it with biopython, assuming you input is raw.pdb:
from Bio.PDB import PDBParser, PDBIO
parser=PDBParser()
structure = parser.get_structure('some_id', 'raw.pdb')
for r in structure.get_residues():
r.id = (r.id[0], r.id[1] + 308, r.id[2])
io = PDBIO()
io.set_structure(structure)
io.save('shifted.pdb')
I googled a bit and find a quick solution to solve your specific problem here (without third-party dependencies):
http://code.google.com/p/pdb-tools/
There is -- among many other useful pdb-python-script-tools -- this script pdb_offset.py
It is a standalone script and I just copied its pdb_offset method to show it working, your three-line example code is in raw.pdb:
def pdbOffset(pdb_file, offset):
"""
Adds an offset to the residue column of a pdb file without touching anything
else.
"""
# Read in the pdb file
f = open(pdb_file,'r')
pdb = f.readlines()
f.close()
out = []
for line in pdb:
# For and ATOM record, update residue number
if line[0:6] == "ATOM " or line[0:6] == "TER ":
num = offset + int(line[22:26])
out.append("%s%4i%s" % (line[0:22],num,line[26:]))
else:
out.append(line)
return "".join(out)
print pdbOffset('raw.pdb', 308)
which prints
ATOM 973 CG ARG A 369 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 369 -21.610 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 369 -21.047 7.452 67.937 1.00 12.13 N
I have a DTML document which only contains:
<dtml-var public_blast_results>
and displays when i view it as:
YP_001336283 100.00 345 0 0 23 367 23 367 0.0 688
When I edit the DTML page for example just adding a header like:
<h3>Header</h3>
<dtml-var public_blast_results>
The "public_blast_results" loeses its formatting and displayes as:
Header
YP_001336283 100.00 345 0 0 23 367 23 367 0.0 688
Is there a way for maintaining the formatting? public_blast_results is a python function which just simply reads the contents of a file and returns it.
This is nothing to do with DTML - it's a basic issue with HTML, which is that it ignores whitespace. If you want to preserve it, you need to wrap the content with <pre>.
<pre><dtml-var public_blast_results></pre>