DTML: How to prevent formatting loss

DTML: How to prevent formatting loss - python

I have a DTML document which only contains:
<dtml-var public_blast_results>
and displays when i view it as:
YP_001336283 100.00 345 0 0 23 367 23 367 0.0 688
When I edit the DTML page for example just adding a header like:
<h3>Header</h3>
<dtml-var public_blast_results>
The "public_blast_results" loeses its formatting and displayes as:
Header
YP_001336283 100.00 345 0 0 23 367 23 367 0.0 688
Is there a way for maintaining the formatting? public_blast_results is a python function which just simply reads the contents of a file and returns it.

This is nothing to do with DTML - it's a basic issue with HTML, which is that it ignores whitespace. If you want to preserve it, you need to wrap the content with <pre>.
<pre><dtml-var public_blast_results></pre>

Related

Read in a file, splitting and then writing out desired output

I am very new to python, and am having some problems I can't seem to find answers to.
I have a large file I am trying to read in and then split and write out specific information. I am having trouble with the read in and split, where it is only printing the same thing over and over again.
blast_output = open("blast.txt").read()
for line in blast_output:
subFields = [item.split('|') for item in blast_output.split()]
print(str(subFields[0][0]) + "\t" + str(subFields[0][1]) + "\t" + str(subFields[1][3]) + "\t" + str(subFields[2][0]))
My input file has many rows that look like this:
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
c1003_g1_i1|m.803 gi|74631197|sp|Q6BDR8.1|NSE4_SCHPO 100.00 246 0 0 1 246 1 246 1e-179 502
c1004_g1_i1|m.804 gi|74676184|sp|O94325.1|PEX5_SCHPO 100.00 598 0 0 1 598 1 598 0.0 1227
The output I am receiving is this:
c0_g1_i1 m.1 Q9HGP0.1 100.00
c0_g1_i1 m.1 Q9HGP0.1 100.00
c0_g1_i1 m.1 Q9HGP0.1 100.00
c0_g1_i1 m.1 Q9HGP0.1 100.00
But what I am wanting is
c0_g1_i1 m.1 Q9HGP0.1 100.0
c1002_g1_i1 m.801 Q10302.1 100.0
c1003_g1_i1 m.803 Q6BDR8.1 100.0
c1004_g1_i1 m.804 O94325.1 100.0

You don't need to call the read method of the file object, just iterate over it, line by line. Then replace blast_output with line in the for loop to avoid repeating the same action across all the iterations:
with open("blast.txt") as blast_output:
for line in blast_output:
subFields = [item.split('|') for item in line.split()]
print("{:15}{:10}{:10}{:10}".format(subFields[0][0], subFields[0][1],
subFields[0][1], subFields[1][3], subFields[2][0]))
I have opened the file in a context using with, so closing is automatically done by Python. I have also used string formatting to build the final string.
c0_g1_i1 m.1 m.1 Q9HGP0.1
c1002_g1_i1 m.801 m.801 Q10302.1
c1003_g1_i1 m.803 m.803 Q6BDR8.1
c1004_g1_i1 m.804 m.804 O94325.1

Great question. You are taking the same input over and over again with this line
subFields = [item.split('|') for item in blast_output.split()]
The python 2.x version looks like this:
blast_output = open("blast.txt").read()
for line in blast_output:
subFields = [item.split('|') for item in line.split()]
print(str(subFields[0][0]) + "\t" + str(subFields[0][1]) + "\t" + str(subFields[1][3]) + "\t" + str(subFields[2][0]))
see Moses Koledoye's version for the Python 3.x formatted niceness

Writing to a file in python

I have been receiving indexing errors in python. I got my code to work correctly through reading in a file and simply printing the desired output, but now I am trying to write the output to a file. I seem to be having a problem with indexing when trying to write it. I've tried a couple different things, I left an attempt commented out. Either way I keep getting an indexing error.
EDIT Original error may be caused by an error in eclipse, but when running on server, having a new issue*
I can now get it to run and produce output to a .txt file, however it only prints a single output
with open("blast.txt") as blast_output:
for line in blast_output:
subFields = [item.split('|') for item in line.split()]
#transId = str(subFields[0][0])
#iso = str(subFields[0][1])
#sp = str(subFields[1][3])
#identity = str(subFields[2][0])
out = open("parsed_blast.txt", "w")
#out.write(transId + "\t" + iso + "\t" + sp + "\t" + identity)
out.write((str(subFields[0][0]) + "\t" + str(subFields[0][1]) + "\t" + str(subFields[1][3]) + "\t" + str(subFields[2][0])))
out.close()
IndexError: list index out of range
Input file looks like:
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
c1003_g1_i1|m.803 gi|74631197|sp|Q6BDR8.1|NSE4_SCHPO 100.00 246 0 0 1 246 1 246 1e-179 502
c1004_g1_i1|m.804 gi|74676184|sp|O94325.1|PEX5_SCHPO 100.00 598 0 0 1 598 1 598 0.0 1227
c1005_g1_i1|m.805 gi|9910811|sp|O42832.2|SPB1_SCHPO 100.00 802 0 0 1 802 1 802 0.0 1644
c1006_g1_i1|m.806 gi|74627042|sp|O94631.1|MRM1_SCHPO 100.00 255 0 0 1 255 47 301 0.0 525
Expected output
c0_g1_i1 m.1 Q9HGP0.1 100.00
c1002_g1_i1 m.801 Q10302.1 100.00
c1003_g1_i1 m.803 Q6BDR8.1 100.00
c1004_g1_i1 m.804 O94325.1 100.00
c1005_g1_i1 m.805 O42832.2 100.00
c1006_g1_i1 m.806 O94631.1 100.00
My output is instead only one of the lines instead of all of the lines

You are overwriting the same file again and again. Open the file outside the for loop or open it in append mode 'a'

I suggest you write the whole file to a string.
with open("blast.txt", 'r') as fileIn:
data = fileIn.read()
then process the data.
data = func(data)
Then write to file out.
with open('bast_out.txt','w') as fileOut:
fileOut.write()

As #H Doucet said, write the whole thing to a string, then work with it. Leave the open() function out of the loop so it only opens & closes the file once, and make sure to open as "append." I've also cleaned up your out.write() function. No need to specify those list items as strings, they already are. And added a newline ("\n") to the end of each line.
with open("blast.txt") as f:
blast_output = f.read()
out = open("parsed_blast.txt", "a")
for line in blast_output.split("\n"):
subFields = [item.split('|') for item in line.split()]
out.write("{}\t{}\t{}\t{}\n".format(subFields[0][0], subFields[0][1],
subFields[1][3], subFields[2][0]))
out.close()

Text edition: from Python to Matlab

I have a .txt file looking like:
rawdata/d-0197.bmp 1 329 210 50 51
rawdata/c-0044.bmp 1 215 287 59 48
rawdata/e-0114.bmp 1 298 244 46 45
rawdata/102.bmp 1 243 126 163 143
I need to transform it in the following way:
-Before "rawdata", add the whole path, which is "/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/".
-Add a comma after ".bmp"
-Remove the first number (so the 1).
-Put the other four numbers into square brackets [].
It would look like:
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/d-0197.bmp, [329 210 50 51]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/c-0044.bmp, [215 287 59 48]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/e-0114.bmp, [298 244 46 45]
/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/102.bmp, [243 126 163 143]
I have done it, first by replacing "rawdata/" with nothing in a simple text editor, and then with python:
file=open('data.txt')
fout=open('data2.txt','w')
for line in file:
line=line.rstrip()
pieces=line.split('.bmp')
pieces2=pieces[1].split()
fout.write('/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/rawdata/'+pieces[0]+'.bmp, '+'['+pieces2[1]+' '+pieces2[2]+' '+pieces2[3]+' '+pieces2[4]+']'+'\n')
fout.close()
But this file is going to be used in Matlab, so it would be much better to have an automatic process. How can I do the same in Matlab?
Thank you

Here you go:
infid = fopen('data.txt', 'r');
outfid = fopen('data2.txt', 'w');
dirStr = '/home/camroom/Dropbox/Internship/MyCascades/Cascade1/training/positive/';
while ~feof(infid)
inline = fgetl(infid);
outline = [dirStr, regexprep(inline,' 1 (\d* \d* \d* \d*)',', [$1]')];
fprintf(outfid, '%s\n', outline);
end
fclose(infid);
fclose(outfid);
What we've done there is to read in the code from the input file line by line, then use a regular expression to make the changes to the line, then write it out to the output file. There are probably better ways of applying the regular expression, but that was pretty quick.

Correct use of split()

I'm trying to split lines of text and store key information in a dictionary.
For example I have lines that look like:
Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
For the first line, my key will be "Lasal_00010", and the value I'm storing is "H293".
My current code works fine for this case, but when I encounter a line like:
Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
my code will not store the string "SSCG".
Here is my current code:
dataHash = {}
with open(fasta,'r') as f:
for ln in f:
query = ln.split('\t')[0]
query.strip()
tempValue = ln.split('\t')[1]
value = tempValue.split('|')[0]
value.strip()
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
for x in dataHash:
print x + " " + str(dataHash[x])
I believe I am splitting the line incorrectly in the case with two vertical bars. But I'm confused as to where my problem is. Shouldn't "SSCG" be the value I get when I write value = tempValue.split('|')[0]? Can someone explain to me how split works or what I'm missing?

Split on the first pipe, then on the space:
with open(fasta,'r') as f:
for ln in f:
query, value = ln.partition('|')[0].split()
I used str.partition() here as you only need to split once.
Your code makes assumptions on where tabs are being used; by splitting on the first pipe first we get to ignore the rest of the line altogether, making it a lot simpler to split the first from the second column.
Demo:
>>> lines = '''\
... Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
... Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
... Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
... '''
>>> for ln in lines.splitlines(True):
... query, value = ln.partition('|')[0].split()
... print query, value
...
Lasal_00010 H293
Lasal_00010 SSEG
Lasal_00030 SSCG
However, your code works too, up to a point, albeit less efficiently. Your real problem is with:
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
This really means: First time I see query, store an empty string, otherwise store value. I am not sure why you do this; if there are no other lines starting with Lasal_00030, all you have is an empty value in the dictionary. If that wasn't the intention, just store the value:
dataHash[query] = value
No if statement.
Note that dict.has_key() has been deprecated; it is better to use in to test for a key:
if query not in dataHash:

Importing data in SPSS syntax incl. 'value labels' and 'var labels'

I am trying to setup a standard work flow to efficiently import data from the Dutch National Bureau of Statistics (http://statline.cbs.nl) in SPSS syntax into R and /or Python so I can do analyses, load it into our database etc.
The good news is that they have standardized a lot of different output formats, amongst others an .sps syntax file. In essence, this is a space-delimited data file with extra information contained in the header and in the footer. The file looks like shown below. I prefer to use this format than plain .csv because it contains more data and should make it easier to import large amounts of data in a consistent manner.
The bad news is that I can't find a working library in Python and/or R that can deal with .sps SPPS syntax files. Most libraries work with the binary .sav or .por formats.
I am not looking for a full working SPSS clone, but something that will parse the data correctly using the meta-data with the keywords 'DATA LIST' (length of each column, 'VAR LABELS' (the column headers) and 'VALUE LABELS' (extra data should be joined/replaced during the import).
I'm sure a Python/R library could be written to parse and process all this info efficiently, but I am not that fluent/experienced in either language to do it myself.
Any suggestions or hints would be helpful
SET DECIMAL = DOT.
TITLE "Gezondheidsmonitor; regio, 2012, bevolking van 19 jaar of ouder".
DATA LIST RECORDS = 1
/1 Key0 1 - 5 (A)
Key1 7 - 7 (A)
Key2 9 - 14 (A)
Key3 16 - 23 (A)
Key4 25 - 28 (A)
Key5 30 - 33 (A)
Key6 35 - 38 (A)
Key7 40 - 43 (A).
BEGIN DATA
80200 1 GM1680 2012JJ00 . . . .
80200 1 GM0738 2012JJ00 13.2 . . 21.2
80200 1 GM0358 2012JJ00 . . . .
80200 1 GM0197 2012JJ00 13.7 . . 10.8
80200 1 GM0059 2012JJ00 12.4 . . 16.5
80200 1 GM0482 2012JJ00 13.3 . . 14.1
80200 1 GM0613 2012JJ00 11.6 . . 16.2
80200 1 GM0361 2012JJ00 17.0 9.6 17.1 14.9
80200 1 GM0141 2012JJ00 . . . .
80200 1 GM0034 2012JJ00 14.3 18.7 22.5 18.3
80200 1 GM0484 2012JJ00 9.7 . . 15.5
(...)
80200 3 GM0642 2012JJ00 15.6 . . 19.6
80200 3 GM0193 2012JJ00 . . . .
END DATA.
VAR LABELS
Key0 "Leeftijd"/
Key1 "Cijfersoort"/
Key2 "Regio's"/
Key3 "Perioden"/
Key4 "Mantelzorger"/
Key5 "Zwaar belaste mantelzorgers"/
Key6 "Uren mantelzorg per week"/
Key7 "Ernstig overgewicht".
VALUE LABELS
Key0 "80200" "65 jaar of ouder"/
Key1 "1" "Percentages"
"2" "Ondergrens"
"3" "Bovengrens"/
Key2 "GM1680" "Aa en Hunze"
"GM0738" "Aalburg"
"GM0358" "Aalsmeer"
"GM0197" "Aalten"
(...)
"GM1896" "Zwartewaterland"
"GM0642" "Zwijndrecht"
"GM0193" "Zwolle"/
Key3 "2012JJ00" "2012".
LIST /CASES TO 10.
SAVE /OUTFILE "Gezondheidsmonitor__regio,_2012,_bevolking_van_19_jaar_of_ouder.SAV".

Some sample code to get you started - sorry not the best Python programmer here.. so any improvements might be welcome.
Steps to add here is a method to load the labels and create a list of dicts for the LABEL VALUES.....
f = open('Bevolking_per_maand__100214211711.sps','r')
#lines = f.readlines()
spss_keys = list()
data = list()
begin_data_step= False
end_data_step = False
for l in f:
# first look for TITLE
if l.find('TITLE') <> -1:
start_pos=l.find('"')+1
end_pos = l.find('"',start_pos+1)
title = l[start_pos:end_pos]
print "title:" ,title
if l.find('DATA LIST') <> -1:
data_list = True
start_pos=l.find('=')+1
end_pos=len(l)
num_records= l[start_pos:end_pos].strip()
print "number of records =", num_records
if num_records=='1':
if ((l.find("Key") <> -1) and (not begin_data_step) and (not end_data_step)):
spss_keys.append([l[15:22].strip(),int(l[23:29].strip()),int(l[32:36].strip()),l[37:].strip()])
if l.find('END DATA.') <> -1:
end_data_step=True
if ((begin_data_step) and (not end_data_step)):
values = list()
for key in spss_keys:
values.append(l[key[1]-1:key[2]])
data.append(values)
if l[-1]=="." :
begin_data=False
if l.find('BEGIN DATA') <> -1:
begin_data_step=True
if end_data_step:
print ""
# more to follow
data

From my point of view I would not bother with the SPSS file option, but choose the HTML version and scrape it down. It looks the tables are nicely formatted with classes which would make scraping/parsing the HTML much easier....
Another question to be answered should be: are you going to download the files manually or would you also like to do that automatically?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DTML: How to prevent formatting loss - python

This is nothing to do with DTML - it's a basic issue with HTML, which is that it ignores whitespace. If you want to preserve it, you need to wrap the content with <pre>. <pre><dtml-var public_blast_results></pre>

Related

Read in a file, splitting and then writing out desired output

Writing to a file in python

Text edition: from Python to Matlab

Correct use of split()

Importing data in SPSS syntax incl. 'value labels' and 'var labels'

Categories

Resources