Read in a file, splitting and then writing out desired output - python

I am very new to python, and am having some problems I can't seem to find answers to.
I have a large file I am trying to read in and then split and write out specific information. I am having trouble with the read in and split, where it is only printing the same thing over and over again.
blast_output = open("blast.txt").read()
for line in blast_output:
subFields = [item.split('|') for item in blast_output.split()]
print(str(subFields[0][0]) + "\t" + str(subFields[0][1]) + "\t" + str(subFields[1][3]) + "\t" + str(subFields[2][0]))
My input file has many rows that look like this:
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
c1003_g1_i1|m.803 gi|74631197|sp|Q6BDR8.1|NSE4_SCHPO 100.00 246 0 0 1 246 1 246 1e-179 502
c1004_g1_i1|m.804 gi|74676184|sp|O94325.1|PEX5_SCHPO 100.00 598 0 0 1 598 1 598 0.0 1227
The output I am receiving is this:
c0_g1_i1 m.1 Q9HGP0.1 100.00
c0_g1_i1 m.1 Q9HGP0.1 100.00
c0_g1_i1 m.1 Q9HGP0.1 100.00
c0_g1_i1 m.1 Q9HGP0.1 100.00
But what I am wanting is
c0_g1_i1 m.1 Q9HGP0.1 100.0
c1002_g1_i1 m.801 Q10302.1 100.0
c1003_g1_i1 m.803 Q6BDR8.1 100.0
c1004_g1_i1 m.804 O94325.1 100.0

You don't need to call the read method of the file object, just iterate over it, line by line. Then replace blast_output with line in the for loop to avoid repeating the same action across all the iterations:
with open("blast.txt") as blast_output:
for line in blast_output:
subFields = [item.split('|') for item in line.split()]
print("{:15}{:10}{:10}{:10}".format(subFields[0][0], subFields[0][1],
subFields[0][1], subFields[1][3], subFields[2][0]))
I have opened the file in a context using with, so closing is automatically done by Python. I have also used string formatting to build the final string.
c0_g1_i1 m.1 m.1 Q9HGP0.1
c1002_g1_i1 m.801 m.801 Q10302.1
c1003_g1_i1 m.803 m.803 Q6BDR8.1
c1004_g1_i1 m.804 m.804 O94325.1

Great question. You are taking the same input over and over again with this line
subFields = [item.split('|') for item in blast_output.split()]
The python 2.x version looks like this:
blast_output = open("blast.txt").read()
for line in blast_output:
subFields = [item.split('|') for item in line.split()]
print(str(subFields[0][0]) + "\t" + str(subFields[0][1]) + "\t" + str(subFields[1][3]) + "\t" + str(subFields[2][0]))
see Moses Koledoye's version for the Python 3.x formatted niceness

Related

How to multiply different numbers from different lines in a text

I have a text file like this:
month /name/ number/ price
1 John 100 120.00
1 Sean 90 125.00
1 Laura 150 100.00
1 Joseph 95 140.00
1 Pam 91 105.00
2 John 110 120.00
2 Sean 98 100.00
2 Laura 100 100.00
2 Joseph 89 150.00
2 Pam 100 100.00
3 John 100 121.00
3 Sean 90 120.00
3 Laura 97 100.00
3 Joseph 120 110.00
3 Pam 101 100.00
I need to get a specific person's (such as Pam) revenue per month and total revenue in 1,2 and 3 months (number*price). I have the code below and the output below. But I have no idea how to get the total revenue, can anyone give to me some advice or idea?
#This is the code I use
f = input('Enter The File Name:')
sales_data = open("sales.txt",'r')
lines = sales_data.readlines()
m = input('Enter the Manager Name:')
print('Monthly Sales Report for' +' ' + m)
for line in lines:
line = line.split()
tr = (float(line[2]) * float(line[3]))
if m in line:
print(line[0] +' ' + line[2] + ' ' + line[3] +' ' + str(tr))
#This is the output I got
Enter the Manager Name: Pam
Monthly Sales Report for Pam
1 91 105.00 9555.0
2 100 100.00 10000.0
3 101 100.00 10100.0
One possible way to solve your issue is to store all monthly values in a dictionary for a particular manager:
file_name = input('Enter The File Name: ')
manager_summary = {'1':0.0, '2':0.0, '3':0.0}
with open (file_name, 'r') as fin:
lines = fin.readlines()
manager = input('Enter the Manager Name: ')
print('Monthly Sales Report for' +' ' + manager)
for line in lines:
line = line.split()
if manager in line:
manager_summary[line[0]] += float(line[-2])*float(line[-1])
manager_total = 0.0
for key, value in manager_summary.items():
manager_total += value
print(manager_total)
The code reads the input file at once, loops through all the lines in search of the target manager and stores cumulative monthly sales for that manager in a dictionary. The total revenue for 3 month period is then computed by adding cumulative values for each month stored in the dictionary.
There were couple changes with respect to your original code worth noting:
In your code you ask the user for a file name but then you have it hardcoded in the next line - here you use that input file name.
Instead of opening the file with open this code uses with open - with open is a context manager that will automatically close the file for your when closing is needed, something that your were missing in your program.
Cumulative data is stored in a dictionary with keys being month numbers. This allows for having more than one monthly entry per manager.
Variable names are more meaningful. It is generally not recommended to use variables like f, m, it makes the program more bug prone and way less readable. The ones used here are longish, you can always come up with something inbetween.
You can solve this using a dictionary - specifically a defaultdict. You can keep track of a dictionary of people's names to revenue.
First import defaultdict and define a dictionary:
from collections import defaultdict
revenue_dictionary = defaultdict(float)
Then just after you've calculated tr, add this to the dictionary:
revenue_dictionary[line[1]] += tr
At the end of the script, you'll have a dictionary which looks like:
{
'John': 37300.0,
'Sean': 31850.0,
'Laura': 34700.0,
'Joseph': 39850.0,
'Pam': 29655.0
}
And you can access any of these using revenue_dictionary['Pam'], or m instead of 'Pam'.
f = input('Enter The File Name: ')
sales_data = open(f,'r')
lines = sales_data.readlines()
m = input('Enter the Manager Name: ')
print('Monthly Sales Report for ' + m)
TOTAL_REVENUE=0
for line in lines:
line = line.split()
if m in line:
tr = (float(line[2]) * float(line[3]))
TOTAL_REVENUE=TOTAL_REVENUE+tr
print(line[0] +' ' + line[2] + ' ' + line[3] +' ' + str(tr))
print("GRAND TOTAL REVENUE: " + str(TOTAL_REVENUE))

Writing to a file in python

I have been receiving indexing errors in python. I got my code to work correctly through reading in a file and simply printing the desired output, but now I am trying to write the output to a file. I seem to be having a problem with indexing when trying to write it. I've tried a couple different things, I left an attempt commented out. Either way I keep getting an indexing error.
EDIT Original error may be caused by an error in eclipse, but when running on server, having a new issue*
I can now get it to run and produce output to a .txt file, however it only prints a single output
with open("blast.txt") as blast_output:
for line in blast_output:
subFields = [item.split('|') for item in line.split()]
#transId = str(subFields[0][0])
#iso = str(subFields[0][1])
#sp = str(subFields[1][3])
#identity = str(subFields[2][0])
out = open("parsed_blast.txt", "w")
#out.write(transId + "\t" + iso + "\t" + sp + "\t" + identity)
out.write((str(subFields[0][0]) + "\t" + str(subFields[0][1]) + "\t" + str(subFields[1][3]) + "\t" + str(subFields[2][0])))
out.close()
IndexError: list index out of range
Input file looks like:
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
c1003_g1_i1|m.803 gi|74631197|sp|Q6BDR8.1|NSE4_SCHPO 100.00 246 0 0 1 246 1 246 1e-179 502
c1004_g1_i1|m.804 gi|74676184|sp|O94325.1|PEX5_SCHPO 100.00 598 0 0 1 598 1 598 0.0 1227
c1005_g1_i1|m.805 gi|9910811|sp|O42832.2|SPB1_SCHPO 100.00 802 0 0 1 802 1 802 0.0 1644
c1006_g1_i1|m.806 gi|74627042|sp|O94631.1|MRM1_SCHPO 100.00 255 0 0 1 255 47 301 0.0 525
Expected output
c0_g1_i1 m.1 Q9HGP0.1 100.00
c1002_g1_i1 m.801 Q10302.1 100.00
c1003_g1_i1 m.803 Q6BDR8.1 100.00
c1004_g1_i1 m.804 O94325.1 100.00
c1005_g1_i1 m.805 O42832.2 100.00
c1006_g1_i1 m.806 O94631.1 100.00
My output is instead only one of the lines instead of all of the lines
You are overwriting the same file again and again. Open the file outside the for loop or open it in append mode 'a'
I suggest you write the whole file to a string.
with open("blast.txt", 'r') as fileIn:
data = fileIn.read()
then process the data.
data = func(data)
Then write to file out.
with open('bast_out.txt','w') as fileOut:
fileOut.write()
As #H Doucet said, write the whole thing to a string, then work with it. Leave the open() function out of the loop so it only opens & closes the file once, and make sure to open as "append." I've also cleaned up your out.write() function. No need to specify those list items as strings, they already are. And added a newline ("\n") to the end of each line.
with open("blast.txt") as f:
blast_output = f.read()
out = open("parsed_blast.txt", "a")
for line in blast_output.split("\n"):
subFields = [item.split('|') for item in line.split()]
out.write("{}\t{}\t{}\t{}\n".format(subFields[0][0], subFields[0][1],
subFields[1][3], subFields[2][0]))
out.close()

python print particular lines from file

The background:
Table$Gene=Gene1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.928 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 2208 40 0.755 0.00803 0.739 0.771
5 2256 48 0.769 0.00787 0.754 0.784
6 2208 40 0.755 0.00803 0.739 0.771
Table$Gene=Gene2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.938 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 1000 40 0.744 0.00803 0.739 0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.
What I want seems simple. I want to turn the above file into an output that looks like this:
Gene1 0.755
Gene2 0.744
i.e. each gene, and the last number in the survival column from each section.
I have tried multiple ways, using regular expression, reading the file in as a list and saying ".next()". One example of code that I have tried:
fileopen = open(sys.argv[1]).readlines() # Read in the file as a list.
for index,line in enumerate(fileopen): # Enumerate items in list
if "Table" in line: # Find the items with "Table" (This will have my gene name)
line2 = line.split("=")[1] # Parse line to get my gene name
if "\n" in fileopen[index+1]: # This is the problem section.
print fileopen[index]
else:
fileopen[index+1]
So as you can see in the problem section, I was trying to say in this attempt:
if the next item in the list is a new line, print the item, else, the next line is the current line (and then I can split the line to pull out the particular number I want).
If anyone could correct the code so I can see what I did wrong I'd appreciate it.
Bit of overkill, but instead of manually writing parser for each data item use existing package like pandas to read in the csv file. Just need to write a bit of code to specify the relevant lines in the file. Un-optimized code (reading file twice):
import pandas as pd
def genetable(gene):
l = open('gene.txt').readlines()
l += "\n" # add newline to end of file in case last line is not newline
lines = len(l)
skiprows = -1
for (i, line) in enumerate(l):
if "Table$Gene=Gene"+str(gene) in line:
skiprows = i+1
if skiprows>=0 and line=="\n":
skipfooter = lines - i - 1
df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
# assuming tab separated data given your inputs. change as needed
# assert df.columns.....
return df
return "Not Found"
this will read in a DataFrame with all the relevant data in that file
can then do:
genetable(2).survival # series with all survival rates
genetable(2).survival.iloc[-1] last item in survival
The advantages of this is that you have access to all the items, any mal-formatting of the file will probably be better picked up and prevent incorrect values from being used. If my own code i would add assertions on column names before returning the pandas DataFrame. Want to pick up any errors in parsing early so that it does not propagate.
This worked when I tried it:
gene = 1
for i in range(len(filelines)):
if filelines[i].strip() == "":
print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
gene += 1
You could try something like this (I copied your data into foo.dat);
In [1]: with open('foo.dat') as input:
...: lines = input.readlines()
...:
Using with makes sure the file is closed after reading.
In [3]: lines = [ln.strip() for ln in lines]
This gets rid of extra whitespace.
In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]
In [6]: startgenes
Out[6]: [0, 10]
In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]
In [8]: emptylines
Out[8]: [9, 17]
Using emptylines relies on the fact that the records are separated by lines containing only whitespace.
In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]
In [10]: for first, last in zip(startgenes, lastlines):
....: gene = lines[first].split("=")[1]
....: num = lines[last].split()[-1]
....: print gene, num
....:
Gene1 0.771
Gene2 0.774
here is my solution:
>>> with open('t.txt','r') as f:
... for l in f:
... if "Table" in l:
... gene = l.split("=")[1][:-1]
... elif l not in ['\n', '\r\n']:
... surv = l.split()[3]
... else:
... print gene, surv
...
Gene1 0.755
Gene2 0.744
Instead of checking for new line, simply print when you are done reading the file
lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
if "Table" in line:
if table != "": # print previous survival
print table, finalsurvival
table = line.strip().split('=')[1]
else:
try:
finalsurvival = line.split('\t')[4]
except IndexError:
continue
print table, finalsurvival

Python: replace only one occurrence in a string

I have some sample data which looks like:
ATOM 973 CG ARG A 61 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 61 -21.610 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 61 -21.047 7.452 67.937 1.00 12.13 N
I want to replace the 6th column and only the 6th column by the addition of the offset value, in the case above it is 308.
So 61+308 = 369, so 61 in the 6th column should be replaced by 369
I can't str.split() the line as the line spacing is very important.
I have tried tried using str.replace() but the values in column 2 can also overlap with column 6
I did try reversing the line and use str.repalce() but the values in columns 7,8,9,10 and 11 can overlap with the str to be replaced.
The ugly code I have so far is (which partially works apart from if the values overlap in columns 7,8,9,10 and/or 11):
with open('2kqx.pdb', 'r') as inf, open('2kqx_renumbered.pdb', 'w') as outf:
for line in inf:
if line.startswith('ATOM'):
segs = line.split()
if segs[4] == 'A':
offset = 308
number = segs[5][::-1]
replacement = str((int(segs[5])+offset))[::-1]
print number[::-1],replacement
line_rev = line[::-1]
replaced_line = line_rev.replace(number,replacement,1)
print line
print replaced_line[::-1]
outf.write(replaced_line[::-1])
The code above produced this output below. As you can see in the second line the 6th column is not changed, but is changed in column 7. I thought by reversing the string I could bypass the potential overlap with column 2, but I forgot about the other columns and I dont really know how to get around it.
ATOM 973 CG ARG A 369 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 61 -21.3690 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 369 -21.047 7.452 67.937 1.00 12.13 N
data = """\
ATOM 973 CG ARG A 61 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 61 -21.610 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 61 -21.047 7.452 67.937 1.00 12.13 N"""
offset = 308
for line in data.split('\n'):
line = line[:22] + " {:<5d} ".format(int(line[22:31]) + offset) + line[31:]
print line
I haven't done the exact counting of whitespace, that's just a rough estimate.
If you want more flexibility than just having the numbers 22 and 31 scattered in your code, you'll need a way to determine your start and end index (but that contrasts my assumption that the data is in fixed column format).
You better not try to parse PDB-files on your own.
Use a PDB-Parser. There are many freely available inside different bio/computational chemistry packages, for instance
biopython
Here's how to it with biopython, assuming you input is raw.pdb:
from Bio.PDB import PDBParser, PDBIO
parser=PDBParser()
structure = parser.get_structure('some_id', 'raw.pdb')
for r in structure.get_residues():
r.id = (r.id[0], r.id[1] + 308, r.id[2])
io = PDBIO()
io.set_structure(structure)
io.save('shifted.pdb')
I googled a bit and find a quick solution to solve your specific problem here (without third-party dependencies):
http://code.google.com/p/pdb-tools/
There is -- among many other useful pdb-python-script-tools -- this script pdb_offset.py
It is a standalone script and I just copied its pdb_offset method to show it working, your three-line example code is in raw.pdb:
def pdbOffset(pdb_file, offset):
"""
Adds an offset to the residue column of a pdb file without touching anything
else.
"""
# Read in the pdb file
f = open(pdb_file,'r')
pdb = f.readlines()
f.close()
out = []
for line in pdb:
# For and ATOM record, update residue number
if line[0:6] == "ATOM " or line[0:6] == "TER ":
num = offset + int(line[22:26])
out.append("%s%4i%s" % (line[0:22],num,line[26:]))
else:
out.append(line)
return "".join(out)
print pdbOffset('raw.pdb', 308)
which prints
ATOM 973 CG ARG A 369 -21.593 8.884 69.770 1.00 25.13 C
ATOM 974 CD ARG A 369 -21.610 7.433 69.314 1.00 23.44 C
ATOM 975 NE ARG A 369 -21.047 7.452 67.937 1.00 12.13 N

DTML: How to prevent formatting loss

I have a DTML document which only contains:
<dtml-var public_blast_results>
and displays when i view it as:
YP_001336283 100.00 345 0 0 23 367 23 367 0.0 688
When I edit the DTML page for example just adding a header like:
<h3>Header</h3>
<dtml-var public_blast_results>
The "public_blast_results" loeses its formatting and displayes as:
Header
YP_001336283 100.00 345 0 0 23 367 23 367 0.0 688
Is there a way for maintaining the formatting? public_blast_results is a python function which just simply reads the contents of a file and returns it.
This is nothing to do with DTML - it's a basic issue with HTML, which is that it ignores whitespace. If you want to preserve it, you need to wrap the content with <pre>.
<pre><dtml-var public_blast_results></pre>

Categories

Resources