I am quite new in python and I need your help.
I have a file like this:
>chr14_Gap_2
ACCGCGATGAAAGAGTCGGTGGTGGGCTCGTTCCGACGCGCATCCCCTGGAAGTCCTGCTCAATCAGGTGCCGGATGAAGGTGGT
GCTCCTCCAGGGGGCAGCAGCTTCTGCGCGTACAGCTGCCACAGCCCCTAGGACACCGTCTGGAAGAGCTCCGGCTCCTTCTTG
acacccaggactgatctcctttaggatggactggctggatcttcttgcagtccaaggggctctcaagagt
………..
>chr14_Gap_3
ACCGCGATGAAAGAGTCGGTGGTGGGCTCGTTCCGACGCGCATCCCCTGGAAGTCCTGCTCAATCAGGTGCCGGATGAAGGTGGT
GCTCCTCCAGGGGGCAGCAGCTTCTGCGCGTACAGCTGCCACAGCCCCTAGGACACCGTCTGGAAGAGCTCCGGCTCCTTCTTG
acacccaggactgatctcctttaggatggactggctggatcttcttgcagtccaaggggctctcaagagt
………..
One string as a tag and one string the dna sequence.
I want to calculate the number of the N letters and the number of the lower case letters and take the percentage.
I wrote the following script which works but I have a problem in printing.
#!/usr/bin/python
import sys
if len (sys.argv) != 2 :
print "Usage: If you want to run this python script you have to put the fasta file that includes the desert area's sequences as arument"
sys.exit (1)
fasta_file = sys.argv[1]
#This script reads the sequences of the desert areas (fasta files) and calculates the persentage of the Ns and the repeats.
fasta_file = sys.argv[1]
f = open(fasta_file, 'r')
content = f.readlines()
x = len(content)
#print x
for i in range(0,len(content)):
if (i%2 == 0):
content[i].strip()
name = content[i].split(">")[1]
print name, #the "," makes the print command to avoid to print a new line
else:
content[i].strip()
numberOfN = content[i].count('N')
#print numberOfN
allChar = len(content[i])
lowerChars = sum(1 for c in content[i] if c.islower())
Ns_persentage = 100 * (numberOfN/float(allChar))
lower_persentage = 100 * (lowerChars/float(allChar))
waste = Ns_persentage + lower_persentage
print ("The waste persentage is: %s" % (round(waste)))
#print ("The persentage of Ns is: %s and the persentage of repeats is: %s" % (Ns_persentage,lower_persentage))
#print (name + waste)
The thing is that it can print the tag in the first line and the waste variable in the second one like this:
chr10_Gap_18759
The waste persentage is: 52.0
How can I print it in the same line, tab separated?
eg
chr10_Gap_18759 52.0
chr10_Gap_19000 78.0
…….
Thank you very much.
You can print it with:
print name, "\t", round(waste)
If you are using python 2.X
I would make some modification to your code. There is the argparse module of python to manage the arguments from the command line. I would do something like this:
#!/usr/bin/python
import argparse
# To use the arguments
parser = argparse.ArgumentParser()
parser.add_argument("fasta_file", help = "The fasta file to be processed ", type=str)
args = parser.parse_args()
f= open(args.fasta_file, "r")
content = f.readlines()
f.close()
x = len(content)
for i in range(x):
line = content[i].strip()
if (i%2 == 0):
#The first time it will fail, for the next occasions it will be printed as you wish
try:
print bname, "\t", round(waste)
except:
pass
name = line.split(">")[1]
else:
numberOfN = line.count('N')
allChar = len(line)
lowerChars = sum(1 for c in content[i] if c.islower())
Ns_persentage = 100 * (numberOfN/float(allChar))
lower_persentage = 100 * (lowerChars/float(allChar))
waste = Ns_persentage + lower_persentage
# To print the last case you need to do it outside the loop
print name, "\t", round(waste)
You can also print it like the other answer with print("{}\t{}".format(name, round(waste)))
I am not sure about the use of i%2, Note that if the sequence uses and odd number of lines you'll will not get the name of the next sequence until the same event occurs. I would check if the line begin with ">" then use store the name, and sum the characters of the next line.
Don't print the name when (i%2 == 0), just save it in variable and print in the next iteration together with the percentage:
print("{0}\t{1}".format(name, round(waste)))
This method of string formatting (new in version 2.6) is the new standard in Python 3, and should be preferred to the % formatting described in String Formatting Operations in new code.
I've fixed the indentation and redundancy:
#!/usr/bin/python
"""
This script reads the sequences of the desert areas (fasta files) and calculates the percentage of the Ns and the repeats.
2014-10-05 v1.0 by Vasilis
2014-10-05 v1.1 by Llopis
2015-02-27 v1.2 by Cees Timmerman
"""
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("fasta_file", help="The fasta file to be processed.", type=str)
args = parser.parse_args()
with open(args.fasta_file, "r") as f:
for line in f.readlines():
line = line.strip()
if line[0] == '>':
name = line.split(">")[1]
print name,
else:
numberOfN = line.count('N')
allChar = len(line)
lowerChars = sum(1 for c in line if c.islower())
Ns_percentage = 100 * (numberOfN/float(allChar))
lower_percentage = 100 * (lowerChars/float(allChar))
waste = Ns_percentage + lower_percentage
print "\t", round(waste) # Note: https://docs.python.org/2/library/functions.html#round
Fed:
>chr14_Gap_2
ACCGCGATGAAAGAGTCGGTGGTGGGCTCGTTCCGACGCGCATCCCCTGGAAGTCCTGCTCAATCAGGTGCCGGATGAAGGTGGTGCTCCTCCAGGGGGCAGCAGCTTCTGCGCGTACAGCTGCCACAGCCCCTAGGACACCGTCTGGAAGAGCTCCGGCTCCTTCTTGacacccaggactgatctcctttaggatggactggctggatcttcttgcagtccaaggggctctcaagagt
>chr14_Gap_3
ACCGCGATGAAAGAGTCGGTGGTGGGCTCGTTCCGACGCGCATCCCCTGGAAGTCCTGCTCAATCAGGTGCCGGATGAAGGTGGTGCTCCTCCAGGGGGCAGCAGCTTCTGCGCGTACAGCTGCCACAGCCCCTAGGACACCGTCTGGAAGAGCTCCGGCTCCTTCTTGacacccaggactgatctcctttaggatggactggctggatcttcttgcagtccaaggggctctcaagagt
Gives:
C:\Python27\python.exe -u "dna.py" fasta.txt
Process started >>>
chr14_Gap_2 29.0
chr14_Gap_3 29.0
<<< Process finished. (Exit code 0)
Using my favorite Python IDE: Notepad++ with NppExec plugin.
Related
I'm a beginner in coding and am trying to build a script that takes a txt file as an input, hash it and output to another txt file containing "string:hashedstring" in each line of it. The code is working properly. The problem I am facing now is that if the input file is big, it will consume all RAM and kill it. I tried to use chunks, but couldn't figure out how to use it with multiline input and output.
Any suggestions regarding other parts of the code other than the main subject here is very welcome, since I am just starting on this. Thanks.
import argparse
import hashlib
import os
import sys
def sofia_hash(msg):
h = ""
m = hashlib.md5()
m.update(msg.encode('utf-8'))
msg_md5 = m.digest()
for i in range(8):
n = (msg_md5[2*i] + msg_md5[2*i+1]) % 0x3e
if n > 9:
if n > 35:
n += 61
else:
n += 55
else:
n += 0x30
h += chr(n)
return h
top_parser = argparse.ArgumentParser(description='Sofiamass')
top_parser.add_argument('input', action="store", type=argparse.FileType('r', encoding='utf8'), help="Set input file")
top_parser.add_argument('output', action="store", help="Set output file")
args = top_parser.parse_args()
sofiainput = args.input.read().splitlines()
a = 0
try:
while a < len(sofiainput):
target_sofiainput = sofiainput[a]
etarget_sofiainput = (target_sofiainput).encode('utf-8')
try:
sofia_pass = sofia_hash(target_sofiainput)
x = True
except KeyboardInterrupt:
print ("\n[---]exiting now[---]")
if x == True:
with open(args.output, 'a') as sofiaoutput:
sofiaoutput.write(str(target_sofiainput) + ":" + str(sofia_pass) + "\n")
elif x == False:
print('error')
a += 1
except KeyboardInterrupt:
print ("\n[---]exiting now[---]")
except AttributeError:
pass
When you open the file with the open command, it creates a object called file handler. So, when you do:
with open('filepath.txt', 'r') as f:
for line in f:
print(line)
it only keeps the current line you are using in the RAM, thus achieving your objective to use as little as RAM as possible.
I'm trying to write a code that will take data from a file and write it differently. I have the code for the most part but when i run it, everything is on one line.
import csv
#Step 4
def read_data(filename):
try:
data = open("dna.txt", "r")
except IOError:
print( "File not found")
return data
#Step 5
def get_dna_stats(dna_string):
a_letters = ""
t_letters = ""
if "A" in dna_string:
a_letters.append("A")
if "T" in dna_string:
t_letters.append("T")
nucleotide_content = ((len(a_letters) + len(t_letters))/len(dna_string))
#Step 6
def get_dna_complement(dna_string):
dna_complement = ""
for i in dna_string:
if i == "A":
dna_complement.append("T")
elif i == "T":
dna_complement.append("A")
elif i == "G":
dna_complement.append("C")
elif i == "C":
dna_complement.append("G")
else:
break
return dna_complement
#Step 7
def print_dna(dna_strand):
dna_complement = get_dna_complement(dna_strand)
for i in dna_strand:
for j in dna_complement:
print( i + "=" + j)
#Step 8
def get_rna_sequence(dna_string):
rna_complement = ""
for i in dna_string:
if i == "A":
rna_complement.append("U")
elif i == "T":
rna_complement.append("A")
elif i == "G":
rna_complement.append("C")
elif i == "C":
rna_complement.append("G")
else:
break
return rna_complement
#Step 9
def extract_exon(dna_strand, start, end):
return (f"{dna_strand} between {start} and {end}")
#Step 10
def calculate_exon_pctg(dna_strand, exons):
exons_length = 0
for i in exons:
exons_length += 1
return exons_length/ len(dna_strand)
#Step 11
def format_data(dna_string):
x = "dna_strand"[0:62].upper()
y = "dna_strand"[63:90].lower()
z = "dna_strand"[91:-1].upper()
return x+y+z
#Step 12
def write_results(output, filename):
try:
with open("output.csv","w") as csvFile:
writer = csv.writer(csvFile)
for i in output:
csvFile.write(i)
except IOError:
print("Error writing file")
#Step 13
def main():
read_data("dna.txt")
output = []
output.append("The AT content is" + get_dna_stats() + "% of the DNA sequence.")
get_dna_stats("dna_sequence")
output.append("The DNA complement is " + get_dna_complement())
get_dna_complement("dna_sequence")
output.append("The RNA sequence is" + get_rna_sequence())
get_rna_sequence("dna_sequence")
exon1 = extract_exon("dna_sequence", 0, 62)
exon2 = extract_exon("dna_sequence", 91, len("dna_sequence"))
output.append(f"The exon regions are {exon1} and {exon2}")
output.append("The DNA sequence, which exons in uppercase and introns in lowercase, is" + format_dna())
format_data("dna_sequence")
output.append("Exons comprise " + calculate_exon_pctg())
calculate_exon_pctg("dna_sequence",[exon1, exon2])
write_results(output, "results.txt")
print("DNA processing complete")
#Step 14
if __name__ == "__main__":
main()
When I run it, its supposed to output a file that looks like this but my code ends up putting every word on the top line like this
I have a feeling it has to do with the write_resultsfunction but that's all i know on how to write to the file.
The second mistake I'm making is that I'm not calling the functions correctly in the append statements. I've tried concatenating and I've tried formatting the string but now I'm hitting a road block on what I need to do.
When you write to the file you need to concat a '\n' to the end of the string every time you want to have something on a new line in the written file
for example:
output.append("The AT content is" + get_dna_stats() + "% of the DNA sequence." + '\n')
To solve your second problem I would change your code to something like this:
temp = "The AT content is" + get_dna_stats() + "% of the DNA sequence." + '\n'
output.append(temp)
When you append to a list and call a function it will take the literal text of the function instead of calling it. Doing it with a temp string holder will call the function before the string is concatenated. Then you are able to append the string to the list
read_data() doesn't actually read anything (just opens file). It should read the file and return its contents:
def read_data(filename):
with open(filename, "r") as f:
return f.read()
get_dna_stats() won't get DNA stats (won't return anything, and it doesn't count "A"s or "T"s, only checks if they're present, nucleotide_content is computed but never used or returned. It should probably count and return the results:
def get_dna_stats(dna_string):
num_a = dna_string.count("A")
num_t = dna_string.count("T")
nucleotide_content = (num_a + num_t) /float(len(dna_string))
return nucleotide_content
get_dna_complement() and get_rna_sequence(): you can't append to a string. Instead use
dna_complement += "T"
... and rather than break, you either append a "?" to denote a failed transscription, or raise ValueError("invalid letter in DNA: "+i)
print_dna() is a bit more interesting. I'm guessing you want to "zip" each letter of the DNA and its complement. Coincidentally, you can use the zip function to achieve just that:
def print_dna(dna_strand):
dna_complement = get_dna_complement(dna_strand)
for dna_letter, complement in zip(dna_strand, dna_complement):
print(dna_letter + "=" + complement)
As for extract_exon(), I don't know what that is, but presumably you just want the substring from start to end, which is achieved by:
def extract_exon(dna_strand, start, end):
return dna_strand[start:end] # possibly end+1, I don't know exons
I am guessing that in calculate_exon_pctg(), you want exons_length += len(i) to sum the lengths of the exons. You can achieve this by using the buildin function sum:
exons_length = sum(exons)
In function format_data(), loose the doublequotes. You want the variable.
main() doesn't pass any data around. It should pass the results of read_data() to all the other functions:
def main():
data = read_data("dna.txt")
output = []
output.append("The AT content is " + get_dna_stats(data) + "% of the DNA sequence.")
output.append("The DNA complement is " + get_dna_complement(data))
output.append("The RNA sequence is" + get_rna_sequence(data))
...
write_results(output, "results.txt")
print("DNA processing complete")
The key for you at this stage is to understand how function calls work: they take data as input parameters, and they return some results. You need to a) provide the input data, and b) catch the results.
write_results() - from your screenshot, you seem to want to write a plain old text file, yet you use csv.writer() (which writes CSV, i.e. tabular data). To write plain text,
def write_results(output, filename):
with open(filename, "w") as f:
f.write("\n".join(output)) # join output lines with newline
f.write("\n") # extra newline at file's end
If you really do want a CSV file, you'll need to define the columns first, and make all the output you collect fit that column format.
You never told your program to make a new line. You could either append or prepend the special "\n" character to each of your strings or you could do it in a system agnostic way by doing
import os
at the top of your file and writing your write_results function like this:
def write_results(output, filename):
try:
with open("output.csv","w") as csvFile:
writer = csv.writer(csvFile)
for i in output:
csvFile.write(i)
os.write(csvFile, os.linesep) # Add this line! It is a system agnostic newline
except IOError:
print("Error writing file")
I'm writing a script to search through multiple text files with mac addresses in them to find what port they are associated with. I need to do this for several hundred mac addresses. The function runs the first time through fine. After that though the new mac address doesn't get passed to the function it remains as the same one it already used and the functions for loop only seems to run once.
import re
import csv
f = open('all_switches.csv','U')
source_file = csv.reader(f)
m = open('macaddress.csv','wb')
macaddress = csv.writer(m)
s = open('test.txt','r')
source_mac = s.read().splitlines()
count = 0
countMac = 0
countFor = 0
def find_mac(sneaky):
global count
global countFor
count = count +1
for switches in source_file:
countFor = countFor + 1
# print sneaky only goes through the loop once
switch = switches[4]
source_switch = open(switch + '.txt', 'r')
switch_read = source_switch.readlines()
for mac in switch_read:
# print mac does search through all the switches
found_mac = re.search(sneaky, mac)
if found_mac is not None:
interface = re.search("(Gi|Eth|Te)(\S+)", mac)
if interface is not None:
port = interface.group()
macaddress.writerow([sneaky, switch, port])
print sneaky + ' ' + switch + ' ' + port
source_switch.close()
for macs in source_mac:
match = re.search(r'[a-fA-F0-9]{4}[.][a-fA-F0-9]{4}[.][a-fA-F0-9]{4}', macs)
if match is not None:
sneaky = match.group()
find_mac(sneaky)
countMac = countMac + 1
print count
print countMac
print countFor
I've added the count countFor and countMac to see how many times the loops and functions run. Here is the output.
549f.3507.7674 the name of the switch Eth100/1/11
677
677
353
Any insight would be appreciated.
source_file is opened globally only once, so the first time you execute call find_mac(), the for switches in source_file: loop will exhaust the file. Since the file wasn't closed and reopened, the next time find_mac() is called the file pointer is at the end of the file and reads nothing.
Moving the following to the beginning of find_mac should fix it:
f = open('all_switches.csv','U')
source_file = csv.reader(f)
Consider using with statements to ensure your files are closed as well.
Started learning python this week, so I thought I would use it rather than excel to parse some fields out of file paths.
I have about 3000 files that all fit the naming convention.
/Household/LastName.FirstName.Account.Doctype.Date.extension
For example one of these files might be named: Cosby.Bill..Profile.2006.doc
and the fullpath is /Volumes/HD/Organized Files/Cosby, Bill/Cosby.Bill..Profile.2006.doc
In this case:
Cosby, Bill would be the Household
Where the household (Cosby, Bill) is the enclosing folder for the actual file
Bill would be the first name
Cosby would be the last name
The Account field is ommitted
Profile is the doctype
2006 is the date
doc is the extension
All of these files are located at this directory /Volumes/HD/Organized Files/ I used terminal and ls to get the list of all the files into a .txt file on my desktop and I am trying to parse the information from the filepaths into categories like in the sample above. Ideally I would like to output to a csv, with a column for each category. Here is my ugly code:
def main():
file = open('~/Desktop/client_docs.csv', "rb")
output = open('~/Desktop/client_docs_parsed.txt', "wb")
for line in file:
i = line.find(find_nth(line, '/', 2))
beghouse = line[i + len(find_nth(line, '/', 2)):]
endhouse = beghouse.find('/')
household = beghouse[:endhouse]
lastn = (line[line.find(household):])[(line[line.find(household):]).find('/') + 1:(line[line.find(household):]).find('.')]
firstn = line[line.find('.') + 1: line.find('.', line.find('.') + 1)]
acct = line[line.find('{}.{}.'.format(lastn,firstn)) + len('{}.{}.'.format(lastn,firstn)):line.find('.',line.find('{}.{}.'.format(lastn,firstn)) + len('{}.{}.'.format(lastn,firstn)))]
doctype_beg = line[line.find('{}.{}.{}.'.format(lastn, firstn, acct)) + len('{}.{}.{}.'.format(lastn, firstn, acct)):]
doctype = doctype_beg[:doctype_beg.find('.')]
date_beg = line[line.find('{}/{}.{}.{}.{}.'.format(household,lastn,firstn,acct,doctype)) + len('{}/{}.{}.{}.{}.'.format(household,lastn,firstn,acct,doctype)):]
date = date_beg[:date_beg.find('.')]
print '"',household, '"','"',lastn, '"','"',firstn, '"','"',acct, '"','"',doctype, '"','"',date,'"'
def find_nth(body, s_term, n):
start = body[::-1].find(s_term)
while start >= 0 and n > 1:
start = body[::-1].find(s_term, start+len(s_term))
n -= 1
return ((body[::-1])[start:])[::-1]
if __name__ == "__main__": main()
It seems to work ok, but I run into problems when there is another enclosing folder, it then shifts all my fields about.. for example when rather than the file residing at
/Volumes/HD/Organized Files/Cosby, Bill/
its at /Volumes/HD/Organized Files/Resigned/Cosby, Bill/
I know there has got to be a less clunky way to go about this.
Here's a tool more practical than your function find_nth() :
rstrip()
def find_nth(body, s_term, n):
start = body[::-1].find(s_term)
print '------------------------------------------------'
print 'body[::-1]\n',body[::-1]
print '\nstart == %s' % start
while start >= 0 and n > 1:
start = body[::-1].find(s_term, start+len(s_term))
print 'n == %s start == %s' % (n,start)
n -= 1
print '\n (body[::-1])[start:]\n',(body[::-1])[start:]
print '\n((body[::-1])[start:])[::-1]\n',((body[::-1])[start:])[::-1]
print '---------------\n'
return ((body[::-1])[start:])[::-1]
def cool_find_nth(body, s_term, n):
assert(len(s_term)==1)
return body.rsplit(s_term,n)[0] + s_term
ss = 'One / Two / Three / Four / Five / Six / End'
print 'the string\n%s\n' % ss
print ('================================\n'
"find_nth(ss, '/', 3)\n%s" % find_nth(ss, '/', 3) )
print '================================='
print "cool_find_nth(ss, '/', 3)\n%s" % cool_find_nth(ss, '/', 3)
result
the string
One / Two / Three / Four / Five / Six / End
------------------------------------------------
body[::-1]
dnE / xiS / eviF / ruoF / eerhT / owT / enO
start == 4
n == 3 start == 10
n == 2 start == 17
(body[::-1])[start:]
/ ruoF / eerhT / owT / enO
((body[::-1])[start:])[::-1]
One / Two / Three / Four /
---------------
================================
find_nth(ss, '/', 3)
One / Two / Three / Four /
=================================
cool_find_nth(ss, '/', 3)
One / Two / Three / Four /
EDIT 1
Here's another very practical tool : regex
import re
reg = re.compile('/'
'([^/.]*?)/'
'([^/.]*?)\.'
'([^/.]*?)\.'
'([^/.]*?)\.'
'([^/.]*?)\.'
'([^/.]*?)\.'
'[^/.]+\Z')
def main():
#file = open('~/Desktop/client_docs.csv', "rb")
#output = open('~/Desktop/client_docs_parsed.txt', "wb")
li = ['/Household/LastName.FirstName.Account.Doctype.Date.extension',
'- /Volumes/HD/Organized Files/Cosby, Bill/Cosby.Bill..Profile.2006.doc']
for line in li:
print "line == %r" % line
household,lastn,firstn,acct,doctype,date = reg.search(line).groups('')
print ('household == %r\n'
'lastn == %r\n'
'firstn == %r\n'
'acct == %r\n'
'doctype == %r\n'
'date == %r\n'
% (household,lastn,firstn,acct,doctype,date))
if __name__ == "__main__": main()
result
line == '/Household/LastName.FirstName.Account.Doctype.Date.extension'
household == 'Household'
lastn == 'LastName'
firstn == 'FirstName'
acct == 'Account'
doctype == 'Doctype'
date == 'Date'
line == '- /Volumes/HD/Organized Files/Cosby, Bill/Cosby.Bill..Profile.2006.doc'
household == 'Cosby, Bill'
lastn == 'Cosby'
firstn == 'Bill'
acct == ''
doctype == 'Profile'
date == '2006'
EDIT 2
I wonder where was my brain when I posted my last edit. The following does the job as well:
rig = re.compile('[/.]')
rig.split(line)[-7:-1]
From what I can gather, I believe this will work as a solution, which won't rely on a previously compiled list of files
import csv
import os, os.path
# Replace this with the directory where the household directories are stored.
directory = "home"
output = open("Output.csv", "wb")
csvf = csv.writer(output)
headerRow = ["Household", "Lastname", "Firstname", "Account", "Doctype",
"Date", "Extension"]
csvf.writerow(headerRow)
for root, households, files in os.walk(directory):
for household in households:
for filename in os.listdir(os.path.join(directory, household)):
# This will create a record for each filename within the "household"
# Then will split the filename out, using the "." as a delimiter
# to get the detail
csvf.writerow([household] + filename.split("."))
output.flush()
output.close()
This uses the os library to "walk" the list of households. Then for each "household", it will gather a file listing. It this takes this list, to generate records in a csv file, breaking apart the name of the file, using the period as a delimiter.
It makes use of the csv library to generate the output, which will look somewhat like;
"Household,LastName,Firstname,Account,Doctype,Date,Extension"
If the extension is not needed, then it can be ommited by changing the line:
csvf.writerow([household] + filename.split("."))
to
csvf.writerow([household] + filename.split(".")[-1])
which tells it to only use up until the last part of the filename, then remove the "Extension" string from headerRow.
Hopefully this helps
It's a bit unclear what the question is but meanwhile, here is something to get you started:
#!/usr/bin/env python
import os
import csv
with open("f1", "rb") as fin:
reader = csv.reader(fin, delimiter='.')
for row in reader:
# split path
row = list(os.path.split(row[0])) + row[1:]
print ','.join(row)
Output:
/Household,LastName,FirstName,Account,Doctype,Date,extension
Another interpretation is that you would like to store each field in a parameter
and that an additional path screws things up...
This is what row looks like in the for-loop:
['/Household/LastName', 'FirstName', 'Account', 'Doctype', 'Date', 'extension']
The solution then might be to work backwards.
Assign row[-1] to extension, row[-2] to date and so on.
Below is a script to read velocity values from molecular dynamics trajectory data. I have many trajectory files with the name pattern as below:
waters1445-MD001-run0100.traj
waters1445-MD001-run0200.traj
waters1445-MD001-run0300.traj
waters1445-MD001-run0400.traj
waters1445-MD001-run0500.traj
waters1445-MD001-run0600.traj
waters1445-MD001-run0700.traj
waters1445-MD001-run0800.traj
waters1445-MD001-run0900.traj
waters1445-MD001-run1000.traj
waters1445-MD002-run0100.traj
waters1445-MD002-run0200.traj
waters1445-MD002-run0300.traj
waters1445-MD002-run0400.traj
waters1445-MD002-run0500.traj
waters1445-MD002-run0600.traj
waters1445-MD002-run0700.traj
waters1445-MD002-run0800.traj
waters1445-MD002-run0900.traj
waters1445-MD002-run1000.traj
Each file has 200 frames of data to analyse. So I planned in such a way where this code is supposed to read in each traj file (shown above) one after another, and extract the velocity values and write in a specific file (text_file = open("Output.traj.dat", "a") corresponding to the respective input trajectory file.
So I defined a function called 'loops(mmm)', where 'mmm' is a trajectory file name parser to the function 'loops'.
#!/usr/bin/env python
'''
always put #!/usr/bin/env python at the shebang
'''
#from __future__ import print_function
from Scientific.IO.NetCDF import NetCDFFile as Dataset
import itertools as itx
import sys
#####################
def loops(mmm):
inputfile = mmm
for FRAMES in range(0,200):
frame = FRAMES
text_file = open("Output.mmm.dat", "a")
def grouper(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
return itx.izip_longest(fillvalue=fillvalue, *args)
formatxyz = "%12.7f%12.7f%12.7f%12.7f%12.7f%12.7f"
formatxyz_size = 6
formatxyzshort = "%12.7f%12.7f%12.7f"
formatxyzshort_size = 3
#ncfile = Dataset(inputfile, 'r')
ncfile = Dataset(ppp, 'r')
variableNames = ncfile.variables.keys()
#print variableNames
shape = ncfile.variables['coordinates'].shape
'''
do the header
'''
print 'title ' + str(frame)
text_file.write('title ' + str(frame) + '\n')
print "%5i%15.7e" % (shape[1],ncfile.variables['time'][frame])
text_file.write("%5i%15.7e" % (shape[1],ncfile.variables['time']\
[frame]) + '\n')
'''
do the velocities
'''
try:
xyz = ncfile.variables['velocities'][frame]
temp = grouper(2, xyz, "")
for i in temp:
z = tuple(itx.chain(*i))
if (len(z) == formatxyz_size):
print formatxyz % z
text_file.write(formatxyz % z + '\n')
elif (len(z) == formatxyzshort_size):
print formatxyzshort % z
text_file.write(formatxyzshort % z + '\n' )
except(KeyError):
xyz = [0] * shape[2]
xyz = [xyz] * shape[1]
temp = grouper(2, xyz, "")
for i in temp:
z = tuple(itx.chain(*i))
if (len(z) == formatxyz_size):
print formatxyz % z
elif (len(z) == formatxyzshort_size):
print formatxyzshort % z
x = ncfile.variables['cell_angles'][frame]
y = ncfile.variables['cell_lengths'][frame]
#text_file.close()
# program starts - generation of file name
for md in range(1,3):
if md < 10:
for pico in range(100,1100, 100):
if pico >= 1000:
kkk = "waters1445-MD00{0}-run{1}.traj".format(md,pico)
loops(kkk)
elif pico < 1000:
kkk = "waters1445-MD00{0}-run0{1}.traj".format(md,pico)
loops(kkk)
#print kkk
At the (# program starts - generation of file name) line, the code supposed to generate the file name and accordingly call the function and extract the velocity and dump the values in (text_file = open("Output.mmm.dat", "a")
When execute this code, the program is running, but unfortunately could not produce output files according the input trajectory file names.
I want the output file names to be:
velo-waters1445-MD001-run0100.dat
velo-waters1445-MD001-run0200.dat
velo-waters1445-MD001-run0300.dat
velo-waters1445-MD001-run0400.dat
velo-waters1445-MD001-run0500.dat
.
.
.
I could not trace where I need to do changes.
Your code's indentation is broken: The first assignment to formatxyz and the following code is not aligned to either the def grouper, nor the for FRAMES.
The main problem may be (like Johannes commented on already) the time when you open the file(s) for writing and when you actually write data into the file.
Check:
for FRAMES in range(0,200):
frame = FRAMES
text_file = open("Output.mmm.dat", "a")
The output file is named (hardcoded) Output.mmm.dat. Change to "Output.{0}.dat".format(mmm). But then, the variable mmm never changes inside the loop. This may be ok, if all frames are supposed to be written to the same file.
Generally, please work on the names you choose for variables and functions. loops is very generic, and so are kkk and mmm. Be more specific, it helps debugging. If you don't know what's happening and where your programs go wrong, insert print("dbg> do (a)") statements with some descriptive text and/or use the Python debugger to step through your program. Especially interactive debugging is essential in learning a new language and new concepts, imho.