Python parse filenames by custom field pattern - python

Started learning python this week, so I thought I would use it rather than excel to parse some fields out of file paths.
I have about 3000 files that all fit the naming convention.
/Household/LastName.FirstName.Account.Doctype.Date.extension
For example one of these files might be named: Cosby.Bill..Profile.2006.doc
and the fullpath is /Volumes/HD/Organized Files/Cosby, Bill/Cosby.Bill..Profile.2006.doc
In this case:
Cosby, Bill would be the Household
Where the household (Cosby, Bill) is the enclosing folder for the actual file
Bill would be the first name
Cosby would be the last name
The Account field is ommitted
Profile is the doctype
2006 is the date
doc is the extension
All of these files are located at this directory /Volumes/HD/Organized Files/ I used terminal and ls to get the list of all the files into a .txt file on my desktop and I am trying to parse the information from the filepaths into categories like in the sample above. Ideally I would like to output to a csv, with a column for each category. Here is my ugly code:
def main():
file = open('~/Desktop/client_docs.csv', "rb")
output = open('~/Desktop/client_docs_parsed.txt', "wb")
for line in file:
i = line.find(find_nth(line, '/', 2))
beghouse = line[i + len(find_nth(line, '/', 2)):]
endhouse = beghouse.find('/')
household = beghouse[:endhouse]
lastn = (line[line.find(household):])[(line[line.find(household):]).find('/') + 1:(line[line.find(household):]).find('.')]
firstn = line[line.find('.') + 1: line.find('.', line.find('.') + 1)]
acct = line[line.find('{}.{}.'.format(lastn,firstn)) + len('{}.{}.'.format(lastn,firstn)):line.find('.',line.find('{}.{}.'.format(lastn,firstn)) + len('{}.{}.'.format(lastn,firstn)))]
doctype_beg = line[line.find('{}.{}.{}.'.format(lastn, firstn, acct)) + len('{}.{}.{}.'.format(lastn, firstn, acct)):]
doctype = doctype_beg[:doctype_beg.find('.')]
date_beg = line[line.find('{}/{}.{}.{}.{}.'.format(household,lastn,firstn,acct,doctype)) + len('{}/{}.{}.{}.{}.'.format(household,lastn,firstn,acct,doctype)):]
date = date_beg[:date_beg.find('.')]
print '"',household, '"','"',lastn, '"','"',firstn, '"','"',acct, '"','"',doctype, '"','"',date,'"'
def find_nth(body, s_term, n):
start = body[::-1].find(s_term)
while start >= 0 and n > 1:
start = body[::-1].find(s_term, start+len(s_term))
n -= 1
return ((body[::-1])[start:])[::-1]
if __name__ == "__main__": main()
It seems to work ok, but I run into problems when there is another enclosing folder, it then shifts all my fields about.. for example when rather than the file residing at
/Volumes/HD/Organized Files/Cosby, Bill/
its at /Volumes/HD/Organized Files/Resigned/Cosby, Bill/
I know there has got to be a less clunky way to go about this.

Here's a tool more practical than your function find_nth() :
rstrip()
def find_nth(body, s_term, n):
start = body[::-1].find(s_term)
print '------------------------------------------------'
print 'body[::-1]\n',body[::-1]
print '\nstart == %s' % start
while start >= 0 and n > 1:
start = body[::-1].find(s_term, start+len(s_term))
print 'n == %s start == %s' % (n,start)
n -= 1
print '\n (body[::-1])[start:]\n',(body[::-1])[start:]
print '\n((body[::-1])[start:])[::-1]\n',((body[::-1])[start:])[::-1]
print '---------------\n'
return ((body[::-1])[start:])[::-1]
def cool_find_nth(body, s_term, n):
assert(len(s_term)==1)
return body.rsplit(s_term,n)[0] + s_term
ss = 'One / Two / Three / Four / Five / Six / End'
print 'the string\n%s\n' % ss
print ('================================\n'
"find_nth(ss, '/', 3)\n%s" % find_nth(ss, '/', 3) )
print '================================='
print "cool_find_nth(ss, '/', 3)\n%s" % cool_find_nth(ss, '/', 3)
result
the string
One / Two / Three / Four / Five / Six / End
------------------------------------------------
body[::-1]
dnE / xiS / eviF / ruoF / eerhT / owT / enO
start == 4
n == 3 start == 10
n == 2 start == 17
(body[::-1])[start:]
/ ruoF / eerhT / owT / enO
((body[::-1])[start:])[::-1]
One / Two / Three / Four /
---------------
================================
find_nth(ss, '/', 3)
One / Two / Three / Four /
=================================
cool_find_nth(ss, '/', 3)
One / Two / Three / Four /
EDIT 1
Here's another very practical tool : regex
import re
reg = re.compile('/'
'([^/.]*?)/'
'([^/.]*?)\.'
'([^/.]*?)\.'
'([^/.]*?)\.'
'([^/.]*?)\.'
'([^/.]*?)\.'
'[^/.]+\Z')
def main():
#file = open('~/Desktop/client_docs.csv', "rb")
#output = open('~/Desktop/client_docs_parsed.txt', "wb")
li = ['/Household/LastName.FirstName.Account.Doctype.Date.extension',
'- /Volumes/HD/Organized Files/Cosby, Bill/Cosby.Bill..Profile.2006.doc']
for line in li:
print "line == %r" % line
household,lastn,firstn,acct,doctype,date = reg.search(line).groups('')
print ('household == %r\n'
'lastn == %r\n'
'firstn == %r\n'
'acct == %r\n'
'doctype == %r\n'
'date == %r\n'
% (household,lastn,firstn,acct,doctype,date))
if __name__ == "__main__": main()
result
line == '/Household/LastName.FirstName.Account.Doctype.Date.extension'
household == 'Household'
lastn == 'LastName'
firstn == 'FirstName'
acct == 'Account'
doctype == 'Doctype'
date == 'Date'
line == '- /Volumes/HD/Organized Files/Cosby, Bill/Cosby.Bill..Profile.2006.doc'
household == 'Cosby, Bill'
lastn == 'Cosby'
firstn == 'Bill'
acct == ''
doctype == 'Profile'
date == '2006'
EDIT 2
I wonder where was my brain when I posted my last edit. The following does the job as well:
rig = re.compile('[/.]')
rig.split(line)[-7:-1]

From what I can gather, I believe this will work as a solution, which won't rely on a previously compiled list of files
import csv
import os, os.path
# Replace this with the directory where the household directories are stored.
directory = "home"
output = open("Output.csv", "wb")
csvf = csv.writer(output)
headerRow = ["Household", "Lastname", "Firstname", "Account", "Doctype",
"Date", "Extension"]
csvf.writerow(headerRow)
for root, households, files in os.walk(directory):
for household in households:
for filename in os.listdir(os.path.join(directory, household)):
# This will create a record for each filename within the "household"
# Then will split the filename out, using the "." as a delimiter
# to get the detail
csvf.writerow([household] + filename.split("."))
output.flush()
output.close()
This uses the os library to "walk" the list of households. Then for each "household", it will gather a file listing. It this takes this list, to generate records in a csv file, breaking apart the name of the file, using the period as a delimiter.
It makes use of the csv library to generate the output, which will look somewhat like;
"Household,LastName,Firstname,Account,Doctype,Date,Extension"
If the extension is not needed, then it can be ommited by changing the line:
csvf.writerow([household] + filename.split("."))
to
csvf.writerow([household] + filename.split(".")[-1])
which tells it to only use up until the last part of the filename, then remove the "Extension" string from headerRow.
Hopefully this helps

It's a bit unclear what the question is but meanwhile, here is something to get you started:
#!/usr/bin/env python
import os
import csv
with open("f1", "rb") as fin:
reader = csv.reader(fin, delimiter='.')
for row in reader:
# split path
row = list(os.path.split(row[0])) + row[1:]
print ','.join(row)
Output:
/Household,LastName,FirstName,Account,Doctype,Date,extension
Another interpretation is that you would like to store each field in a parameter
and that an additional path screws things up...
This is what row looks like in the for-loop:
['/Household/LastName', 'FirstName', 'Account', 'Doctype', 'Date', 'extension']
The solution then might be to work backwards.
Assign row[-1] to extension, row[-2] to date and so on.

Related

Error in wikipedia subcategory crawling using python3

Hello Community Members,
I am getting the error NameError: name 'f' is not defined. The code is as follows. Please help. Any sort of help is appreciated. I have been strucked onto this since 3 days. The code is all about to extract all the subcategories name of wikipedia category in Python 3.
I have tried both the relative and absolute paths.
The code is as follows:
import httplib2
from bs4 import BeautifulSoup
import subprocess
import time, wget
import os, os.path
#declarations
catRoot = "http://en.wikipedia.org/wiki/Category:"
MAX_DEPTH = 100
done = []
ignore = []
path = 'trivial'
#Removes all newline characters and replaces with spaces
def removeNewLines(in_text):
return in_text.replace('\n', ' ')
# Downloads a link into the destination
def download(link, dest):
# print link
if not os.path.exists(dest) or os.path.getsize(dest) == 0:
subprocess.getoutput('wget "' + link + '" -O "' + dest+ '"')
print ("Downloading")
def ensureDir(f):
if not os.path.exists(f):
os.mkdir(f)
# Cleans a text by removing tags
def clean(in_text):
s_list = list(in_text)
i,j = 0,0
while i < len(s_list):
#iterate until a left-angle bracket is found
if s_list[i] == '<':
if s_list[i+1] == 'b' and s_list[i+2] == 'r' and s_list[i+3] == '>':
i=i+1
print ("hello")
continue
while s_list[i] != '>':
#pop everything from the the left-angle bracket until the right-angle bracket
s_list.pop(i)
#pops the right-angle bracket, too
s_list.pop(i)
elif s_list[i] == '\n':
s_list.pop(i)
else:
i=i+1
#convert the list back into text
join_char=''
return (join_char.join(s_list))#.replace("<br>","\n")
def getBullets(content):
mainSoup = BeautifulSoup(contents, "html.parser")
# Gets empty bullets
def getAllBullets(content):
mainSoup = BeautifulSoup(str(content), "html.parser")
subcategories = mainSoup.findAll('div',attrs={"class" : "CategoryTreeItem"})
empty = []
full = []
for x in subcategories:
subSoup = BeautifulSoup(str(x))
link = str(subSoup.findAll('a')[0])
if (str(x)).count("CategoryTreeEmptyBullet") > 0:
empty.append(clean(link).replace(" ","_"))
elif (str(x)).count("CategoryTreeBullet") > 0:
full.append(clean(link).replace(" ","_"))
return((empty,full))
def printTree(catName, count):
catName = catName.replace("\\'","'")
if count == MAX_DEPTH : return
download(catRoot+catName, path)
filepath = "categories/Category:"+catName+".html"
print(filepath)
content = open('filepath', 'w+')
content.readlines()
(emptyBullets,fullBullets) = getAllBullets(content)
f.close()
for x in emptyBullets:
for i in range(count):
print (" "),
download(catRoot+x, "categories/Category:"+x+".html")
print (x)
for x in fullBullets:
for i in range(count):
print (" "),
print (x)
if x in done:
print ("Done... "+x)
continue
done.append(x)
try: printTree(x, count + 1)
except:
print ("ERROR: " + x)
name = "Cricket"
printTree(name, 0)
The error encountered is as follows.
I think f.close() should be content.close().
It's common to use a context manager for such cases, though, like this:
with open(filepath, 'w+') as content:
(emptyBullets,fullBullets) = getAllBullets(content)
Then Python will close the file for you, even in case of an exception.
(I also changed 'filepath' to filepath, which I assume is the intent here.)

how to merge same object in json file using python

1.json file contain many sniffing WIFI packets, I want get the mac address of receiver and transmitter which can be found in the first "wlan" object called "wlan.ra" and "wlan.sa". data[0] is the first WIFI packet.
Q1:
But when I try to print the elements of wlan after json load, it only show the elements of the second "wlan" object so there is no "wlan.ra" and "wlan.sa" in the data.
with open('1.json','r') as json_data:
data = json.load(json_data)
a=data[0]
print a
Q2:
There are two 'wlan' objects in my json file. How can I merge the elements in these two 'wlan' objects into just one 'wlan' object?
The following is my code, but it doesn't work:
with open('1.json','r') as f:
data=json.load(f)
for i in data:
i['_source']['layers']['wlan'].update()
Screenshot of json file:
'''
Created on 2017/10/3
#author: DD
'''
import os
def modify_jsonfile(jsonfile):
'''
replace wlan to wlan1/wlan2
'''
FILESUFFIX = '_new' # filename suffix
LBRACKET = '{' # json object delimiter
RBRACKET = '}'
INTERSETED = '"wlan"' # string to be replaced
nBrackets = 0 # stack to record object status
nextIndex = 1 # next index of wlan
with open(jsonfile, 'r') as fromJsonFile:
fields = os.path.splitext(jsonfile) # generate new filename
with open(fields[0] + FILESUFFIX + fields[1], 'w') as toJsonFile:
for line in fromJsonFile.readlines():
for ch in line: # record bracket
if ch == LBRACKET:
nBrackets += 1
elif ch == RBRACKET:
nBrackets -= 1
if nBrackets == 0:
nextIndex = 1
if (nextIndex == 1 or nextIndex == 2) and line.strip().find(INTERSETED) == 0: # replace string
line = line.replace(INTERSETED, INTERSETED[:-1] + str(nextIndex) + INTERSETED[-1])
nextIndex += 1
toJsonFile.write(line);
print 'done.'
if __name__ == '__main__':
jsonfile = r'C:\Users\DD\Desktop\1.json';
modify_jsonfile(jsonfile)

Python - how to optimize iterator in file parsing

I get files that have NTFS audit permissions and I'm using Python to parse them. The raw CSV files list the path and then which groups have which access, such as this type of pattern:
E:\DIR A, CREATOR OWNER FullControl
E:\DIR A, Sales FullControl
E:\DIR A, HR Full Control
E:\DIR A\SUBDIR, Sales FullControl
E:\DIR A\SUBDIR, HR FullControl
My code parses the file to output this:
File Access for: E:\DIR A
CREATOR OWNER,FullControl
Sales,FullControl
HR,FullControl
File Access For: E:\DIR A\SUBDIR
Sales,FullControl
HR,FullControl
I'm new to generators but I'd like to use them to optimize my code. Nothing I've tried seems to work, so here is the original code (I know it's ugly). It works but it's very slow. The only way I can do this is by parsing out the paths first, put them in a list, make a set so that they're unique, then iterate over that list and match them with the path in the second list, and list all of the items it finds. Like I said, it's ugly but works.
import os, codecs, sys
reload(sys)
sys.setdefaultencoding('utf8') // to prevent cp-932 errors on screen
file = "aud.csv"
outfile = "access-2.csv"
filelist = []
accesslist = []
with codecs.open(file,"r",'utf-8-sig') as infile:
for line in infile:
newline = line.split(',')
folder = newline[0].replace("\"","")
user = newline[1].replace("\"","")
filelist.append(folder)
accesslist.append(folder+","+user)
newfl = sorted(set(filelist))
def makeFile():
print "Starting, please wait"
for i in range(1,len(newfl)):
searchItem = str(newfl[i])
with codecs.open(outfile,"a",'utf-8-sig') as output:
outtext = ("\r\nFile access for: "+ searchItem + "\r\n")
output.write(outtext)
for item in accesslist:
searchBreak = item.split(",")
searchTarg = searchBreak[0]
if searchItem == searchTarg:
searchBreaknew = searchBreak[1].replace("FSA-INC01S\\","")
searchBreaknew = str(searchBreaknew)
# print(searchBreaknew)
searchBreaknew = searchBreaknew.replace(" ",",")
searchBreaknew = searchBreaknew.replace("CREATOR,OWNER","CREATOR OWNER")
output.write(searchBreaknew)
How should I optimize this?
EDIT:
Here is an edited version. It works MUCH faster, though I'm sure it can still be fixed:
import os, codecs, sys, csv
reload(sys)
sys.setdefaultencoding('utf8')
file = "aud.csv"
outfile = "access-3.csv"
filelist = []
accesslist = []
with codecs.open(file,"r",'utf-8-sig') as csvinfile:
auditfile = csv.reader(csvinfile, delimiter=",")
for line in auditfile:
folder = line[0]
user = line[1].replace("FSA-INC01S\\","")
filelist.append(folder)
accesslist.append(folder+","+user)
newfl = sorted(set(filelist))
def makeFile():
print "Starting, please wait"
for i in xrange(1,len(newfl)):
searchItem = str(newfl[i])
outtext = ("\r\nFile access for: "+ searchItem + "\r\n")
accessUserlist = ""
for item in accesslist:
searchBreak = item.split(",")
if searchItem == searchBreak[0]:
searchBreaknew = str(searchBreak[1]).replace(" ",",")
searchBreaknew = searchBreaknew.replace("R,O","R O")
accessUserlist += searchBreaknew+"\r\n"
with codecs.open(outfile,"a",'utf-8-sig') as output:
output.write(outtext)
output.write(accessUserlist)
I'm misguided from your used .csv file extension.
Your given expected output isn't compatible with csv, as inside a record no \n possible.
Proposal using a generator returning record by record:
class Audit(object):
def __init__(self, fieldnames):
self.fieldnames = fieldnames
self.__access = {}
def append(self, row):
folder = row[self.fieldnames[0]]
access = row[self.fieldnames[1]].strip(' ')
access = access.replace("FSA-INC01S\\", "")
access = access.split(' ')
if len(access) == 3:
if access[0] == 'CREATOR':
access[0] += ' ' + access[1]
del access[1];
elif access[1] == 'Full':
access[1] += ' ' + access[2]
del access[2];
if folder not in self.__access:
self.__access[folder] = []
self.__access[folder].append(access)
# Generator for class Audit
def __iter__(self):
record = ''
for folder in sorted(self.__access):
record = folder + '\n'
for access in self.__access[folder]:
record += '%s\n' % (','.join(access) )
yield record + '\n'
How to use it:
def main():
import io, csv
audit = Audit(['Folder', 'Accesslist'])
with io.open(file, "r", encoding='utf-8') as csc_in:
for row in csv.DictReader(csc_in, delimiter=","):
audit.append(row)
with io.open(outfile, 'w', newline='', encoding='utf-8') as txt_out:
for record in audit:
txt_out.write(record)
Tested with Python:3.4.2 - csv:1.0

Printing in the same line in python

I am quite new in python and I need your help.
I have a file like this:
>chr14_Gap_2
ACCGCGATGAAAGAGTCGGTGGTGGGCTCGTTCCGACGCGCATCCCCTGGAAGTCCTGCTCAATCAGGTGCCGGATGAAGGTGGT
GCTCCTCCAGGGGGCAGCAGCTTCTGCGCGTACAGCTGCCACAGCCCCTAGGACACCGTCTGGAAGAGCTCCGGCTCCTTCTTG
acacccaggactgatctcctttaggatggactggctggatcttcttgcagtccaaggggctctcaagagt
………..
>chr14_Gap_3
ACCGCGATGAAAGAGTCGGTGGTGGGCTCGTTCCGACGCGCATCCCCTGGAAGTCCTGCTCAATCAGGTGCCGGATGAAGGTGGT
GCTCCTCCAGGGGGCAGCAGCTTCTGCGCGTACAGCTGCCACAGCCCCTAGGACACCGTCTGGAAGAGCTCCGGCTCCTTCTTG
acacccaggactgatctcctttaggatggactggctggatcttcttgcagtccaaggggctctcaagagt
………..
One string as a tag and one string the dna sequence.
I want to calculate the number of the N letters and the number of the lower case letters and take the percentage.
I wrote the following script which works but I have a problem in printing.
#!/usr/bin/python
import sys
if len (sys.argv) != 2 :
print "Usage: If you want to run this python script you have to put the fasta file that includes the desert area's sequences as arument"
sys.exit (1)
fasta_file = sys.argv[1]
#This script reads the sequences of the desert areas (fasta files) and calculates the persentage of the Ns and the repeats.
fasta_file = sys.argv[1]
f = open(fasta_file, 'r')
content = f.readlines()
x = len(content)
#print x
for i in range(0,len(content)):
if (i%2 == 0):
content[i].strip()
name = content[i].split(">")[1]
print name, #the "," makes the print command to avoid to print a new line
else:
content[i].strip()
numberOfN = content[i].count('N')
#print numberOfN
allChar = len(content[i])
lowerChars = sum(1 for c in content[i] if c.islower())
Ns_persentage = 100 * (numberOfN/float(allChar))
lower_persentage = 100 * (lowerChars/float(allChar))
waste = Ns_persentage + lower_persentage
print ("The waste persentage is: %s" % (round(waste)))
#print ("The persentage of Ns is: %s and the persentage of repeats is: %s" % (Ns_persentage,lower_persentage))
#print (name + waste)
The thing is that it can print the tag in the first line and the waste variable in the second one like this:
chr10_Gap_18759
The waste persentage is: 52.0
How can I print it in the same line, tab separated?
eg
chr10_Gap_18759 52.0
chr10_Gap_19000 78.0
…….
Thank you very much.
You can print it with:
print name, "\t", round(waste)
If you are using python 2.X
I would make some modification to your code. There is the argparse module of python to manage the arguments from the command line. I would do something like this:
#!/usr/bin/python
import argparse
# To use the arguments
parser = argparse.ArgumentParser()
parser.add_argument("fasta_file", help = "The fasta file to be processed ", type=str)
args = parser.parse_args()
f= open(args.fasta_file, "r")
content = f.readlines()
f.close()
x = len(content)
for i in range(x):
line = content[i].strip()
if (i%2 == 0):
#The first time it will fail, for the next occasions it will be printed as you wish
try:
print bname, "\t", round(waste)
except:
pass
name = line.split(">")[1]
else:
numberOfN = line.count('N')
allChar = len(line)
lowerChars = sum(1 for c in content[i] if c.islower())
Ns_persentage = 100 * (numberOfN/float(allChar))
lower_persentage = 100 * (lowerChars/float(allChar))
waste = Ns_persentage + lower_persentage
# To print the last case you need to do it outside the loop
print name, "\t", round(waste)
You can also print it like the other answer with print("{}\t{}".format(name, round(waste)))
I am not sure about the use of i%2, Note that if the sequence uses and odd number of lines you'll will not get the name of the next sequence until the same event occurs. I would check if the line begin with ">" then use store the name, and sum the characters of the next line.
Don't print the name when (i%2 == 0), just save it in variable and print in the next iteration together with the percentage:
print("{0}\t{1}".format(name, round(waste)))
This method of string formatting (new in version 2.6) is the new standard in Python 3, and should be preferred to the % formatting described in String Formatting Operations in new code.
I've fixed the indentation and redundancy:
#!/usr/bin/python
"""
This script reads the sequences of the desert areas (fasta files) and calculates the percentage of the Ns and the repeats.
2014-10-05 v1.0 by Vasilis
2014-10-05 v1.1 by Llopis
2015-02-27 v1.2 by Cees Timmerman
"""
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("fasta_file", help="The fasta file to be processed.", type=str)
args = parser.parse_args()
with open(args.fasta_file, "r") as f:
for line in f.readlines():
line = line.strip()
if line[0] == '>':
name = line.split(">")[1]
print name,
else:
numberOfN = line.count('N')
allChar = len(line)
lowerChars = sum(1 for c in line if c.islower())
Ns_percentage = 100 * (numberOfN/float(allChar))
lower_percentage = 100 * (lowerChars/float(allChar))
waste = Ns_percentage + lower_percentage
print "\t", round(waste) # Note: https://docs.python.org/2/library/functions.html#round
Fed:
>chr14_Gap_2
ACCGCGATGAAAGAGTCGGTGGTGGGCTCGTTCCGACGCGCATCCCCTGGAAGTCCTGCTCAATCAGGTGCCGGATGAAGGTGGTGCTCCTCCAGGGGGCAGCAGCTTCTGCGCGTACAGCTGCCACAGCCCCTAGGACACCGTCTGGAAGAGCTCCGGCTCCTTCTTGacacccaggactgatctcctttaggatggactggctggatcttcttgcagtccaaggggctctcaagagt
>chr14_Gap_3
ACCGCGATGAAAGAGTCGGTGGTGGGCTCGTTCCGACGCGCATCCCCTGGAAGTCCTGCTCAATCAGGTGCCGGATGAAGGTGGTGCTCCTCCAGGGGGCAGCAGCTTCTGCGCGTACAGCTGCCACAGCCCCTAGGACACCGTCTGGAAGAGCTCCGGCTCCTTCTTGacacccaggactgatctcctttaggatggactggctggatcttcttgcagtccaaggggctctcaagagt
Gives:
C:\Python27\python.exe -u "dna.py" fasta.txt
Process started >>>
chr14_Gap_2 29.0
chr14_Gap_3 29.0
<<< Process finished. (Exit code 0)
Using my favorite Python IDE: Notepad++ with NppExec plugin.

Generate output file name in accordance to input file names

Below is a script to read velocity values from molecular dynamics trajectory data. I have many trajectory files with the name pattern as below:
waters1445-MD001-run0100.traj
waters1445-MD001-run0200.traj
waters1445-MD001-run0300.traj
waters1445-MD001-run0400.traj
waters1445-MD001-run0500.traj
waters1445-MD001-run0600.traj
waters1445-MD001-run0700.traj
waters1445-MD001-run0800.traj
waters1445-MD001-run0900.traj
waters1445-MD001-run1000.traj
waters1445-MD002-run0100.traj
waters1445-MD002-run0200.traj
waters1445-MD002-run0300.traj
waters1445-MD002-run0400.traj
waters1445-MD002-run0500.traj
waters1445-MD002-run0600.traj
waters1445-MD002-run0700.traj
waters1445-MD002-run0800.traj
waters1445-MD002-run0900.traj
waters1445-MD002-run1000.traj
Each file has 200 frames of data to analyse. So I planned in such a way where this code is supposed to read in each traj file (shown above) one after another, and extract the velocity values and write in a specific file (text_file = open("Output.traj.dat", "a") corresponding to the respective input trajectory file.
So I defined a function called 'loops(mmm)', where 'mmm' is a trajectory file name parser to the function 'loops'.
#!/usr/bin/env python
'''
always put #!/usr/bin/env python at the shebang
'''
#from __future__ import print_function
from Scientific.IO.NetCDF import NetCDFFile as Dataset
import itertools as itx
import sys
#####################
def loops(mmm):
inputfile = mmm
for FRAMES in range(0,200):
frame = FRAMES
text_file = open("Output.mmm.dat", "a")
def grouper(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
return itx.izip_longest(fillvalue=fillvalue, *args)
formatxyz = "%12.7f%12.7f%12.7f%12.7f%12.7f%12.7f"
formatxyz_size = 6
formatxyzshort = "%12.7f%12.7f%12.7f"
formatxyzshort_size = 3
#ncfile = Dataset(inputfile, 'r')
ncfile = Dataset(ppp, 'r')
variableNames = ncfile.variables.keys()
#print variableNames
shape = ncfile.variables['coordinates'].shape
'''
do the header
'''
print 'title ' + str(frame)
text_file.write('title ' + str(frame) + '\n')
print "%5i%15.7e" % (shape[1],ncfile.variables['time'][frame])
text_file.write("%5i%15.7e" % (shape[1],ncfile.variables['time']\
[frame]) + '\n')
'''
do the velocities
'''
try:
xyz = ncfile.variables['velocities'][frame]
temp = grouper(2, xyz, "")
for i in temp:
z = tuple(itx.chain(*i))
if (len(z) == formatxyz_size):
print formatxyz % z
text_file.write(formatxyz % z + '\n')
elif (len(z) == formatxyzshort_size):
print formatxyzshort % z
text_file.write(formatxyzshort % z + '\n' )
except(KeyError):
xyz = [0] * shape[2]
xyz = [xyz] * shape[1]
temp = grouper(2, xyz, "")
for i in temp:
z = tuple(itx.chain(*i))
if (len(z) == formatxyz_size):
print formatxyz % z
elif (len(z) == formatxyzshort_size):
print formatxyzshort % z
x = ncfile.variables['cell_angles'][frame]
y = ncfile.variables['cell_lengths'][frame]
#text_file.close()
# program starts - generation of file name
for md in range(1,3):
if md < 10:
for pico in range(100,1100, 100):
if pico >= 1000:
kkk = "waters1445-MD00{0}-run{1}.traj".format(md,pico)
loops(kkk)
elif pico < 1000:
kkk = "waters1445-MD00{0}-run0{1}.traj".format(md,pico)
loops(kkk)
#print kkk
At the (# program starts - generation of file name) line, the code supposed to generate the file name and accordingly call the function and extract the velocity and dump the values in (text_file = open("Output.mmm.dat", "a")
When execute this code, the program is running, but unfortunately could not produce output files according the input trajectory file names.
I want the output file names to be:
velo-waters1445-MD001-run0100.dat
velo-waters1445-MD001-run0200.dat
velo-waters1445-MD001-run0300.dat
velo-waters1445-MD001-run0400.dat
velo-waters1445-MD001-run0500.dat
.
.
.
I could not trace where I need to do changes.
Your code's indentation is broken: The first assignment to formatxyz and the following code is not aligned to either the def grouper, nor the for FRAMES.
The main problem may be (like Johannes commented on already) the time when you open the file(s) for writing and when you actually write data into the file.
Check:
for FRAMES in range(0,200):
frame = FRAMES
text_file = open("Output.mmm.dat", "a")
The output file is named (hardcoded) Output.mmm.dat. Change to "Output.{0}.dat".format(mmm). But then, the variable mmm never changes inside the loop. This may be ok, if all frames are supposed to be written to the same file.
Generally, please work on the names you choose for variables and functions. loops is very generic, and so are kkk and mmm. Be more specific, it helps debugging. If you don't know what's happening and where your programs go wrong, insert print("dbg> do (a)") statements with some descriptive text and/or use the Python debugger to step through your program. Especially interactive debugging is essential in learning a new language and new concepts, imho.

Categories

Resources