How to use for loop in the write method? - python

The program is supposed to read a given file, count the occurrence of each word with a dictionary, then create a file called report.txt and output the list of words and their frequencies
infile = open('text file.txt','r')
dictionary = {}
# count words' frequency
for i in range(1,14):
temp = infile.readline().strip().split()
for item in temp:
if dictionary.has_key(item) == False:
dictionary[item] = 1
elif dictionary.has_key:
temp2 = dictionary.get(item)
dictionary[item] = temp2 + 1
infile.close()
outfile = open('report.txt','w')
outfile.write( for words in dictionary:
print '%15s :' %words, dictionary[words])
everything works at the counting part, but
just right at the last part of writing the output, I realize I can't put a for loop in the write method

You need to put the write inside the for loop:
for words in dictionary:
outfile.write('%15s : %s\n' % (words, dictionary[words]))
Alternatively you can use a comprehension, but they're a bit ninja and can be harder to read:
outfile.write('\n'.join(['%15s : %s' % key_value for key_value in dictionary.items()]))

As has been said already in the accepted answer, you need the write inside the for loop. However, when using files it is also good practice to perform your actions within a with context as this will automatically handle the closing of the file. e.g.
with open('report.txt','w') as outfile:
for words in dictionary:
outfile.write('%15s : %s\n' % (words, dictionary[words]))

Your code contains several deficiencies:
You don't use has_key and you don't compare to True / False directly - it is redundant and bad style (in any language)
if dictionary.has_key(item) == False:
should be
`if not item in dictionary`
It is worth mentioning that using positive test first will be more efficient - because you'll probably have more than 1 occurrence of most words in the file
dictionary.has_key returns a reference to has_key method - which in boolean equivalent is True (your code accidentally works, because regardless of the 1st conditions second is always True). Simple else would be enough
The last 2 statements in the condition may be just rewritten as
dictionary[item] += 1
That said, you may use collections.Counter to count words
dictionary = Counter()
for lines in source_file:
dictionary.update(line.split())
(BTW, strip before split is redundant)

Related

What's wrong with populating a dictionary

I want to populate a dictionary newDict in following code:
def sessions():
newDict = {}
output = exe(['loginctl','list-sessions']) # uses subprocess.check_output(). returns shell command's multiline output
i = 0;
for line in output.split('\n'):
words = line.split()
newDict[i] = {'session':words[0], 'uid':words[1], 'user':words[2], 'seat':words[4]}
i += 1
stdout(newDict) # prints using pprint.pprint(newDict)
But it only keeps giving me error:
newDict[i] = {'session':words[0], 'uid':words[1], 'user':words[2], 'seat':words[4]}
IndexError: list index out of range
If I do print words in the loop, here's what I get:
['c3', '1002', 'john', 'seat0']
['c4', '1003', 'jeff', 'seat0']
What am I doing wrong?
I think, it is a typo:
You use words[4] instead of words[3].
BTW:
Here is a slightly improved version of your code. It uses splitlines() instead of split('\n') and skips empty lines. And it uses enumerate(), wich is a pretty neat function when it comes to counting entries while iterating over collections.
def sessions():
newDict = {}
output = exe(['loginctl','list-sessions']) #returns shell command's multiline output
for i, line in enumerate(output.splitlines()):
if len(line.strip()) == 0:
continue
words = line.split()
print words
newDict[i] = {'session':words[0], 'uid':words[1], 'user':words[2], 'seat':words[3]}
stdout(newDict) # prints using pprint.pprint(newDict)
Imo You should check if "words" isn't too short. It's most likely problem with list length after spliting some line (It has no enough elements) .
My best guess is that words, does not allways hold five items,
please try to print len(words) before assigning the dict.
As far as I can tell, this has nothing to do with the dictionary itself, but with parsing the output. Here is an example of the output I obtain:
SESSION UID USER SEAT
c2 1000 willem seat0
1 sessions listed.
Or the string version:
' SESSION UID USER SEAT \n c2 1000 willem seat0 \n\n1 sessions listed.\n'
This all appears on the stdout. The problem is -- as you can see - is that not every line contains four words (there is the empty line at the bottom). Or more pythonic:
>>> lines[2].split()
[]
You thus have to implement a check whether the line has four columns:
def sessions():
newDict = {}
output = exe(['loginctl','list-sessions']) # uses subprocess.check_output(). returns shell command's multiline output
i = 0;
for line in output.split('\n'):
words = line.split()
if len(words) >= 4:
newDict[i] = {'session':words[0], 'uid':words[1], 'user':words[2], 'seat':words[3]}
i += 1
stdout(newDict)
(changes highlighted in boldface)
In the code I've also rewritten words[4] to words[3].

python - matching string and replacing

I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.
ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()

Counting words in a dictionary (Python)

I have this code, which I want to open a specified file, and then every time there is a while loop it will count it, finally outputting the total number of while loops in a specific file. I decided to convert the input file to a dictionary, and then create a for loop that every time the word while followed by a space was seen it would add a +1 count to WHILE_ before finally printing WHILE_ at the end.
However this did not seem to work, and I am at a loss as to why. Any help fixing this would be much appreciated.
This is the code I have at the moment:
WHILE_ = 0
INPUT_ = input("Enter file or directory: ")
OPEN_ = open(INPUT_)
READLINES_ = OPEN_.readlines()
STRING_ = (str(READLINES_))
STRIP_ = STRING_.strip()
input_str1 = STRIP_.lower()
dic = dict()
for w in input_str1.split():
if w in dic.keys():
dic[w] = dic[w]+1
else:
dic[w] = 1
DICT_ = (dic)
for LINE_ in DICT_:
if ("while\\n',") in LINE_:
WHILE_ += 1
elif ('while\\n",') in LINE_:
WHILE_ += 1
elif ('while ') in LINE_:
WHILE_ += 1
print ("while_loops {0:>12}".format((WHILE_)))
This is the input file I was working from:
'''A trivial test of metrics
Author: Angus McGurkinshaw
Date: May 7 2013
'''
def silly_function(blah):
'''A silly docstring for a silly function'''
def nested():
pass
print('Hello world', blah + 36 * 14)
tot = 0 # This isn't a for statement
for i in range(10):
tot = tot + i
if_im_done = false # Nor is this an if
print(tot)
blah = 3
while blah > 0:
silly_function(blah)
blah -= 1
while True:
if blah < 1000:
break
The output should be 2, but my code at the moment prints 0
This is an incredibly bizarre design. You're calling readlines to get a list of strings, then calling str on that list, which will join the whole thing up into one big string with the quoted repr of each line joined by commas and surrounded by square brackets, then splitting the result on spaces. I have no idea why you'd ever do such a thing.
Your bizarre variable names, extra useless lines of code like DICT_ = (dic), etc. only serve to obfuscate things further.
But I can explain why it doesn't work. Try printing out DICT_ after you do all that silliness, and you'll see that the only keys that include while are while and 'while. Since neither of these match any of the patterns you're looking for, your count ends up as 0.
It's also worth noting that you only add 1 to WHILE_ even if there are multiple instances of the pattern, so your whole dict of counts is useless.
This will be a lot easier if you don't obfuscate your strings, try to recover them, and then try to match the incorrectly-recovered versions. Just do it directly.
While I'm at it, I'm also going to fix some other problems so that your code is readable, and simpler, and doesn't leak files, and so on. Here's a complete implementation of the logic you were trying to hack up by hand:
import collections
filename = input("Enter file: ")
counts = collections.Counter()
with open(filename) as f:
for line in f:
counts.update(line.strip().lower().split())
print('while_loops {0:>12}'.format(counts['while']))
When you run this on your sample input, you correctly get 2. And extending it to handle if and for is trivial and obvious.
However, note that there's a serious problem in your logic: Anything that looks like a keyword but is in the middle of a comment or string will still get picked up. Without writing some kind of code to strip out comments and strings, there's no way around that. Which means you're going to overcount if and for by 1. The obvious way of stripping—line.partition('#')[0] and similarly for quotes—won't work. First, it's perfectly valid to have a string before an if keyword, as in "foo" if x else "bar". Second, you can't handle multiline strings this way.
These problems, and others like them, are why you almost certainly want a real parser. If you're just trying to parse Python code, the ast module in the standard library is the obvious way to do this. If you want to be write quick&dirty parsers for a variety of different languages, try pyparsing, which is very nice, and comes with some great examples.
Here's a simple example:
import ast
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
while_loops = sum(1 for node in ast.walk(tree) if isinstance(node, ast.While))
print('while_loops {0:>12}'.format(while_loops))
Or, more flexibly:
import ast
import collections
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
counts = collections.Counter(type(node).__name__ for node in ast.walk(tree))
print('while_loops {0:>12}'.format(counts['While']))
print('for_loops {0:>14}'.format(counts['For']))
print('if_statements {0:>10}'.format(counts['If']))

Python search one million strings in a file and count occurrences of each string

This is more about to find the fastest way to do it.
I have a file1 which contains about one million strings(length 6-40) in separate line. I want to search each of them in another file2 which contains about 80,000 strings and count occurrence(if small string is found in one string multiple times, the occurence of this string is still 1). If anyone is interested to compare performance, there is link to download file1 and file2.
dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m
What i am doing now is construct a dictionary for file 2, use strings ID as key and string as value. (because strings in file2 have duplicate values, only string ID is unique)
my code is
for line in file1:
substring=line[:-1].split("\t")
for ID in dictionary.keys():
bigstring=dictionary[ID]
IDlist=[]
if bigstring.find(substring)!=-1:
IDlist.append(ID)
output.write("%s\t%s\n" % (substring,str(len(IDlist))))
My code will take hours to finish. Can anyone suggest a faster way to do it?
both file1 and file2 are just around 50M, my pc have 8G memory, you can use as much memory as you need to make it faster. Any method that can finish in one hour is acceptable:)
Here, after I have tried some suggestions from these comments below, see performance comparison, first comes the code then it is the run time.
Some improvements suggested by Mark Amery and other peoples
import sys
from Bio import SeqIO
#first I load strings in file2 to a dictionary called var_seq,
var_seq={}
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[record.id]=str(record.seq)
print len(var_seq) #Here print out 76827, which is the right number. loading file2 to var_seq doesn't take long, about 1 second, you shall not focus here to improve performance
output=open(outputfilename,'w')
icount=0
input1=open(file1,'r')
for line in input1:
icount+=1
row=line[:-1].split("\t")
ensp=row[0] #ensp is just peptides iD
peptide=row[1] #peptides is the substrings i want to search in file2
num=0
for ID,bigstring in var_seq.iteritems():
if peptide in bigstring:
num+=1
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It will take 1m4s to finish. Improved 20s compared to my old one
#######NEXT METHOD suggested by entropy
from collections import defaultdict
var_seq=defaultdict(int)
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[str(record.seq)]+=1
print len(var_seq) # here print out 59502, duplicates are removed, but occurances of duplicates are stored as value
handle.close()
output=open(outputfilename,'w')
icount=0
with open(file1) as fd:
for line in fd:
icount+=1
row=line[:-1].split("\t")
ensp=row[0]
peptide=row[1]
num=0
for varseq,num_occurrences in var_seq.items():
if peptide in varseq:
num+=num_occurrences
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
output.close()
This one takes 1m10s,not faster as expected since it avoids searching duplicates, don't understand why.
Haystack and Needle method suggested by Mark Amery, which turned out to be the fastest, The problem of this method is that counting result for all substrings is 0, which I don't understand yet.
Here is the code I implemented his method.
class Node(object):
def __init__(self):
self.words = set()
self.links = {}
base = Node()
def search_haystack_tree(needle):
current_node = base
for char in needle:
try:
current_node = current_node.links[char]
except KeyError:
return 0
return len(current_node.words)
input1=open(file1,'r')
needles={}
for line in input1:
row=line[:-1].split("\t")
needles[row[1]]=row[0]
print len(needles)
handle=SeqIO.parse(file2,'fasta')
haystacks={}
for record in handle:
haystacks[record.id]=str(record.seq)
print len(haystacks)
for haystack_id, haystack in haystacks.iteritems(): #should be the same as enumerate(list)
for i in xrange(len(haystack)):
current_node = base
for char in haystack[i:]:
current_node = current_node.links.setdefault(char, Node())
current_node.words.add(haystack_id)
icount=0
output=open(outputfilename,'w')
for needle in needles:
icount+=1
count = search_haystack_tree(needle)
newline="%s\t%s\t%s\n" % (needles[needle],needle,str(count))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It takes only 0m11s to finish, which is much faster than other methods. However, I don't know it is my mistakes to make all counting result as 0, or there is a flaw in the Mark's method.
Your code doesn't seem like it works(are you sure you didn't just quote it from memory instead of pasting the actual code?)
For example, this line:
substring=line[:-1].split("\t")
will cause substring t be a list. But later you do:
if bigstring.find(substring)!=-1:
That would cause an error if you call str.find(list).
In any case, you are building lists uselessly in your innermost loop. This:
IDlist=[]
if bigstring.find(substring)!=-1:
IDlist.append(ID)
#later
len(IDlist)
That will uselessly allocate and free lists which would cause memory thrashing as well as uselessly bogging everything down.
This is code that should work and uses more efficient means to do the counting:
from collections import defaultdict
dictionary = defaultdict(int)
with open(file2) as fd:
for line in fd:
for s in line.split("\t"):
dictionary[s.strip()] += 1
with open(file1) as fd:
for line in fd:
for substring in line.split('\t'):
count = 0
for bigstring,num_occurrences in dictionary.items():
if substring in bigstring:
count += num_occurrences
print substring, count
PS: I am assuming that you have multiple words per line that are tab-split because you do line.split("\t") at some point. If that is wrong, it should be easy to revise the code.
PPS: If this ends up being too slow for your use(you'd have to try it, but my guess is this should run in ~10min given the number of strings you said you had). You'll have to use suffix trees as one of the comments suggested.
Edit: Amended the code so that it handles multiple occurrences of the same string in file2 without negatively affecting performance
Edit 2: Trading maximum space for time.
Below is code that will consume quite a bit of memory and take a while to build the dictionary. However, once that's done, each search out of the million strings to search for should complete in the time it takes for a single hashtable lookup, that is O(1).
Note, I have added some statements to log the time it takes for each step of the process. You should keep those so you know which part of the time is taken when searching. Since you are testing with 1000 strings only this matters a lot since if 90% of the cost is the build time, not the search time, then when you test with 1M strings you will still only be doing that once, so it won't matter
Also note that I have amended my code to parse file1 and file2 as you do, so you should be able to just plug this in and test it:
from Bio import SeqIO
from collections import defaultdict
from datetime import datetime
def all_substrings(s):
result = set()
for length in range(1,len(s)+1):
for offset in range(len(s)-length+1):
result.add(s[offset:offset+length])
return result
print "Building dictionary...."
build_start = datetime.now()
dictionary = defaultdict(int)
handle = SeqIO.parse(file2, 'fasta')
for record in handle:
for sub in all_substrings(str(record.seq).strip()):
dictionary[sub] += 1
build_end = datetime.now()
print "Dictionary built in: %gs" % (build-end-build_start).total_seconds()
print "Searching...\n"
search_start = datetime.now()
with open(file1) as fd:
for line in fd:
substring = line.strip().split("\t")[1]
count = dictionary[substring]
print substring, count
search_end = datetime.now()
print "Search done in: %gs" % (search_end-search_start).total_seconds()
I'm not an algorithms whiz, but I reckon this should give you a healthy performance boost. You need to set 'haystacks' to be a list of the big words you want to look in, and 'needles' to be a list of the substrings you're looking for (either can contain duplicates), which I'll let you implement on your end. It'd be great if you could post your list of needles and list of haystacks so that we can easily compare performance of proposed solutions.
haystacks = <some list of words>
needles = <some list of words>
class Node(object):
def __init__(self):
self.words = set()
self.links = {}
base = Node()
for haystack_id, haystack in enumerate(haystacks):
for i in xrange(len(haystack)):
current_node = base
for char in haystack[i:]:
current_node = current_node.links.setdefault(char, Node())
current_node.words.add(haystack_id)
def search_haystack_tree(needle):
current_node = base
for char in needle:
try:
current_node = current_node.links[char]
except KeyError:
return 0
return len(current_node.words)
for needle in needles:
count = search_haystack_tree(needle)
print "Count for %s: %i" % (needle, count)
You can probably figure out what's going on by looking at the code, but just to put it in words: I construct a huge tree of substrings of the haystack words, such that given any needle, you can navigate the tree character by character and end up at a node which has attached to it the set of all haystack ids of haystacks containing that substring. Then for each needle we just go through the tree and count the size of the set at the end.

How do I alphabetize a file in Python?

I am trying to get a list of presidents alphabetized by last name, even though the file that it is being drawn is currently listed first name, last name, date in office, and date out of office.
Here is what I have, any help on what I need to do with this. I have searched around for some answers, and most of them are beyond my level of understanding. I feel like I am missing something small. I tried to break them all out into a list, and then sort them, but I could not get it to work, so this is where I started from.
INPUT_FILE = 'presidents.txt'
OUTPUT_FILE = 'president_NEW.txt'
OUTPUT_FILE2 = 'president_NEW2.txt'
def main():
infile = open(INPUT_FILE)
outfile = open(OUTPUT_FILE, 'w')
outfile2 = open(OUTPUT_FILE2,'w')
stuff = infile.readline()
while stuff:
stuff = stuff.rstrip()
data = stuff.split('\t')
president_First = data[1]
president_Last = data[0]
start_date = data[2]
end_date = data[3]
sentence = '%s %s was president from %s to %s' % \
(president_First,president_Last,start_date,end_date)
sentence2 = '%s %s was president from %s to %s' % \
(president_Last,president_First,start_date, end_date)
outfile2.write(sentence2+ '\n')
outfile.write(sentence + '\n')
stuff = infile.readline()
infile.close()
outfile.close()
main()
What you should do is put the presidents in a list, sort that list, and then print out the resulting list.
Before your for loop add:
presidents = []
Have this code inside the for loop after you pull out the names/dates
president = (last_name, first_name, start_date, end_date)
presidents.append(president)
After the for loop
presidents.sort() # because we put last_name first above
# it will sort by last_name
Then print it out:
for president in presidents
last_name, first_name, start_date, end_date = president
string1 = "..."
It sounds like you tried to break them out into a list. If you had trouble with that, show us the code that resulting from that attempt. It was right way to approach the problem.
Other comments:
Just a couple of points where you code could be simpler. Feel free to ignore or use this as you want:
president_First=data[1]
president_Last= data[0]
start_date=data[2]
end_date=data[3]
can be written as:
president_Last, president_First, start_date, end_date = data
stuff=infile.readline()
And
while stuff:
stuff=stuff.rstrip()
data=stuff.split('\t')
...
stuff = infile.readline()
can be written as:
for stuff in infile:
...
#!/usr/bin/env python
# this sounds like a homework problem, but ...
from __future__ import with_statement # not necessary on newer versions
def main():
# input
with open('presidents.txt', 'r') as fi:
# read and parse
presidents = [[x.strip() for x in line.split(',')] for line in fi]
# sort
presidents = sorted(presidents, cmp=lambda x, y: cmp(x[1], y[1]))
# output
with open('presidents_out.txt', 'w') as fo:
for pres in presidents:
print >> fo, "president %s %s was president %s %s" % tuple(pres)
if __name__ == '__main__':
main()
I tried to break them all out into a list, and then sort them
What do you mean by "them"?
Breaking up the line into a list of items is a good start: that means you treat the data as a set of values (one of which is the last name) rather than just a string. However, just sorting that list is no use; Python will take the 4 strings from the line (the first name, last name etc.) and put them in order.
What you want to do is have a list of those lists, and sort it by last name.
Python's lists provide a sort method that sorts them. When you apply it to the list of president-info-lists, it will sort those. But the default sorting for lists will compare them item-wise (first item first, then second item if the first items were equal, etc.). You want to compare by last name, which is the second element in your sublists. (That is, element 1; remember, we start counting list elements from 0.)
Fortunately, it is easy to give Python more specific instructions for sorting. We can pass the sort function a key argument, which is a function that "translates" the items into the value we want to sort them by. Yes, in Python everything is an object - including functions - so there is no problem passing a function as a parameter. So, we want to sort "by last name", so we would pass a function that accepts a president-info-list and returns the last name (i.e., element [1]).
Fortunately, this is Python, and "batteries are included"; we don't even have to write that function ourself. We are given a magical tool that creates functions that return the nth element of a sequence (which is what we want here). It's called itemgetter (because it makes a function that gets the nth item of a sequence - "item" is more usual Python terminology; "element" is a more general CS term), and it lives in the operator module.
By the way, there are also much neater ways to handle the file opening/closing, and we don't need to write an explicit loop to handle reading the file - we can iterate directly over the file (for line in file: gives us the lines of the file in turn, one each time through the loop), and that means we can just use a list comprehension (look them up).
import operator
def main():
# We'll set up 'infile' to refer to the opened input file, making sure it is automatically
# closed once we're done with it. We do that with a 'with' block; we're "done with the file"
# at the end of the block.
with open(INPUT_FILE) as infile:
# We want the splitted, rstripped line for each line in the infile, which is spelled:
data = [line.rstrip().split('\t') for line in infile]
# Now we re-arrange that data. We want to sort the data, using an item-getter for
# item 1 (the last name) as the sort-key. That is spelled:
data.sort(key=operator.itemgetter(1))
with open(OUTPUT_FILE) as outfile:
# Let's say we want to write the formatted string for each line in the data.
# Now we're taking action instead of calculating a result, so we don't want
# a list comprehension any more - so we iterate over the items of the sorted data:
for item in data:
# The item already contains all the values we want to interpolate into the string,
# in the right order; so we can pass it directly as our set of values to interpolate:
outfile.write('%s %s was president from %s to %s' % item)
I did get this working with Karls help above, although I did have to edit the code to get it to work for me, due to some errors I was getting. I eliminated those and ended up with this.
import operator
INPUT_FILE = 'presidents.txt'
OUTPUT_FILE2= 'president_NEW2.txt'
def main():
with open(INPUT_FILE) as infile:
data = [line.rstrip().split('\t') for line in infile]
data.sort(key=operator.itemgetter(0))
outfile=open(OUTPUT_FILE2,'w')
for item in data:
last=item[0]
first=item[1]
start=item[2]
end=item[3]
outfile.write('%s %s was president from %s to %s\n' % (last,first,start,end))
main()

Categories

Resources