How do I alphabetize a file in Python? - python

I am trying to get a list of presidents alphabetized by last name, even though the file that it is being drawn is currently listed first name, last name, date in office, and date out of office.
Here is what I have, any help on what I need to do with this. I have searched around for some answers, and most of them are beyond my level of understanding. I feel like I am missing something small. I tried to break them all out into a list, and then sort them, but I could not get it to work, so this is where I started from.
INPUT_FILE = 'presidents.txt'
OUTPUT_FILE = 'president_NEW.txt'
OUTPUT_FILE2 = 'president_NEW2.txt'
def main():
infile = open(INPUT_FILE)
outfile = open(OUTPUT_FILE, 'w')
outfile2 = open(OUTPUT_FILE2,'w')
stuff = infile.readline()
while stuff:
stuff = stuff.rstrip()
data = stuff.split('\t')
president_First = data[1]
president_Last = data[0]
start_date = data[2]
end_date = data[3]
sentence = '%s %s was president from %s to %s' % \
(president_First,president_Last,start_date,end_date)
sentence2 = '%s %s was president from %s to %s' % \
(president_Last,president_First,start_date, end_date)
outfile2.write(sentence2+ '\n')
outfile.write(sentence + '\n')
stuff = infile.readline()
infile.close()
outfile.close()
main()

What you should do is put the presidents in a list, sort that list, and then print out the resulting list.
Before your for loop add:
presidents = []
Have this code inside the for loop after you pull out the names/dates
president = (last_name, first_name, start_date, end_date)
presidents.append(president)
After the for loop
presidents.sort() # because we put last_name first above
# it will sort by last_name
Then print it out:
for president in presidents
last_name, first_name, start_date, end_date = president
string1 = "..."
It sounds like you tried to break them out into a list. If you had trouble with that, show us the code that resulting from that attempt. It was right way to approach the problem.
Other comments:
Just a couple of points where you code could be simpler. Feel free to ignore or use this as you want:
president_First=data[1]
president_Last= data[0]
start_date=data[2]
end_date=data[3]
can be written as:
president_Last, president_First, start_date, end_date = data
stuff=infile.readline()
And
while stuff:
stuff=stuff.rstrip()
data=stuff.split('\t')
...
stuff = infile.readline()
can be written as:
for stuff in infile:
...

#!/usr/bin/env python
# this sounds like a homework problem, but ...
from __future__ import with_statement # not necessary on newer versions
def main():
# input
with open('presidents.txt', 'r') as fi:
# read and parse
presidents = [[x.strip() for x in line.split(',')] for line in fi]
# sort
presidents = sorted(presidents, cmp=lambda x, y: cmp(x[1], y[1]))
# output
with open('presidents_out.txt', 'w') as fo:
for pres in presidents:
print >> fo, "president %s %s was president %s %s" % tuple(pres)
if __name__ == '__main__':
main()

I tried to break them all out into a list, and then sort them
What do you mean by "them"?
Breaking up the line into a list of items is a good start: that means you treat the data as a set of values (one of which is the last name) rather than just a string. However, just sorting that list is no use; Python will take the 4 strings from the line (the first name, last name etc.) and put them in order.
What you want to do is have a list of those lists, and sort it by last name.
Python's lists provide a sort method that sorts them. When you apply it to the list of president-info-lists, it will sort those. But the default sorting for lists will compare them item-wise (first item first, then second item if the first items were equal, etc.). You want to compare by last name, which is the second element in your sublists. (That is, element 1; remember, we start counting list elements from 0.)
Fortunately, it is easy to give Python more specific instructions for sorting. We can pass the sort function a key argument, which is a function that "translates" the items into the value we want to sort them by. Yes, in Python everything is an object - including functions - so there is no problem passing a function as a parameter. So, we want to sort "by last name", so we would pass a function that accepts a president-info-list and returns the last name (i.e., element [1]).
Fortunately, this is Python, and "batteries are included"; we don't even have to write that function ourself. We are given a magical tool that creates functions that return the nth element of a sequence (which is what we want here). It's called itemgetter (because it makes a function that gets the nth item of a sequence - "item" is more usual Python terminology; "element" is a more general CS term), and it lives in the operator module.
By the way, there are also much neater ways to handle the file opening/closing, and we don't need to write an explicit loop to handle reading the file - we can iterate directly over the file (for line in file: gives us the lines of the file in turn, one each time through the loop), and that means we can just use a list comprehension (look them up).
import operator
def main():
# We'll set up 'infile' to refer to the opened input file, making sure it is automatically
# closed once we're done with it. We do that with a 'with' block; we're "done with the file"
# at the end of the block.
with open(INPUT_FILE) as infile:
# We want the splitted, rstripped line for each line in the infile, which is spelled:
data = [line.rstrip().split('\t') for line in infile]
# Now we re-arrange that data. We want to sort the data, using an item-getter for
# item 1 (the last name) as the sort-key. That is spelled:
data.sort(key=operator.itemgetter(1))
with open(OUTPUT_FILE) as outfile:
# Let's say we want to write the formatted string for each line in the data.
# Now we're taking action instead of calculating a result, so we don't want
# a list comprehension any more - so we iterate over the items of the sorted data:
for item in data:
# The item already contains all the values we want to interpolate into the string,
# in the right order; so we can pass it directly as our set of values to interpolate:
outfile.write('%s %s was president from %s to %s' % item)

I did get this working with Karls help above, although I did have to edit the code to get it to work for me, due to some errors I was getting. I eliminated those and ended up with this.
import operator
INPUT_FILE = 'presidents.txt'
OUTPUT_FILE2= 'president_NEW2.txt'
def main():
with open(INPUT_FILE) as infile:
data = [line.rstrip().split('\t') for line in infile]
data.sort(key=operator.itemgetter(0))
outfile=open(OUTPUT_FILE2,'w')
for item in data:
last=item[0]
first=item[1]
start=item[2]
end=item[3]
outfile.write('%s %s was president from %s to %s\n' % (last,first,start,end))
main()

Related

Problem skipping line whilst iterating using previous line and current line comparison

I have a list of sorted data arranged so that each item in the list is a csv line to be written to file.
The final step of the script checks the contents of each field and if all but the last field match then it will copy the current line's last field onto the previous line's last field.
I would like to as I've found and processed one of these matches skip the current line where the field was copied from thus only leaving one of the lines.
Here's an example set of data
field1,field2,field3,field4,something
field1,field2,field3,field4,else
Desired output
field1,field2,field3,field4,something else
This is what I have so far
output_csv = ['field1,field2,field3,field4,something',
'field1,field2,field3,field4,else']
# run through the output
# open and create a csv file to save output
with open('output_table.csv', 'w') as f:
previous_line = None
part_duplicate_line = None
part_duplicate_flag = False
for line in output_csv:
part_duplicate_flag = False
if previous_line is not None:
previous = previous_line.split(',')
current = line.split(',')
if (previous[0] == current[0]
and previous[1] == current[1]
and previous[2] == current[2]
and previous[3] == current[3]):
print(previous[0], current[0])
previous[4] = previous[4].replace('\n', '') + ' ' + current[4]
part_duplicate_line = ','.join(previous)
part_duplicate_flag = True
f.write(part_duplicate_line)
if part_duplicate_flag is False:
f.write(previous_line)
previous_line = line
ATM script adds the line but doesn't skip the next line, I've tried various renditions of continue statements after part_duplicate_line is written to file but to no avail.
Looks like you want one entry for each combination of the first 4 fields
You can use a dict to aggregate data -
#First we extract the key and values
output_csv_keys = list(map(lambda x: ','.join(x.split(',')[:-1]), output_csv))
output_csv_values = list(map(lambda x: x.split(',')[-1], output_csv))
#Then we construct a dictionary with these keys and combine the values into a list
from collections import defaultdict
output_csv_dict = defaultdict(list)
for key, value in zip(output_csv_keys, output_csv_values):
output_csv_dict[key].append(value)
#Then we extract the key/value combinations from this dictionary into a list
for_printing = [','.join([k, ' '.join(v)]) for k, v in output_csv_dict.items()]
print(for_printing)
#Output is ['field1,field2,field3,field4,something else']
#Each entry of this list can be output to the csv file
I propose to encapsulate what you want to do in a function where the important part obeys this logic:
either join the new info to the old record
or output the old record and forget it
of course at the end of the loop we have in any case a dangling old record to output
def join(inp_fname, out_fname):
'''Input file contains sorted records, when two (or more) records differ
only in the last field, we join the last fields with a space
and output only once, otherwise output the record as-is.'''
######################### Prepare for action ##########################
from csv import reader, writer
with open(inp_fname) as finp, open(out_fname, 'w') as fout:
r, w = reader(finp), writer(fout)
######################### Important Part starts here ##############
old = next(r)
for new in r:
if old[:-1] == new[:-1]:
old[-1] += ' '+new[-1]
else:
w.writerow(old)
old = new
w.writerow(old)
To check what I've proposed you can use these two snippets (note that these records are shorter than yours, but it's an example and it doesn't matter because we use only -1 to index our records).
The 1st one has a "regular" last record
open('a0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n3,3,0\n')
join('a0.csv', 'a1.csv')
while the 2nd has a last record that must be joined to the previous one.
open('b0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n')
join('b0.csv', 'b1.csv')
If you run the snippets, as I have done before posting, in the environment where you have defined join you should get what you want.

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?
I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]
You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

Iterate through two files to create a new file that has fields from second file appended to fields of first file

I am new to Python. I attempted to use logic from answers in #mgilson, #endolith, and #zackbloom zack's example
I am getting a bunch of blank columns placed in front of the first field of the primary record.
My out_file is empty (more than likely because of the columns from the two files cannot match up.
How can I fix this?
The end result should look like the following:
('PUDO_id','Load_id','carrier_id','PUDO_from_company','PUDOItem_id';'PUDO_id';'PUDOItem_make')
('1','1','14','FMH MATERIAL HANDLING SOLUTIONS','1','1','CROWN','TR3520 / TWR3520','TUGGERS')
('2','2','7','WIESE USA','2','2','CAT','NDC100','3','2','CAT','NDC100','4','2',' 2 BATTERIES')
Note: In the output of the 3rd row, it appended 3 rows from the sub file to the array, while the first 2 rows only appended 1 row from the sub file. This is determined by the value in pri[0] and sub[1] comparing TRUE.
Here is my code based on #Zack Bloom:
def build_set(filename):
# A set stores a collection of unique items. Both adding items and searching for them
# are quick, so it's perfect for this application.
found = set()
with open(filename) as f:
for line in f:
# Tuples, unlike lists, cannot be changed, which is a requirement for anything
# being stored in a set.
line = line.replace('"','')
line = line.replace("'","")
line = line.replace('\n','')
found.add(tuple(sorted(line.split(';'))))
return found
set_primary_records = build_set('C:\\temp\\oz\\loads_pudo.csv')
set_sub_records = build_set('C:\\temp\\oz\\pudo_items.csv')
record = []
with open('C:\\temp\\oz\\loads_pudo_out.csv', 'w') as out_file:
# Using with to open files ensures that they are properly closed, even if the code
# raises an exception.
for pri in set_primary_records :
for sub in set_sub_records :
#out_file.write(" ".join(res) + "\n")
if sub[1] == pri [0] :
record = pri.extend(sub)
out_file.write(record + '\n')
Sample source data (primary records):
PUDO_id;"Load_id";"carrier_id";"PUDO_from_company"
1;"1";"14";"FMH MATERIAL HANDLING SOLUTIONS"
2;"2";"7";"WIESE USA"
Sample source data (sub records):
PUDOItem_id;"PUDO_id";"PUDOItem_make"
1;"1";"CROWN";"TR3520 / TWR3520";"TUGGERS"
2;"2";" CAT";"NDC100"
3;"2";"CAT";"NDC100"
4;"2";" 2 BATTERIES"
5;"11";"MIDLAND"
The extend attribute is not available for tuples which is what build_set is creating. Tuples are immutable but they can be concatenated or sliced with normal python string functions.
For example:
with open('C:\\temp\\oz\\loads_pudo_out.csv', 'w') as out_file:
for pri in set_primary_records :
for sub in set_sub_records :
if sub[1] == pri[0] :
record = pri + sub
out_file.write(str(record)[1:-1] + '\n')
This is the same code as above, just modified to allow for tuple concatenation. In the write line we convert record to a string and strip the start and end brackets, before appending '\n'. Maybe there are better / prettier ways to do this, but I'm new to Python too.
Edit:
To get the output you are expecting, a few changes are required:
# On this line, remove the sort() as we do not wish to change tuple item order..
found.add(tuple(line.split(';')))
...
with open('C:\\temp\\loads_out.csv', 'w') as out_file:
for pri in set_primary_records:
record = pri # record tuple is set in main loop
for sub in set_sub_records:
if sub[1] == pri[0]:
record += sub # for each match, sub appended to record
out_file.write(str(record) + '\n') # removed stripping of brackets

In Python how to make each line in a file to be a tuple and make them in a list?

We have a homework that I have a serious problem on.
The key is to make each line to a tuple and make these tuple to a list.
like list=[tuple(line1),tuple(line2),tuple(line3),...].
Besides, there are many strings separated by commas, like "aei","1433","lincoln",...
Here is the question:
A book can be represented as a tuple of the author's lastName, the author's firstName, the title, date, and ISBN.
Write a function, readBook(), that, given a comma-separated string containing this information, returns a tuple representing the book.
Write a function, readBooks(), that, given the name of a text file containing one comma-separated line per book, uses readBook() to return a list of tuples, each of which describes one book.
Write a function, buildIndex(), that, given a list of books as returned by readBooks(), builds a map from key word to book title. A key word is any word in a book's title, except "a", "an, or "the".
Here is my code:
RC=("Chann", "Robbbin", "Pride and Prejudice", "2013", "19960418")
RB=("Benjamin","Franklin","The Death of a Robin Thickle", "1725","4637284")
def readBook(lastName, firstName, booktitle, date, isbn):
booktuple=(lastName, firstName, booktitle, date, isbn)
return booktuple
# print readBook("Chen", "Robert", "Pride and Prejudice", "2013", "19960418")
def readBooks(file1):
inputFile = open(file1, "r")
lines = inputFile.readlines()
book = (lines)
inputFile.close()
return book
print readBooks("book.txt")
BooklistR=[RC,RB]
def buildIndex(file2):
inputFile= open("book.txt","r")
Blist = inputFile.readlines()
dictbooks={}
for bookinfo in Blist:
title=bookinfo[2].split()
for infos in title:
if infos.upper()=="A":
title.remove(infos)
elif infos.upper()=="THE":
title.remove(infos)
elif infos.upper()=="AN":
title.remove(infos)
else:
pass
dictbooks[tuple(title)]= bookinfo[2]
return dictbooks
print buildIndex("book.txt")
#Queries#
def lookupKeyword(keywords):
dictbooks=buildIndex(BooklistR)
keys=dictbooks.viewkeys()
values=dictbooks.viewvalues()
for keybook in list(keys):
for keyw in keywords:
for keyk in keybook:
if keyw== keyk:
printoo= dictbooks[keybook]
else:
pass
return printoo
print lookupKeyword("Robin")
What's wrong with something like this?:
with open(someFile) as inputFile:
myListofTuples = [tuple(line.split(',')) for line in inputFile.readlines()]
[Explanation added based on Robert's comment]
The first line opens the file in a with statement. Python with statements are a fairly new feature and rather advanced. The set up a context in which code executes with certain guarantees about how clean-up and finalization code will be executed as the Python engine exits that context (whether by completing the work or encountering an un-handled exception).
You can read about the ugly details at: Python Docs: Context Managers but the gist of it all is that we're opening someFile with a guarantee that it'll be closed properly after the execution of the code leaves that context (the suite of statements after the with statement. That'll be done even if we encounter some error or if our code inside that suite raises some exception that we fail to catch.
In this case we use the as clause to give us a local name by which we can refer to the opened file object. (The filename is just a string, passed as an argument to the open() built-in function ... the object returned by that function needs to have a name by which we can refer to it. This is similar to who a for i in whatever statement binds each item in whatever to the name i for each iteration through the loop.
The suite of our with statement (that's the set of indented statements which is run within the context of the context manager) consists of a single statement ... a list comprehension which is bound to the name myListofTuples.
A list comprehension is another fairly advanced programming concept. There are a number of very high level languages which implement them in various ways. In the case of Python they date back to much earlier versions than the with statement --- I think they were introduced in the 2.2 or so timeframe.
Consequently, list comprehensions are fairly common in Python code while with statements are only slowly being adopted.
A list literal in Python looks like: [something, another_thing, etc, ...] a list comprehension is similar but replaces the list of item literals with an expression, a line of code, which evaluates into a list. For example: [x*x for x in range(100) if x % 2] is a list comprehension which evaluates into a list of integers which are the squares of odd integers between 1 and 99. (Notice the absence of commas in the list comprehension. An expression takes the place of the comma delimited sequence which would have been used in a list literal).
In my example I'm using for line in inputFile.readlines() as the core of the expression and I'm splitting each of those on the common (line.split(',')) and then converting the resulting list into a tuple().
This is just a very concise way of saying:
myListofTuples = list()
for line in inputfile.readlines():
myListofTuples.append(line.split(','))
One possible program:
import fileinput
def readBook(str):
l = str.split(',')
t = (l[0:5])
return t
#b = readBook("First,Last,Title,2013,ISBN")
#print b
def readBooks(file):
l = []
for line in fileinput.input(file):
t = readBook(line)
# print t
l.append(t)
return l
books = readBooks("data")
#for t in books:
# for f in t:
# print f
def buildIndex(books):
i = {}
for b in books:
for w in b[2].split():
if w.lower() not in ('a', 'an', 'the'):
if w not in i:
i[w] = []
i[w].append(b[2])
return i
index = buildIndex(books)
for w in sorted(index):
print "Word: ", w
for t in index[w]:
print "Title: ", t
Sample data file (called "data" in the code):
Austen,Jane,Pride and Prejudice,1811,123456789012X
Austen,Jane,Sense and Sensibility,1813,21234567892
Rice-Burroughs,Edgar,Tarzan and the Apes,1911,302912341234X
Sample output:
Word: Apes
Title: Tarzan and the Apes
Word: Prejudice
Title: Pride and Prejudice
Word: Pride
Title: Pride and Prejudice
Word: Sense
Title: Sense and Sensibility
Word: Sensibility
Title: Sense and Sensibility
Word: Tarzan
Title: Tarzan and the Apes
Word: and
Title: Pride and Prejudice
Title: Sense and Sensibility
Title: Tarzan and the Apes
Note that the data format can't support book titles such as "The Lion, The Witch, and the Wardrobe" because of the embedded commas. If the file was in CSV format with quotes around the strings, then it could manage that.
I'm not sure that's perfectly minimally Pythonic code (not at all sure), but it does seem to match the requirements.

Python - Import file into NamedTuple

Recently I had a question regarding data types.
Since then, I've been trying to use NamedTuples (with more or less success).
My problem currently:
- How to import the lines from a file to new tuples,
- How to import the values separated with space/tab(/whatever) into a given part of the tuple?
Like:
Monday 8:00 10:00 ETR_28135 lh1n1522 Computer science 1
Tuesday 12:00 14:00 ETR_28134 lh1n1544 Geography EA 1
First line should go into tuple[0]. First data: tuple[0].day; second: tuple[0].start; ..and so on.
And when the new line starts (that's two TAB (\t), start a new tuple, like tuple[1]).
I use this to separate the data:
with open(Filename) as f:
for line in f:
rawData = line.strip().split('\t')
And the rest of the logic is still missing (the filling up of the tuples).
(I know. This question, and the recent one are really low-level ones. However, hope these will help others too. If you feel like it's not a real question, too simple to be a question, etc etc, just vote to close. Thank you for your understanding.)
Such database files are called comma separated values even though they are not really separated by commas. Python has a handy library called csv that lets you easily read such files
Here is a slightly modified example from the docs
csv.register_dialect('mycsv', delimiter='\t', quoting=csv.QUOTE_NONE)
with open(filename, 'rb') as f:
reader = csv.reader(f, 'mycsv')
Usually you work one line at a time. If you need the whole file in a tuple then:
t = tuple(reader)
EDIT
If you need to access fields by name you could use cvs.DictReader, but I don't know how exactly that works and I could not test it here.
EDIT 2
Looking at what namedtuples are, I'm a bit outdated. There is a nice example on how namedtuple could work with the csv module:
EmployeeRecord = namedtuple('EmployeeRecord', 'name, age, title, department, paygrade')
import csv
for line in csv.reader(open("employees.csv", "rb")):
emp = EmployeeRecord._make(line)
print emp.name, emp.title
If you want to use a NamedTuple, you can use a slightly modified version of the example given in the Python documentation:
MyRecord = namedtuple('MyRecord', 'Weekday, start, end, code1, code2, title, whatever')
import csv
for rec in map(MyRecord._make, csv.reader(open("mycsv.csv", "rb"), delimiter='\t')):
print rec.weekday
print rec.title
# etc...
Here's a compact way of doing such things.
First declare the class of line item:
fields = "dow", "open_time", "close _time", "code", "foo", "subject", "bar"
Item = namedtuple('Item', " ".join(fields))
The next part is inside your loop.
# this is what your raw data looks like after the split:
#raw_data = ['Monday', '8:00', '10:00', 'ETR_28135', 'lh1n1522', 'Computer science', '1']
data_tuple = Item(**dict(zip(fields, raw_data)))
Now slowly:
zip(fields, raw_data) creates a list of pairs, like [("dow", "Monday"), ("open_time", "8:00"),..]
then dict() turns it into a dictionary, like {"dow": "Monday", "open_time": "8:00", ..}
then ** interprets this dictionary as a bunch of keyword parameters to Item constructor, an equivalent of Item(dow="Monday", open_time="8:00",..).
So your items are named tuples, with all values being strings.
Edit:
If order of fields is not going to change, you can do it far easier:
data_tuple = Item(*raw_data)
This uses the fact that order of fields in the file and order of parameters in Item definition match.

Categories

Resources