Python - Import file into NamedTuple - python

Recently I had a question regarding data types.
Since then, I've been trying to use NamedTuples (with more or less success).
My problem currently:
- How to import the lines from a file to new tuples,
- How to import the values separated with space/tab(/whatever) into a given part of the tuple?
Like:
Monday 8:00 10:00 ETR_28135 lh1n1522 Computer science 1
Tuesday 12:00 14:00 ETR_28134 lh1n1544 Geography EA 1
First line should go into tuple[0]. First data: tuple[0].day; second: tuple[0].start; ..and so on.
And when the new line starts (that's two TAB (\t), start a new tuple, like tuple[1]).
I use this to separate the data:
with open(Filename) as f:
for line in f:
rawData = line.strip().split('\t')
And the rest of the logic is still missing (the filling up of the tuples).
(I know. This question, and the recent one are really low-level ones. However, hope these will help others too. If you feel like it's not a real question, too simple to be a question, etc etc, just vote to close. Thank you for your understanding.)

Such database files are called comma separated values even though they are not really separated by commas. Python has a handy library called csv that lets you easily read such files
Here is a slightly modified example from the docs
csv.register_dialect('mycsv', delimiter='\t', quoting=csv.QUOTE_NONE)
with open(filename, 'rb') as f:
reader = csv.reader(f, 'mycsv')
Usually you work one line at a time. If you need the whole file in a tuple then:
t = tuple(reader)
EDIT
If you need to access fields by name you could use cvs.DictReader, but I don't know how exactly that works and I could not test it here.
EDIT 2
Looking at what namedtuples are, I'm a bit outdated. There is a nice example on how namedtuple could work with the csv module:
EmployeeRecord = namedtuple('EmployeeRecord', 'name, age, title, department, paygrade')
import csv
for line in csv.reader(open("employees.csv", "rb")):
emp = EmployeeRecord._make(line)
print emp.name, emp.title

If you want to use a NamedTuple, you can use a slightly modified version of the example given in the Python documentation:
MyRecord = namedtuple('MyRecord', 'Weekday, start, end, code1, code2, title, whatever')
import csv
for rec in map(MyRecord._make, csv.reader(open("mycsv.csv", "rb"), delimiter='\t')):
print rec.weekday
print rec.title
# etc...

Here's a compact way of doing such things.
First declare the class of line item:
fields = "dow", "open_time", "close _time", "code", "foo", "subject", "bar"
Item = namedtuple('Item', " ".join(fields))
The next part is inside your loop.
# this is what your raw data looks like after the split:
#raw_data = ['Monday', '8:00', '10:00', 'ETR_28135', 'lh1n1522', 'Computer science', '1']
data_tuple = Item(**dict(zip(fields, raw_data)))
Now slowly:
zip(fields, raw_data) creates a list of pairs, like [("dow", "Monday"), ("open_time", "8:00"),..]
then dict() turns it into a dictionary, like {"dow": "Monday", "open_time": "8:00", ..}
then ** interprets this dictionary as a bunch of keyword parameters to Item constructor, an equivalent of Item(dow="Monday", open_time="8:00",..).
So your items are named tuples, with all values being strings.
Edit:
If order of fields is not going to change, you can do it far easier:
data_tuple = Item(*raw_data)
This uses the fact that order of fields in the file and order of parameters in Item definition match.

Related

Combining files in python using

I am attempting to combine a collection of 600 text files, each line looks like
Measurement title Measurement #1
ebv-miR-BART1-3p 4.60618701
....
evb-miR-BART1-200 12.8327289
with 250 or so rows in each file. Each file is formatted that way, with the same data headers. What I would like to do is combine the files such that it looks like this
Measurement title Measurement #1 Measurement #2
ebv-miR-BART1-3p 4.60618701 4.110878867
....
evb-miR-BART1-200 12.8327289 6.813287556
I was wondering if there is an easy way in python to strip out the second column of each file, then append it to a master file? I was planning on pulling each line out, then using regular expressions to look for the second column, and appending it to the corresponding line in the master file. Is there something more efficient?
It is a small amount of data for today's desktop computers (around 150000 measurements) - so keeping everything in memory, and dumping to a single file will be easier than an another strategy. If it would not fit in RAM, maybe using SQL would be a nice approach there -
but as it is, you can create a single default dictionary, where each element is a list -
read all your files and collect the measurements to this dictionary, and dump it to disk -
# create default list dictionary:
>>> from collections import defaultdict
>>> data = defaultdict(list)
# Read your data into it:
>>> from glob import glob
>>> import csv
>>> for filename in glob("my_directory/*csv"):
... reader = csv.reader(open(filename))
... # throw away header row:
... reader.readrow()
... for name, value in reader:
... data[name].append(value)
...
>>> # and record everything down in another file:
...
>>> mydata = open("mydata.csv", "wt")
>>> writer = csv.writer(mydata)
>>> for name, values in sorted(data.items()):
... writer.writerow([name] + values)
...
>>> mydata.close()
>>>
Use the csv module to read the files in, create a dictionary of the measurement names, and make the values in the dictionary a list of the values from the file.
I don't have comment privileges yet, therefore a separate answer.
jsbueno's answer works really well as long as you're sure that the same measurement IDs occur in every file (order is not important, but the sets should be equal!).
In the following situation:
file1:
measID,meas1
a,1
b,2
file2:
measID,meas1
a,3
b,4
c,5
you would get:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,5
instead of the desired:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,,5 # measurement c was missing in file1!
I'm using commas instead of spaces as delimiters for better visibility.

How do I handle closing double quotes in CSV column with python?

This is the python script:
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
bits[1] = '"input"'
fo.write( ','.join(bits) )
f.close()
fo.close()
I have a CSV file and I'm replacing the content of the 2nd column with the string "input". However, I need to grab some information from that column content first.
The content might look like this:
failurelog_wl","inputfile/source/XXXXXXXX"; "**X_CORD2**"; "Invoice_2M";
"**Y_CORD42**"; "SIZE_ID37""
It has weird type of data as you can see, especially that it has 2 double quotes at the end of the line instead of just one that you would expect.
I need to extract the XCORD and YCORD information, like XCORD = 2 and YCORD = 42, before replacing the column value. I then want to insert an extra column, named X_Y, which represents (2_42).
How can I modify my script to do that?
If I understand your question correctly, you can use a simple regular expression to pull out the numbers you want:
import re
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
x_y_matches = re.match('.*X_CORD(\d+).*Y_CORD(\d+).*', bits[1])
assert x_y_matches is not None, 'Line had unexpected format: {0}'.format(bits[1])
x_y = '({0}_{1})'.format(x_y_matches.group(1), x_y_matches.group(2))
bits[1] = '"input"'
bits.append(x_y)
fo.write( ','.join(bits) )
f.close()
fo.close()
Note that this will only work if column 2 always says 'X_CORD' and 'Y_CORD' immediately before the numbers. If it is sometimes a slightly different format, you'll need to adjust the regular expression to allow for that. I added the assert to give a more useful error message if that happens.
You mentioned wanting the column to be named X_Y. Your script appears to assume that there is no header, and my modified version definitely makes this assumption. Again, you'd need to adjust for that if there is a header line.
And, yes, I agree with the other commenters that using the csv module would be cleaner, in general, for reading and writing csv files.

Get Attribute From Text File in Python

So I'm making a Yu-Gi-Oh database program. I have all the information stored in a large text file. Each monster is chategorized in the following way:
|Name|NUM 1|DESC 1|TYPE|LOCATION|STARS|ATK|DEF|DESCRIPTION
Here's an actual example:
|A Feather of the Phoenix|37;29;18|FET;YSDS;CP03|Spell Card}Spell||||Discard 1 card. Select from your Graveyard and return it to the top of your Deck.|
So I made a program that searches this large text file by name and it returns the information from the text file without the '|'. Here it is:
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
print('\n'.join(to_search[name]))
Now I'm trying to edit my program so I can search for the name of the monster and choose which attribute I want to display. So it'd appear like
A Feather of the Phoenix
Description:
Discard 1 card. Select from your Graveyard and return it to the top of your Deck.
Any clues as to how I can do this?
First, this is a variant dialect of CSV, and can be parsed with the csv module instead of trying to do it manually. For example:
with open('TEXT.txt') as fd:
rows = csv.reader(fd, delimiter='|')
to_search = {row[1]:row for row in rows}
print('\n'.join(to_search[name]))
You might also prefer to use DictReader, so each row is a dict (keyed off the names in the header row, or manually-specified column names if you don't have one):
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print('\n'.join(to_search[name]))
Then, to select a specific attribute:
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print(to_search[name][attribute])
However… I'm not sure this is a good design in the first place. Do you really want to re-read the entire file for each lookup? I think it makes more sense to read it into memory once, into a general-purpose structure that you can use repeatedly. And in fact, you've almost got such a structure:
with open('TEXT.txt') as fd:
monsters = list(csv.DictReader(fd, delimiter='|'))
monsters_by_name = {monster['Name']: monster for monster in monsters}
Then you can build additional indexes, like a multi-map of monsters by location, etc., if you need them.
All this being said, your original code can almost handle what you want already. to_search[name] is a list. If you just build a map from attribute names to indices, you can do this:
attributes = ['Name', 'NUM 1', 'DESC 1', 'TYPE', 'LOCATION', 'STARS', 'ATK', 'DEF', 'DESCRIPTION']
attributes_by_name = {value: idx for idx, value in enumerate(attributes)}
# ...
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
attribute_index = attributes_by_name[attributes]
print(to_search[name][attribute_index])
You could look at the namedtuple class in collections. You will want to make each entry a namedtuple with your fields as attributes. The namedtuple might look like:
Card = namedtuple('Card', 'name, number, description, whatever_else')
As shown in the collections documentation, namedtuple and csv work well together:
import csv
for card in map(Card._make, csv.reader(open("cards", "rb"))):
print card.name, card.description # format however you want here
The mechanics around search can be very complicated. For example, if you want a really fast search built around an exact match, you could build a dictionary for each attribute you're interested in:
name_map = {card.name: card for card in all_cards}
search_result = name_map[name_you_searched_for]
You could also do a startswith search:
possibles = [card for card in all_cards if card.name.startswith(search_string)]
# here you need to decide what to do with these possibles, in this example, I'm just snagging the first one, and I'm not handling the possibility that you don't find one, you should.
search_result = possibles[0]
I recommend against trying to search the file itself. This is an extremely complex kind of search to do and is typically left up to database systems to implement this kind of functionality. If you need to do this, consider switching the application to sqlite or another lightweight database.

Search for a variable in a file and get its value with python

I want to have some variables that are stored in a file (text file or yaml file)
for example if I have these variables stored in the file
employee = ['Tom', 'Bob','Anny']
salary = 200
managers = ['Saly','Alice']
and I want the user to enter the list name or the variable name for example
if the user entered employee and want to do some operations on the list values so the user supposed to access employee[0], employee[1] .... etc
how can I write a python script that will go to the file search for the correct variable and give the user access to its value
Thanks
Like what #Levon said, there are several ways that allow you do that, and the best depends on your problem context. for example, you could
read the file yourself by formatting it e.g., via delimiter "=" in your file
use a database to store your data
use pickle or shelve to serialize your variables and get them back later.
put the variables in a python module and import it
This approach might be one way assuming your file contents is somewhat consistent:
Updated: I added the code necessary to parse the lists which previously wasn't provided.
The code takes all of the data in your file and assigns it to the variables as appropriate types (i.e., float and lists). The list parsing isn't particularly pretty, but it is functional.
import re
with open('data.txt') as inf:
salary = 0
for line in inf:
line = line.split('=')
line[0] = line[0].strip()
if line[0] == 'employee':
employee = re.sub(r'[]\[\' ]','', line[1].strip()).split(',')
elif line[0] == 'salary':
salary = float(line[1])
elif line[0] == 'managers':
managers = re.sub(r'[]\[\' ]','', line[1].strip()).split(',')
print employee
print salary
print managers
yields:
['Tom', 'Bob', 'Anny']
200.0
['Saly', 'Alice']

How do I alphabetize a file in Python?

I am trying to get a list of presidents alphabetized by last name, even though the file that it is being drawn is currently listed first name, last name, date in office, and date out of office.
Here is what I have, any help on what I need to do with this. I have searched around for some answers, and most of them are beyond my level of understanding. I feel like I am missing something small. I tried to break them all out into a list, and then sort them, but I could not get it to work, so this is where I started from.
INPUT_FILE = 'presidents.txt'
OUTPUT_FILE = 'president_NEW.txt'
OUTPUT_FILE2 = 'president_NEW2.txt'
def main():
infile = open(INPUT_FILE)
outfile = open(OUTPUT_FILE, 'w')
outfile2 = open(OUTPUT_FILE2,'w')
stuff = infile.readline()
while stuff:
stuff = stuff.rstrip()
data = stuff.split('\t')
president_First = data[1]
president_Last = data[0]
start_date = data[2]
end_date = data[3]
sentence = '%s %s was president from %s to %s' % \
(president_First,president_Last,start_date,end_date)
sentence2 = '%s %s was president from %s to %s' % \
(president_Last,president_First,start_date, end_date)
outfile2.write(sentence2+ '\n')
outfile.write(sentence + '\n')
stuff = infile.readline()
infile.close()
outfile.close()
main()
What you should do is put the presidents in a list, sort that list, and then print out the resulting list.
Before your for loop add:
presidents = []
Have this code inside the for loop after you pull out the names/dates
president = (last_name, first_name, start_date, end_date)
presidents.append(president)
After the for loop
presidents.sort() # because we put last_name first above
# it will sort by last_name
Then print it out:
for president in presidents
last_name, first_name, start_date, end_date = president
string1 = "..."
It sounds like you tried to break them out into a list. If you had trouble with that, show us the code that resulting from that attempt. It was right way to approach the problem.
Other comments:
Just a couple of points where you code could be simpler. Feel free to ignore or use this as you want:
president_First=data[1]
president_Last= data[0]
start_date=data[2]
end_date=data[3]
can be written as:
president_Last, president_First, start_date, end_date = data
stuff=infile.readline()
And
while stuff:
stuff=stuff.rstrip()
data=stuff.split('\t')
...
stuff = infile.readline()
can be written as:
for stuff in infile:
...
#!/usr/bin/env python
# this sounds like a homework problem, but ...
from __future__ import with_statement # not necessary on newer versions
def main():
# input
with open('presidents.txt', 'r') as fi:
# read and parse
presidents = [[x.strip() for x in line.split(',')] for line in fi]
# sort
presidents = sorted(presidents, cmp=lambda x, y: cmp(x[1], y[1]))
# output
with open('presidents_out.txt', 'w') as fo:
for pres in presidents:
print >> fo, "president %s %s was president %s %s" % tuple(pres)
if __name__ == '__main__':
main()
I tried to break them all out into a list, and then sort them
What do you mean by "them"?
Breaking up the line into a list of items is a good start: that means you treat the data as a set of values (one of which is the last name) rather than just a string. However, just sorting that list is no use; Python will take the 4 strings from the line (the first name, last name etc.) and put them in order.
What you want to do is have a list of those lists, and sort it by last name.
Python's lists provide a sort method that sorts them. When you apply it to the list of president-info-lists, it will sort those. But the default sorting for lists will compare them item-wise (first item first, then second item if the first items were equal, etc.). You want to compare by last name, which is the second element in your sublists. (That is, element 1; remember, we start counting list elements from 0.)
Fortunately, it is easy to give Python more specific instructions for sorting. We can pass the sort function a key argument, which is a function that "translates" the items into the value we want to sort them by. Yes, in Python everything is an object - including functions - so there is no problem passing a function as a parameter. So, we want to sort "by last name", so we would pass a function that accepts a president-info-list and returns the last name (i.e., element [1]).
Fortunately, this is Python, and "batteries are included"; we don't even have to write that function ourself. We are given a magical tool that creates functions that return the nth element of a sequence (which is what we want here). It's called itemgetter (because it makes a function that gets the nth item of a sequence - "item" is more usual Python terminology; "element" is a more general CS term), and it lives in the operator module.
By the way, there are also much neater ways to handle the file opening/closing, and we don't need to write an explicit loop to handle reading the file - we can iterate directly over the file (for line in file: gives us the lines of the file in turn, one each time through the loop), and that means we can just use a list comprehension (look them up).
import operator
def main():
# We'll set up 'infile' to refer to the opened input file, making sure it is automatically
# closed once we're done with it. We do that with a 'with' block; we're "done with the file"
# at the end of the block.
with open(INPUT_FILE) as infile:
# We want the splitted, rstripped line for each line in the infile, which is spelled:
data = [line.rstrip().split('\t') for line in infile]
# Now we re-arrange that data. We want to sort the data, using an item-getter for
# item 1 (the last name) as the sort-key. That is spelled:
data.sort(key=operator.itemgetter(1))
with open(OUTPUT_FILE) as outfile:
# Let's say we want to write the formatted string for each line in the data.
# Now we're taking action instead of calculating a result, so we don't want
# a list comprehension any more - so we iterate over the items of the sorted data:
for item in data:
# The item already contains all the values we want to interpolate into the string,
# in the right order; so we can pass it directly as our set of values to interpolate:
outfile.write('%s %s was president from %s to %s' % item)
I did get this working with Karls help above, although I did have to edit the code to get it to work for me, due to some errors I was getting. I eliminated those and ended up with this.
import operator
INPUT_FILE = 'presidents.txt'
OUTPUT_FILE2= 'president_NEW2.txt'
def main():
with open(INPUT_FILE) as infile:
data = [line.rstrip().split('\t') for line in infile]
data.sort(key=operator.itemgetter(0))
outfile=open(OUTPUT_FILE2,'w')
for item in data:
last=item[0]
first=item[1]
start=item[2]
end=item[3]
outfile.write('%s %s was president from %s to %s\n' % (last,first,start,end))
main()

Categories

Resources