reading file into a dictionary - python

I was wondering if there is a way that i can read delimitered text into a dictionary. I have been able to get it into lists no problem here is the code:
def _demo_fileopenbox():
msg = "Pick A File!"
msg2 = "Select a country to learn more about!"
title = "Open files"
default="*.py"
f = fileopenbox(msg,title,default=default)
writeln("You chose to open file: %s" % f)
c = []
a = []
p = []
with open(f,'r') as handle:
reader = csv.reader(handle, delimiter = '\t')
for row in reader:
c = c + [row[0]]
a = a + [row[1]]
p = p + [row[2]]
while 1:
reply = choicebox(msg=msg2, choices= c )
writeln( reply + ";\tArea: " + a[(c.index(reply))] + " square miles \tPopulation: " + p[(c.index(reply))] )
that code makes it 3 lists because each line of text is a country name, their area, and their population. I had it that way so if i choose a country it will give me the corrosponding information on pop and area. Some people say a dictionary is a better approach, but first of all i dont think that i can put three things into one spot int the dictionary. I need the Country name to be the key and then the the population and area the info for that key. 2 dictionaries could probably work? but i just dont know how to get from file to dictionary, any help plz?

You could use two dictionaries, but you could also use a 2-tuple like this:
countries = {}
# ... other code as before
for row in reader:
countries[row[0]] = (row[1], row[2])
Then you can iterate through it all like this:
for country, (area, population) in countries.iteritems():
# ... Do stuff with country, area and population
... or you can access data on a specific country like this:
area, population = countries["USA"]
Finally, if you're planning to add more information in the future you might instead want to use a class as a more elegant way to hold the information - this makes it easier to write code which doesn't break when you add new stuff. You'd have a class something like this:
class Country(object):
def __init__(self, name, area, population):
self.name = name
self.area = area
self.population = population
And then your reading code would look something like this:
for row in reader:
countries[row[0]] = Country(row[0], row[1], row[2])
Or if you have the constructor take the entire row rather than individual items you might find it easier to extend the format later, but you're also coupling the class more closely to the representation in the file. It just depends on how you think you might extend things later.
Then you can look things up like this:
country = countries["USA"]
print "Area is: %s" % (country.area,)
This has the advantage that you can add new methods to do more clever stuff in the future. For example, a method which returns the population density:
class Country(object):
# ...
def get_density(self):
return self.population / self.area
In general I would recommend classes over something like nested dictionaries once you get beyond something where you're storing more than a couple of items. They make your code easier to read and easier to extend later.
As with most programming issues, however, other approaches will work - it's a case of choosing the method that works best for you.

Something like this should work:
from collections import defaultdict
myDict = {}
for row in reader:
country, area, population = row
myDict[country] = {'area': area, 'population': population}
Note that you'll have to add some error checking so that your code doesn't break if there are greater or less than three delimited items in each row.
You can access the values as follows:
>>> myDict['Mordor']['area']
175000
>>> myDict['Mordor']['population']
3000000

data = []
with open(f,'r') as handle:
reader = csv.reader(handle, delimiter = '\t')
for row in reader:
(country, area, population) = row
data.append({'country': country, 'area': area, 'population': population})
Data would then be a list of dictionaries.
But I'm not sure that's really a better approach, because it would use more memory. Another option is just a list of lists:
data = list(csv.reader(open(f), delimiter='\t'))
print data
# [['USA', 'big', '300 million'], ...]

the value of the dictionary can be a tuple of the population and area info. So when you read in the file you can do something such as
countries_dict = {}
for row in reader:
countries_dict[row[0]] = (row[1],row[2])

Related

Read csv to dict of lists of dicts

I have a data set with tons of (intentional) duplication. I'd like to collapse(?) that to make it better suited for my needs. The data reads like this:
Header1, Header2, Header3
Example1, Content1, Stuff1
Example1, Content2, Stuff2
Example1, Content3, Stuff3
Example2, Content1, Stuff1
Example2, Content5, Stuff5
etc...
And I want that to end up as a dict with column one's values as keys and lists of dicts as values to those keys like so:
{Example1 : [{Header2:Content1, Header3:Stuff1}, {Header2:Content2, Header3:Stuff2}, {Header2:Content3, Header3:Stuff3}],
Example2 : [{Header2:Content1, Header3:Stuff1}, {Header2:Content5, Header3:Stuff5}]}
I'm brand new to Python and a novice programmer over all so feel free to get clarification if this question is confusing. 😅 Thanks!
Update I was rightfully called out for not posting my example code (thanks for keeping me honest!) so here it is. The code below works but since I'm new to Python I don't know if it's well written or not. Also the dict ends up with the keys (Example1 and Example2) in reverse order. That doesn't really matter but I do not understand why.
def gather_csv_info():
all_csv_data = []
flattened_data = {}
reading_csv = csv.DictReader(open(sys.argv[1], 'rb'))
for row in reading_csv:
all_csv_data.append(row)
for row in all_csv_data:
if row["course_sis_ids"] in flattened_data:
flattened_data[row["course_sis_ids"]].append({"user_sis_ids":row["user_sis_ids"], "file_ids":row["file_ids"]})
else:
flattened_data[row["course_sis_ids"]] = [{"user_sis_ids":row["user_sis_ids"], "file_ids":row["file_ids"]}]
return flattened_data
This code works but I don't know how pythonic it is and I don't understand why the flattened_data dict has the keys in reverse order that they appear in the original CSV. It doesn't strictly matter that they're not in order, but it is curious.
def gather_csv_info():
all_csv_data = []
flattened_data = {}
reading_csv = csv.DictReader(open(sys.argv[1], 'rb'))
for row in reading_csv:
all_csv_data.append(row)
for row in all_csv_data:
if row["course_sis_ids"] in flattened_data:
flattened_data[row["course_sis_ids"]].append({"user_sis_ids":row["user_sis_ids"], "file_ids":row["file_ids"]})
else:
flattened_data[row["course_sis_ids"]] = [{"user_sis_ids":row["user_sis_ids"], "file_ids":row["file_ids"]}]
return flattened_data
I completely changed the answer as you changed your question, so instead I just tidied the code in your own answer so it's more "Pythonic":
import csv
from collections import defaultdict
def gather_csv_info(filename):
all_csv_data = []
flattened_data = defaultdict(list)
with open(filename, 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
key = row["Header1"]
flattened_data[key].append({"Header2":row["Header2"], "Header3":row["Header3"]})
return flattened_data
print(gather_csv_info('data.csv'))
Not sure why you want the data in this format, but that's up to you.

Automate reading users from config file and apply a math function to the read users

how would I:
Automate reading in all users in row[0] of a config file csv and store those users
Then, run them all through a math function that will output different values for each of them
To format of the csv's are the same: (DEREK,23, 12.344444444, 5)
Output expected: 143.34
Right now, I have user interaction with finding all the users but this needs to be changed to have a faster program.
with open("main.csv") as input:
for row in csv.reader(f_input):
data.append((row[0], int(row[1]), int(row[2]))) #row[0] being strings aka the users the rest are their values
with open("user_dat.csv") as usr_in:
for rows in csv.reader(usr_in):
usr_dat.append((rows[0], int(rows[1]), int(rows[2]), int(rows[3])))
with open("all_user_values.csv", 'wb') as out:
writer = csv.writer(out)
for usrs, val, val1 in usr_dat: # problem lies here because i have no idea how to go about doing this
for usr, chng, cst in data:
if act_name in usrs:
if name in usr:
do(stuff)
Lets make things clear.
You have two separate lists.
Each list holds three items per "user" - name, and two int values.
You can iterate over each list (or a list that contains the two lists, is even better), and call "do" on each item in the lists (item is all three values as a tuple, like so:
validated_names = ['name1', 'name2']
final_list = data + usr_dat # Both lists combined into one
for user_tuple in final_list:
do_stuff(user_tuple)
def do_stuff(user_tuple):
name, int_values_tuple = user_tuple
if name in validated_names:
do_other_stuff(int_values_tuple)
def do_other_stuff(packed_values):
if len(packed_values) == 2:
calc_two_ints(packed_values)
elif len(packed_values) == 3:
calc_three_ints(packed_values)
else:
pass # No such cases for now.
def calc_two_ints(packed_values):
val1, val2 = packed_values
# Do something
def calc_three_ints(packed_values):
val1, val2, val3 = packed_values
# Do something

Objects/classes/lists Python

I am confused about classes in python. I don't want anyone to write down raw code but suggest methods of doing it. Right now I have the following code...
def main():
lst = []
filename = 'yob' + input('Enter year: ') + '.txt'
for line in open(filename):
line = line.strip()
lst.append(line.split(',')
What this code does is have a input for a file based on a year. The program is placed in a folder with a bunch of text files that have different years to them. Then, I made a class...
class Names():
__slots__ = ('Name', 'Gender', 'Occurences')
This class just defines what objects I should make. The goal of the project is to build objects and create lists based off these objects. My main function returns a list containing several elements that look like the following:
[[jon, M, 190203], ...]
These elements have a name in lst[0], a gender M or F in [1] and a occurence in [3]. I'm trying to find the top 20 Male and Female candidates and print them out.
Goal-
There should be a function which creates a name entry, i.e. mkEntry. It should be
passed the appropriate information, build a new object, populate the fields, and return
it.
If all you want is a handy container class to hold your data in, I suggest using the namedtuple type factory from the collections module, which is designed for exactly this. You should probably also use the csv module to handle reading your file. Python comes with "batteries included", so learn to use the standard library!
from collections import namedtuple
import csv
Person = namedtuple('Person', ('name', 'gender', 'occurences')) # create our type
def main():
filename = 'yob' + input('Enter year: ') + '.txt'
with open(filename, newlines="") as f: # parameters differ a bit in Python 2
reader = csv.reader(f) # the reader handles splitting the lines for you
lst = [Person(*row) for row in reader]
Note: If you're using Python 2, the csv module needs you to open the file in binary mode (with a second argument of 'rb') rather than using the newlines parameter.
If your file had just the single person you used in your example output, you' get a list with one Person object:
>>> print(lst)
[Person(name='jon', gender='M', occurences=190203)]
You can access the various values either by index (like a list or tuple) or by attribute name (like a custom object):
>>> jon = lst[0]
>>> print(jon[0])
jon
>>> print(jon.gender)
M
In your class, add an __init__ method, like this:
def __init__(self, name, gender, occurrences):
self.Name = name
# etc.
Now you don't need a separate "make" method; just call the class itself as a constructor:
myname = Names(lst[0], etc.)
And that's all there is to it.
If you really want an mkEntry function anyway, it'll just be a one-liner: return Names(etc.)
I know you said not to write out the code but it's just easier to explain it this way. You don't need to use slots - they're for a specialised optimisation purpose (and if you don't know what it is, you don't need it).
class Person(object):
def __init__(self, name, gender, occurrences):
self.name = name
self.gender = gender
self.occurrences = occurrences
def main():
# read in the csv to create a list of Person objects
people = []
filename = 'yob' + input('Enter year: ') + '.txt'
for line in open(filename):
line = line.strip()
fields = line.split(',')
p = Person(fields[0], fields[1], int(fields[2]))
people.append(p)
# split into genders
p_m = [p for p in people if p.gender == 'M']
p_f = [p for p in people if p.gender == 'F']
# sort each by occurrences descending
p_m = sorted(p_m, key=lambda x: -x.occurrences)
p_f = sorted(p_f, key=lambda x: -x.occurrences)
# print out the first 20 of each
for p in p_m[:20]:
print p.name, p.gender, p.occurrences
for p in p_f[:20]:
print p.name, p.gender, p.occurrences
if __name__ == '__main__':
main()
I've used a couple of features here that might look a little scary, but they're easy enough once you get used to them (and you'll see them all over python code). List comprehensions give us an easy way of filtering our list of people into genders. lambda gives you an anonymous function. The [:20] syntax says, give me the first 20 elements of this list - refer to list slicing.
Your case is quite simple and you probably don't even really need the class / objects but it should give you an idea of how you use them. There's also a csv reading library in python that will help you out if the csvs are more complex (quoted fields etc).

Maintaining order in large list of movies/ratings

I have a text file with hundreds of thousands of students, and their ratings for certain films organized with the first word being the student number, the second being the name of the movie (with no spaces), and the third being the rating they gave the movie:
student1000 Thor 1
student1001 Superbad -3
student1002 Prince_of_Persia:_The_Sands_of_Time 5
student1003 Old_School 3
student1004 Inception 5
student1005 Finding_Nemo 3
student1006 Tangled 5
I would like to arrange them in a dictionary so that I have each student mapped to a list of their movie ratings, where the ratings are in the same order for each student. In other words, I would like to have it like this:
{student1000 : [1, 3, -5, 0, 0, 3, 0,...]}
{student1001 : [0, 1, 0, 0, -3, 0, 1,...]}
Such that the first, second, third, etc. ratings for each student correspond to the same movies. The order is completely random for movies AND student numbers, and I'm having quite a bit of trouble doing this effectively. Any help in coming up with something that will minimize the big-O complexity of this problem would be awesome.
I ended up figuring it out. Here's the code I used for anyone wondering:
def get_movie_data(fileLoc):
movieDic = {}
movieList = set()
f = open(fileLoc)
setHold = set()
for line in f:
setHold.add(line.split()[1])
f.close()
movieList = sorted(setHold)
f = open(fileLoc)
for line in f:
hold = line.strip().split()
student = hold[0]
movie = hold[1]
rating = int(hold[2])
if student not in movieDic:
lst = [0]*len(movieList)
movieDic[student] = lst
hold2 = movieList.index(movie)
rate = movieDic[student]
rate[hold2] = rating
f.close()
return movieList, movieDic
Thanks for the help!
You can first build a dictionary of dictionaries:
{
'student1000' : {'Thor': 1, 'Superbad': 3, ...},
'student1001' : {'Thor': 0, 'Superbad': 1, ...},
...
}
Then you can go through that to get a master list of all the movies, establish an order for them (corresponding to the order within each student's rating list), and finally go through each student in the dictionary, converting the dictionary to the list you want. Or, like another answer said, just keep it as a dictionary.
defaultdict will probably come in handy. It lets you say that the default value for each student is an empty list (or dictionary) so you don't have to initialize it before you start appending values (or setting key-value pairs).
from collections import defaultdict
students = defaultdict(dict)
with open(filename, 'r') as f:
for line in f.readlines():
elts = line.split()
student = elts[0]
movie = elts[1]
rating = int(elts[2])
students[student][movie] = rating
So, the answers here are functionally the same as what you seem to be looking for, but as far as directly constructing the lists you're looking for, they seem to be answering slightly different questions. Personally I would prefer to do this in a more dynamic way. Since it doesn't seem to me like you actually know the movies that are going to be rated ahead of time, you've gotta keep some kind of running tally of that.
ratings = {}
allMovies = []
for line in file:
info = line.split(" ")
movie = info[1].strip().lower()
student = info[0].strip().lower()
rating = float(info[2].strip().lower())
if movie not in allMovies:
allMovies.append(movie)
movieIndex = allMovies.index(movie)
if student not in ratings:
ratings[student] = ([0]*(len(allMovies)-1)).append(rating)
else:
if len(allMovies) > len(ratings[student]):
ratings[student] = ratings[student].extend([0]*(len(allMovies)-len(ratings[student]))
ratings[student][movieIndex] = rating
It's not the way I would approach this problem, but I think this solution is closest to the original intent of the question and you can use a buffer to feed in the lines if there's a memory issue, but unless your file is multiple gigabytes there shouldn't be an issue with that.
Just put the scores into a dictionary rather than a list. After you've read all the data, you can then extract the movie names and put them in any order you want. Assuming students can rate different movies, maintaining some kind of consistent order while reading the file, without knowing the order of the movies to begin with, seems like a lot of work.
If you're worrying about the keys taking up a lot of memory, use intern() on the keys to make sure you're only storing one copy of each string.

Help with Excel, Python and XLRD

Relatively new to programming hence why I've chosen to use python to learn.
At the moment I'm attempting to read a list of Usernames, passwords from an Excel Spreadsheet with XLRD and use them to login to something. Then back out and go to the next line. Log in etc and keep going.
Here is a snippit of the code:
import xlrd
wb = xlrd.open_workbook('test_spreadsheet.xls')
# Load XLRD Excel Reader
sheetname = wb.sheet_names() #Read for XCL Sheet names
sh1 = wb.sheet_by_index(0) #Login
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
print readRows()
I've gotten the variables out and it reads all of them in one shot, here is where my lack of programming skills come in to play. I know I need to iterate through these and do something with them but Im kind of lost on what is the best practice. Any insight would be great.
Thank you again
couple of pointers:
i'd suggest you not print your function with no return value, instead just call it, or return something to print.
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
readRows()
or looking at the docs you can take a slice from the row_values:
row_values(rowx, start_colx=0,
end_colx=None) [#]
Returns a slice of the values of the cells in the given row.
because you just want rows with index 4 - 6:
def readRows():
# using list comprehension
return [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
print readRows()
using the second method you get a list return value from your function, you can use this function to set a variable with all of your data you read from the excel file. The list is actually a list of lists containing your row values.
L1 = readRows()
for row in L1:
print row[0], row[1], row[2]
After you have your data, you are able to manipulate it by iterating through the list, much like for the print example above.
def login(name, password, id):
# do stuff with name password and id passed into method
...
for row in L1:
login(row)
you may also want to look into different data structures for storing your data. If you need to find a user by name using a dictionary is probably your best bet:
def readRows():
rows = [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
# using list comprehension
return dict([ [row[4], (row[5], row[6])] for row in rows ])
D1 = readRows()
print D['Bob']
('sdfadfadf',23)
import pprint
pprint.pprint(D1)
{'Bob': ('sdafdfadf',23),
'Cat': ('asdfa',24),
'Dog': ('fadfasdf',24)}
one thing to note is that dictionary values returned arbitrarily ordered in python.
I'm not sure if you are intent on using xlrd, but you may want to check out PyWorkbooks (note, I am the writter of PyWorkbooks :D)
from PyWorkbooks.ExWorkbook import ExWorkbook
B = ExWorkbook()
B.change_sheet(0)
# Note: it might be B[:1000, 3:6]. I can't remember if xlrd uses pythonic addressing (0 is first row)
data = B[:1000,4:7] # gets a generator, the '1000' is arbitrarily large.
def readRows()
while True:
try:
userNm, Password, supID = data.next() # you could also do data[0]
print userNm, Password, supID
if usrNm == None: break # when there is no more data it stops
except IndexError:
print 'list too long'
readRows()
You will find that this is significantly faster (and easier I hope) than anything you would have done. Your method will get an entire row, which could be a thousand elements long. I have written this to retrieve data as fast as possible (and included support for such things as numpy).
In your case, speed probably isn't as important. But in the future, it might be :D
Check it out. Documentation is available with the program for newbie users.
http://sourceforge.net/projects/pyworkbooks/
Seems to be good. With one remark: you should replace "rows" by "cells" because you actually read values from cells in every single row

Categories

Resources