Read csv to dict of lists of dicts - python

I have a data set with tons of (intentional) duplication. I'd like to collapse(?) that to make it better suited for my needs. The data reads like this:
Header1, Header2, Header3
Example1, Content1, Stuff1
Example1, Content2, Stuff2
Example1, Content3, Stuff3
Example2, Content1, Stuff1
Example2, Content5, Stuff5
etc...
And I want that to end up as a dict with column one's values as keys and lists of dicts as values to those keys like so:
{Example1 : [{Header2:Content1, Header3:Stuff1}, {Header2:Content2, Header3:Stuff2}, {Header2:Content3, Header3:Stuff3}],
Example2 : [{Header2:Content1, Header3:Stuff1}, {Header2:Content5, Header3:Stuff5}]}
I'm brand new to Python and a novice programmer over all so feel free to get clarification if this question is confusing. 😅 Thanks!
Update I was rightfully called out for not posting my example code (thanks for keeping me honest!) so here it is. The code below works but since I'm new to Python I don't know if it's well written or not. Also the dict ends up with the keys (Example1 and Example2) in reverse order. That doesn't really matter but I do not understand why.
def gather_csv_info():
all_csv_data = []
flattened_data = {}
reading_csv = csv.DictReader(open(sys.argv[1], 'rb'))
for row in reading_csv:
all_csv_data.append(row)
for row in all_csv_data:
if row["course_sis_ids"] in flattened_data:
flattened_data[row["course_sis_ids"]].append({"user_sis_ids":row["user_sis_ids"], "file_ids":row["file_ids"]})
else:
flattened_data[row["course_sis_ids"]] = [{"user_sis_ids":row["user_sis_ids"], "file_ids":row["file_ids"]}]
return flattened_data

This code works but I don't know how pythonic it is and I don't understand why the flattened_data dict has the keys in reverse order that they appear in the original CSV. It doesn't strictly matter that they're not in order, but it is curious.
def gather_csv_info():
all_csv_data = []
flattened_data = {}
reading_csv = csv.DictReader(open(sys.argv[1], 'rb'))
for row in reading_csv:
all_csv_data.append(row)
for row in all_csv_data:
if row["course_sis_ids"] in flattened_data:
flattened_data[row["course_sis_ids"]].append({"user_sis_ids":row["user_sis_ids"], "file_ids":row["file_ids"]})
else:
flattened_data[row["course_sis_ids"]] = [{"user_sis_ids":row["user_sis_ids"], "file_ids":row["file_ids"]}]
return flattened_data

I completely changed the answer as you changed your question, so instead I just tidied the code in your own answer so it's more "Pythonic":
import csv
from collections import defaultdict
def gather_csv_info(filename):
all_csv_data = []
flattened_data = defaultdict(list)
with open(filename, 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
key = row["Header1"]
flattened_data[key].append({"Header2":row["Header2"], "Header3":row["Header3"]})
return flattened_data
print(gather_csv_info('data.csv'))
Not sure why you want the data in this format, but that's up to you.

Related

Short way to add dictionary item to a set only when its string value isn't empty?

I currently have this piece of code:
name_set = set()
reader = [{'name':'value1'}, {'name':''}, {'name':'value2'}]
for row in reader:
name = row.get('name', None)
if name:
name_set.add(name)
print(name_set)
In the real code the reader is a DictReader, but I use a list with dicts to represent this.
Note that the if name: will check for:
Empty string present in the Dictionary (thus "")
Not whenever the key does not exist in the Dictionary
Although, I think this code is easy readable, but I'm wondering if there is a shorter way as this code is 6 lines to simply extract values from dicts and save these in a set.
Your existing code is fine.
But since you asked for a "short" way, you could just use set comprehensions/arithmetic:
>>> reader = [{'name':'value1'}, {'name':''}, {'name':'value2'}]
>>> {d['name'] for d in reader} - {''}
{'value1', 'value2'}

Python: Maintaining original order when zipping two lists into a dictionary

I am reading a CSV file and combining rows into dictionaries, with the first row containing the keys and the subsequent rows containing the values.
I want my dictionary keys to be in the same order as the original csv file, but the dict(zip)) function seems to order them randomly. I tried OrderedDict and that didn't work.
If there is a better way to produce my dictionaries I'm open to suggestions, but I would really like to know how i can do this while keeping my existing code, just because I am very new to Python (and programming in general) and I would like to be able to understand my own code at this point.
import csv # imports the csv module
with open("csvfile.csv", "r") as file_var:
reader = csv.reader(file_var)
my_list = []
for row in reader:
if (len(row)!=0):
my_list = my_list + [row]
for i in range(1, len(my_list)):
user = dict(zip(my_list[0], my_list[i]))
print "----------------------"
print user['first_name'], user['last_name']
for key in user:
print key, user[key]
Dictionaries have an arbitrary order. You should use an OrderedDict instead.
from collections import OrderedDict
user = OrderedDict(zip(my_list[0], my_list[i]))
etc.
I note you say it didn't work, but I see no reason why it wouldn't. In what way did it fail?

How do I write the contents of nested dictionaries to a file in a certain format?

I have a dictionary of dictionaries, and I'm trying to output the information within them in a certain way so that it will be usable for downstream analysis. Note: All the keys in dict are in also in list.
for item in list:
for key, value in dict[item].items():
print item, key, value
This is the closest I've gotten to what I want, but it's still a long way off. Ideally what I want is:
item1 item2 item3 item4
key1 value value value value
key2 value value value value
key2 value value value value
Is this even possible?
First, if I understand your structure, the list is just a way of ordering the keys for the outer dictionary, and a lot of your complexity is trying to use these two together to simulate an ordered dictionary. If so, there's a much easier way to do that: use collections.OrderedDict. I'll come back to that at the end.
First, you need to get all of the keys of your sub-dictionaries, because those are the rows of your output.
From comments, it sounds like all of the sub-dictionaries in dct have the same keys, so you can just pull the keys out of any arbitrary one of them:
keys = dct.values()[0].keys()
If each sub-dictionary can have a different subset of keys, you'll need to instead do a first pass over dct to get all the keys:
keys = reduce(set.union, map(set, dct.values()))
Some people find reduce hard to understand, even when you're really just using it as "sum with a different operator". For them, here's how to do the same thing explicitly:
keys = set()
for subdct in dct.values():
keys |= set(subdct)
Now, for each key's row, we need to get a column for each sub-dictionary (that is, each value in the outer dictionary), in the order specified by using the elements of the list as keys into the outer dictionary.
So, for each column item, we want to get the outer-dictionary value corresponding to the key in item, and then in the resulting sub-dictionary, get the value corresponding to the row's key. That's hard to say in English, but in Python, it's just:
dct[item][key]
If you don't actually have all the same keys in all of the sub-dictionaries, it's only slightly more complicated:
dct[item].get(key, '')
So, if you didn't want any headers, it would look like this:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
for key in keys:
w.writerow(dct[item].get(key, '') for item in lst)
To add a header column, just prepend the header (in this case, key) to each of those rows:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
for key in keys:
w.writerow([key], [dct[item].get(key, '') for item in lst])
Notice that I turned the genexp into a list comprehension so I could use list concatenation to prepend the key. It's conceptually cleaner to leave it as an iterator, and prepend with itertools.chain, but in trivial cases like this with tiny iterables, I think that's just making the code harder to read:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
for key in keys:
w.writerow(chain([key], (dct[item].get(key, '') for item in lst)))
You also want a header row. That's even easier; it's just the items in the list, with a blank column prepended for the header column:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
w.writerow([''] + lst)
for key in keys:
w.writerow([key] + [dct[item].get(key, '') for item in lst])
However, there are two ways to make things even simpler.
First, you can use an OrderedDict, so you don't need the separate key list. If you're stuck with the separate list and dict, you can still build an OrderedDict on the fly to make your code easier to read. For example:
od = collections.OrderedDict((item, dct[item]) for item in lst)
And now:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
w.writerow([''] + od.keys())
for key in keys:
w.writerow([key] + [subdct.get(key, '') for subdct in od.values()])
Second, you could just build the transposed structure:
transposed = {key_b: {key_a: dct[key_a].get(key_b, '') for key_a in dct}
for key_b in keys}
And then iterate over it in the obvious order (or use a DictWriter to handle the ordering of the columns for you, and use its writerows method to deal with the rows, so the whole thing becomes a one-liner).
To store objects in Python so that you can re-use them later, you can you use the shelve module. This a module that lets you write objects to a shelf file and re-open it and retrieve the objects later, but it's operating system-dependent, so it won't work if say you made it on a Mac and later you want to open it on a Windows machine.
import shelve
shelf = shelve.open("filename", flag='c')
#with flag='c', you have to delete the old shelf if you want to overwrite it
dict1 = #something
dict2 = #something
shelf['key1'] = dict1
shelf['key2'] = dict2
shelf.close()
To read objects from a shelf:
shelf_reader = shelve.open("filename", flag='r')
for k in shelf_reader.keys():
retrieved = shelf_reader[k]
print(retrieved) #prints the retrieved dictionary
shelf_reader.close()
It may be a matter of opinion, but I think one of the best (and by far easieset) ways to serialize a (nested) dictionnary is using the JSON format:
{ "key1" : { "subkey1" : "value1",
"subkey2" : "value2" },
"key2" : {"subkey3" : "value3"} }
The best is that this can be done (either for encoding your values or decoding them) in a single line using the builtin json module !
Let's consider your dictionnary is the dico variable:
import json
save_file = open('save_file', 'w')
save_file.write( json.dumps(dico) )
Et voilà :-) !
If the data is guaranteed to be loaded back into Python, I'd suggest simply using pickle instead of worrying about the format. If it's going to be loaded into another standard language, then consider using json instead - there are libraries for most languages to parse JSON format data.
That said if you really need to invent your own format, you could do something like this to store all keys from all sub-dictionaries in CSV format:
import csv
dict_keys = sorted(dict.keys())
with open("output.csv", "wb") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["Key"] + dict_keys)
all_keys = reduce(set.union, (set(d) for d in dict.values()))
for key in sorted(all_keys):
writer.writerow([key] + [dict[k].get(key, "") for k in dict_keys])

reading file into a dictionary

I was wondering if there is a way that i can read delimitered text into a dictionary. I have been able to get it into lists no problem here is the code:
def _demo_fileopenbox():
msg = "Pick A File!"
msg2 = "Select a country to learn more about!"
title = "Open files"
default="*.py"
f = fileopenbox(msg,title,default=default)
writeln("You chose to open file: %s" % f)
c = []
a = []
p = []
with open(f,'r') as handle:
reader = csv.reader(handle, delimiter = '\t')
for row in reader:
c = c + [row[0]]
a = a + [row[1]]
p = p + [row[2]]
while 1:
reply = choicebox(msg=msg2, choices= c )
writeln( reply + ";\tArea: " + a[(c.index(reply))] + " square miles \tPopulation: " + p[(c.index(reply))] )
that code makes it 3 lists because each line of text is a country name, their area, and their population. I had it that way so if i choose a country it will give me the corrosponding information on pop and area. Some people say a dictionary is a better approach, but first of all i dont think that i can put three things into one spot int the dictionary. I need the Country name to be the key and then the the population and area the info for that key. 2 dictionaries could probably work? but i just dont know how to get from file to dictionary, any help plz?
You could use two dictionaries, but you could also use a 2-tuple like this:
countries = {}
# ... other code as before
for row in reader:
countries[row[0]] = (row[1], row[2])
Then you can iterate through it all like this:
for country, (area, population) in countries.iteritems():
# ... Do stuff with country, area and population
... or you can access data on a specific country like this:
area, population = countries["USA"]
Finally, if you're planning to add more information in the future you might instead want to use a class as a more elegant way to hold the information - this makes it easier to write code which doesn't break when you add new stuff. You'd have a class something like this:
class Country(object):
def __init__(self, name, area, population):
self.name = name
self.area = area
self.population = population
And then your reading code would look something like this:
for row in reader:
countries[row[0]] = Country(row[0], row[1], row[2])
Or if you have the constructor take the entire row rather than individual items you might find it easier to extend the format later, but you're also coupling the class more closely to the representation in the file. It just depends on how you think you might extend things later.
Then you can look things up like this:
country = countries["USA"]
print "Area is: %s" % (country.area,)
This has the advantage that you can add new methods to do more clever stuff in the future. For example, a method which returns the population density:
class Country(object):
# ...
def get_density(self):
return self.population / self.area
In general I would recommend classes over something like nested dictionaries once you get beyond something where you're storing more than a couple of items. They make your code easier to read and easier to extend later.
As with most programming issues, however, other approaches will work - it's a case of choosing the method that works best for you.
Something like this should work:
from collections import defaultdict
myDict = {}
for row in reader:
country, area, population = row
myDict[country] = {'area': area, 'population': population}
Note that you'll have to add some error checking so that your code doesn't break if there are greater or less than three delimited items in each row.
You can access the values as follows:
>>> myDict['Mordor']['area']
175000
>>> myDict['Mordor']['population']
3000000
data = []
with open(f,'r') as handle:
reader = csv.reader(handle, delimiter = '\t')
for row in reader:
(country, area, population) = row
data.append({'country': country, 'area': area, 'population': population})
Data would then be a list of dictionaries.
But I'm not sure that's really a better approach, because it would use more memory. Another option is just a list of lists:
data = list(csv.reader(open(f), delimiter='\t'))
print data
# [['USA', 'big', '300 million'], ...]
the value of the dictionary can be a tuple of the population and area info. So when you read in the file you can do something such as
countries_dict = {}
for row in reader:
countries_dict[row[0]] = (row[1],row[2])

Help with Excel, Python and XLRD

Relatively new to programming hence why I've chosen to use python to learn.
At the moment I'm attempting to read a list of Usernames, passwords from an Excel Spreadsheet with XLRD and use them to login to something. Then back out and go to the next line. Log in etc and keep going.
Here is a snippit of the code:
import xlrd
wb = xlrd.open_workbook('test_spreadsheet.xls')
# Load XLRD Excel Reader
sheetname = wb.sheet_names() #Read for XCL Sheet names
sh1 = wb.sheet_by_index(0) #Login
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
print readRows()
I've gotten the variables out and it reads all of them in one shot, here is where my lack of programming skills come in to play. I know I need to iterate through these and do something with them but Im kind of lost on what is the best practice. Any insight would be great.
Thank you again
couple of pointers:
i'd suggest you not print your function with no return value, instead just call it, or return something to print.
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
readRows()
or looking at the docs you can take a slice from the row_values:
row_values(rowx, start_colx=0,
end_colx=None) [#]
Returns a slice of the values of the cells in the given row.
because you just want rows with index 4 - 6:
def readRows():
# using list comprehension
return [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
print readRows()
using the second method you get a list return value from your function, you can use this function to set a variable with all of your data you read from the excel file. The list is actually a list of lists containing your row values.
L1 = readRows()
for row in L1:
print row[0], row[1], row[2]
After you have your data, you are able to manipulate it by iterating through the list, much like for the print example above.
def login(name, password, id):
# do stuff with name password and id passed into method
...
for row in L1:
login(row)
you may also want to look into different data structures for storing your data. If you need to find a user by name using a dictionary is probably your best bet:
def readRows():
rows = [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
# using list comprehension
return dict([ [row[4], (row[5], row[6])] for row in rows ])
D1 = readRows()
print D['Bob']
('sdfadfadf',23)
import pprint
pprint.pprint(D1)
{'Bob': ('sdafdfadf',23),
'Cat': ('asdfa',24),
'Dog': ('fadfasdf',24)}
one thing to note is that dictionary values returned arbitrarily ordered in python.
I'm not sure if you are intent on using xlrd, but you may want to check out PyWorkbooks (note, I am the writter of PyWorkbooks :D)
from PyWorkbooks.ExWorkbook import ExWorkbook
B = ExWorkbook()
B.change_sheet(0)
# Note: it might be B[:1000, 3:6]. I can't remember if xlrd uses pythonic addressing (0 is first row)
data = B[:1000,4:7] # gets a generator, the '1000' is arbitrarily large.
def readRows()
while True:
try:
userNm, Password, supID = data.next() # you could also do data[0]
print userNm, Password, supID
if usrNm == None: break # when there is no more data it stops
except IndexError:
print 'list too long'
readRows()
You will find that this is significantly faster (and easier I hope) than anything you would have done. Your method will get an entire row, which could be a thousand elements long. I have written this to retrieve data as fast as possible (and included support for such things as numpy).
In your case, speed probably isn't as important. But in the future, it might be :D
Check it out. Documentation is available with the program for newbie users.
http://sourceforge.net/projects/pyworkbooks/
Seems to be good. With one remark: you should replace "rows" by "cells" because you actually read values from cells in every single row

Categories

Resources