Writing a Array of Dictionaries to CSV - python

I'm trying to get the dictionary (which the first part of the program generates) to write to a csv so that I can perform further operations on the data in excel. I realize the code isn't efficient but at this point I'd just like it to work. I can deal with speeding it up later.
import csv
import pprint
raw_data = csv.DictReader(open("/Users/David/Desktop/crimestats/crimeincidentdata.csv", "r"))
neighborhood = []
place_count = {}
stats = []
for row in raw_data:
neighborhood.append(row["Neighborhood"])
for place in set(neighborhood):
place_count.update({place:0})
for key,value in place_count.items():
for place in neighborhood:
if key == place:
place_count[key] = place_count[key]+1
for key in place_count:
stats.append([{"Location":str(key)},{"Volume":str(place_count[key])}])
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(stats)
The program is still running fine here as is evident by the pprint output
[ [{'Location': 'LINNTON'}, {'Volume': '109'}],
[{'Location': 'SUNDERLAND'}, {'Volume': '118'}],
[{'Location': 'KENTON'}, {'Volume': '715'}]
This is where the error is definitely happening. The program writes the headers to the csv just fine then throws the ValueError.
fieldnames = ['Location', 'Volume']
with open('/Users/David/Desktop/crimestats/localdata.csv', 'w', newline='') as output_file:
csvwriter = csv.DictWriter(output_file, delimiter=',', fieldnames=fieldnames, dialect='excel')
csvwriter.writeheader()
for row in stats:
csvwriter.writerow(row)
output_file.close()
I've spent quite a bit of time searching for this problem but none of the suggestions I have attempted to use have worked. I figure I must me missing something so I'd really appreciate any and all help.
Traceback (most recent call last):
File "/Users/David/Desktop/crimestats/statsreader.py", line 34, in <module>
csvwriter.writerow(row)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 153, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 149, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: {'Location': 'SABIN'}, {'Volume': '247'}

I believe your problem is here:
for key in place_count:
stats.append([{"Location":str(key)},{"Volume":str(place_count[key])}])
This is creating a list of two dictionaries. The first has only a "Location" key, and the second has only a "Volume" key. However, the csv.DictWriter objects are expecting a single dictionary per row, with all the keys in the dictionary. Change that code snippet to the following and it should work:
for key in place_count:
stats.append({"Location": str(key), "Volume": str(place_count[key])})
That should take care of the errors you're seeing.
Now, as for why the error message is complaining about fields not in fieldnames, which completely misled you away from the real problem you're having: the writerow() function expects to get a dictionary as its row parameter, but you're passing it a list. The result is confusion: it iterates over the dict in a for loop expecting to get the dict's keys (because that's what you get when you iterate over a dict in Python), and it compares those keys to the values in the fieldnames list. What it's expecting to see is:
"Location"
"Volume"
in either order (because a Python dict makes no guarantees about which order it will return its keys). The reason why they want you to pass in a fieldnames list is so that the fields can be written to the CSV in the correct order. However, because you're passing in a list of two dictionaries, when it iterates over the row parameter, it gets the following:
{'Location': 'SABIN'}
{'Volume': '247'}
Now, the dictionary {'Location': 'SABIN'} does not equal the string "Location", and the dictionary {'Volume': '247'} does not equal the string "Volume", so the writerow() function thinks it's found dict keys that aren't in the fieldnames list you supplied, and it throws that exception. What was really happening was "you passed me a list of two dicts-of-one-key, when I expected a single dict-with-two-keys", but the function wasn't written to check for that particular mistake.
Now I'll mention a couple things you could do to speed up your code. One thing that will help quite a bit is to reduce those three for loops at the start of your code down to just one. What you're trying to do is to go through the raw data, and count the number of times each neighborhood shows up. First I'll show you a better way to do that, then I'll show you an even better way that improves on my first solution.
The better way to do that is to make use of the wonderful defaultdict class that Python provides in the collections module. defaultdict is a subclass of Python's dictionary type, which will automatically create dict entries when they're accessed for the first time. Its constructor takes a single parameter, a function which will be called with no parameters and should return the desired default value for any new item. If you had used defaultdict for your place_count dict, this code:
place_count = {}
for place in set(neighborhood):
place_count.update({place:0})
could simply become:
place_count = defaultdict(int)
What's going on here? Well, the int function (which really isn't a function, it's the constructor for the int class, but that's a bit beyond the scope of this explanation) just happens to return 0 if it's called with no parameters. So instead of writing your own function def returnzero(): return 0, you can just use the existing int function (okay, constructor). Now every time you do place_count["NEW PLACE"], the key NEW PLACE will automatically appear in your place_count dictionary, with the value 0.
Now, your counting loop needs to be modified too: it used to go over the keys of place_count, but now that place_count automatically creates its keys the first time they're accessed, you need a different source. But you still have that source in the raw data: the row["Neighborhood"] value for each row. So your for key,value in place_count.items(): loop could become:
for row in raw_data:
place = row["Neighborhood"]
place_count[place] = place_count[place] + 1
And now that you're using a defaultdict, you don't even need that first loop (the one that created the neighborhood list) at all! So we've just turned three loops into one. The final version of what I'm suggesting looks like this:
from collections import defaultdict
place_count = defaultdict(int)
for row in raw_data:
place = row["Neighborhood"]
place_count[place] = place_count[place] + 1
# Or: place_count[place] += 1
However, there's a way to improve that even more. The Counter object from the collections module is designed for just this case, and has some handy extra functionality, like the ability to retrieve the N most common items. So the final final version :-) of what I'm suggesting is:
from collections import Counter
place_count = Counter()
for row in raw_data:
place = row["Neighborhood"]
place_count[place] = place_count[place] + 1
# Or: place_count[place] += 1
That way if you need to retrieve the 5 most crime-ridden neighborhoods, you can just call place_count.most_common(5).
You can read more about Counter and defaultdict in the documentation for the collections module.

Related

iter through the dict store the key value and iter again to look for similar word in dict and delete form dict eg(Light1on,Light1off) in Python

[I had problem on how to iter through dict to find a pair of similar words and output it then the delete from dict]
My intention is to generate a random output label then store it into dictionary then iter through the dictionary and store the first key in the list or some sort then iter through the dictionary to search for similar key eg Light1on and Light1off has Light1 in it and get the value for both of the key to store into a table in its respective columns.
such as
Dict = {Light1on,Light2on,Light1off...}
store value equal to Light1on the iter through the dictionary to get eg Light1 off then store its Light1on:value1 and Light1off:value2 into a table or DF with columns name: On:value1 off:value2
As I dont know how to insert the code as code i can only provide the image sry for the trouble,its my first time asking question here thx.
from collections import defaultdict
import difflib, random
olist = []
input = 10
olist1 = ['Light1on','Light2on','Fan1on','Kettle1on','Heater1on']
olist2 = ['Light2off','Kettle1off','Light1off','Fan1off','Heater1off']
events = list(range(input + 1))
for i in range(len(olist1)):
output1 = random.choice(olist1)
print(output1,'1')
olist1.remove(output1)
output2 = random.choice(olist2)
print(output2,'2')
olist2.remove(output2)
olist.append(output1)
olist.append(output2)
print(olist,'3')
outputList = {olist[i]:events[i] for i in range(10)}
print (str(outputList),'4')
# Iterating through the keys finding a pair match
for s in range(5):
for i in outputList:
if i == list(outputList)[0]:
skeys = difflib.get_close_matches(i, outputList, n=2, cutoff=0.75)
print(skeys,'5')
del outputList[skeys]
# Modified Dictionary
difflib.get_close_matches('anlmal', ['car', 'animal', 'house', 'animaltion'])
['animal']
Updated: I was unable to delete the pair of similar from the list(Dictionary) after founding par in the dictionary
You're probably getting an error about a dictionary changing size during iteration. That's because you're deleting keys from a dictionary you're iterating over, and Python doesn't like that:
d = {1:2, 3:4}
for i in d:
del d[i]
That will throw:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration
To work around that, one solution is to store a list of the keys you want to delete, then delete all those keys after you've finished iterating:
keys_to_delete = []
d = {1:2, 3:4}
for i in d:
if i%2 == 1:
keys_to_delete.append(i)
for i in keys_to_delete:
del d[i]
Ta-da! Same effect, but this way avoids the error.
Also, your code above doesn't call the difflib.get_close_matches function properly. You can use print(help(difflib.get_close_matches)) to see how you are meant to call that function. You need to provide a second argument that indicates the items to which you wish to compare your first argument for possible matches.
All of that said, I have a feeling that you can accomplish your fundamental goals much more simply. If you spend a few minutes describing what you're really trying to do (this shouldn't involve any references to data types, it should just involve a description of your data and your goals), then I bet someone on this site can help you solve that problem much more simply!

How to create a nested python dictionary with keys as strings?

Summary of issue: I'm trying to create a nested Python dictionary, with keys defined by pre-defined variables and strings. And I'm populating the dictionary from regular expressions outputs. This mostly works. But I'm getting an error because the nested dictionary - not the main one - doesn't like having the key set to a string, it wants an integer. This is confusing me. So I'd like to ask you guys how I can get a nested python dictionary with string keys.
Below I'll walk you through the steps of what I've done. What is working, and what isn't. Starting from the top:
# Regular expressions module
import re
# Read text data from a file
file = open("dt.cc", "r")
dtcc = file.read()
# Create a list of stations from regular expression matches
stations = sorted(set(re.findall(r"\n(\w+)\s", dtcc)))
The result is good, and is as something like this:
stations = ['AAAA','BBBB','CCCC','DDDD']
# Initialize a new dictionary
rows = {}
# Loop over each station in the station list, and start populating
for station in stations:
rows[station] = re.findall("%s\s(.+)" %station, dtcc)
The result is good, and is something like this:
rows['AAAA'] = ['AAAA 0.1132 0.32 P',...]
However, when I try to create a sub-dictionary with a string key:
for station in stations:
rows[station] = re.findall("%s\s(.+)" %station, dtcc)
rows[station]["dt"] = re.findall("%s\s(\S+)" %station, dtcc)
I get the following error.
"TypeError: list indices must be integers, not str"
It doesn't seem to like that I'm specifying the second dictionary key as "dt". If I give it a number instead, it works just fine. But then my dictionary key name is a number, which isn't very descriptive.
Any thoughts on how to get this working?
The issue is that by doing
rows[station] = re.findall(...)
You are creating a dictionary with the station names as keys and the return value of re.findall method as values, which happen to be lists. So by calling them again by
rows[station]["dt"] = re.findall(...)
on the LHS row[station] is a list that is indexed by integers, which is what the TypeError is complaining about. You could do rows[station][0] for example, you would get the first match from the regex. You said you want a nested dictionary. You could do
rows[station] = dict()
rows[station]["dt"] = re.findall(...)
To make it a bit nicer, a data structure that you could use instead is a defaultdict from the collections module.
The defaultdict is a dictionary that accepts a default type as a type for its values. You enter the type constructor as its argument. For example dictlist = defaultdict(list) defines a dictionary that has as values lists! Then immediately doing dictlist[key].append(item1) is legal as the list is automatically created when setting the key.
In your case you could do
from collections import defaultdict
rows = defaultdict(dict)
for station in stations:
rows[station]["bulk"] = re.findall("%s\s(.+)" %station, dtcc)
rows[station]["dt"] = re.findall("%s\s(\S+)" %station, dtcc)
Where you have to assign the first regex result to a new key, "bulk" here but you can call it whatever you like. Hope this helps.

How to read 2 different text files line by line and make another file containing a dictionary using python?

I have two text files name weburl.txt and imageurl.txt, weburl.txt contain URLs of website and imageurl.txt contain all images URLs I want to create a dictionary that read a line of weburl.txt and make key of a dictionary and imageurl.txt line as a value.
weburl.txt
url1
url2
url3
url4
url5......
imageurl.txt
imgurl1
imgurl2
imgurl3
imgurl4
imgurl5
required output is
{'url1': imgurl1, 'url2': imgurl2, 'url3': imgurl3......}
I am using this code
with open('weburl.txt') as f :
key = f.readlines()
with open('imageurl.txt') as g:
value = g.readlines()
dict[key] = [value]
print dict
I am not getting the required results
you can write something like
with open('weburl.txt') as f, \
open('imageurl.txt') as g:
# we use `str.strip` method
# to remove newline characters
keys = (line.strip() for line in f)
values = (line.strip() for line in g)
result = dict(zip(keys, values))
print(result)
more info about zip at docs
There are problems with the statement dict[key] = [value] on so many levels that I get a kind of vertigo as we drill down through them:
The apparent intention to use a variable called dict (a bad idea because it would overshadow Python's builtin reference to the dict class). Let's call it d instead.
Not initializing the dictionary instance first. If you had called it something liked this oversight would earn you an easy-to-understand NameError. However since you're calling it dict, Python will actually be attempting to set items in the dict class itself (which doesn't support __setitem__) instead of inside a dict instance, so you'll get a different, more-confusing error.
Attempting to make a dict entry assignment where the key is a non-hashable type (key is alist). You could convert thelist to the hashable type tuple easily enough, but that's not what you want because you'd still be...
Attempting to assign bunch of values to their respective keys all at once. This can't be done with d[key] = value syntax. It could be done all in one relatively simple statement, i.e. d=dict(zip(key,value)) but unfortunately that doesn't get around the fact that you're...
Not stripping the newline character off the end of each key and value.
Instead, this line:
d = dict((k.strip(), v.strip()) for k, v in zip(key, value))
will do what you appear to want.

Python: Maintaining original order when zipping two lists into a dictionary

I am reading a CSV file and combining rows into dictionaries, with the first row containing the keys and the subsequent rows containing the values.
I want my dictionary keys to be in the same order as the original csv file, but the dict(zip)) function seems to order them randomly. I tried OrderedDict and that didn't work.
If there is a better way to produce my dictionaries I'm open to suggestions, but I would really like to know how i can do this while keeping my existing code, just because I am very new to Python (and programming in general) and I would like to be able to understand my own code at this point.
import csv # imports the csv module
with open("csvfile.csv", "r") as file_var:
reader = csv.reader(file_var)
my_list = []
for row in reader:
if (len(row)!=0):
my_list = my_list + [row]
for i in range(1, len(my_list)):
user = dict(zip(my_list[0], my_list[i]))
print "----------------------"
print user['first_name'], user['last_name']
for key in user:
print key, user[key]
Dictionaries have an arbitrary order. You should use an OrderedDict instead.
from collections import OrderedDict
user = OrderedDict(zip(my_list[0], my_list[i]))
etc.
I note you say it didn't work, but I see no reason why it wouldn't. In what way did it fail?

How do I write the contents of nested dictionaries to a file in a certain format?

I have a dictionary of dictionaries, and I'm trying to output the information within them in a certain way so that it will be usable for downstream analysis. Note: All the keys in dict are in also in list.
for item in list:
for key, value in dict[item].items():
print item, key, value
This is the closest I've gotten to what I want, but it's still a long way off. Ideally what I want is:
item1 item2 item3 item4
key1 value value value value
key2 value value value value
key2 value value value value
Is this even possible?
First, if I understand your structure, the list is just a way of ordering the keys for the outer dictionary, and a lot of your complexity is trying to use these two together to simulate an ordered dictionary. If so, there's a much easier way to do that: use collections.OrderedDict. I'll come back to that at the end.
First, you need to get all of the keys of your sub-dictionaries, because those are the rows of your output.
From comments, it sounds like all of the sub-dictionaries in dct have the same keys, so you can just pull the keys out of any arbitrary one of them:
keys = dct.values()[0].keys()
If each sub-dictionary can have a different subset of keys, you'll need to instead do a first pass over dct to get all the keys:
keys = reduce(set.union, map(set, dct.values()))
Some people find reduce hard to understand, even when you're really just using it as "sum with a different operator". For them, here's how to do the same thing explicitly:
keys = set()
for subdct in dct.values():
keys |= set(subdct)
Now, for each key's row, we need to get a column for each sub-dictionary (that is, each value in the outer dictionary), in the order specified by using the elements of the list as keys into the outer dictionary.
So, for each column item, we want to get the outer-dictionary value corresponding to the key in item, and then in the resulting sub-dictionary, get the value corresponding to the row's key. That's hard to say in English, but in Python, it's just:
dct[item][key]
If you don't actually have all the same keys in all of the sub-dictionaries, it's only slightly more complicated:
dct[item].get(key, '')
So, if you didn't want any headers, it would look like this:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
for key in keys:
w.writerow(dct[item].get(key, '') for item in lst)
To add a header column, just prepend the header (in this case, key) to each of those rows:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
for key in keys:
w.writerow([key], [dct[item].get(key, '') for item in lst])
Notice that I turned the genexp into a list comprehension so I could use list concatenation to prepend the key. It's conceptually cleaner to leave it as an iterator, and prepend with itertools.chain, but in trivial cases like this with tiny iterables, I think that's just making the code harder to read:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
for key in keys:
w.writerow(chain([key], (dct[item].get(key, '') for item in lst)))
You also want a header row. That's even easier; it's just the items in the list, with a blank column prepended for the header column:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
w.writerow([''] + lst)
for key in keys:
w.writerow([key] + [dct[item].get(key, '') for item in lst])
However, there are two ways to make things even simpler.
First, you can use an OrderedDict, so you don't need the separate key list. If you're stuck with the separate list and dict, you can still build an OrderedDict on the fly to make your code easier to read. For example:
od = collections.OrderedDict((item, dct[item]) for item in lst)
And now:
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
w.writerow([''] + od.keys())
for key in keys:
w.writerow([key] + [subdct.get(key, '') for subdct in od.values()])
Second, you could just build the transposed structure:
transposed = {key_b: {key_a: dct[key_a].get(key_b, '') for key_a in dct}
for key_b in keys}
And then iterate over it in the obvious order (or use a DictWriter to handle the ordering of the columns for you, and use its writerows method to deal with the rows, so the whole thing becomes a one-liner).
To store objects in Python so that you can re-use them later, you can you use the shelve module. This a module that lets you write objects to a shelf file and re-open it and retrieve the objects later, but it's operating system-dependent, so it won't work if say you made it on a Mac and later you want to open it on a Windows machine.
import shelve
shelf = shelve.open("filename", flag='c')
#with flag='c', you have to delete the old shelf if you want to overwrite it
dict1 = #something
dict2 = #something
shelf['key1'] = dict1
shelf['key2'] = dict2
shelf.close()
To read objects from a shelf:
shelf_reader = shelve.open("filename", flag='r')
for k in shelf_reader.keys():
retrieved = shelf_reader[k]
print(retrieved) #prints the retrieved dictionary
shelf_reader.close()
It may be a matter of opinion, but I think one of the best (and by far easieset) ways to serialize a (nested) dictionnary is using the JSON format:
{ "key1" : { "subkey1" : "value1",
"subkey2" : "value2" },
"key2" : {"subkey3" : "value3"} }
The best is that this can be done (either for encoding your values or decoding them) in a single line using the builtin json module !
Let's consider your dictionnary is the dico variable:
import json
save_file = open('save_file', 'w')
save_file.write( json.dumps(dico) )
Et voilĂ  :-) !
If the data is guaranteed to be loaded back into Python, I'd suggest simply using pickle instead of worrying about the format. If it's going to be loaded into another standard language, then consider using json instead - there are libraries for most languages to parse JSON format data.
That said if you really need to invent your own format, you could do something like this to store all keys from all sub-dictionaries in CSV format:
import csv
dict_keys = sorted(dict.keys())
with open("output.csv", "wb") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["Key"] + dict_keys)
all_keys = reduce(set.union, (set(d) for d in dict.values()))
for key in sorted(all_keys):
writer.writerow([key] + [dict[k].get(key, "") for k in dict_keys])

Categories

Resources