Finding modes for multiple dictionary keys - python

I currently have a Python dictionary with keys assigned to multiple values (which have come from a CSV), in a format similar to:
{
'hours': ['4', '2.4', '5.8', '2.4', '7'],
'name': ['Adam', 'Bob', 'Adam', 'John', 'Harry'],
'salary': ['55000', '30000', '55000', '30000', '80000']
}
(The actual dictionary is significantly larger in both keys and values.)
I am looking to find the mode* for each set of values, with the stipulation that sets where all values occur only once do not need a mode. However, I'm not sure how to go about this (and I can't find any other examples similar to this). I am also concerned about the different (implied) data types for each set of values (e.g. 'hours' values are floats, 'name' values are strings, 'salary' values are integers), though I have a rudimentary conversion function included but not used yet.
import csv
f = 'blah.csv'
# Conducts type conversion
def conversion(value):
try:
value = float(value)
except ValueError:
pass
return value
reader = csv.DictReader(open(f))
# Places csv into a dictionary
csv_dict = {}
for row in reader:
for column, value in row.iteritems():
csv_dict.setdefault(column, []).append(value.strip())
*I'm wanting to attempt other types of calculations as well, such as averages and quartiles- which is why I'm concerned about data types- but I'd mostly like assistance with modes for now.
EDIT: the input CSV file can change; I'm unsure if this has any effect on potential solutions.

Ignoring all the csv file stuff which seems tangential to your question, lets say you have a list salary. You can use the Counter class from collections to count the unique list elements.
From that you have a number of different options about how to get from a Counter to your mode.
For example:
from collections import Counter
salary = ['55000', '30000', '55000', '30000', '80000']
counter = Counter(salary)
# This returns all unique list elements and their count, sorted by count, descending
mc = counter.most_common()
print(mc)
# This returns the unique list elements and their count, where their count equals
# the count of the most common list element.
gmc = [(k,c) for (k,c) in mc if c == mc[0][1]]
print(gmc)
# If you just want an arbitrary (list element, count) pair that has the most occurences
amc = counter.most_common()[0]
print(amc)
For the salary list in the code, this outputs:
[('55000', 2), ('30000', 2), ('80000', 1)] # mc
[('55000', 2), ('30000', 2)] # gmc
('55000', 2) # amc
Of course, for your case you'd probably use Counter(csv_dict["salary"]) instead of Counter(salary).

I'm not sure I understand the question, but you could create a dictionary matching each desired mode to those keys, manually, or you could use the 'type' class by asking the values, then if the type returns a string ask other questions/parameters, like length of the item.

Related

Impose conditional IF statement while updating dictionary values

I'm working on determining the maximum value (third value in the tuple) shared between the first two values presented in the tuple.
I created a defaultdict that utilizes the sorted concatenated values of the first two values of the tuple as the dic key and assign the dic value as the third value of the tuple.
How can I impose a condition so that when I come across the same pairing I replace the dic value with the larger value? I only want to read through my list once to be efficient.
users = [
('2','1',0.7),
('1','2', 0.5),
('3','2', 0.99),
('1','3', 0.78),
('2','1', 0.5),
('2','3', 0.99),
('3','1', 0.78),
('3','2', 0.96)]
#The above list is much longer ~10mill+, thus the need to only read through it once.
#Current code
from collections import defaultdict
user_pairings = defaultdict()
for us1, us2, maxval in users:
user_pairings[''.join(sorted(us1+us2))] = maxval ##-> How to impose the condition here?
print(user_pairings)
EDIT
Just realized a major flaw in my approach. If the values used for keys are not single digit, then my output will not be the correct result due to using
sorted.
You can use the dictionary get method to check if a key already exists in the dictionary, returning 0 if it doesn't, and then assign the max of that value and the current value to the key:
user_pairings = {}
for us1, us2, maxval in users:
key = '-'.join(sorted([us1, us2]))
user_pairings[key] = max(maxval, user_pairings.get(key, 0))
print(user_pairings)
Output for your sample data:
{'1-3': 0.78, '2-3': 0.99, '1-2': 0.7}
Note I don't see much point in converting us1 and us2 into a string so that sorted can then split it back out into a list. May as well just use a list [us1, us2] to begin with.
By using a list and joining with a character (I've used - but any would do), we can avoid the issue that can arise when the us1 and us2 values have multiple digits (e.g. if us1, us2 = 1, 23 and us1, us2 = 12, 3).
On way of doing it would be to replace:
user_pairings[''.join(sorted(us1+us2))] = maxval
With:
key = ''.join(sorted(us1 + us2))
user_pairings[key] = max(maxval, user_pairings[key] if key in user_pairings else 0)

Getting data from a dictionary Python

I'm using python dictionaries from counting the repeated objects from an array.
I use a function obtained from this forum from counting the objects, and the obtained result is on the next format: {object: nÂșelements, ...).
My issue is that the function doesn't return the dictionary objects differentiated by keys and I don't know how to get the objects.
count_assistence_issues = {x:list(assistances_issues).count(x) for x in list(assistances_issues)}
count_positive_issues = {x:list(positive_issues).count(x) for x in list(positive_issues)}
count_negative_issues = {x:list(negative_issues).count(x) for x in list(negative_issues)}
print(count_assistence_issues)
print(count_positive_issues)
print(count_negative_issues)
This is the output obtained:
{school.issue(10,): 1, school.issue(13,): 1}
{school.issue(12,): 1}
{school.issue(14,): 2}
And this is the output I need to obtain:
{{issue: school.issue(10,), count: 1},
{issue: school.issue(13,), count: 1}}
{{issue: school.issue(12,), count: 1}}
{{issue: school.issue(14,), count: 2}}
Anyone know how to differenciate by keys the elements of the array using the function?
Or any other function for counting the repeated objects for obtaining a dictionary with the format {'issue': issue,'count': count)
Thanks for reading!
Given your input and output. I would consider the following.
1) Merge all your counts into a single dictionary
#assuming that what diffrentitaes your issues is a unique ID/key/value etc.
#meaning that no issues are subsets of the other. IF they are this will need some tweaking
issue_count = {}
issue_count.update(count_assistence_issues)
issue_count.update(count_positive_issues)
issue_count.update(count_positive_issues)
Getting the counts is then simply:
issue_count[school.issue(n,)]
The key is your array. If you want an alternative. You could make a list of keys or dict of your keys. You can make this as verbose as you want.
key_issues = {"issue1":school.issue(1,),"issue2":school.issue(2,)....}
This then allows you to call your counts by:
issue_count[key_issues["issue1"]]
If you want to use the "count" field. You would need to fix your counter to give you a dict of your issue with field count but that's another question.

Writing a Array of Dictionaries to CSV

I'm trying to get the dictionary (which the first part of the program generates) to write to a csv so that I can perform further operations on the data in excel. I realize the code isn't efficient but at this point I'd just like it to work. I can deal with speeding it up later.
import csv
import pprint
raw_data = csv.DictReader(open("/Users/David/Desktop/crimestats/crimeincidentdata.csv", "r"))
neighborhood = []
place_count = {}
stats = []
for row in raw_data:
neighborhood.append(row["Neighborhood"])
for place in set(neighborhood):
place_count.update({place:0})
for key,value in place_count.items():
for place in neighborhood:
if key == place:
place_count[key] = place_count[key]+1
for key in place_count:
stats.append([{"Location":str(key)},{"Volume":str(place_count[key])}])
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(stats)
The program is still running fine here as is evident by the pprint output
[ [{'Location': 'LINNTON'}, {'Volume': '109'}],
[{'Location': 'SUNDERLAND'}, {'Volume': '118'}],
[{'Location': 'KENTON'}, {'Volume': '715'}]
This is where the error is definitely happening. The program writes the headers to the csv just fine then throws the ValueError.
fieldnames = ['Location', 'Volume']
with open('/Users/David/Desktop/crimestats/localdata.csv', 'w', newline='') as output_file:
csvwriter = csv.DictWriter(output_file, delimiter=',', fieldnames=fieldnames, dialect='excel')
csvwriter.writeheader()
for row in stats:
csvwriter.writerow(row)
output_file.close()
I've spent quite a bit of time searching for this problem but none of the suggestions I have attempted to use have worked. I figure I must me missing something so I'd really appreciate any and all help.
Traceback (most recent call last):
File "/Users/David/Desktop/crimestats/statsreader.py", line 34, in <module>
csvwriter.writerow(row)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 153, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 149, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: {'Location': 'SABIN'}, {'Volume': '247'}
I believe your problem is here:
for key in place_count:
stats.append([{"Location":str(key)},{"Volume":str(place_count[key])}])
This is creating a list of two dictionaries. The first has only a "Location" key, and the second has only a "Volume" key. However, the csv.DictWriter objects are expecting a single dictionary per row, with all the keys in the dictionary. Change that code snippet to the following and it should work:
for key in place_count:
stats.append({"Location": str(key), "Volume": str(place_count[key])})
That should take care of the errors you're seeing.
Now, as for why the error message is complaining about fields not in fieldnames, which completely misled you away from the real problem you're having: the writerow() function expects to get a dictionary as its row parameter, but you're passing it a list. The result is confusion: it iterates over the dict in a for loop expecting to get the dict's keys (because that's what you get when you iterate over a dict in Python), and it compares those keys to the values in the fieldnames list. What it's expecting to see is:
"Location"
"Volume"
in either order (because a Python dict makes no guarantees about which order it will return its keys). The reason why they want you to pass in a fieldnames list is so that the fields can be written to the CSV in the correct order. However, because you're passing in a list of two dictionaries, when it iterates over the row parameter, it gets the following:
{'Location': 'SABIN'}
{'Volume': '247'}
Now, the dictionary {'Location': 'SABIN'} does not equal the string "Location", and the dictionary {'Volume': '247'} does not equal the string "Volume", so the writerow() function thinks it's found dict keys that aren't in the fieldnames list you supplied, and it throws that exception. What was really happening was "you passed me a list of two dicts-of-one-key, when I expected a single dict-with-two-keys", but the function wasn't written to check for that particular mistake.
Now I'll mention a couple things you could do to speed up your code. One thing that will help quite a bit is to reduce those three for loops at the start of your code down to just one. What you're trying to do is to go through the raw data, and count the number of times each neighborhood shows up. First I'll show you a better way to do that, then I'll show you an even better way that improves on my first solution.
The better way to do that is to make use of the wonderful defaultdict class that Python provides in the collections module. defaultdict is a subclass of Python's dictionary type, which will automatically create dict entries when they're accessed for the first time. Its constructor takes a single parameter, a function which will be called with no parameters and should return the desired default value for any new item. If you had used defaultdict for your place_count dict, this code:
place_count = {}
for place in set(neighborhood):
place_count.update({place:0})
could simply become:
place_count = defaultdict(int)
What's going on here? Well, the int function (which really isn't a function, it's the constructor for the int class, but that's a bit beyond the scope of this explanation) just happens to return 0 if it's called with no parameters. So instead of writing your own function def returnzero(): return 0, you can just use the existing int function (okay, constructor). Now every time you do place_count["NEW PLACE"], the key NEW PLACE will automatically appear in your place_count dictionary, with the value 0.
Now, your counting loop needs to be modified too: it used to go over the keys of place_count, but now that place_count automatically creates its keys the first time they're accessed, you need a different source. But you still have that source in the raw data: the row["Neighborhood"] value for each row. So your for key,value in place_count.items(): loop could become:
for row in raw_data:
place = row["Neighborhood"]
place_count[place] = place_count[place] + 1
And now that you're using a defaultdict, you don't even need that first loop (the one that created the neighborhood list) at all! So we've just turned three loops into one. The final version of what I'm suggesting looks like this:
from collections import defaultdict
place_count = defaultdict(int)
for row in raw_data:
place = row["Neighborhood"]
place_count[place] = place_count[place] + 1
# Or: place_count[place] += 1
However, there's a way to improve that even more. The Counter object from the collections module is designed for just this case, and has some handy extra functionality, like the ability to retrieve the N most common items. So the final final version :-) of what I'm suggesting is:
from collections import Counter
place_count = Counter()
for row in raw_data:
place = row["Neighborhood"]
place_count[place] = place_count[place] + 1
# Or: place_count[place] += 1
That way if you need to retrieve the 5 most crime-ridden neighborhoods, you can just call place_count.most_common(5).
You can read more about Counter and defaultdict in the documentation for the collections module.

Adding Multiple Values to a Single Key in Python Dictionary

Python dictionaries really have me today. I've been pouring over stack, trying to find a way to do a simple append of a new value to an existing key in a python dictionary adn I'm failing at every attempt and using the same syntaxes I see on here.
This is what i am trying to do:
#cursor seach a xls file
definitionQuery_Dict = {}
for row in arcpy.SearchCursor(xls):
# set some source paths from strings in the xls file
dataSourcePath = str(row.getValue("workspace_path")) + "\\" + str(row.getValue("dataSource"))
dataSource = row.getValue("dataSource")
# add items to dictionary. The keys are the dayasource table and the values will be definition (SQL) queries. First test is to see if a defintion query exists in the row and if it does, we want to add the key,value pair to a dictionary.
if row.getValue("Definition_Query") <> None:
# if key already exists, then append a new value to the value list
if row.getValue("dataSource") in definitionQuery_Dict:
definitionQuery_Dict[row.getValue("dataSource")].append(row.getValue("Definition_Query"))
else:
# otherwise, add a new key, value pair
definitionQuery_Dict[row.getValue("dataSource")] = row.getValue("Definition_Query")
I get an attribute error:
AttributeError: 'unicode' object has no attribute 'append'
But I believe I am doing the same as the answer provided here
I've tried various other methods with no luck with various other error messages. i know this is probably simple and maybe I couldn't find the right source on the web, but I'm stuck. Anyone care to help?
Thanks,
Mike
The issue is that you're originally setting the value to be a string (ie the result of row.getValue) but then trying to append it if it already exists. You need to set the original value to a list containing a single string. Change the last line to this:
definitionQuery_Dict[row.getValue("dataSource")] = [row.getValue("Definition_Query")]
(notice the brackets round the value).
ndpu has a good point with the use of defaultdict: but if you're using that, you should always do append - ie replace the whole if/else statement with the append you're currently doing in the if clause.
Your dictionary has keys and values. If you want to add to the values as you go, then each value has to be a type that can be extended/expanded, like a list or another dictionary. Currently each value in your dictionary is a string, where what you want instead is a list containing strings. If you use lists, you can do something like:
mydict = {}
records = [('a', 2), ('b', 3), ('a', 4)]
for key, data in records:
# If this is a new key, create a list to store
# the values
if not key in mydict:
mydict[key] = []
mydict[key].append(data)
Output:
mydict
Out[4]: {'a': [2, 4], 'b': [3]}
Note that even though 'b' only has one value, that single value still has to be put in a list, so that it can be added to later on.
Use collections.defaultdict:
from collections import defaultdict
definitionQuery_Dict = defaultdict(list)
# ...

CSV module sorted output unexpected

In the code below (for printing salaries in descending order, ordered by profession),
reader = csv.DictReader(open('salaries.csv','rb'))
rows = sorted(reader)
a={}
for i in xrange(len(rows)):
if rows[i].values()[2]=='Plumbers':
a[rows[i].values()[1]]=rows[i].values()[0]
t = [i for i in sorted(a, key=lambda key:a[key], reverse=True)]
p=a.values()
p.sort()
p.reverse()
for i in xrange(len(a)):
print t[i]+","+p[i]
when i put 'Plumbers' in the conditional statement, the output among the salaries of plumbers comes out to be :
Tokyo,400
Delhi,300
London,100
and when i put 'Lawyers' in the same 'if' condition, output is:
Tokyo,800
London,700
Delhi,400
content of CSV go like:
City,Job,Salary
Delhi,Lawyers,400
Delhi,Plumbers,300
London,Lawyers,700
London,Plumbers,100
Tokyo,Lawyers,800
Tokyo,Plumbers,400
and when i remove --> if rows[i].values()[2]=='Plumbers': <-- from the program,
then it was supposed to print all the outputs but it prints only these 3:
Tokyo,400
Delhi,300
London,100
Though output should look something like:
Tokyo,800
London,700
Delhi,400
Tokyo,400
Delhi,300
London,100
Where is the problem exactly?
First of all, your code works as described... outputs in descending salary order. So works as designed?
In passing, your sorting code seems overly complex. You don't need to split the location/salary pairs into two lists and sort them independently. For example:
# Plumbers
>>> a
{'Delhi': '300', 'London': '100', 'Tokyo': '400'}
>>> [item for item in reversed(sorted(a.iteritems(),key=operator.itemgetter(1)))]
[('Tokyo', '400'), ('Delhi', '300'), ('London', '100')]
# Lawyers
>>> a
{'Delhi': '400', 'London': '700', 'Tokyo': '800'}
>>> [item for item in reversed(sorted(a.iteritems(),key=operator.itemgetter(1)))]
[('Tokyo', '800'), ('London', '700'), ('Delhi', '400')]
And to answer your last question, when you remove the 'if' statement: you are storing location vs. salary in a dictionary and a dictionary can't have duplicate keys. It will contain the last update for each location, which based on your input csv, is the salary for Plumbers.
First of all, reset all indices to index - 1 as currently rows[i].values()[2] cannot equal Plumbers unless the DictReader is a 1-based index system.
Secondly, what is unique about the Tokyo in the first row of you desired output and the Tokyo of the third row? When you create a dict, using the same value as a key will result in overwriting whatever was previously associated with that key. You need some kind of unique identifier, such as Location.Profession for the key. You could simply do the following to get a key that will preserve all of your information:
key = "".join([rows[i].values()[0], rows[i].values()[1]], sep=",")

Categories

Resources