Sorry if this sounds like a silly question but this problem has gotten me really confused. I'm fairly new to python, so maybe I'm missing something. I did some research but haven't gotten too far. Here goes:
I'm going to use a simple example that makes the question clearer, my data is different but the format and required action is the same. We have a database of people and the pizzas they eat (and some other data). Our database however has multiple entries of the same people with different pizzas (because we combined data gotten from different pizzerias).
example dataset:
allData = [['joe','32', 'pepperoni,cheese'],['marc','24','cheese'],['jill','27','veggie supreme, cheese'],['joe','32','pepperoni,veggie supreme']['marc','25','cheese,chicken supreme']]
Few things we notice and rules I want to follow:
names can appear multiple times though in this specific case we KNOW that any entries with the same name is the same person.
the age can be different for the same person in different entries, so we just pick the first age we encountered of the person and use it. example marc's age is 24 and we ignore the 25 from the second entry
I want to edit the data so that a person's name only appears ONCE, and the pizzas he eats is a unique set from all entries with the same name. As mentioned before the age is just the first one encountered. Therefore, i'd want the final data to look like this:
fixedData = [['joe','32','pepperoni,cheese,veggie supreme'],['marc','24','cheese,chicken supreme'],['jill','27','veggie supreme, cheese']]
I'm thinking something on the lines of:
fixedData = []
for i in allData:
if i[0] not in fixedData[0]:
fixedData.append[i]
else:
fixedData[i[-1]]=set(fixedData[i[-1]],i[-1])
I know I'm making several mistakes. could you please please point me towards the right direction?
Thanks heaps.
Since names are unique, it makes sense to use them as keys in a dict, where the name is the key. This will be much more appropriate in your case:
>>> d = {}
>>> for i in allData:
if i[0] in d:
d[i[0]][-1] = list(set(d[i[0]][-1] + (i[-1].split(','))))
else:
d[i[0]] = [i[1],i[2].split(',')]
>>> d
{'jill': ['27', ['veggie supreme', ' cheese']], 'joe': ['32', ['pepperoni', 'cheese', 'pepperoni', 'veggie supreme']], 'marc': ['24', ['cheese', 'cheese', 'chicken supreme']]}
In cases like yours i like to use defaultdict. I really hate the guesswork that comes with list indexes.
from collections import defaultdict
allData = [['joe', '32', 'pepperoni,cheese'],
['marc', '24', 'cheese'],
['jill', '27', 'veggie supreme, cheese'],
['joe', '32', 'pepperoni,veggie supreme'],
['marc', '25', 'cheese,chicken supreme']]
d = defaultdict(dict)
for name, age, pizzas in allData:
d[name].setdefault('age', age)
d[name].setdefault('pizzas', set())
d[name]['pizzas'] |= set(pizzas.split(','))
Notice the usage of setdefault to set the first age value we encounter. It also enables the use of set union to get the unique pizzas.
Related
Let's say I have this Person class that consist of last_name string field.
I would like to display links of all first letters of names existing in db.
So for example:
A B D E... when there's Adams, Brown, Douglas, Evans and no one that last_name starts with C.
Of course view is not a problem here as I want to prepare all of this on backend. So the question is how to write good model's or view's function that will provide this.
I would like it to be DB-independent, however tricks for any particular DB would be a bonus.
Also, I would like this to be quite well optimized in terms of speed because it could be a lot of names. It should be less naive than this most simple algorithm:
Get all people
Create set (because of the uniqueness of elements) of the first letters
Sort and return
So for example (in views.py):
names = Person.objects.values_list('last_name', flat=True)
letters = {name[0] for name in names}
letters_sorted = sorted(letters)
Of course I could order_by first or assign new attribute to each objects (containing first letter) but I don't think it will speed up the process.
I also think that assuming that all letters are in use is bad assumption if I would go check each letter if at least one name for this letter exists ;-)
Which approach would be best effective for databases and django here?
You could also use the orm methods to generate the list of initals (including count):
from django.db.models import Count
from django.db.models.functions import Left
initials = (Person
.objects
.annotate(initial=Left('last_name', 1))
.values('initial')
.annotate(count=Count('initial'))
.order_by('initial'))
This will result in something like
<QuerySet [{'initial': 'A', 'count': 1},
{'initial': 'B', 'count': 2},
...
{'initial': 'Y', 'count': 1}]>
I would sort the names first, then unpack the names as arguments for zip, then consume just the first tuple that zip yields:
names = sorted(Person.objects.values_list('last_name', flat=True))
first_letters = next(zip(*names))
This doesn't use sets or remove duplicates or anything like that. Is that critical? If it is, you could do this:
names = Person.objects.values_list('last_name', flat=True)
first_letters = sorted(set(next(zip(*names))))
Though this would be hardly more performant that what you've already written.
import string
alphabet_list = list(string.ascii_lowercase) + list(string.ascii_uppercase)
result = dict(map(lambda x: (x, Person.objects.filter(name__start_with=x).exists()), alphabet_list))
I am working on an online course exercise (practice problem before the final test).
The test involves working with a big csv file (not downloadable) and answering questions about the dataset. You're expected to write code to get the answers.
The data set is a list of all documented baby names each year, along with
#how often each name was used for boys and for girls.
A sample list of the first 10 lines is also given:
Isabella,42567,Girl
Sophia,42261,Girl
Jacob,42164,Boy
and so on.
Questions you're asked include things like 'how many names in the data set', 'how many boys' names beginning with z' etc.
I can get all the data into a list of lists:
[['Isabella', '42567', 'Girl'], ['Sophia', '42261', 'Girl'], ['Jacob', '42164', 'Boy']]
My plan was to convert into a dictionary, as that would probably be easier for answering some of the other questions. The list of lists is saved to the variable 'data':
names = {}
for d in data:
names[d[0]] = d[1:]
print(names)
{'Isabella': ['42567', 'Girl'], 'Sophia': ['42261', 'Girl'], 'Jacob': ['42164', 'Boy']}
Works perfectly.
Here's where it gets weird. If instead of opening the sample file with 10 lines, I open the real csv file, with around 16,000 lines. everything works perfectly right up to the very last bit.
I get the complete list of lists, but when I go to create the dictionary, it breaks - here I'm just showing the first three items, but the full 16000 lines are all wrong in a similar way):
names = {}
for d in data:
names[d[0]] = d[1:]
print(names)
{'Isabella': ['56', 'Boy'], 'Sophia': ['48', 'Boy'], 'Jacob': ['49', 'Girl']
I know the data is there and correct, since I can read it directly:
for d in data:
print(d[0], d[1], d[2])
Isabella 42567 Girl
Sophia 42261 Girl
Jacob 42164 Boy
Why would this dictionary work fine with the cvs file with 10 lines, but completely break with the full file? I can't find any
Follow the comments to create two dicts, or a single dictionary with tuple keys. Using tuples as keys is fine if you keep your variables inside python, but you might get into trouble when exporting to json for example.
Try a dictionary comprehension with list unpacking
names = {(name, sex): freq for name, freq, sex in data}
Or a for loop as you started
names = dict()
for name, freq, sex in data:
names[(name, freq)] = freq
I'd go with something like
results = {}
for d in data:
name, amount, gender = d.split(',')
results[name] = data.get(name, {})
results[name].update({ gender: amount })
this way you'll get results in smth like
{
'Isabella': {'Girl': '42567', 'Boy': '67'},
'Sophia': {'Girl': '42261'},
'Jacob': {'Boy': '42164'}
}
However duplicated values will override previous, so you need to take that into account if there are some and it also assumes that the whole file matches format you've provided
I currently have a Python dictionary with keys assigned to multiple values (which have come from a CSV), in a format similar to:
{
'hours': ['4', '2.4', '5.8', '2.4', '7'],
'name': ['Adam', 'Bob', 'Adam', 'John', 'Harry'],
'salary': ['55000', '30000', '55000', '30000', '80000']
}
(The actual dictionary is significantly larger in both keys and values.)
I am looking to find the mode* for each set of values, with the stipulation that sets where all values occur only once do not need a mode. However, I'm not sure how to go about this (and I can't find any other examples similar to this). I am also concerned about the different (implied) data types for each set of values (e.g. 'hours' values are floats, 'name' values are strings, 'salary' values are integers), though I have a rudimentary conversion function included but not used yet.
import csv
f = 'blah.csv'
# Conducts type conversion
def conversion(value):
try:
value = float(value)
except ValueError:
pass
return value
reader = csv.DictReader(open(f))
# Places csv into a dictionary
csv_dict = {}
for row in reader:
for column, value in row.iteritems():
csv_dict.setdefault(column, []).append(value.strip())
*I'm wanting to attempt other types of calculations as well, such as averages and quartiles- which is why I'm concerned about data types- but I'd mostly like assistance with modes for now.
EDIT: the input CSV file can change; I'm unsure if this has any effect on potential solutions.
Ignoring all the csv file stuff which seems tangential to your question, lets say you have a list salary. You can use the Counter class from collections to count the unique list elements.
From that you have a number of different options about how to get from a Counter to your mode.
For example:
from collections import Counter
salary = ['55000', '30000', '55000', '30000', '80000']
counter = Counter(salary)
# This returns all unique list elements and their count, sorted by count, descending
mc = counter.most_common()
print(mc)
# This returns the unique list elements and their count, where their count equals
# the count of the most common list element.
gmc = [(k,c) for (k,c) in mc if c == mc[0][1]]
print(gmc)
# If you just want an arbitrary (list element, count) pair that has the most occurences
amc = counter.most_common()[0]
print(amc)
For the salary list in the code, this outputs:
[('55000', 2), ('30000', 2), ('80000', 1)] # mc
[('55000', 2), ('30000', 2)] # gmc
('55000', 2) # amc
Of course, for your case you'd probably use Counter(csv_dict["salary"]) instead of Counter(salary).
I'm not sure I understand the question, but you could create a dictionary matching each desired mode to those keys, manually, or you could use the 'type' class by asking the values, then if the type returns a string ask other questions/parameters, like length of the item.
I have a dictionary with the last and first names of the authors being the key, and the book, quantity, and price being the values. I want to print them out sorted in alphabetical order by the author name, and then by the book name.
The author is: Dickens, Charles
The title is: Hard Times
The qty is: 7
The price is: 27.00
----
The author is: Shakespeare, William
The title is: Macbeth
The qty is: 3
The price is: 7.99
----
The title is: Romeo And Juliet
The qty is: 5
The price is: 5.99
I'm very new to dictionaries and can't understand how you can sort a dictionary. My code so far is this:
def displayInventory(theInventory):
theInventory = readDatabase(theInventory)
for key in theInventory:
for num in theInventory[key]:
print("The author is", ' '.join(str(n) for n in key))
print(' '.join(str(n) for n in num), '\n')
The dictionary, when printed, from which I read this looks like this:
defaultdict(<class 'list'>, {('Shakespeare', 'William'): [['Rome And Juliet', '5', '5.99'], ['Macbeth', '3', '7.99']], ('Dickens', 'Charles'): [['Hard Times', '7', '27.00']]})
fwiw, camelCase is very uncommon in Python; almost everything is written in snake_case. :)
I would do this:
for names, books in sorted(inventory.items()):
for title, qty, price in sorted(books):
print("The author is {0}".format(", ".join(names)))
print(
"The book is {0}, and I've got {1} of them for {2} each"
.format(title, qty, price))
print()
Ignoring for the moment that not everyone has a first and last name...
There are some minor tricks involved here.
First, inventory.items() produces a list of key, value tuples. I can then sort that directly, because tuples sort element-wise — that is, (1, "z") sorts before (2, "a"). So Python will compare the keys first, and the keys are tuples themselves, so it'll compare last names and then first names. Exactly what you want.
I can likewise sort books directly because I actually want to sort by title, and the title is the first thing in each structure.
I can .join the names tuple directly, because I already know everything in it should be a string, and something is wrong if that's not the case.
Then I use .format() everywhere because str() is a bit ugly.
The key is to use sorted() to sort the dictionary by its keys, but then use sort() on the dictionaries values. This is necessary because your values are actually a list of lists and it seems you want only to sort them by the first value in each sub-list.
theInventory = {('Shakespeare', 'William'): [['Rome And Juliet', '5', '5.99'], ['Macbeth', '3', '7.99']], ('Dickens', 'Charles'): [['Hard Times', '7', '27.00']]}
for Author in sorted(theInventory.keys()):
Author_Last_First = Author[0]+", "+Author[1]
Titles = theInventory[Author]
Titles.sort(key=lambda x: x[0])
for Title in Titles:
print("Author: "+str(Author_Last_First))
print("Title: "+str(Title[0]))
print("Qty: "+str(Title[1]))
print("Price: "+str(Title[2]))
print("\n")
Is that what you had in mind? You can of course always put this in a function to make calling it easier.
In the code below (for printing salaries in descending order, ordered by profession),
reader = csv.DictReader(open('salaries.csv','rb'))
rows = sorted(reader)
a={}
for i in xrange(len(rows)):
if rows[i].values()[2]=='Plumbers':
a[rows[i].values()[1]]=rows[i].values()[0]
t = [i for i in sorted(a, key=lambda key:a[key], reverse=True)]
p=a.values()
p.sort()
p.reverse()
for i in xrange(len(a)):
print t[i]+","+p[i]
when i put 'Plumbers' in the conditional statement, the output among the salaries of plumbers comes out to be :
Tokyo,400
Delhi,300
London,100
and when i put 'Lawyers' in the same 'if' condition, output is:
Tokyo,800
London,700
Delhi,400
content of CSV go like:
City,Job,Salary
Delhi,Lawyers,400
Delhi,Plumbers,300
London,Lawyers,700
London,Plumbers,100
Tokyo,Lawyers,800
Tokyo,Plumbers,400
and when i remove --> if rows[i].values()[2]=='Plumbers': <-- from the program,
then it was supposed to print all the outputs but it prints only these 3:
Tokyo,400
Delhi,300
London,100
Though output should look something like:
Tokyo,800
London,700
Delhi,400
Tokyo,400
Delhi,300
London,100
Where is the problem exactly?
First of all, your code works as described... outputs in descending salary order. So works as designed?
In passing, your sorting code seems overly complex. You don't need to split the location/salary pairs into two lists and sort them independently. For example:
# Plumbers
>>> a
{'Delhi': '300', 'London': '100', 'Tokyo': '400'}
>>> [item for item in reversed(sorted(a.iteritems(),key=operator.itemgetter(1)))]
[('Tokyo', '400'), ('Delhi', '300'), ('London', '100')]
# Lawyers
>>> a
{'Delhi': '400', 'London': '700', 'Tokyo': '800'}
>>> [item for item in reversed(sorted(a.iteritems(),key=operator.itemgetter(1)))]
[('Tokyo', '800'), ('London', '700'), ('Delhi', '400')]
And to answer your last question, when you remove the 'if' statement: you are storing location vs. salary in a dictionary and a dictionary can't have duplicate keys. It will contain the last update for each location, which based on your input csv, is the salary for Plumbers.
First of all, reset all indices to index - 1 as currently rows[i].values()[2] cannot equal Plumbers unless the DictReader is a 1-based index system.
Secondly, what is unique about the Tokyo in the first row of you desired output and the Tokyo of the third row? When you create a dict, using the same value as a key will result in overwriting whatever was previously associated with that key. You need some kind of unique identifier, such as Location.Profession for the key. You could simply do the following to get a key that will preserve all of your information:
key = "".join([rows[i].values()[0], rows[i].values()[1]], sep=",")