I am working on an online course exercise (practice problem before the final test).
The test involves working with a big csv file (not downloadable) and answering questions about the dataset. You're expected to write code to get the answers.
The data set is a list of all documented baby names each year, along with
#how often each name was used for boys and for girls.
A sample list of the first 10 lines is also given:
Isabella,42567,Girl
Sophia,42261,Girl
Jacob,42164,Boy
and so on.
Questions you're asked include things like 'how many names in the data set', 'how many boys' names beginning with z' etc.
I can get all the data into a list of lists:
[['Isabella', '42567', 'Girl'], ['Sophia', '42261', 'Girl'], ['Jacob', '42164', 'Boy']]
My plan was to convert into a dictionary, as that would probably be easier for answering some of the other questions. The list of lists is saved to the variable 'data':
names = {}
for d in data:
names[d[0]] = d[1:]
print(names)
{'Isabella': ['42567', 'Girl'], 'Sophia': ['42261', 'Girl'], 'Jacob': ['42164', 'Boy']}
Works perfectly.
Here's where it gets weird. If instead of opening the sample file with 10 lines, I open the real csv file, with around 16,000 lines. everything works perfectly right up to the very last bit.
I get the complete list of lists, but when I go to create the dictionary, it breaks - here I'm just showing the first three items, but the full 16000 lines are all wrong in a similar way):
names = {}
for d in data:
names[d[0]] = d[1:]
print(names)
{'Isabella': ['56', 'Boy'], 'Sophia': ['48', 'Boy'], 'Jacob': ['49', 'Girl']
I know the data is there and correct, since I can read it directly:
for d in data:
print(d[0], d[1], d[2])
Isabella 42567 Girl
Sophia 42261 Girl
Jacob 42164 Boy
Why would this dictionary work fine with the cvs file with 10 lines, but completely break with the full file? I can't find any
Follow the comments to create two dicts, or a single dictionary with tuple keys. Using tuples as keys is fine if you keep your variables inside python, but you might get into trouble when exporting to json for example.
Try a dictionary comprehension with list unpacking
names = {(name, sex): freq for name, freq, sex in data}
Or a for loop as you started
names = dict()
for name, freq, sex in data:
names[(name, freq)] = freq
I'd go with something like
results = {}
for d in data:
name, amount, gender = d.split(',')
results[name] = data.get(name, {})
results[name].update({ gender: amount })
this way you'll get results in smth like
{
'Isabella': {'Girl': '42567', 'Boy': '67'},
'Sophia': {'Girl': '42261'},
'Jacob': {'Boy': '42164'}
}
However duplicated values will override previous, so you need to take that into account if there are some and it also assumes that the whole file matches format you've provided
Related
I have three lists, (1) treatments (2) medicine name and (3) medicine code symbol. I am trying to identify the respective medicine code symbol for each of 14,700 treatments. My current approach is to identify if any name in (2) is "in" (1), and then return the corresponding (3). However, I am returned an abitrary list (correct length) of medicine code symbols corresponding to the 14,700 treatments. Code for the method I've written is below:
codes = pandas.read_csv('Codes.csv', dtype=str)
codes_list = _codes.values.tolist()
names = pandas.read_csv('Names.csv', dtype=str)
names_list = names.values.tolist()
treatments = pandas.read_csv('Treatments.csv', dtype=str)
treatments_list = treatments.values.tolist()
matched_codes_list = range(len(treatments_list))
for i in range(len(treatments_list)):
for j in range(len(names_list)):
if names_list[j] in treatments_list[i]:
matched_codes_list[i]=codes_list_text[j]
print matched_codes_list
Any suggestions for where I am going wrong would be much appreciated!
I can't tell what you are expecting. You should replace the xxx_list code with examples instead, since you don't seem to have any problems with the csv reading.
Let's suppose you did that, and your result looks like this.
codes_list = ['shark', 'panda', 'horse']
names_list = ['fin', 'paw', 'hoof']
assert len(codes_list) == len(names_list)
treatments_list = ['tape up fin', 'reverse paw', 'stand on one hoof', 'pawn affinity maneuver', 'alert wing patrol']
it sounds like you are trying to determine the 'code' for each 'treatment', assuming that the number of codes and names are the same (and indicate some mapping). You plan to use the presence of the name to determine the code.
we can zip together the name and codes list to avoid using indexes there, and we can use iteration over the treatment list instead of indexes for pythonic readability
matched_codes_list = []
for treatment in treatment:
matched_codes = []
for name, code in zip(names_list, codes_list):
if name in treatment:
matched_codes.append(code)
matched_codes_list.append(matched_codes)
this would give something like
assert matched_codes_list == [
['shark'], # 'tape up fin'
['panda'], # 'reverse paw'
['horse'], # 'stand on one hoof'
['shark', 'panda', 'horse'], # 'pawn affinity maneuver'
[], # 'alert wing patrol'
]
note that the method used to do this is quite slow (and probably will give false positives, see 4th entry). You will traverse the text of all treatment descriptions once for each name/code pair.
You can use a dictionary like 'lookup = {name: code for name, code in zip(names_list, codes_list)}, or itertools.izip for minor gains. Otherwise something more clever might be needed, perhaps splitting treatments into a set containing words, or mapping words into multiple codes.
Let's say I have a file called 'Food' listing the names of some food, and their prices. Some of these items are raw ingredients, and others are made from different amounts of these- for example i might manually list the price of eggs as 1 and find that the omelette has a default price of 10, but then find that an omelette will only need 5 eggs, so i would need the program to read the price of eggs, find the line containing the omelette, and replace it with "omelette: " + str(5*eggs). I may also need to add extra ingredients/ items of food e.g. a pile of omelettes which is made from 5 omelettes. the basic goal would be to make it possible to just edit the value of eggs, and the value of omelette and pileofomelettes to update. I've started the code simply by creating a list of the lines contained within the file.
with open("Food.txt") as g:
foodlist=g.readlines()
The file 'Food.txt' would be in the following format:
eggs: 5
omelette: 20
pileofomelettes: 120
etc...
and after the code runs it should look like
eggs: 5
omelette: 25
pileofomelettes: 125
I would code the relations manually since they would be so unlikely to ever change (and even if they did it would be fairly easy for me to go in and change the coefficients)
and would be read by python in its list format as something like
'['egg 2\n', 'flour 1\n', 'butter 1\n', 'sugar 3\n', 'almond 5\n', 'cherry 8\n']'
I have searched for search/replace algorithms that can search for a specific phrase and replace it with another specific phrase, but i don't know how i'd apply it if the line was subject to change (the user could change the raw ingredient values if he wanted to update all of the values related to it). One solution i can think of involves converting them into a dictionary format, with them all listed as a string-integer value pair, so that i could just replace the integer part of the pair based on the integer values stored within other string-integer pairs, but, being inexperienced, I don't know how i'd convert the list (or the raw file itself, even better) into a dictionary.
Any advice on how to carry out steps of this program would be greatly appreciated :)
EDIT- in the actual application of the program, it doesn't matter what order the items are listed in in the final file, so if i listed out all the raw ingredients in 1 place and all of the composite items in another (With a large space in between them if more raw items need to be added) then i could just re-write the entire second half of the file in an arbitrary order with no problem- so long as the line position of the raw ingredients remains the same.
Okay, I would suggest make a relations text file which you can parse in case you think the relations can later change, or just so that your code is easier to read and mutable. This can be then parsed to find the required relations between raw ingredients and complexes. Let it be "relations.txt" , and of the type:
omelette: 5 x eggs + 1 x onions
pileofomelettes: 6 x omelette
Here, you can put arbitrary number of ingredients of the type:
complex: number1 x ingredient1 + number2 x ingredient2 + ...
and so on.
And your food.txt contains prices of all ingredients and complexes:
eggs: 2
onions: 1
omelette: 11.0
pileofomelettes: 60
Now we can see that the value for pileofomlettes is intentionally not mapped here correctly. So, we will run the code below, and also you can change numbers and see the results.
#!/usr/bin/python
''' This program takes in a relations file and a food text files as inputs
and can be used to update the food text file based on changes in either of these'''
relations = {}
foodDict = {}
# Mapping ingredients to each other in the relations dictionary
with open("relations.txt") as r:
relationlist=r.read().splitlines()
for relation in relationlist:
item, equatedTo = relation.split(': ')
ingredientsAndCoefficients = equatedTo.split(' + ')
listIngredients = []
for ingredient in ingredientsAndCoefficients:
coefficient, item2 = ingredient.split(' x ')
# A list of sets of amount and type of ingredient
listIngredients.append((float(coefficient),item2))
relations.update({item:listIngredients})
# Creating a food dictionary with values from food.txt and mapping to the relations dictionary
with open("food.txt") as g:
foodlist=g.read().splitlines()
for item in foodlist:
food,value = item.split(': ')
foodDict.update({food:value})
for food in relations.keys():
# (Raw) Ingredients with no left hand side value in relations.txt will not change here.
value = 0.
for item2 in range(len(relations[food])):
# Calculating new value for complex here.
value += relations[food][item2][0]* float(foodDict[relations[food][item2][1]])
foodDict.update({food: value })
# Altering the food.txt with the new dictionary values
with open("food.txt",'w') as g:
for key in sorted(foodDict.keys()):
g.write(key + ': ' + str (foodDict[key])+ '\n')
print key + ': ' + str(foodDict[key])
And it comes out be:
eggs: 2
onions: 1
omelette: 11.0
pileofomelettes: 66.0
You can change the price of eggs to 5 in the food.txt file, and
eggs: 5
onions: 1
omelette: 26.0
pileofomelettes: 156.0
How does your program know the components of each item? I suggest that you keep two files: one with the cost of atomic items (eggs) and another with recipes (omelette <= 5 eggs).
Read both files. Store the atomic costs, remembering how many of these items you have, atomic_count. Extend this table from the recipes file, one line at a time. If the recipe you're reading consists entirely of items with known costs, then compute the cost and add that item to the "known" list. Otherwise, append the recipe to a "later" list and continue.
When you reach the end of both input files, you will have a list of known costs, and a few other recipes that depended on items farther down the recipe file. Now cycle through this "unknown" list until (a) it's empty; (b) you don't have anything with all the ingredients known. If case (b), you have something wrong with your input: either an ingredient with no definition, or a circular dependency. Print the remaining recipes list and debug your input files.
In case (a), you are now ready to print your Food.txt list. Go through your "known" list and write out one item or recipe at a time. When you get to item [atomic_count], write out a second file, a new recipe list. This is your old recipe list, but in a useful top-down order. In the future, you won't have any "unknown" recipes after the first pass.
For future changes ... don't bother. You have only 173 items, and the list sounds unlikely to grow past 500. When you change or add an item, just hand-edit the file and rerun the program. That will be faster than the string-replacement algorithm you're trying to write.
In summary, I suggest that you do just the initial computation problem, which is quite a bit simpler than adding the string update. Don't do incremental updates; redo the whole list from scratch. For such a small list, the computer will do this faster than you can write and debug the extra coding.
I'm still not really sure what you are asking but this is what I came up with...
from collections import OrderedDict
food_map = {'omelette': {'eggs': 5, 'price': None}, 'pileofomelettes': {'eggs': 25, 'price': None}, 'eggs': {'price': 5}}
with open('food.txt', 'rb') as f:
data = f.read().splitlines()
data = OrderedDict([(x[0], int(x[1])) for x in [x.split(': ') for x in data]])
for key, val in data.items():
if key == 'eggs':
continue
food_rel = food_map.get(key, {})
val = food_rel.get('eggs', 1) * food_map.get('eggs', {}).get('price', 1)
data[key] = val
with open('out.txt', 'wb') as f:
data = '\n'.join(['{0}: {1}'.format(key, val) for key, val in data.items()])
f.write(data)
Sorry if this sounds like a silly question but this problem has gotten me really confused. I'm fairly new to python, so maybe I'm missing something. I did some research but haven't gotten too far. Here goes:
I'm going to use a simple example that makes the question clearer, my data is different but the format and required action is the same. We have a database of people and the pizzas they eat (and some other data). Our database however has multiple entries of the same people with different pizzas (because we combined data gotten from different pizzerias).
example dataset:
allData = [['joe','32', 'pepperoni,cheese'],['marc','24','cheese'],['jill','27','veggie supreme, cheese'],['joe','32','pepperoni,veggie supreme']['marc','25','cheese,chicken supreme']]
Few things we notice and rules I want to follow:
names can appear multiple times though in this specific case we KNOW that any entries with the same name is the same person.
the age can be different for the same person in different entries, so we just pick the first age we encountered of the person and use it. example marc's age is 24 and we ignore the 25 from the second entry
I want to edit the data so that a person's name only appears ONCE, and the pizzas he eats is a unique set from all entries with the same name. As mentioned before the age is just the first one encountered. Therefore, i'd want the final data to look like this:
fixedData = [['joe','32','pepperoni,cheese,veggie supreme'],['marc','24','cheese,chicken supreme'],['jill','27','veggie supreme, cheese']]
I'm thinking something on the lines of:
fixedData = []
for i in allData:
if i[0] not in fixedData[0]:
fixedData.append[i]
else:
fixedData[i[-1]]=set(fixedData[i[-1]],i[-1])
I know I'm making several mistakes. could you please please point me towards the right direction?
Thanks heaps.
Since names are unique, it makes sense to use them as keys in a dict, where the name is the key. This will be much more appropriate in your case:
>>> d = {}
>>> for i in allData:
if i[0] in d:
d[i[0]][-1] = list(set(d[i[0]][-1] + (i[-1].split(','))))
else:
d[i[0]] = [i[1],i[2].split(',')]
>>> d
{'jill': ['27', ['veggie supreme', ' cheese']], 'joe': ['32', ['pepperoni', 'cheese', 'pepperoni', 'veggie supreme']], 'marc': ['24', ['cheese', 'cheese', 'chicken supreme']]}
In cases like yours i like to use defaultdict. I really hate the guesswork that comes with list indexes.
from collections import defaultdict
allData = [['joe', '32', 'pepperoni,cheese'],
['marc', '24', 'cheese'],
['jill', '27', 'veggie supreme, cheese'],
['joe', '32', 'pepperoni,veggie supreme'],
['marc', '25', 'cheese,chicken supreme']]
d = defaultdict(dict)
for name, age, pizzas in allData:
d[name].setdefault('age', age)
d[name].setdefault('pizzas', set())
d[name]['pizzas'] |= set(pizzas.split(','))
Notice the usage of setdefault to set the first age value we encounter. It also enables the use of set union to get the unique pizzas.
I have a list of names and addresses organized in the following format:
Mr and Mrs Jane Doe
Candycane Lane
Magic Meadows, SC
I have several blocks of data written like this, and I want to be able to alphabetize each block by the last name (Doe, in this case). After doing some digging, the best I can reckon is that I need to make a "List of lists" and then use the last name as a key by which to alphabetize the block. However, given by freshness to python and lack of Google skills, the closest I could find was this. I'm confused as to converting each block to a list and then slicing it; I can't seem to find a way to do this and still be able to alphabetize properly. Any and all guidance is greatly appreciated.
If I understood correctly, what you want basically is to sort values by "some computation done on the value", in this case the extracted last name.
For that, use the key keyword argument to .sort() or sorted():
def my_key_function(original_name):
## do something to extract the last name, for example:
try:
return original_name.split(',')[1].strip()
except IndexError:
return original_name
my_sorted_values = sorted(my_original_values, key=my_key_function)
The only requirement is that your "key" function is deterministic, i.e. always return the same output for each given input.
You might also want to sort by last name and then first name: in this case, just return a tuple (last, first): if last si the same for two given items, first will be used to further sort the two.
Update
For your specific case, this function should do the trick:
def my_key_function(original_name):
return original_name.splitlines()[0].split()[-1]
Assuming you already have the data in a list
l = ['Mr and Mrs Jane Smith\nCandycane Lane\nMagic Meadows, SC',
'Mr and Mrs Jane Doe\nCandycane Lane\nMagic Meadows, SC',
'Mr and Mrs Jane Atkins\nCandycane Lane\nMagic Meadows, SC']
You can specify the key to sort on.
l.sort(key=lambda x: x.split('\n')[0].split(' ')[-1])
In this case, get the last word (.split(' ')[-1]) on the first line (.split('\n')[0])
you want to make a new list where each entry is a tuple containing the sort key you want and the whole thing. Sort that list and then get the second component of each entry in the sort:
def get_sort_name (address):
name, address, city = address.split('\n')
return (name.split(' ')[-1] , address) # last item of first line & whole thing as tulle
keyed_list = map (get_sort_name, addresses)
keyed_list.sort()
sorted_addresses = [item[1] for item in keyed_list]
Thi could be more compact using lambdas of course but its better to be readable :)
My stock programs input is as follow
'Sqin.txt' data read in and is a cvs file
AAC,D,20111207,9.83,9.83,9.83,9.83,100
AACC,D,20111207,3.46,3.47,3.4,3.4,13400
AACOW,D,20111207,0.3,0.3,0.3,0.3,500
AAME,D,20111207,1.99,1.99,1.95,1.99,8600
AAON,D,20111207,21.62,21.9,21.32,21.49,93200
AAPL,D,20111207,389.93,390.94,386.76,389.09,10892800
AATI,D,20111207,5.75,5.75,5.73,5.75,797900
The output is
dat1[]
['AAC', ['9.83', '9.83', '9.83', '9.83', '100'], ['9.83', '9.83', '9.83', '9.83', '100']]
dat1[0] is the stock symbol 'ACC' used for lookup and data updates
Dat1[1....?] Is the EOD (end of day) data
At the close of stock markets the EOD data will be inserted at dat1.insert (1,M) each update cycle .
Guys you can code this out in probably one line. Mine so far is over 30 lines, so seeing my code isn't relevant. Above is an example of some simple input and the desired output.
If you decide to take on some real world programing please keep it verbose. Declare your variables, then populate it, and finally use them ex.
M = []
M = q [0][3:] ## had to do it this way because 'ACC' made the variable M [] begin as a string (inmutable). So I could not add M to the data.-dat1[]- because -dat1[]- also became a string (inmutable strings how stupid). Had to force 'ACC' to be a list so I can create a list of lists -dat1-
Dat1.insert(1.M) ## -M- is used to add another list to the master.dat record
Maybe it would be OK to be some what pythonic and a little less verbose.
You should use a dictionary with the names as keys:
import csv
import collections
filename = 'csv.txt'
with open(filename) as file_:
reader = csv.reader(file_)
data = collections.defaultdict(list)
for line in reader:
# line[1] contains "D" and line[2] is the date
key, value = line[0], line[3:]
data[key].append(value)
To add data you do data[name].insert(0, new_data). Where name could be AAC and value is a list of the data. This places the new data at the beginning of the list like you said in your post.
I would recommend append instead of insert, it is faster. If you really want the data added to the begin of the list use collections.deque instead of list.