Processing dictionaries in Python with large amount of data

Processing dictionaries in Python with large amount of data - python

I am now trying to process IMDb data with Python dictionary. After some basic data cleaning, I have a dictionary people_dict, which looks like
people_dict = {...,936: ['And White Was the Night (2015)', 'Lipton Cockton in the Shadows of Sodoma (1995)', 'Maraton (1997)', 'Rundi (1990)', 'Sounds Like Suomi (2008)'],...}
where the key stands for the id of an actor/actress and the list is a set of movies he/she has acted in.
Now I am trying to get another dictionary movie_dict based on people_dict, which looks like
movie_dict = {...,'Beats, Rhymes & Life: The Travels of a Tribe Called Quest (2011)': [3],...}
where the key is name of movie while the value is actor/actress id.
However, my implementation (see below) for this is nested loops but almost 100, 000 movies and actor/actress are involved. It optimistically could give what I want in a week.
for value in movie_dict.keys():
for people_id, movie_list in people_dict.items():
if value in movie_list:
movie_dict[value].append(people_id)
So is there anything I could do to significantly reduce the runtime. I have checked out this thread where map seems to be a good option.

Related

To many print results - for key in dict

Good morning everyone,
There are a number of posts on here somewhat related to this however I am newer and a hands on learner so it's difficult to grasp the solutions offered on other peoples coding when I don't necessarily know their end goal. This is the first time I have tried to apply my coding skills (or lack thereof :D) but I have been working through m1mo and reading/watching an assortment of guides/tutorials the last couple of months. So yes my code may look goofy to a lot of you but got to start somewhere!
Goal: I want to pull the dictionary value from product_dict of the key that is in the c_item_number_one, which this code does successfully, but if the key does not appear in the dictionary then I want to print("Not in dictionary")
Issue: While the code does provide the dictionary value based on the key there will be times when the c_item_number_one does not include a valid key. When this happens, I want to know by print("Not in dictionary"). Currently this code will print "Not in dictionary" for every single dictionary entry that does not appear in my product_dict. I only want it to tell me once in the event it does not appear a single time.
There will also be times, like in this example, where multiple keys are found within the dictionary, this is okay. I want it to print all of these instances as I will be adding further validation when this occurs in later code.
Note that the below is a small sample of the actual dictionary and that I have roughly 1200 entries in reality with more to be added as time goes on.
"Product" is only one of a dozen categories I need to pull from descriptions so any help here will greatly help me towards the end game and be very much appreciated!
product_dict = {
'BEND': 'BEND',
'FABRICATE SPOOL': 'PIPE SPOOL',
'STUB-END': 'STUB END',
'GSK': 'GASKET',
'SA-106': 'PIPE',
'PIPE': 'PIPE',
}
c_item_number_one = '12",PIPE , GR. B, SCH 40, WALL SMLS'
#Product for Item One
def item_one_product():
found = True
for key in product_dict:
if key in c_item_number_one:
item_number_one_product = product_dict[key]
print(item_number_one_product)
else:
found = False
print("Not in dictionary")
item_one_product()
It prints:
Not in dictionary
Not in dictionary
Not in dictionary
Not in dictionary
PIPE
PIPE

You were close, check this:
#Product for Item One
def item_one_product():
not_found = True
for key in product_dict:
if key in c_item_number_one:
item_number_one_product = product_dict[key]
print(item_number_one_product)
not_found = False
if not_found:
print("Not in dictionary")
item_one_product()

Creating dictionary from list of lists

I am working on an online course exercise (practice problem before the final test).
The test involves working with a big csv file (not downloadable) and answering questions about the dataset. You're expected to write code to get the answers.
The data set is a list of all documented baby names each year, along with
#how often each name was used for boys and for girls.
A sample list of the first 10 lines is also given:
Isabella,42567,Girl
Sophia,42261,Girl
Jacob,42164,Boy
and so on.
Questions you're asked include things like 'how many names in the data set', 'how many boys' names beginning with z' etc.
I can get all the data into a list of lists:
[['Isabella', '42567', 'Girl'], ['Sophia', '42261', 'Girl'], ['Jacob', '42164', 'Boy']]
My plan was to convert into a dictionary, as that would probably be easier for answering some of the other questions. The list of lists is saved to the variable 'data':
names = {}
for d in data:
names[d[0]] = d[1:]
print(names)
{'Isabella': ['42567', 'Girl'], 'Sophia': ['42261', 'Girl'], 'Jacob': ['42164', 'Boy']}
Works perfectly.
Here's where it gets weird. If instead of opening the sample file with 10 lines, I open the real csv file, with around 16,000 lines. everything works perfectly right up to the very last bit.
I get the complete list of lists, but when I go to create the dictionary, it breaks - here I'm just showing the first three items, but the full 16000 lines are all wrong in a similar way):
names = {}
for d in data:
names[d[0]] = d[1:]
print(names)
{'Isabella': ['56', 'Boy'], 'Sophia': ['48', 'Boy'], 'Jacob': ['49', 'Girl']
I know the data is there and correct, since I can read it directly:
for d in data:
print(d[0], d[1], d[2])
Isabella 42567 Girl
Sophia 42261 Girl
Jacob 42164 Boy
Why would this dictionary work fine with the cvs file with 10 lines, but completely break with the full file? I can't find any

Follow the comments to create two dicts, or a single dictionary with tuple keys. Using tuples as keys is fine if you keep your variables inside python, but you might get into trouble when exporting to json for example.
Try a dictionary comprehension with list unpacking
names = {(name, sex): freq for name, freq, sex in data}
Or a for loop as you started
names = dict()
for name, freq, sex in data:
names[(name, freq)] = freq

I'd go with something like
results = {}
for d in data:
name, amount, gender = d.split(',')
results[name] = data.get(name, {})
results[name].update({ gender: amount })
this way you'll get results in smth like
{
'Isabella': {'Girl': '42567', 'Boy': '67'},
'Sophia': {'Girl': '42261'},
'Jacob': {'Boy': '42164'}
}
However duplicated values will override previous, so you need to take that into account if there are some and it also assumes that the whole file matches format you've provided

How to reference a specific part of a dictionary in Python

Alright, so I am trying to create a simple thing that will tell me the showtimes of the movies at the theatre, the names of the movies, and the Rotten Tomatoes score in Python, but I am having a hard time figuring out how to get the meterScore.
actorCount = 0
actors = []
criticCount = 0
critics = []
franchiseCount = 0
franchises = []
movieCount = 3
movies = [{'name': 'Frozen II', 'year': 2019, 'url': '/m/frozen_ii', 'image': 'https://resizing.flixster.com/QZg2MuPQoRlWcWYAwufbQBlv-I0=/fit-in/80x80/v1.bTsxMzIwMzIxODtqOzE4Mjg3OzEyMDA7NTQwOzgxMA', 'meterClass': 'certified_fresh', 'meterScore': 76, 'castItems': [{'name': 'Kristen Bell', 'url': '/celebrity/kristin_bell'}, {'name': 'Idina Menzel', 'url': '/celebrity/idina_menzel'}, {'name': 'Josh Gad', 'url': '/celebrity/josh_gad'}], 'subline': 'Kristen Bell, Idina Menzel, Josh Gad, '}]
tvCount = 0
tvSeries = []
What I am trying to get from that list of data is the meterScore, if you scroll over to the right far enough you can see it. All this data is part of a bigger dictionary, which I named resultOne, but I don't think that matters. I just need some help figuring out how to reference and get the meterScore from the dictionary, so I can print it out, so when I want to see what movies and what rating they got I can just run this program and it will do it for me. I don't really use dictionaries that much, but the library I am using to get the Rotten Tomato data creates it as this very hard to reference dictionary, so any help is appreciated! What I don't get is that if I try to print(resultOne.movies) it says that that is not an attribute or something to that affect, even though when I put it into something that will print out the keys and values, such as I did to get the code above, it clearly shows it is a key. I also tried to print(resultOne.movies[meterScore]), but that didn't work either.

Dictionary values are looked up by their keys using [], not ..
Now, the trick is that the movies key points to a list. So you need to mix two kinds of indexing that both use []: dictionary indexing, which is by key, and list indexing, which is by position in the list (starting at 0).
Ultimately, you want to do this:
score = resultOne['movies'][0]['meterScore']
^ ^ ^
| | |
lookup in outer dict | |
first item in list |
lookup in inner dict

Try this:
movies[0]['meterScore']
# 76

Why don't you try something like this to extract every meterScore from all the movies in the dictionary:
listOfAllMeterScores = [ movie['meterScore'] for movie in movies ]

In that snippet, movies is a list containing a dict. So index the list then index the dict:
movies[0]['meterScore']
If movies might contain more than one item (or zero for that matter), iterate over it instead to get a list of the meterScores:
meter_scores = [movie['meterScore'] for movie in movies]

List Matching in Python using nested for loops

I have three lists, (1) treatments (2) medicine name and (3) medicine code symbol. I am trying to identify the respective medicine code symbol for each of 14,700 treatments. My current approach is to identify if any name in (2) is "in" (1), and then return the corresponding (3). However, I am returned an abitrary list (correct length) of medicine code symbols corresponding to the 14,700 treatments. Code for the method I've written is below:
codes = pandas.read_csv('Codes.csv', dtype=str)
codes_list = _codes.values.tolist()
names = pandas.read_csv('Names.csv', dtype=str)
names_list = names.values.tolist()
treatments = pandas.read_csv('Treatments.csv', dtype=str)
treatments_list = treatments.values.tolist()
matched_codes_list = range(len(treatments_list))
for i in range(len(treatments_list)):
for j in range(len(names_list)):
if names_list[j] in treatments_list[i]:
matched_codes_list[i]=codes_list_text[j]
print matched_codes_list
Any suggestions for where I am going wrong would be much appreciated!

I can't tell what you are expecting. You should replace the xxx_list code with examples instead, since you don't seem to have any problems with the csv reading.
Let's suppose you did that, and your result looks like this.
codes_list = ['shark', 'panda', 'horse']
names_list = ['fin', 'paw', 'hoof']
assert len(codes_list) == len(names_list)
treatments_list = ['tape up fin', 'reverse paw', 'stand on one hoof', 'pawn affinity maneuver', 'alert wing patrol']
it sounds like you are trying to determine the 'code' for each 'treatment', assuming that the number of codes and names are the same (and indicate some mapping). You plan to use the presence of the name to determine the code.
we can zip together the name and codes list to avoid using indexes there, and we can use iteration over the treatment list instead of indexes for pythonic readability
matched_codes_list = []
for treatment in treatment:
matched_codes = []
for name, code in zip(names_list, codes_list):
if name in treatment:
matched_codes.append(code)
matched_codes_list.append(matched_codes)
this would give something like
assert matched_codes_list == [
['shark'], # 'tape up fin'
['panda'], # 'reverse paw'
['horse'], # 'stand on one hoof'
['shark', 'panda', 'horse'], # 'pawn affinity maneuver'
[], # 'alert wing patrol'
]
note that the method used to do this is quite slow (and probably will give false positives, see 4th entry). You will traverse the text of all treatment descriptions once for each name/code pair.
You can use a dictionary like 'lookup = {name: code for name, code in zip(names_list, codes_list)}, or itertools.izip for minor gains. Otherwise something more clever might be needed, perhaps splitting treatments into a set containing words, or mapping words into multiple codes.

How do I replace lines in a file using data contained elsewhere in the same file?

Let's say I have a file called 'Food' listing the names of some food, and their prices. Some of these items are raw ingredients, and others are made from different amounts of these- for example i might manually list the price of eggs as 1 and find that the omelette has a default price of 10, but then find that an omelette will only need 5 eggs, so i would need the program to read the price of eggs, find the line containing the omelette, and replace it with "omelette: " + str(5*eggs). I may also need to add extra ingredients/ items of food e.g. a pile of omelettes which is made from 5 omelettes. the basic goal would be to make it possible to just edit the value of eggs, and the value of omelette and pileofomelettes to update. I've started the code simply by creating a list of the lines contained within the file.
with open("Food.txt") as g:
foodlist=g.readlines()
The file 'Food.txt' would be in the following format:
eggs: 5
omelette: 20
pileofomelettes: 120
etc...
and after the code runs it should look like
eggs: 5
omelette: 25
pileofomelettes: 125
I would code the relations manually since they would be so unlikely to ever change (and even if they did it would be fairly easy for me to go in and change the coefficients)
and would be read by python in its list format as something like
'['egg 2\n', 'flour 1\n', 'butter 1\n', 'sugar 3\n', 'almond 5\n', 'cherry 8\n']'
I have searched for search/replace algorithms that can search for a specific phrase and replace it with another specific phrase, but i don't know how i'd apply it if the line was subject to change (the user could change the raw ingredient values if he wanted to update all of the values related to it). One solution i can think of involves converting them into a dictionary format, with them all listed as a string-integer value pair, so that i could just replace the integer part of the pair based on the integer values stored within other string-integer pairs, but, being inexperienced, I don't know how i'd convert the list (or the raw file itself, even better) into a dictionary.
Any advice on how to carry out steps of this program would be greatly appreciated :)
EDIT- in the actual application of the program, it doesn't matter what order the items are listed in in the final file, so if i listed out all the raw ingredients in 1 place and all of the composite items in another (With a large space in between them if more raw items need to be added) then i could just re-write the entire second half of the file in an arbitrary order with no problem- so long as the line position of the raw ingredients remains the same.

Okay, I would suggest make a relations text file which you can parse in case you think the relations can later change, or just so that your code is easier to read and mutable. This can be then parsed to find the required relations between raw ingredients and complexes. Let it be "relations.txt" , and of the type:
omelette: 5 x eggs + 1 x onions
pileofomelettes: 6 x omelette
Here, you can put arbitrary number of ingredients of the type:
complex: number1 x ingredient1 + number2 x ingredient2 + ...
and so on.
And your food.txt contains prices of all ingredients and complexes:
eggs: 2
onions: 1
omelette: 11.0
pileofomelettes: 60
Now we can see that the value for pileofomlettes is intentionally not mapped here correctly. So, we will run the code below, and also you can change numbers and see the results.
#!/usr/bin/python
''' This program takes in a relations file and a food text files as inputs
and can be used to update the food text file based on changes in either of these'''
relations = {}
foodDict = {}
# Mapping ingredients to each other in the relations dictionary
with open("relations.txt") as r:
relationlist=r.read().splitlines()
for relation in relationlist:
item, equatedTo = relation.split(': ')
ingredientsAndCoefficients = equatedTo.split(' + ')
listIngredients = []
for ingredient in ingredientsAndCoefficients:
coefficient, item2 = ingredient.split(' x ')
# A list of sets of amount and type of ingredient
listIngredients.append((float(coefficient),item2))
relations.update({item:listIngredients})
# Creating a food dictionary with values from food.txt and mapping to the relations dictionary
with open("food.txt") as g:
foodlist=g.read().splitlines()
for item in foodlist:
food,value = item.split(': ')
foodDict.update({food:value})
for food in relations.keys():
# (Raw) Ingredients with no left hand side value in relations.txt will not change here.
value = 0.
for item2 in range(len(relations[food])):
# Calculating new value for complex here.
value += relations[food][item2][0]* float(foodDict[relations[food][item2][1]])
foodDict.update({food: value })
# Altering the food.txt with the new dictionary values
with open("food.txt",'w') as g:
for key in sorted(foodDict.keys()):
g.write(key + ': ' + str (foodDict[key])+ '\n')
print key + ': ' + str(foodDict[key])
And it comes out be:
eggs: 2
onions: 1
omelette: 11.0
pileofomelettes: 66.0
You can change the price of eggs to 5 in the food.txt file, and
eggs: 5
onions: 1
omelette: 26.0
pileofomelettes: 156.0

How does your program know the components of each item? I suggest that you keep two files: one with the cost of atomic items (eggs) and another with recipes (omelette <= 5 eggs).
Read both files. Store the atomic costs, remembering how many of these items you have, atomic_count. Extend this table from the recipes file, one line at a time. If the recipe you're reading consists entirely of items with known costs, then compute the cost and add that item to the "known" list. Otherwise, append the recipe to a "later" list and continue.
When you reach the end of both input files, you will have a list of known costs, and a few other recipes that depended on items farther down the recipe file. Now cycle through this "unknown" list until (a) it's empty; (b) you don't have anything with all the ingredients known. If case (b), you have something wrong with your input: either an ingredient with no definition, or a circular dependency. Print the remaining recipes list and debug your input files.
In case (a), you are now ready to print your Food.txt list. Go through your "known" list and write out one item or recipe at a time. When you get to item [atomic_count], write out a second file, a new recipe list. This is your old recipe list, but in a useful top-down order. In the future, you won't have any "unknown" recipes after the first pass.
For future changes ... don't bother. You have only 173 items, and the list sounds unlikely to grow past 500. When you change or add an item, just hand-edit the file and rerun the program. That will be faster than the string-replacement algorithm you're trying to write.
In summary, I suggest that you do just the initial computation problem, which is quite a bit simpler than adding the string update. Don't do incremental updates; redo the whole list from scratch. For such a small list, the computer will do this faster than you can write and debug the extra coding.

I'm still not really sure what you are asking but this is what I came up with...
from collections import OrderedDict
food_map = {'omelette': {'eggs': 5, 'price': None}, 'pileofomelettes': {'eggs': 25, 'price': None}, 'eggs': {'price': 5}}
with open('food.txt', 'rb') as f:
data = f.read().splitlines()
data = OrderedDict([(x[0], int(x[1])) for x in [x.split(': ') for x in data]])
for key, val in data.items():
if key == 'eggs':
continue
food_rel = food_map.get(key, {})
val = food_rel.get('eggs', 1) * food_map.get('eggs', {}).get('price', 1)
data[key] = val
with open('out.txt', 'wb') as f:
data = '\n'.join(['{0}: {1}'.format(key, val) for key, val in data.items()])
f.write(data)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.