Maintaining order in large list of movies/ratings

Maintaining order in large list of movies/ratings - python

I have a text file with hundreds of thousands of students, and their ratings for certain films organized with the first word being the student number, the second being the name of the movie (with no spaces), and the third being the rating they gave the movie:
student1000 Thor 1
student1001 Superbad -3
student1002 Prince_of_Persia:_The_Sands_of_Time 5
student1003 Old_School 3
student1004 Inception 5
student1005 Finding_Nemo 3
student1006 Tangled 5
I would like to arrange them in a dictionary so that I have each student mapped to a list of their movie ratings, where the ratings are in the same order for each student. In other words, I would like to have it like this:
{student1000 : [1, 3, -5, 0, 0, 3, 0,...]}
{student1001 : [0, 1, 0, 0, -3, 0, 1,...]}
Such that the first, second, third, etc. ratings for each student correspond to the same movies. The order is completely random for movies AND student numbers, and I'm having quite a bit of trouble doing this effectively. Any help in coming up with something that will minimize the big-O complexity of this problem would be awesome.
I ended up figuring it out. Here's the code I used for anyone wondering:
def get_movie_data(fileLoc):
movieDic = {}
movieList = set()
f = open(fileLoc)
setHold = set()
for line in f:
setHold.add(line.split()[1])
f.close()
movieList = sorted(setHold)
f = open(fileLoc)
for line in f:
hold = line.strip().split()
student = hold[0]
movie = hold[1]
rating = int(hold[2])
if student not in movieDic:
lst = [0]*len(movieList)
movieDic[student] = lst
hold2 = movieList.index(movie)
rate = movieDic[student]
rate[hold2] = rating
f.close()
return movieList, movieDic
Thanks for the help!

You can first build a dictionary of dictionaries:
{
'student1000' : {'Thor': 1, 'Superbad': 3, ...},
'student1001' : {'Thor': 0, 'Superbad': 1, ...},
...
}
Then you can go through that to get a master list of all the movies, establish an order for them (corresponding to the order within each student's rating list), and finally go through each student in the dictionary, converting the dictionary to the list you want. Or, like another answer said, just keep it as a dictionary.
defaultdict will probably come in handy. It lets you say that the default value for each student is an empty list (or dictionary) so you don't have to initialize it before you start appending values (or setting key-value pairs).
from collections import defaultdict
students = defaultdict(dict)
with open(filename, 'r') as f:
for line in f.readlines():
elts = line.split()
student = elts[0]
movie = elts[1]
rating = int(elts[2])
students[student][movie] = rating

So, the answers here are functionally the same as what you seem to be looking for, but as far as directly constructing the lists you're looking for, they seem to be answering slightly different questions. Personally I would prefer to do this in a more dynamic way. Since it doesn't seem to me like you actually know the movies that are going to be rated ahead of time, you've gotta keep some kind of running tally of that.
ratings = {}
allMovies = []
for line in file:
info = line.split(" ")
movie = info[1].strip().lower()
student = info[0].strip().lower()
rating = float(info[2].strip().lower())
if movie not in allMovies:
allMovies.append(movie)
movieIndex = allMovies.index(movie)
if student not in ratings:
ratings[student] = ([0]*(len(allMovies)-1)).append(rating)
else:
if len(allMovies) > len(ratings[student]):
ratings[student] = ratings[student].extend([0]*(len(allMovies)-len(ratings[student]))
ratings[student][movieIndex] = rating
It's not the way I would approach this problem, but I think this solution is closest to the original intent of the question and you can use a buffer to feed in the lines if there's a memory issue, but unless your file is multiple gigabytes there shouldn't be an issue with that.

Just put the scores into a dictionary rather than a list. After you've read all the data, you can then extract the movie names and put them in any order you want. Assuming students can rate different movies, maintaining some kind of consistent order while reading the file, without knowing the order of the movies to begin with, seems like a lot of work.
If you're worrying about the keys taking up a lot of memory, use intern() on the keys to make sure you're only storing one copy of each string.

Related

Python append terms that had same id but different values to a list?

I have csv file where I have general concepts and corresponding medical terms or phrases. How can I write a loop so that I can group all the phrases to their corresponding concept? I'm not very experienced with python, so I'm not reallt sure how to write the loop.
id concept phrase
--------------------------------
1 general_history H&P
1 general_history history and physical
1 general_history history physical
2 clinic_history clinic history physical
2 clinic_history outpatient h p
3 discharge discharge summary
3 discharge DCS
For the same concept term (or same ID) how can I append the phrases to a list to get something like this:
var = [[general_history, ['history and physical', history physical]],
[clinic_history, ['clinic history physical', 'outpatient h p']],
[discharge, ['discharge summary', 'DCS']]]

Use a for loop to and defaultdict to accumulate the terms.
import csv
from collections import defaultdict
var = defaultdict(list)
records = ... # read csv with csv.DictReader
for row in records:
concept = row.get('concept', None)
if concept is None: continue
phrase = row.get('phrase', None)
if phrase is None: continue
var[concept].append(phrase)
print(var)

Assuming you can parse the csv already, here's how you can go about sorting together by concept
from collections import defaultdict
concepts = defaultdict(list)
""" parse csv """
for row in csv:
id, concept, phrase = row
concepts[concept].append(phrase)
var = [[k, concepts[k]] for k in concepts.keys()]
var will hold something like this:
[['general_history', ['history and physical', 'history physical']...]
What might even be useful is if you maintain the keys to that dictionary, as var looks something like this:
{
"general_history": [
"history and physical",
"history physical",
],
...
}

If you're using pandas, try filtering. It should look something like this:
new_dataframe = dataframe[dataframe['id'] == id]
then, concat the dataframes,
final_df = pd.concat([new_dataframe1, new_dataframe2], axis = 0)
You can try to do the same thing for concept as well.

Hopefully, this will solve your question:
# a quick way to to transfer the data into python
csv_string = """id, concept, phrase
1, general_history, H&P
1, general_history, history and physical
1, general_history, history physical
2, clinic_history, clinic history physical
2, clinic_history, outpatient h p
3, discharge, discharge summary
3, discharge, DCS"""
# formats the data as shown in the original question
csv=[[x.strip() for x in line.split(", ")] for line in csv_string.split("\n")]
# makes a dictionary with an empty list that will hold all data points
id_dict = {line[0]:[] for line in csv[1:]}
# iterates and adds all possible combinations of id's and phrases
for line in csv[1:]:
current_id = line[0]
phrases = line[2]
id_dict[current_id].append(phrases)
# makes the data into a list of lists containing only unique phrases
[[current_id, list(set(phrases))] for current_id, phrases in id_dict.items()]

Append number to list if string matches list name

I am trying to write a Python program as following:
list_a = []
list_b = []
list_c = []
listName = str(input('insert list name: '))
listValue = int(input('insert value: '))
listName.append(listValue)
But unfortunately "listName.append()" won't work.
Using only IF functions as in:
if listName == 'list_a':
list_a.append(listValue)
is impractical because I am actually working with 600+ lists...
Is there any other way I could do to make something like this work properly??
Help is much appreciated!
Thank you very much in advance!!

When you're tempted to use variable names to hold data — like the names of stocks — you should almost certainly be using a dictionary with the data as keys. You can still pre-populate the dictionary if you want to (although you don't have to).
You can change your existing to code to something like:
# all the valid names
names = ['list_a', 'list_b', 'list_c']
# pre-populate a dictionary with them
names = {name:[] for name in names}
# later you can add data to the arrays:
listName = str(input('insert list name: '))
listValue = int(input('insert value: '))
# append value
names[listName].append(listValue)
With this, all your data is in one structure. It will be easy to loop through it, make summary statistics like sums/averages act. Having a dictionary with 600 keys is no big deal, but having 600 individual variables is a recipe for a mess.
This will raise a key error if you try to add a name that doesn't exists, which you can catch in a try block if that's a possibility.

Keep your lists in a dict. You could initialize your dict from a file, db, etc.
lists = {"GOOG": []}
listName = str(input('insert list name: '))
listValue = int(input('insert value: '))
lists.setdefault(listName,[]).append(listValue)
# Just a little output to see what you've got in there...
for list_name, list_value in lists.items():
print(list_name, list_value)

So, following Mark Meyer's and MSlimmer's suggestions above, I am using a dictionary of lists to simplify data input, which has made this section of my program work flawlessly (thanks again, guys!).
However, I am experiencing problems with the next section of my code (which was working before when I had it all as lists haha). I have a dictionary as below:
names = {'list_a':[5, 7, -3], 'list_b':[10, 12, -10, -10]}
Now, I have to add up all positives and all negatives to each list. I want it to have the following result:
names_positives = {'list_a': 12, 'list_b': 22}
names_negatives = {'list_a': -3, 'list_b': -20}
I have tried three different ways, but none of them worked:
## first try:
names_positives = sum(a for a in names.values() if a > 0)
## second try:
names_positives = []
names_positives.append(a for a in names.values() if compras > 0)
## third try:
names_positives = dict(zip(names.keys(), [[sum(a)] for a in names.values() if compra > 0]))
To be honest, I have no idea how to proceed -- I am getting errors due to mixing strings and integers in those lines, and I am not being able to work some way around this problem... Any help is much appreciated. It could result in a dictionary, in a list or even in only the total sum (as in the first try, under no better circumstances).
Cheers!

You can try this one:
I just added one line for your code to work.
list_a = ['a']
list_b = []
list_c = []
listName = str(input('insert list name: '))
listValue = int(input('insert value: '))
eval(listName).append(listValue)
the eval function evaluates the string into an actual python expression. It should be noted however that there are security issues regarding the use of eval. But for the exact question that you were trying to answer, this would be the easiest way that wouldn't require much refactoring of the existing code.

How do I replace lines in a file using data contained elsewhere in the same file?

Let's say I have a file called 'Food' listing the names of some food, and their prices. Some of these items are raw ingredients, and others are made from different amounts of these- for example i might manually list the price of eggs as 1 and find that the omelette has a default price of 10, but then find that an omelette will only need 5 eggs, so i would need the program to read the price of eggs, find the line containing the omelette, and replace it with "omelette: " + str(5*eggs). I may also need to add extra ingredients/ items of food e.g. a pile of omelettes which is made from 5 omelettes. the basic goal would be to make it possible to just edit the value of eggs, and the value of omelette and pileofomelettes to update. I've started the code simply by creating a list of the lines contained within the file.
with open("Food.txt") as g:
foodlist=g.readlines()
The file 'Food.txt' would be in the following format:
eggs: 5
omelette: 20
pileofomelettes: 120
etc...
and after the code runs it should look like
eggs: 5
omelette: 25
pileofomelettes: 125
I would code the relations manually since they would be so unlikely to ever change (and even if they did it would be fairly easy for me to go in and change the coefficients)
and would be read by python in its list format as something like
'['egg 2\n', 'flour 1\n', 'butter 1\n', 'sugar 3\n', 'almond 5\n', 'cherry 8\n']'
I have searched for search/replace algorithms that can search for a specific phrase and replace it with another specific phrase, but i don't know how i'd apply it if the line was subject to change (the user could change the raw ingredient values if he wanted to update all of the values related to it). One solution i can think of involves converting them into a dictionary format, with them all listed as a string-integer value pair, so that i could just replace the integer part of the pair based on the integer values stored within other string-integer pairs, but, being inexperienced, I don't know how i'd convert the list (or the raw file itself, even better) into a dictionary.
Any advice on how to carry out steps of this program would be greatly appreciated :)
EDIT- in the actual application of the program, it doesn't matter what order the items are listed in in the final file, so if i listed out all the raw ingredients in 1 place and all of the composite items in another (With a large space in between them if more raw items need to be added) then i could just re-write the entire second half of the file in an arbitrary order with no problem- so long as the line position of the raw ingredients remains the same.

Okay, I would suggest make a relations text file which you can parse in case you think the relations can later change, or just so that your code is easier to read and mutable. This can be then parsed to find the required relations between raw ingredients and complexes. Let it be "relations.txt" , and of the type:
omelette: 5 x eggs + 1 x onions
pileofomelettes: 6 x omelette
Here, you can put arbitrary number of ingredients of the type:
complex: number1 x ingredient1 + number2 x ingredient2 + ...
and so on.
And your food.txt contains prices of all ingredients and complexes:
eggs: 2
onions: 1
omelette: 11.0
pileofomelettes: 60
Now we can see that the value for pileofomlettes is intentionally not mapped here correctly. So, we will run the code below, and also you can change numbers and see the results.
#!/usr/bin/python
''' This program takes in a relations file and a food text files as inputs
and can be used to update the food text file based on changes in either of these'''
relations = {}
foodDict = {}
# Mapping ingredients to each other in the relations dictionary
with open("relations.txt") as r:
relationlist=r.read().splitlines()
for relation in relationlist:
item, equatedTo = relation.split(': ')
ingredientsAndCoefficients = equatedTo.split(' + ')
listIngredients = []
for ingredient in ingredientsAndCoefficients:
coefficient, item2 = ingredient.split(' x ')
# A list of sets of amount and type of ingredient
listIngredients.append((float(coefficient),item2))
relations.update({item:listIngredients})
# Creating a food dictionary with values from food.txt and mapping to the relations dictionary
with open("food.txt") as g:
foodlist=g.read().splitlines()
for item in foodlist:
food,value = item.split(': ')
foodDict.update({food:value})
for food in relations.keys():
# (Raw) Ingredients with no left hand side value in relations.txt will not change here.
value = 0.
for item2 in range(len(relations[food])):
# Calculating new value for complex here.
value += relations[food][item2][0]* float(foodDict[relations[food][item2][1]])
foodDict.update({food: value })
# Altering the food.txt with the new dictionary values
with open("food.txt",'w') as g:
for key in sorted(foodDict.keys()):
g.write(key + ': ' + str (foodDict[key])+ '\n')
print key + ': ' + str(foodDict[key])
And it comes out be:
eggs: 2
onions: 1
omelette: 11.0
pileofomelettes: 66.0
You can change the price of eggs to 5 in the food.txt file, and
eggs: 5
onions: 1
omelette: 26.0
pileofomelettes: 156.0

How does your program know the components of each item? I suggest that you keep two files: one with the cost of atomic items (eggs) and another with recipes (omelette <= 5 eggs).
Read both files. Store the atomic costs, remembering how many of these items you have, atomic_count. Extend this table from the recipes file, one line at a time. If the recipe you're reading consists entirely of items with known costs, then compute the cost and add that item to the "known" list. Otherwise, append the recipe to a "later" list and continue.
When you reach the end of both input files, you will have a list of known costs, and a few other recipes that depended on items farther down the recipe file. Now cycle through this "unknown" list until (a) it's empty; (b) you don't have anything with all the ingredients known. If case (b), you have something wrong with your input: either an ingredient with no definition, or a circular dependency. Print the remaining recipes list and debug your input files.
In case (a), you are now ready to print your Food.txt list. Go through your "known" list and write out one item or recipe at a time. When you get to item [atomic_count], write out a second file, a new recipe list. This is your old recipe list, but in a useful top-down order. In the future, you won't have any "unknown" recipes after the first pass.
For future changes ... don't bother. You have only 173 items, and the list sounds unlikely to grow past 500. When you change or add an item, just hand-edit the file and rerun the program. That will be faster than the string-replacement algorithm you're trying to write.
In summary, I suggest that you do just the initial computation problem, which is quite a bit simpler than adding the string update. Don't do incremental updates; redo the whole list from scratch. For such a small list, the computer will do this faster than you can write and debug the extra coding.

I'm still not really sure what you are asking but this is what I came up with...
from collections import OrderedDict
food_map = {'omelette': {'eggs': 5, 'price': None}, 'pileofomelettes': {'eggs': 25, 'price': None}, 'eggs': {'price': 5}}
with open('food.txt', 'rb') as f:
data = f.read().splitlines()
data = OrderedDict([(x[0], int(x[1])) for x in [x.split(': ') for x in data]])
for key, val in data.items():
if key == 'eggs':
continue
food_rel = food_map.get(key, {})
val = food_rel.get('eggs', 1) * food_map.get('eggs', {}).get('price', 1)
data[key] = val
with open('out.txt', 'wb') as f:
data = '\n'.join(['{0}: {1}'.format(key, val) for key, val in data.items()])
f.write(data)

Sorting Average Score in a file. Columns and descending

I have a file with 3 scores for each person. I want to use these scores, and get the average of all 3 of them. There scores are separated by tabs and in descending order. For example:
jack 10 6 11
claire 3 7 3
conrad 5 4 6
these people would come out with an average of:
jack 9
conrad 5
claire 4
I want these to be able to print to the python shell, however not be saved to the file.
with open("file.txt") as file1:
d = {}
for line in file1:
column = line.split("/t")
names = column[0]
scores = int(column[1].strip())
I have made a start, but I have found myself stuck and I do not know how to continue with the coding. Does anyone have a solution?

Use the concept of a set, or a dictionary. Both are similar: they can only contain a certain "key" once.
So if you created a dictionary that stored the contents of each score, using the 'name' as a key, you could create a data structure that looked something like this:
student_scores = {'dave': [9, 8, 11], 'john': [12, 7, 10], ... }
That data structure would be something you could analyze in a second step, easily enough, to calculate averages.
Then write a loop that goes through your student_scores, and averages each one. You can loop through a dictionary like this:
for name, list_of_scores in student_scores.items():
# do something with name & list...
Hope that gives you some ideas, without just coding it for you!
Also: Remember of course that "average" is defined as the sum of a list of numbers, divided by the len of a list of numbers. Lucky for you, both of these built-in functions exist in Python, and operate on lists!

You could use a dictionary.
scores = {}
with open("class6Afile.txt") as file1:
for line in file1:
column = line.split("/t")
names[column[0]] = column[1:]
Post this, you can iterate over key,value pairs as shown here and find the average over each set of value against the corresponding name.
Also , you can't store names in an array the way you have done it. I'm assuming names is an array, so the correct way to do it is.
names.append(column[0])

with open("class6Afile.txt") as file1:
d = {}
class Person:
def __init__(self, name, average):
self.name = name
self.average = average
for line in file1:
column = line.split("/t")
name = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
d.append(Person(name,average))
d.sort(key=lambda x: x.average, reverse=True)
your desired result exist in the top of the list!

Sorting on list values read into a list from a file

I am trying to write a routine to read values from a text file, (names and scores) and then be able to sort the values az by name, highest to lowest etc. I am able to sort the data but only by the position in the string, which is no good where names are different lengths. This is the code I have written so far:
ClassChoice = input("Please choose a class to analyse Class 1 = 1, Class 2 = 2")
if ClassChoice == "1":
Classfile = open("Class1.txt",'r')
else:
Classfile = open("Class2.txt",'r')
ClassList = [line.strip() for line in Classfile]
ClassList.sort(key=lambda s: s[x])
print(ClassList)
This is an example of one of the data files (Each piece of data is on a separate line):
Bob,8,7,5
Fred,10,9,9
Jane,7,8,9
Anne,6,4,8
Maddy,8,5,5
Jim, 4,6,5
Mike,3,6,5
Jess,8,8,6
Dave,4,3,8
Ed,3,3,4
I can sort on the name, but not on score 1, 2 or 3. Something obvious probably but I have not been able to find an example that works in the same way.
Thanks

How about something like this?
indexToSortOn = 0 # will sort on the first integer value of each line
classChoice = ""
while not classChoice.isdigit():
classChoice = raw_input("Please choose a class to analyse (Class 1 = 1, Class 2 = 2) ")
classFile = "Class%s.txt" % classChoice
with open(classFile, 'r') as fileObj:
classList = [line.strip() for line in fileObj]
classList.sort(key=lambda s: int(s.split(",")[indexToSortOn+1]))
print(classList)
The key is to specify in the key function that you pass in what part of each string (the line) you want to be sorting on:
classList.sort(key=lambda s: int(s.split(",")[indexToSortOn+1]))
The cast to an integer is important as it ensures the sort is numeric instead of alphanumeric (e.g. 100 > 2, but "100" < "2")

I think I understand what you are asking. I am not a sort expert, but here goes:
Assuming you would like the ability to sort the lines by either the name, the first int, second int or third int, you have to realize that when you are creating the list, you aren't creating a two dimensional list, but a list of strings. Due to this, you may wish to consider changing your lambda to something more like this:
ClassList.sort(key=lambda s: str(s).split(',')[x])
This assumes that the x is defined as one of the fields in the line with possible values 0-3.
The one issue I see with this is that list.sort() may sort Fred's score of 10 as being less than 2 but greater than 0 (I seem to remember this being how sort worked on ints, but I might be mistaken).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.