How to Sort Arrays in Dictionary? - python

I'm currently writing a program in Python to track statistics on video games. An example of the dictionary I'm using to track the scores :
ten = 1
sec = 9
fir = 10
thi5 = 6
sec5 = 8
games = {
'adom': [ten+fir+sec+sec5, "Ancient Domain of Mysteries"],
'nethack': [fir+fir+fir+sec+thi5, "Nethack"]
}
Right now, I'm going about this the hard way, and making a big long list of nested ifs, but I don't think that's the proper way to go about it. I was trying to figure out a way to sort the dictionary, via the arrays, and then, finding a way to display the first ten that pop up... instead of having to work deep in the if statements.
So... basically, my question is : Do you have any ideas that I could use to about making this easier, instead of wayyyy, way harder?
===== EDIT ====
the ten+fir produces numbers. I want to find a way to go about sorting the lists (I lack the knowledge of proper terminology) to go by the number (basically, whichever ones have the highest number in the first part of the array go first.
Here's an example of my current way of going about it (though, it's incomplete, due to it being very tiresome : Example Nests (paste2) (let's try this one?)
==== SECOND EDIT ====
In case someone doesn't see my comment below :
ten, fir, et cetera - these are just variables for scores. Basically, it goes from a top ten list into a variable number.
ten = 1, nin = 2, fir = 10, fir5 = 10, sec5 = 8, sec = 9...
so : 'adom': [ten+fir+sec+sec5, "Ancient Domain of Mysteries"] actually registers as : 'adom': [1+10+9+8, "Ancient Domain of Mysteries"] , which ends up looking like :
'adom': [28, "Ancient Domain of Mysteries"]
So, basically, if I ended up doing the "top two" out of my example, it'd be :
((1)) Nethack (48)
((2)) ADOM (28)
I'd write an actual number, but I'm thinking of changing a few things up, so the numbers might be a touch different, and I wouldn't want to rewrite it.
== THIRD (AND HOPEFULLY THE FINAL) EDIT ==
Fixed my original code example.

How about something like this:
scores = games.items()
scores.sort(key = lambda key, value: value[0])
return scores[:10]
This will return the first 10 items, sorted by the first item in the array.
I'm not sure if this is what you want though, please update the question (and fix the example link) if you need something else...

import heapq
return heapq.nlargest(10, games.iteritems(), key=lambda k, v: v[0])
is the most direct way to get the top ten key / value pairs, sorted by the first item of each "value" list. If you can define more precisely what output you want (just the names, the name / value pairs, or what else?) and the sorting criterion, this is easy to adjust, of course.

Wim's solution is good, but I'd say that you should probably go the extra mile and push this work off onto a database, rather than relying on Python. Python interfaces well with most types of databases, where much of what you're exploring is already a solved problem.
For example, instead of worrying about shifting your dictionaries to various other data types in order to properly sort them, you can simply get all the data for each pertinent entry pre-sorted based on the criteria of your query. There goes the need for convoluted sorting and resorting right there.
While dictionaries are tempting to use, because they give the illusion of database-like abilities to access data based on its attributes, I still think they stumble quite a bit with respect to implementation. I don't really have any numbers to throw at you, but just from personal experience, anything you do on Python when it comes to manipulating large amounts of data, you can do much faster and more efficient both in code and computation with something like MySQL.
I'm not sure what you have planned as far as the structure of your data goes, but along with adding data, changing its structure is a lot easier using a database, too.

Related

Remove A Specific Instance of a Partially Duplicated Entry In a List In Python 3

I am relatively new to Python. However, my needs generally only involve simple string manipulation of rigidly formatted data files. I have a specific situation that I have scoured the web trying to solve and have come up blank.
This is the situation. I have a simple list of two-part entires, formatted like this:
name = ['PAUL;25', 'MARY;60', 'PAUL;40', 'NEIL;50', 'MARY;55', 'HELEN;25', ...]
And, I need to keep only one instance of any repeated name (ignoring the number to the right of the ' ; '), keeping only the entry with the highest number, along with that highest value still attached. So the answer would look like this:
ans = ['MARY;60', 'PAUL;40', 'HELEN;25', 'NEIL;50, ...]
The order of the elements in the list is irrelevant, but the format of the ans list entries must remain the same.
I can probably figure out a way to brute force it. I have looked at 2D lists, sets, tuples, etc. But, I can't seem to find the answer. The name list has about a million entries, so I need something that is efficient. I am sure it will be painfully easy for some of you.
Thanks for any input you can provide.
Cheers.
alkemyst
Probably the best data structure for this would be a dictionary, with the entries split up (and converted to integer) and later re-joined.
Something like this:
max_score = {}
for n in name:
person, score_str = n.split(';')
score = int(score_str)
if person not in max_score or max_score[person] < score:
max_score[person] = score
ans = [
'%s;%s' % (person, score)
for person, score in max_score.items()
]
This is a fairly common structure for many functions and programs: first convert the input to an internal representation (in this case, split and convert to integer), then do the logic or calculation (in this case, uniqueness and maximum), then convert to the required output representation (in this case, string separated with ;).
In terms of efficiency, this code looks at each input item once, then at each output item once; there's unlikely to be any approach that can do better than that (certainly not formally, and likely not in practice). All of the per-item operations are constant-time and fast. It accumulates the intermediate answer in memory (in max_score), but again that is unavoidable; if memory is an issue, the input and output could be changed to iterators/generators, but the whole intermediate answer has to be accumulated in max_score before any items can be output.

How to impliment a binary search on a list created from a file

This is my first post, please be gentle. I'm attempting to sort some
files into ascending and descending order. Once I have sorted a file, I am storing it in a list which is assigned to a variable. The user is then to choose a file and search for an item. I get an error message....
TypeError: unorderable types; int() < list()
.....when ever I try to search for an item using the variable of my sorted list, the error occurs on line 27 of my code. From research, I know that an int and list cannot be compared, but I cant for the life of me think how else to search a large (600) list for an item.
At the moment I'm just playing around with binary search to get used to it.
Any suggestions would be appreciated.
year = []
with open("Year_1.txt") as file:
for line in file:
line = line.strip()
year.append(line)
def selectionSort(alist):
for fillslot in range(len(alist)-1,0,-1):
positionOfMax=0
for location in range(1,fillslot+1):
if alist[location]>alist[positionOfMax]:
positionOfMax = location
temp = alist[fillslot]
alist[fillslot] = alist[positionOfMax]
alist[positionOfMax] = temp
def binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
while first<=last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
selectionSort(year)
testlist = []
testlist.append(year)
print(binarySearch(testlist, 2014))
Year_1.txt file consists of 600 items, all years in the format of 2016.
They are listed in descending order and start at 2017, down to 2013. Hope that makes sense.
Is there some reason you're not using the Python: bisect module?
Something like:
import bisect
sorted_year = list()
for each in year:
bisect.insort(sorted_year, each)
... is sufficient to create the sorted list. Then you can search it using functions such as those in the documentation.
(Actually you could just use year.sort() to sort the list in-place ... bisect.insort() might be marginally more efficient for building the list from the input stream in lieu of your call to year.append() ... but my point about using the `bisect module remains).
Also note that 600 items is trivial for modern computing platforms. Even 6,000 won't take but a few milliseconds. On my laptop sorting 600,000 random integers takes about 180ms and similar sized strings still takes under 200ms.
So you're probably not gaining anything by sorting this list in this application at that data scale.
On the other hand Python also includes a number of modules in its standard libraries for managing structured data and data files. For example you could use Python: SQLite3.
Using this you'd use standard SQL DDL (data definition language) to describe your data structure and schema, SQL DML (data manipulation language: INSERT, UPDATE, and DELETE statements) to manage the contents of the data and SQL queries to fetch data from it. Your data can be returned sorted on any column and any mixture of ascending and descending on any number of columns with the standard SQL ORDER BY clauses and you can add indexes to your schema to ensure that the data is stored in a manner to enable efficient querying and traversal (table scans) in any order on any key(s) you choose.
Because Python includes SQLite in its standard libraries, and because SQLite provides SQL client/server semantics over simple local files ... there's almost no downside to using it for structured data. It's not like you have to install and maintain additional software, servers, handle network connections to a remote database server nor any of that.
I'm going to walk through some steps before getting to the answer.
You need to post a [mcve]. Instead of telling us to read from "Year1.txt", which we don't have, you need to put the list itself in the code. Do you NEED 600 entries to get the error in your code? No. This is sufficient:
year = ["2001", "2002", "2003"]
If you really need 600 entries, then provide them. Either post the actual data, or
year = [str(x) for x in range(2017-600, 2017)]
The code you post needs to be Cut, Paste, Boom - reproduces the error on my computer just like that.
selectionSort is completely irrelevant to the question, so delete it from the question entirely. In fact, since you say the input was already sorted, I'm not sure what selectionSort is actually supposed to do in your code, either. :)
Next you say testlist = [].append(year). USE YOUR DEBUGGER before you ask here. Simply looking at the value in your variable would have made a problem obvious.
How to append list to second list (concatenate lists)
Fixing that means you now have a list of things to search. Before you were searching a list to see if 2014 matched the one thing in there, which was a complete list of all the years.
Now we get into binarySearch. If you look at the variables, you see you are comparing the integer 2014 with some string, maybe "1716", and the answer to that is useless, if it even lets you do that (I have python 2.7 so I am not sure exactly what you get there). But the point is you can't find the integer 2014 in a list of strings, so it will always return False.
If you don't have a debugger, then you can place strategic print statements like
print ("debug info: binarySearch comparing ", item, alist[midpoint])
Now here, what VBB said in comments worked for me, after I fixed the other problems. If you are searching for something that isn't even in the list, and expecting True, that's wrong. Searching for "2014" returns True, if you provide the correct list to search. Alternatively, you could force it to string and then search for it. You could force all the years to int during the input phase. But the int 2014 is not the same as the string "2014".

Converting a list of strings to a single pattern

I have a list of strings that follow a specific pattern. Here's an example
['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
I'm trying to end up with a blob pattern that will represent this list like the following
'ratelimiter:foobar:201401011*
I know the first two fields ahead of time. The third field is a time stamp and I want to find the column at which they start to have different values from other values in the column.
In the example given the timestamp ranges from 2014-01-01-11:57 to 2014-01-01-12:00 and the column that's different is the third to the last column where 1 changes to 2. If I can find that, then I can slice the string to [:-3] += '*' (for this example)
Every time I try and tackle this problem I end up with loops everywhere. I just feel like there's a better way of doing this.
Or maybe someone knows a better way of doing this with redis. I'm doing this because I'm trying to get keys from redis and I don't want to make a request for every key but rather make a batch request using the pattern parameter. Maybe there's a better way of doing this but haven't found anything yet.
Thanks
Staying in the pattern thing (converting to timestamp is probably best, though), I would do that to find the longest prefix:
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
print items[0][:[len(set(x)) == 1 for x in zip(*items)].index(False)] + '*'
# ratelimiter:foobar:201401011*
Which reads as: cut the first element of items where all nth elements of items are no longer equals.
[len(set(x)) == 1 for x in zip(*items)] will return a list of boolean being True for i if all elements at i are equal across items
This is what I will do:
convert the timestamp to numbers
find the max and min (if your list is not ordered)
take the difference between max and min and convert it back to pattern.
For example, in your case, the difference between max and min is 43. And the min is already 57, you can quickly deduct that if the min ends with ***157, the max should be ***200. And you know the pattern
You almost never want to use the '*' parameter in Redis in production because it is very slow-- much slower than making a request for each key individually in the vast majority of cases. Unless you're requesting so many keys that your bottleneck becomes the sheer amount of data you're transferring over the network (in which case you should really convert things to Lua and run the logic server-side), a pipeline is really want you want.
The reason you want a pipeline is you're probably getting hit by the costs of transferring data back and forth between your Redis server in separate hops right now. A pipeline, in contrast, queues up a bunch of commands to run against Redis, and then executes them all at once, when you're ready. Assuming you're using redis-py (if you're not, you really should be), and r is your connection to your Redis server, you can do this like so:
r = redis.Redis(...)
pipe = r.pipeline()
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
for item in items:
pipe.get(item)
#all the values for each item you're getting from Redis will be here.
item_values = pipe.execute()
Note: this will only make one call to Redis and will be much faster than either getting each value individually or running a pattern selection.
All of the other answers so far are good Python answers, but you're dealing with a Redis problem. You need a Redis answer.

Python- How to find the height of standard binary search tree?

I've a json file of type [{"score": 68},{"score": 78}]
I need to find the height of standard binary search tree that is made using the scores of all the objects. How can I do it?
This is what I'm doing. I'm first getting all the scores and storing inside the json file and then applying the formula.
import ijson
import math
f = open ('data_large')
content = ijson.items(f, 'item')
n = len(list(i['score'] for i in content))
height = math.ceil(math.log((n+1),2)-1)
print height
Well, this does gives me the correct answer, but wanted to know 2 things?
1) Whether this formula will also be valid in case when there are duplicates in the list, since I need to develop a BST which can have duplicates as well?
2) I think n = len(list(i['score'] for i in content)) is useless because since I dont need the node values to calculate the height of the BST, but only the length of the list. Is there any way I can calculate the number of entries so that I omit this line and calculate the total nuber of entries in the json file, which will serve the purpose of n?
The other thing is I also wanted to calculate is the unique scores as well from the file. So, this is how I'm doing print set(i['score'] for i in content) , but it takes 201secs to execute since the file is so large( 256MB, hence used ijson for fast processing ), hence there are multiple entries inside the content. Can I make this statement much more time-efficient. If yes, How?
1) Yes/no. If you've added a property to each node which counts the number of times that node has been inserted into the tree then you still have a BST and the answer is yes.
If you actually want duplicate nodes, then you'd need modify the BST property. The unmodified property says that items smaller than X go left and items greater than X go right. If you instead say items greater than or equal to X to right, then it's easy to see that you can make the tree arbitrarily high by adding many duplicate items and the answer is no.
2) Have you tried list(content)? Of course, you cannot build a BST without somehow removing duplicate nodes, this cannot be used to calculate the height of the BST. You need to remove duplicate items. This leads to your third question.
3) As regards the print set(i['score']... You shouldn't bundle separate questions together like this as it will lead you down the dark path to having different answers addressing different parts of your question. However, the code you've write certainly does have something Pythonic about it. So you've got to ask yourself if it's really worth your time (which is often the only time that really matters) to try to find a more convoluted, but quicker solution.

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...
done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']
What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]
I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).
Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)
Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

Categories

Resources