Python: Best way to store the top ten numbers

Python: Best way to store the top ten numbers - python

I have the following problem: I do paramter tests and create for every single paramter combination a new object, which is replaced by the next object created with other paramters. The Object has an attribute jaccard coefficient and an attribute ID. In every step i want to store the jaccard coeeficient of the object. At the end i want the top ten jaccard coeefcient and their corresponding ID.
r=["%.2f" % r for r in np.arange(3,5,1)]
fs=["%.2f" % fs for fs in np.arange(2,5,1)]
co=["%.2f" % co for co in np.arange(1,5,1)]
frc_networks=[]
bestJC = []
bestPercent = []
best10Candidates = []
count = 0
for parameters in itertools.product(r,fs,co):
args = parser.parse_args(["path1.csv","path2.csv","--r",parameters[0],"--fs",parameters[1],"--co",parameters[2]])
if not os.path.isfile('FCR_Network_Coordinates_ID_{}_r_{}_x_{}_y_{}_z_{}_fcr_{}_co_{}_1.csv'.format(count, args.r, args.x, args.y, args.z, args.fs,args.co)):
FRC_Network(count,args.p[0],args.p[1],args.x,args.y,args.z,args.r,args.fs,args.co)
The attributes can be called by FRC_Network.ID and FRC_Network.JC

I think I'd use heapq.heappushpop() for this. That way, no matter how large your input set is, your data requirement is limited to a list of 10 tuples.
Note the use of tuples to keep the JC and ID parameters. Since the comparisons are lexicographic, this will always sort by JC.
Also, note that the final call to .sort() is optional. If you just want the ten best, skip the call. If you want the ten best in order, keep the call.
import heapq
#UNTESTED
best = []
for parameters in itertools.product(r,fs,co):
# ...
if len(best) < 10:
heapq.heappush(best, (FRC_Network.JC, FRC_Network.ID))
else:
heapq.heappushpop(best, (FRC_Network.JC, FRC_Network.ID))
best.sort(reverse=True)
Here is a tested version that demonstrates the concept:
import heapq
import random
from pprint import pprint
best = []
for ID in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ':
JC = random.randint(0, 100)
if len(best) < 10:
heapq.heappush(best, (JC, ID))
else:
heapq.heappushpop(best, (JC, ID))
pprint(best)
Result:
[(81, 'E'),
(82, 'd'),
(83, 'G'),
(92, 'i'),
(95, 'Z'),
(100, 'p'),
(89, 'q'),
(98, 'a'),
(96, 'z'),
(97, 'O')]

Related

iterating over list containing duplicate values

I am looking to iterate over a list with duplicate values. The 101 has 101.A and 101.B which is right but the 102 starts from 102.C instead of 102.A
import string
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
door_numbers = []
num_count = 0
for el in room_numbers:
if room_numbers.count(el) == 1:
door_numbers.append("%s.%s" % (el, string.ascii_uppercase[0]))
elif room_numbers.count(el) > 1:
door_numbers.append("%s.%s" % (el, string.ascii_uppercase[num_count]))
num_count += 1
door_numbers = ['101.A','103.A','101.B','102.C','104.A',
'105.A','106.A','107.A','102.D','108.A']

Given
import string
import itertools as it
import collections as ct
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
letters = string.ascii_uppercase
Code
Simple, Two-Line Solution
dd = ct.defaultdict(it.count)
print([".".join([room, letters[next(dd[room])]]) for room in room_numbers])
or
dd = ct.defaultdict(lambda: iter(letters))
print([".".join([room, next(dd[room])]) for room in room_numbers])
Output
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
Details
In the first example we are using itertools.count as a default factory. This means that a new count() iterator is made whenever a new room number is added to the defaultdict dd. Iterators are useful because they are lazily evaluated and memory efficient.
In the list comprehension, these iterators get initialized per room number. The next number of the counter is yielded, the number is used as an index to get a letter, and the result is simply joined as a suffix to each room number.
In the second example (recommended), we use an iterator of strings as the default factory. The callable requirement is satisfied by returning the iterator in a lambda function. An iterator of strings enables us to simply call next() and directly get the next letter. Consequently, the comprehension is simplified since slicing letters is no longer required.

The problem in your implementation is that you have a value num_count which is continuously incremented for each item in the list than just the specific items' count. What you'd have to do instead is to count the number of times each of the item has occurred in the list.
Pseudocode would be
1. For each room in room numbers
2. Add the room to a list of visited rooms
3. Count the number of times the room number is available in visited room
4. Add the count to 64 and convert it to an ascii uppercase character where 65=A
5. Join the required strings in the way you want to and then append it to the door_numbers list.
Here's an implementation
import string
room_numbers = ['101','103','101','102','104','105','106','107','102','108']
door_numbers = []
visited_rooms = []
for room in room_numbers:
visited_rooms.append(room)
room_count = visited_rooms.count(room)
door_value = chr(64+room_count) # Since 65 = A when 1st item is present
door_numbers.append("%s.%s"%(room, door_value))
door_numbers now contains the final list you're expecting which is
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
for the given input room_numbers

The naive way, simply count the number of times the element is contained in the list up until that index:
>>> door_numbers = []
>>> for i in xrange(len(room_numbers)):
... el = room_numbers[i]
... n = 0
... for j in xrange(0, i):
... n += el == room_numbers[j]
... c = string.ascii_uppercase[n]
... door_numbers.append("{}.{}".format(el, c))
...
>>> door_numbers
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
This two explicit for-loops make the quadratic complexity pop out. Indeed, (1/2) * (N * (N-1)) iterations are made. I would say that in most cases you would be better off keeping a dict of counts instead of counting each time.
>>> door_numbers = []
>>> counts = {}
>>> for el in room_numbers:
... count = counts.get(el, 0)
... c = string.ascii_uppercase[count]
... counts[el] = count + 1
... door_numbers.append("{}.{}".format(el, c))
...
>>> door_numbers
['101.A', '103.A', '101.B', '102.A', '104.A', '105.A', '106.A', '107.A', '102.B', '108.A']
That way, there's no messing around with indices, and it's more time efficient (at the expense of auxiliary space).

Using iterators and comprehensions:
Enumerate the rooms to preserve the original order
Group rooms by room number, sorting first as required by groupby()
For each room in a group, append .A, .B, etc.
Sort by the enumeration values from step 1 to restore the original order
Extract the door numbers, e.g. '101.A'
.
#!/usr/bin/env python3
import operator
from itertools import groupby
import string
room_numbers = ['101', '103', '101', '102', '104',
'105', '106', '107', '102', '108']
get_room_number = operator.itemgetter(1)
enumerated_and_sorted = sorted(list(enumerate(room_numbers)),
key=get_room_number)
# [(0, '101'), (2, '101'), (3, '102'), (8, '102'), (1, '103'),
# (4, '104'), (5, '105'), (6, '106'), (7, '107'), (9, '108')]
grouped_by_room = groupby(enumerated_and_sorted, key=get_room_number)
# [('101', [(0, '101'), (2, '101')]),
# ('102', [(3, '102'), (8, '102')]),
# ('103', [(1, '103')]),
# ('104', [(4, '104')]),
# ('105', [(5, '105')]),
# ('106', [(6, '106')]),
# ('107', [(7, '107')]),
# ('108', [(9, '108')])]
door_numbers = ((order, '{}.{}'.format(room, char))
for _, room_list in grouped_by_room
for (order, room), char in zip(room_list,
string.ascii_uppercase))
# [(0, '101.A'), (2, '101.B'), (3, '102.A'), (8, '102.B'),
# (1, '103.A'), (4, '104.A'), (5, '105.A'), (6, '106.A'),
# (7, '107.A'), (9, '108.A')]
door_numbers = [room for _, room in sorted(door_numbers)]
# ['101.A', '103.A', '101.B', '102.A', '104.A',
# '105.A', '106.A', '107.A', '102.B', '108.A']

Counting the number of times a letter occurs at a certain position using python

I'm a python beginner and I've come across this problem and I'm not sure how I'd go about tackling it.
If I have the following sequence/strings:
GATCCG
GTACGC
How to I count the frequency each letter occurs at each position. ie) G occurs at position one twice in the two sequences, A occurs at position 1 zero times etc.
Any help would be appreciated, thank you!

You can use a combination of defaultdict and enumerate like so:
from collections import defaultdict
sequences = ['GATCCG', 'GTACGC']
d = defaultdict(lambda: defaultdict(int)) # d[char][position] = count
for seq in sequences:
for i, char in enumerate(seq): # enum('abc'): [(0,'a'),(1,'b'),(2,'c')]
d[char][i] += 1
d['C'][3] # 2
d['C'][4] # 1
d['C'][5] # 1
This builds a nested defaultdict that takes the character as first and the position as second key and provides the count of occurrences of said character in said position.
If you want lists of position-counts:
max_len = max(map(len, sequences))
d = defaultdict(lambda: [0]*max_len) # d[char] = [pos0, pos12, ...]
for seq in sequences:
for i, char in enumerate(seq):
d[char][i] += 1
d['G'] # [2, 0, 0, 0, 1, 1]

Not sure this is the best way but you can use zip to do a sort of transpose on the the strings, producing tuples of the letters in each position, e.g.:
x = 'GATCCG'
y = 'GTACGC'
zipped = zip(x,y)
print zipped
will produce as output:
[('G', 'G'), ('A', 'T'), ('T', 'A'), ('C', 'C'), ('C', 'G'), ('G', 'C')]
You can see from the tuples that the first positions of the two strings contain two Gs, the second positions contain an A and a T, etc. Then you could use Counter (or some other method) to get at what you want.

Python - Build a tuple list according to the character frequencies in the input

I have a function buildFrequencyList that should work like this:
>>> L = []
>>> buildFrequencyList(L, 'bbaabtttaabtctce')
>>> L
[(4, 'b'), (4, 'a'), (5, 't'), (2, 'c'), (1, 'e')]
Here is the code:
def buildFrequencyList(outputList, dataIN):
for c in dataIN:
a = 1
bo = True
if outputList == []:
outputList.append((a,c))
for i in outputList:
(a,b) = i
if b==c:
bo= False
a +=1
if(bo):
outputList.append((1,c))
return outputList
But the output actually is:
[(1, 'b'), (1, 'a'), (1, 't'), (1, 'c'), (1, 'e')]
I don't know why. Can somebody explain to me what the problem is?
Edit:
I modified the code and I have really strange output:
def buildFrequencyList(outputList, dataIN):
for c in range(len(dataIN)):
if outputList == []:
outputList.append((1,dataIN[c]))
for i in range(len(outputList)):
(a,b) = outputList[i]
if b==dataIN[c]:
outputList[i] = (a+1,b)
else:
outputList.append((1,dataIN[c]))
return outputList
the output:
[(5, 'b'), (4, 'a'), (3, 'a'), (2, 'b'), (2, 'b'), (5, 't'), (5, 't'), (5, 't'), (5, 't'), (5, 't'), (4, 't')...] # is infinite

You are not updating your list, and instead your local variables. You are also doing a lot of unnecessary computations. (See #jonrsharpe's comment)
A clearer way to achieve the desired output would be:
def build_frequency_list(s):
return [(s.count(c), c) for c in sorted(set(s))]
Result:
>>> [(s.count(c), c) for c in sorted(set(s))]
[(4, 'a'), (4, 'b'), (2, 'c'), (1, 'e'), (5, 't')]

Let's look at your code:
def buildFrequencyList(outputList, dataIN):
for c in dataIN:
a = 1
bo = True
if outputList == []:
outputList.append((a,c))
for i in outputList:
(a,b) = i
if b==c:
bo= False
a +=1
if(bo):
outputList.append((1,c))
return outputList
Now let's consider some cases. First, the case where the outputList is empty:
a = 1
bo = True
if outputList == []:
outputList.append((a,c))
Now, notice that there is never a value in a other than 1. This is one of those cases where it's okay to use a "magic number", because it should be obvious that you're counting things (based on the function name, since you provide no docs).
if outputList == []:
outputList.append((1,c))
But wait! Because the for loop will execute zero times on an empty list. So the code at the bottom:
if (bo):
outputList.append((1,c))
would do the same job as this code. This code is totally unnecessary. Just delete it.
Now, what if the outputList is not empty?
for i in outputList:
(a,b) = i
if b == c:
bo = False
a += 1
What does this do? It increments a, which is fine - the count is one higher. It sets bo to False, to indicate something. I guess maybe that you found an entry in the list so a new object is not needed.
Then what happens to a?
if (bo):
outputList.append((1,c))
return outputList
NOTHING! You never use a again.
So there's your problem: the times when you already have an entry in the list, you never update it.
How can you fix it?
The short answer is that you can't. Because tuples are immutable. This means you can't mutate (or 'change') the values stored in a tuple. You have to throw the tuple away, and build a new one with the correct values in it.
One solution might be to .remove() the tuple from the list, and then append a new tuple (a,c) after incrementing a.
for tpl in outputList:
freq,val = tpl
if val == c:
outputList.remove(tpl) # INVALIDATES FOR LOOP! MUST BREAK!
outputList.append((freq+1,val))
bo = False
break
Another solution might be to use enumerate(outputList) to get an index and a value, then overwrite the tuple like this:
for i, freq in outputList:
if freq[1] == c:
outputList[i] = (freq[0]+1, freq[1])
break
Another choice would be to hold the frequency info in a separate container, like a dictionary, until you have a "final" count, then go through and append all the counts to the list at one time.
counts = collections.defaultdict(0)
for c in dataIN:
counts[c] += 1
for char,count in counts.items():
outputList.append((count,char))

What about this?
from collections import defaultdict
s = 'bbaabtttaabtctce'
d = defaultdict(int)
for c in s:
d[c] += 1
d.items()
The obvious answer is
from collections import Counter
Counter(s).items()
but you said you cannot use it.

Pyspark: operation on values based on type

I've got such an RDD:
[('a', ('H', 1)), ('b', (('H', 41), ('S', 1)))]
so that keys can have either a single tuple or a tuple of tuples as values. This comes from a reduceByKey.
I need to perform a simple operation: divide the counts of S for the counts of (H + S).
When S is not there, like in the case of the first item, I will have to return 0.
The problem is to isolate the first case (single tuple) from the second (tuple of two tuples) so that I know how to operate in a map.
How would I proceed?

Generally speaking it would make more sense to fix this upstream but you can try for example something like this:
from operator import truediv
def f(vs):
try:
d = dict(vs)
except ValueError:
d = dict([vs])
s = sum(d.values())
return truediv(d.get("S", 0), s) if s else float('nan')
rdd = sc.parallelize([('a', ('H', 1)), ('b', (('H', 41), ('S', 1)))])
rdd.mapValues(f).collect()
## [('a', 0.0), ('b', 0.023809523809523808)]
Alternatively, if you don't mind external dependencies, you can try to use multipledispatch:
from multipledispatch import dispatch
#dispatch(tuple, tuple)
def f(h, s):
try:
return truediv(s[1], h[1] + s[1])
except ZeroDivisionError:
return float('nan')
#dispatch(str, int)
def f(x, y):
return 0.0
rdd.mapValues(lambda args: f(*args)).collect()
## [('a', 0.0), ('b', 0.023809523809523808)]

Order a data set with serial marks

I have the following data set:
import random
def get_data():
data = []
for a in xrange(10):
serial_id = random.randint(0, 100)
node_data = 'data-%d' % (a)
data.append((serial_id, node_data))
return data
Which gives (well, it is random, so ymmv):
[(58, 'data-0'), (37, 'data-1'), (68, 'data-2'), (80, 'data-3'), (89, 'data-4'), (42, 'data-5'), (2, 'data-6'), (90, 'data-7'), (53, 'data-8'), (7, 'data-9')]
I would like to order this data set by serial_id, implementing:
def order_data(data):
...
return ordered
Where ordered would be:
[(2, 'data-6'), ... , (90, 'data-7')]
What would be the most pythonic/efficient way to do this?

Use sorted:
return sorted(data)
or, if you don't care about modifying data, you can just use .sort to do a (slightly more efficient) in-place sort:
data.sort()
return data
The comparison function for tuples orders them by their first element, then their second element, and so on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Best way to store the top ten numbers - python

Related

iterating over list containing duplicate values

Counting the number of times a letter occurs at a certain position using python

Python - Build a tuple list according to the character frequencies in the input

Pyspark: operation on values based on type

Order a data set with serial marks

Categories

Resources