Related
monets = []
for i in range(20):
choices = ['Tails', 'Eagle']
monets.append(random.choice(choices))
cnt = 0
prev = 0
for i, e in enumerate(monets):
if e == 'Eagle':
cnt += 1
if e == 'Eagle' and i == len(monets) - 1 and cnt > prev:
prev = cnt
elif e != 'Eagle':
if prev < cnt:
prev = cnt
cnt = 0
print(monets)
print(prev)
My code calculates the max sequence of 'Eagle' in random generated list, but i stuck on how to calculate first and last index of this sequence. I figured out that using enumerate may help me with it, but i mixed up. Example: ['Tails', 'Eagle','Eagle','Tails','Eagle'] => output: 1,2
This should works, this is a simple algorithm, you don't need any sophisticated libraries:
(revision 2)
m = 0
c = 0
p = -1
for [i,s] in enumerate(monets):
if s == 'Eagle':
c += 1
else:
c = 0
if c > m:
m = c
p = i
print('max Eagle:', m, 'from:', p + 1 - m, 'to:', p)
You could also use itertools.groupby to get groups of consecutive "Eagles". Combine that with enumerate, as in your approach, to pair them with the indices, and use max to find the longest sequence. Finally, get the indices from the first and last elements of that list.
>>> from itertools import groupby
>>> monets = ['Tails', 'Eagle','Eagle','Tails','Eagle']
>>> max((list(g) for k, g in groupby(enumerate(monets), key=lambda x: x[1]) if k == "Eagle"), key=len)
[(1, 'Eagle'), (2, 'Eagle')]
>>> _[0][0], _[-1][0]
(1, 2)
Just reading your code, looks like you've got following computation working (i.e. generally correct, but I didn't actually run and test for bugs)
['Tails', 'Eagle','Eagle','Tails','Eagle'] # monets list
[ 0, 1, 2, 0, 1] # 'Eagle' sequence lengths
There are a few different ways to do what you want, but continuing on your existing methodology, you can indeed use enumerate to generate the following:
[ (0, 0), (1, 1), (2, 2), (3, 0), (4, 1)] # seq lengths from before, enumerated
Where each pair represents: (index, length)
From that, find the pair with the largest length, and you'll have the end index of the sequence, in this case: (2, 2).
The first instance of length == 1, searching backwards from the end index, will give you the start index.
Sidenote: #tobias_k's answer is written in a more functional style (which I also personally prefer). It's a different methodology than you've started with, but I highly recommend learning it. Here is that method written more (IMO) readably:
import itertools as it
monets = ['Tails', 'Eagle','Eagle','Tails','Eagle']
grouped = it.groupby(enumerate(monets), key=lambda pair: pair[1])
eagle_seqs = [list(seq) for v, seq in grouped if v == 'Eagle']
longest_seq = max(eagle_seqs, key=len)
seq_idxs = [i for i, _ in longest_seq]
start_idx, end_idx = seq_idxs[0], seq_idxs[-1]
This is the most elegant solution to this problem:
import random
import numpy as np
import pandas as pd
monets = []
for i in range(20):
choices = ['Tails', 'Eagle']
monets.append(random.choice(choices))
Here the only additional thing to do is to encode the seq into num values and find the longest contiguous sequence of indices:
encode_ = {'Tails': 0, 'Eagle': 1}
df = pd.DataFrame(monets).replace(encode_)
A = np.where(df == 1)[0]
result = max(np.split(A, np.where(np.diff(A) != 1)[0] + 1), key=len).tolist()
start_idx, end_idx = result[0],result[-1]
Using a down-to-ground approach: (it returns the position of the 1st maximal sequence of consecutive terms)
lst = ['Tails', 'Eagle', 'Eagle','Tails', 'Eagle', 'Eagle','Eagle', 'Eagle', 'Tails', 'Eagle', 'Eagle','Tails']
index, counter = -1, 0
tmp_i, tmp_c = -1, 0
for i, v in enumerate(lst):
if v == 'Eagle':
# tmp-update
tmp_c += 1
if tmp_i == -1:
tmp_i = i
else:
if tmp_c > counter:
# global update
counter = tmp_c
index = tmp_i
# reset
tmp_i, tmp_c = -1, 0
# final check for occurrence of max sequence at the end of the list
if tmp_c > counter:
# global update
counter = tmp_c
index = tmp_i
boundaries_max_seq = (index, index + counter - 1)
print(boundaries_max_seq)
# (4, 7)
I'm practicing some exam questions and I've encountered a time limit issue that I can't figure out. I think its to do with how I'm iterating through the inputs.
It's the famous titanic dataset so I won't bother printing a sample of the df as I'm sure everyone is familiar with it.
The function compares the similarity between two passengers which are provided as input. Also, I am mapping the Sex column with integers in order to compare between passengers you'll see below.
I was also thinking it could be how I'm indexing and locating the values for each passenger but again I'm not sure
The function is as follows and the time limit is 1 second but when no_of_queries == 100 the function takes 1.091s.
df = pd.read_csv("titanic.csv")
mappings = {'male': 0, 'female':1}
df['Sex'] = df['Sex'].map(mappings)
def function_similarity(no_of_queries):
for num in range(int(no_of_queries)):
x = input()
passenger_a, passenger_b = x.split()
passenger_a, passenger_b = int(passenger_a), int(passenger_b)
result = 0
if int(df[df['PassengerId'] == passenger_a]['Pclass']) == int(df[df['PassengerId'] == passenger_b]['Pclass']):
result += 1
if int(df[df['PassengerId'] ==passenger_a]['Sex']) == int(df[df['PassengerId'] ==passenger_b]['Sex']):
result += 3
if int(df[df['PassengerId'] ==passenger_a]['SibSp']) == int(df[df['PassengerId'] ==passenger_b]['SibSp']):
result += 1
if int(df[df['PassengerId'] == passenger_a]['Parch']) == int(df[df['PassengerId'] == passenger_b]['Parch']):
result += 1
result += max(0, 2 - abs(float(df[df['PassengerId'] ==passenger_a]['Age']) - float(df[df['PassengerId'] ==passenger_b]['Age'])) / 10.0)
result += max(0, 2 - abs(float(df[df['PassengerId'] ==passenger_a]['Fare']) - float(df[df['PassengerId'] ==passenger_b]['Fare'])) / 5.0)
print(result / 10.0)
function_similarity(input())
Calculate passenger row by id value once per passengers a and b.
df = pd.read_csv("titanic.csv")
mappings = {'male': 0, 'female':1}
df['Sex'] = df['Sex'].map(mappings)
def function_similarity(no_of_queries):
for num in range(int(no_of_queries)):
x = input()
passenger_a, passenger_b = x.split()
passenger_a, passenger_b = df[df['PassengerId'] == int(passenger_a)], df[df['PassengerId'] == int(passenger_b)]
result = 0
if int(passenger_a['Pclass']) == int(passenger_b['Pclass']):
result += 1
if int(passenger_a['Sex']) == int(passenger_b['Sex']):
result += 3
if int(passenger_a['SibSp']) == int(passenger_b['SibSp']):
result += 1
if int(passenger_a['Parch']) == int(passenger_b['Parch']):
result += 1
result += max(0, 2 - abs(float(passenger_a['Age']) - float(passenger_b['Age'])) / 10.0)
result += max(0, 2 - abs(float(passenger_a['Fare']) - float(passenger_b['Fare'])) / 5.0)
print(result / 10.0)
function_similarity(input())
I have a file containing genes of different genomes. Gene is denoted by NZ_CP019047.1_2993 and Genome by NZ_CP019047
They look like this :
NZ_CP019047.1_2993
NZ_CP019047.1_2994
NZ_CP019047.1_2995
NZ_CP019047.1_2999
NZ_CP019047.1_3000
NZ_CP019047.1_3001
NZ_CP019047.1_3003
KE699235.1_379
KE699235.1_1000
KE699235.1_1001
what I want to do is group the genes of a genome (if a genome has more than 1 gene) regarding their distance meaning, if I have genes nearer than 4 positions I want to group them together.The position can be understood as the number after '_'. I want something like these:
[NZ_CP019047.1_2993,NZ_CP019047.1_2994,NZ_CP019047.1_2995]
[NZ_CP019047.1_2999,NZ_CP019047.1_3000,NZ_CP019047.1_3001,NZ_CP019047.1_3003]
[KE699235.1_1000,KE699235.1_1001]
What I have tried so far is creating a dictionary holding for each genome, in my case NZ_CP019047 and KE699235, all the number after '_'. Then I calculate their differences, if it is less than 4 I try to group them. The problem is that I am having duplication and I am having problem in the case when 1 genome has more than 1 group of genes like this case :
[NZ_CP019047.1_2993,NZ_CP019047.1_2994,NZ_CP019047.1_2995]
[NZ_CP019047.1_2999,NZ_CP019047.1_3000,NZ_CP019047.1_3001,NZ_CP019047.1_3003]
This is my code:
for key in sortedDict1:
cassette = ''
differences = []
numbers = sortedDict1[key]
differences = [x - numbers[i - 1] for i, x in enumerate(numbers)][1:]
print(differences)
for i in range(0,len(differences)):
if differences[i] <= 3:
pos = i
el1 = key + str(numbers[i])
el2 = key + str(numbers[i+1])
cas = el1 + ' '
cassette += cas
cas = el2 + ' '
cassette += cas
else:
cassette + '/n'
i+=1
I am referring to groups with variable cassette.
Can someone please help?
Please see below. You can modify the labels and distances to your requirements.
def get_genome_groups(genome_info):
genome_info.sort(key = lambda x: (x.split('.')[0], int(x.split('_')[-1])))
#print(genome_info)
genome_groups = []
close_genome_group = []
last_genome = ''
position = 0
last_position = 0
#'NZ_CP019047.1_2995',
for genomes in genome_info:
genome, position = genomes.split('.')
position = int(position.split('_')[1])
if last_genome and (genome != last_genome):
genome_groups.append(close_genome_group)
close_genome_group = []
elif close_genome_group and position and (position > last_position+3):
genome_groups.append(close_genome_group)
close_genome_group = []
if genomes:
close_genome_group.append(genomes)
last_position = position
last_genome = genome
if close_genome_group:
genome_groups.append(close_genome_group)
return genome_groups
if __name__ == '__main__':
genome_group = get_genome_groups(genome_info)
print(genome_group)
user#Inspiron:~/code/general$ python group_genes.py
[['KE699235.1_379'], ['KE699235.1_1000', 'KE699235.1_1001'], ['NZ_CP019047.1_2993', 'NZ_CP019047.1_2994', 'NZ_CP019047.1_2995'], ['NZ_CP019047.1_2999', 'NZ_CP019047.1_3000', 'NZ_CP019047.1_3001', 'NZ_CP019047.1_3003']]
user#Inspiron:~/code/general$
Input:
NZ_CP019047.1_2993
NZ_CP019047.1_2994
NZ_CP019047.1_2995
NZ_CP019047.1_2999
NZ_CP019047.1_3000
NZ_CP019047.1_3001
NZ_CP019047.1_3003
KE699235.1_379
KE699235.1_1000
KE699235.1_1001
KE6992351.2_379
KE6992352.2_1000
KE6992353.2_1001
Code:
from operator import itemgetter, attrgetter
with open("genes.dat", "r") as msg:
data = msg.read().splitlines()
for i, gene in enumerate(data):
gene_name = gene.split(".")[0]
chr_pos = gene.split(".")[1]
data[i] = (gene_name,int(chr_pos.split("_")[0]),int(chr_pos.split("_")[1]))
data = sorted(data, key=itemgetter(1,2))
output = []
j = 0
for i in range(1,len(data)):
if i == 1:
output.append([data[i]])
elif data[i][1] == output[j][0][1]:
if data[i][2] - output[j][0][2] < 5:
output[j].append(data[i])
else:
output.append([data[i]])
j += 1
else:
output.append([data[i]])
j += 1
print (output)
Output:
[[('KE699235', 1, 1000), ('KE699235', 1, 1001)], [('NZ_CP019047', 1, 2993), ('NZ_CP019047', 1, 2994), ('NZ_CP019047', 1, 2995)], [('NZ_CP019047', 1, 2999), ('NZ_CP019047', 1, 3000), ('NZ_CP019047', 1, 3001), ('NZ_CP019047', 1, 3003)], [('KE6992351', 2, 379)], [('KE6992352', 2, 1000), ('KE6992353', 2, 1001)]]
This should make groups based on max 5 difference in position between the most backward element and the most forward in the same group.
It should work if you get a list of mixed genes considering chr location.
I am trying to identify the length of consecutive sequences within an array that are >100. I have found the longest sequence using the following code but need to alter to also find the average length.
def getLongestSeq(a, n):
maxIdx = 0
maxLen = 0
currLen = 0
currIdx = 0
for k in range(n):
if a[k] >100:
currLen +=1
# New sequence, store
# beginning index.
if currLen == 1:
currIdx = k
else:
if currLen > maxLen:
maxLen = currLen
maxIdx = currIdx
currLen = 0
if maxLen > 0:
print('Index : ',maxIdx,',Length : ',maxLen,)
else:
print("No positive sequence detected.")
# Driver code
arrQ160=resultsQ1['60s']
n=len(arrQ160)
getLongestSeq(arrQ160, n)
arrQ260=resultsQ2['60s']
n=len(arrQ260)
getLongestSeq(arrQ260, n)
arrQ360=resultsQ3['60s']
n=len(arrQ360)
getLongestSeq(arrQ360, n)
arrQ460=resultsQ4['60s']
n=len(arrQ460)
getLongestSeq(arrQ460, n)
output
Index : 12837 ,Length : 1879
Index : 6179 ,Length : 3474
Index : 1164 ,Length : 1236
Index : 2862 ,Length : 617
This should work:
def get_100_lengths( arr ) :
s = ''.join( ['0' if i < 100 else '1' for i in arr] )
parts = s.split('0')
return [len(p) for p in parts if len(p) > 0]
After that you may calculate an average or do whatever you like.
The result:
>>> get_100_lengths( [120,120,120,90,90,120,90,120,120] )
[3, 1, 2]
that might be a little tricky. You want to use one variable to keep track of sum of length, one variable to keep track of how many times a sequence occurred.
We can determine if a sequence terminated when current number<100 and previous number is greater than 100
def getLongestSeq(array):
total_length = total_ct = 0
last_is_greater = False
for number in array:
if number > 100:
total_length += 1
last_is_greater = True
elif number<100 and last_is_greater:
total_ct += 1
last_is_greater = False
return round(total_length / total_ct)
Did not test this code, please comment if there is any issue
You want to find all the sequences, take their lengths, and get the average. Each of those steps are relatively straightforward.
items = [1, 101, 1, 101, 101, 1, 101, 101, 101, 1]
Finding sequences: use groupby.
from itertools import groupby
groups = groupby(items, lambda x: x > 100) # (False, [1]), (True, [101]), ...
Find lengths (careful, iterable of iterables not a list):
lens = [len(g) for k, g in groups if k] # [1, 2, 3]
Find average (assumes at least one):
avg = float(sum(lens)) / len(lens) # 2.0
(Python) Given two numbers A and B. I need to find all nested "groups" of numbers:
range(2169800, 2171194)
leading numbers: 21698XX, 21699XX, 2170XX, 21710XX, 217110X, 217111X,
217112X, 217113X, 217114X, 217115X, 217116X, 217117X, 217118X, 2171190X,
2171191X, 2171192X, 2171193X, 2171194X
or like this:
range(1000, 1452)
leading numbers: 10XX, 11XX, 12XX, 13XX, 140X, 141X, 142X, 143X,
144X, 1450, 1451, 1452
Harder than it first looked - pretty sure this is solid and will handle most boundary conditions. :) (There are few!!)
def leading(a, b):
# generate digit pairs a=123, b=456 -> [(1, 4), (2, 5), (3, 6)]
zip_digits = zip(str(a), str(b))
zip_digits = map(lambda (x,y):(int(x), int(y)), zip_digits)
# this ignores problems where the last matching digits are 0 and 9
# leading (12000, 12999) is same as leading(12, 12)
while(zip_digits[-1] == (0,9)):
zip_digits.pop()
# start recursion
return compute_leading(zip_digits)
def compute_leading(zip_digits):
if(len(zip_digits) == 1): # 1 digit case is simple!! :)
(a,b) = zip_digits.pop()
return range(a, b+1)
#now we partition the problem
# given leading(123,456) we decompose this into 3 problems
# lows -> leading(123,129)
# middle -> leading(130,449) which we can recurse to leading(13,44)
# highs -> leading(450,456)
last_digits = zip_digits.pop()
low_prefix = reduce(lambda x, y : 10 * x + y, [tup[0] for tup in zip_digits]) * 10 # base for lows e.g. 120
high_prefix = reduce(lambda x, y : 10 * x + y, [tup[1] for tup in zip_digits]) * 10 # base for highs e.g. 450
lows = range(low_prefix + last_digits[0], low_prefix + 10)
highs = range(high_prefix + 0, high_prefix + last_digits[1] + 1)
#check for boundary cases where lows or highs have all ten digits
(a,b) = zip_digits.pop() # pop last digits of middle so they can be adjusted
if len(lows) == 10:
lows = []
else:
a = a + 1
if len(highs) == 10:
highs = []
else:
b = b - 1
zip_digits.append((a,b)) # push back last digits of middle after adjustments
return lows + compute_leading(zip_digits) + highs # and recurse - woohoo!!
print leading(199,411)
print leading(2169800, 2171194)
print leading(1000, 1452)
def foo(start, end):
index = 0
is_lower = False
while index < len(start):
if is_lower and start[index] == '0':
break
if not is_lower and start[index] < end[index]:
first_lower = index
is_lower = True
index += 1
return index-1, first_lower
start = '2169800'
end = '2171194'
result = []
while int(start) < int(end):
index, first_lower = foo(start, end)
range_end = index > first_lower and 10 or int(end[first_lower])
for x in range(int(start[index]), range_end):
result.append(start[:index] + str(x) + 'X'*(len(start)-index-1))
if range_end == 10:
start = str(int(start[:index])+1)+'0'+start[index+1:]
else:
start = start[:index] + str(range_end) + start[index+1:]
result.append(end)
print "Leading numbers:"
print result
I test the examples you've given, it is right. Hope this will help you
This should give you a good starting point :
def leading(start, end):
leading = []
hundreds = start // 100
while (end - hundreds * 100) > 100:
i = hundreds * 100
leading.append(range(i,i+100))
hundreds += 1
c = hundreds * 100
tens = 1
while (end - c - tens * 10) > 10:
i = c + tens * 10
leading.append(range(i, i + 10))
tens += 1
c += tens * 10
ones = 1
while (end - c - ones) > 0:
i = c + ones
leading.append(i)
ones += 1
leading.append(end)
return leading
Ok, the whole could be one loop-level deeper. But I thought it might be clearer this way. Hope, this helps you...
Update :
Now I see what you want. Furthermore, maria's code doesn't seem to be working for me. (Sorry...)
So please consider the following code :
def leading(start, end):
depth = 2
while 10 ** depth > end : depth -=1
leading = []
const = 0
coeff = start // 10 ** depth
while depth >= 0:
while (end - const - coeff * 10 ** depth) >= 10 ** depth:
leading.append(str(const / 10 ** depth + coeff) + "X" * depth)
coeff += 1
const += coeff * 10 ** depth
coeff = 0
depth -= 1
leading.append(end)
return leading
print leading(199,411)
print leading(2169800, 2171194)
print leading(1000, 1453)
print leading(1,12)
Now, let me try to explain the approach here.
The algorithm will try to find "end" starting from value "start" and check whether "end" is in the next 10^2 (which is 100 in this case). If it fails, it will make a leap of 10^2 until it succeeds. When it succeeds it will go one depth level lower. That is, it will make leaps one order of magnitude smaller. And loop that way until the depth is equal to zero (= leaps of 10^0 = 1). The algorithm stops when it reaches the "end" value.
You may also notice that I have the implemented the wrapping loop I mentioned so it is now possible to define the starting depth (or leap size) in a variable.
The first while loop makes sure the first leap does not overshoot the "end" value.
If you have any questions, just feel free to ask.