Python Greedy Algorithm

Python Greedy Algorithm - python

I am writing a greedy algorithm (Python 3.x.x) for a 'jewel heist'. Given a series of jewels and values, the program grabs the most valuable jewel that it can fit in it's bag without going over the bag weight limit. I've got three test cases here, and it works perfectly for two of them.
Each test case is written in the same way: first line is the bag weight limit, all lines following are tuples(weight, value).
Sample Case 1 (works):
10
3 4
2 3
1 1
Sample Case 2 (doesn't work):
575
125 3000
50 100
500 6000
25 30
Code:
def take_input(infile):
f_open = open(infile, 'r')
lines = []
for line in f_open:
lines.append(line.strip())
f_open.close()
return lines
def set_weight(weight):
bag_weight = weight
return bag_weight
def jewel_list(lines):
jewels = []
for item in lines:
jewels.append(item.split())
jewels = sorted(jewels, reverse= True)
jewel_dict = {}
for item in jewels:
jewel_dict[item[1]] = item[0]
return jewel_dict
def greedy_grab(weight_max, jewels):
#first, we get a list of values
values = []
weights = []
for keys in jewels:
weights.append(jewels[keys])
for item in jewels.keys():
values.append(item)
values = sorted(values, reverse= True)
#then, we start working
max = int(weight_max)
running = 0
i = 0
grabbed_list = []
string = ''
total_haul = 0
# pick the most valuable item first. Pick as many of them as you can.
# Then, the next, all the way through.
while running < max:
next_add = int(jewels[values[i]])
if (running + next_add) > max:
i += 1
else:
running += next_add
grabbed_list.append(values[i])
for item in grabbed_list:
total_haul += int(item)
string = "The greedy approach would steal $" + str(total_haul) + " of
jewels."
return string
infile = "JT_test2.txt"
lines = take_input(infile)
#set the bag weight with the first line from the input
bag_max = set_weight(lines[0])
#once we set bag weight, we don't need it anymore
lines.pop(0)
#generate a list of jewels in a dictionary by weight, value
value_list = jewel_list(lines)
#run the greedy approach
print(greedy_grab(bag_max, value_list))
Does anyone have any clues why it wouldn't work for case 2? Your help is greatly appreciated.
EDIT: The expected outcome for case 2 is $6130. I seem to get $6090.

Your dictionary keys are strings, not integers so they are sorted like string when you try to sort them. So you would get:
['6000', '3000', '30', '100']
instead wanted:
['6000', '3000', '100', '30']
Change this function to be like this and to have integer keys:
def jewel_list(lines):
jewels = []
for item in lines:
jewels.append(item.split())
jewels = sorted(jewels, reverse= True)
jewel_dict = {}
for item in jewels:
jewel_dict[int(item[1])] = item[0] # changed line
return jewel_dict
When you change this it will give you:
The greedy approach would steal $6130 of jewels.

In [237]: %paste
def greedy(infilepath):
with open(infilepath) as infile:
capacity = int(infile.readline().strip())
items = [map(int, line.strip().split()) for line in infile]
bag = []
items.sort(key=operator.itemgetter(0))
while capacity and items:
if items[-1][0] <= capacity:
bag.append(items[-1])
capacity -= items[-1][0]
items.pop()
return bag
## -- End pasted text --
In [238]: sum(map(operator.itemgetter(1), greedy("JT_test1.txt")))
Out[238]: 8
In [239]: sum(map(operator.itemgetter(1), greedy("JT_test2.txt")))
Out[239]: 6130

I think in this piece of code i has to be incremented on the else side too
while running < max:
next_add = int(jewels[values[i]])
if (running + next_add) > max:
i += 1
else:
running += next_add
grabbed_list.append(values[i])
i += 1 #here
this and #iblazevic's answer explains why it behaves this way

Related

Python create lists conditionally from txt file

I have a txt file with this structure of data:
3
100 name1
200 name2
50 name3
2
1000 name1
2000 name2
0
The input contains several sets. Each set starts with a row containing one natural number N, the number of bids, 1 ≤ N ≤ 100. Next, there are N rows containing the player's price and his name separated by a space. The player's prize is an integer and ranges from 1 to 2*109.
Expected out is:
Name2
Name2
How can I find the highest price and name for each set of data?
I had to try this:(find the highest price)
offer = []
name = []
with open("futbal_zoznam_hracov.txt", "r") as f:
for line in f:
maximum = []
while not line.isdigit():
price = line.strip().split()[0]
offer.append(int(price))
break
maximum.append(max(offer[1:]))
print(offer)
print(maximum)
This creates a list of all sets but not one by one. Thank you for your advice.

You'll want to manually loop over each set using the numbers, rather than a for loop over the whole file
For example
with open("futbal_zoznam_hracov.txt") as f:
while True:
try: # until end of file
bids = int(next(f).strip())
if bids == 0:
continue # or break if this is guaranteed to be end of the file
max_price = float("-inf")
max_player = None
for _ in range(bids):
player = next(f).strip().split()
price = int(player[0])
if price > max_price:
max_price = price
max_player = player[1]
print(max_player)
except:
break

EDITED:
The lines in the input file containing a single token are irrelevant so this can be greatly simplified
with open('futbal_zoznam_hracov.txt') as f:
_set = []
for line in f:
p, *n = line.split()
if n:
_set.append((float(p), n[0]))
else:
if _set:
print(max(_set)[1])
_set = []

Finding the substring with the most repeats in a dictionary with dna sequences

The substring has to be with 6 characters. The number I'm gettig is smaller than it should be.
first I've written code to get the sequences from a file, then put them in a dictionary, then written 3 nested for loops: the first iterates over the dictionary and gets a sequence in each iteration. The second takes each sequence and gets a substring with 6 characters from it. In each iteration, the second loop increases the index of the start of the string (the long sequence) by 1. The third loop takes each substring from the second loop, and counts how many times it appears in each string (long sequence).
I tried rewriting the code many times. I think I got very close. I checked if the loops actually do their iterations, and they do. I even checked manually to see if the counts for a substring in random sequences are the same as the program gives, and they are. Any idea? maybe a different approach? what debugger do you use for Python?
I added a file with 3 shortened sequences for testing. Maybe try smaller substring: say with 3 characters instead of 6: rep_len = 3
The code
matches = []
count = 0
final_count = 0
rep_len = 6
repeat = ''
pos = 0
seq_count = 0
seqs = {}
f = open(r"file.fasta")
# inserting each sequences from the file into a dictionary
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
for key, seq in seqs.items(): # getting one sequence in each iteration
for pos in range(len(seq)): # setting an index and increasing it by 1 in each iteration
if pos <= len(seq) - rep_len: # checking no substring from the end of the sequence are selected
repeat = seq[pos:pos + rep_len] # setting a substring
if repeat not in matches: # checking if the substring was already scanned
matches.append(repeat) # adding the substring to previously checked substrings' list
for key1, seq2 in seqs.items(): # iterating over each sequence
count += seq2.count(repeat) # counting the substring's repetitions
if count > final_count: # if the count is greater than the previously saved greatest number
final_count = count # the new value is saved
count = 0
print('repetitions: ', final_count) # printing
sequences.fasta

The code is not very clear, so it is a bit difficult to debug. I suggest rewriting.
Anyway, I (currently) just noted one small mistake:
if pos < len(seq) - rep_len:
Should be
if pos <= len(seq) - rep_len:
Currently, the last character in each sequence is ignored.
EDIT:
Here some rewriting of your code that is clearer and might help you investigate the errors:
rep_len = 6
seq_count = 0
seqs = {}
filename = "dna2.txt"
# Extract the data into a dictionary
with open(filename, "r") as f:
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
# Store all the information, so that you can reuse it later
counter = {}
for key, seq in seqs.items():
for pos in range(len(seq)-rep_len):
repeat = seq[pos:pos + rep_len]
if repeat in counter:
counter[repeat] += 1
else:
counter[repeat] = 1
# Sort the counter to have max occurrences first
sorted_counter = sorted(counter.items(), key = lambda item:item[1], reverse=True )
# Display the 5 max occurrences
for i in range(5):
key, rep = sorted_counter[i]
print("{} -> {}".format(key, rep))
# GCGCGC -> 11
# CCGCCG -> 11
# CGCCGA -> 10
# CGCGCG -> 9
# CGTCGA -> 9

It might be easier to use Counter from the collections module in Python. Also check out the NLTK library.
An example:
from collections import Counter
from nltk.util import ngrams
sequence = "cggttgcaatgagcgtcttgcacggaccgtcatgtaagaccgctacgcttcgatcaacgctattacgcaagccaccgaatgcccggctcgtcccaacctg"
def reps(substr):
"Counts repeats in a substring"
return sum([i for i in Counter(substr).values() if i>1])
def make_grams(sent, n=6):
"splits a sentence into n-grams"
return ["".join(seq) for seq in (ngrams(sent,n))]
grams = make_grams(sequence) # splits string into substrings
max_length = max(list(map(reps, grams))) # gets maximum repeat count
result = [dna for dna in grams if reps(dna) == max_length]
print(result)
Output: ['gcgtct', 'cacgga', 'acggac', 'tgtaag', 'agaccg', 'gcttcg', 'cgcaag', 'gcaagc', 'gcccgg', 'cccggc', 'gctcgt', 'cccaac', 'ccaacc']
And if the question is look for the string with the most repeated character:
repeat_count = [max(Counter(a).values()) for a in result] # highest character repeat count
result_dict = {dna:ct for (dna,ct) in zip(result, repeat_count)}
another_result = [dna for dna in result_dict.keys() if result_dict[dna] == max(repeat_count)]
print(another_result)
Output: ['cccggc', 'cccaac', 'ccaacc']

Concatenate several seq within a file according to their percentage of similarities

Hel lo I need your help in a complicated task.
Here is a file1.txt :
>Name1.1_1-40_-__Sp1
AAAAAACC-------------
>Name1.1_67-90_-__Sp1
------CCCCCCCCC------
>Name1.1_90-32_-__Sp1
--------------CCDDDDD
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
and the idea is to create a new file called file1.txt_Hsp such as:
>Name1.1-3HSPs-__Sp1
AAAAAACCCCCCCCCCDDDDD
>Name3.1_-__Sp2
AAAAAACCCCCCCCCCC----
>Name4.1_-__Sp2
-------CCCCCCCCCCCCCC
So basically the idea is to:
Compare each sequence from the same SpN <-- (here it is very important only with the same SpN name) with each other in file1.txt.
For instance I will have to compare :
Name1.1_1-40_-__Sp1 vs Name1.1_67-90_-__Sp1
Name1.1_1-40_-__Sp1 vs Name1.1_90-32_-__Sp1
Name1.1_67-90_-__Sp1 vs Name1.1_90-32_-__Sp1
Name2.1_20-89_-__Sp2 vs Name2.1_78-200_-__Sp2
So for exemple when I compare:
Name1.1_1-40_-__Sp1 vs Name1.1_67-90_-__Sp1 I get :
>Name1.1_1-40_-__Sp1
AAAAAACC-------------
>Name1.1_67-90_-__Sp1
------CCCCCCCCC------
here I want to concatenate the two sequences if ratio between number of letter matching with another letter / nb letter matching with a (-) is < 0.20`.
Here for example there are 21 characters, and the number of letter matching with another letter = 2 (C and C).
And the number of letter that match with a - , is 13 (AAAAAA+CCCCCCC)
so
ratio = 2/15 : 0.1538462
and if this ratio < 0.20 then I want to concatenate this 2 sequences such as :
>Name1.1-2HSPs_-__Sp1
AAAAAACCCCCCCCC------
(As you can se the name of the new seq is now : Name.1-2HSPs_-__Sp1 with the 2 meaning that there are 2 sequences concatenated) So we remove the number-number part for XHSPS with X being the number of sequence concatenated.
and get the file1.txt_Hsp :
>Name1.1-2HSPs_-__Sp1
AAAAAACCCCCCCCC------
>Name1.1_90-32_-__Sp1
--------------CCDDDDD
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
Then I do it again with Name1.1-2HSPs_-__Sp1 vs Name1.1_90-32_-__Sp1
>Name1.1-2HSPs_-__Sp1
AAAAAACCCCCCCCC------
>Name1.1_90-32-__Sp1
--------------CCDDDDD
Where ratio = 1/20 = 0.05
Then because the ratio is < 0.20 I want to concatenate this 2 sequences such as :
>Name1.1-3HSPs_-__Sp1
AAAAAACCCCCCCCCCDDDDD
(As you can see the name of the new seq is now : Name.1-3HSPs_-__Sp1 with the 3 meaning that there are 3 sequences concatenated)
file1.txt_Hsp:
>Name1.1-3HSPs_-__Sp1
AAAAAACCCCCCCCCCDDDDD
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
Then I do it again with Name2.1_20-89_-__Sp2 vs Name2.1_78-200_-__Sp2
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
Where ratio = 10/11 = 0.9090909
Then because the ratio is > 0.20 I do nothing and get the final file1.txt_Hsp:
>Name1.1-3HSPs_-__Sp1
AAAAAACCCCCCCCCCDDDDD
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
Which is the final result I needed.
A simplest exemple would be :
>Name1.1_10-60_-__Seq1
AAA------
>Name1.1_70-120_-__Seq1
--AAAAAAA
>Name2.1_12-78_-__Seq2
--AAAAAAA
The ratio is 1/8 = 0.125 because only 1 letter is matching and 8 because 8 letters are matching with a (-)
Because the ratio < 0.20 I concatenate the two sequences Seq1 to:
>Name1.1_2HSPs_-__Seq1
AAAAAAAAA
and the new file should be :
>Name1.1_2HSPs_-__Seq1
AAAAAAAAA
>Name2.1_-__Seq2
--AAAAAAA
** Here is an exemple from my real data **
>YP_009186705
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLDDEIFYKSLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
>XO009980.1_26784332-20639090_-__Agapornis_vilveti
------------------------------------------------------LNQSNL
ALPFDQSVNPVSFSMISSHDLIA
>CM009917.1_20634332-20634508_-__Neodiprion_lecontei
---CDSWMIKFFARISQMC---IKIHSKYEEVSFFLFQSK--KKKIADSHFFRSLNQDTA
-------LNTVSY----------
>XO009980.1_20634508-20634890_-__Agapornis_vilveti
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKL--------------
-----------------------
>YUUBBOX12
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLDDEIFYKSLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
and I should get :
>YP_009186705
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLDDEIFYKSLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
>XO009980.1_2HSPs_-__Agapornis_vilveti
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
>CM009917.1_20634332-20634508_-__Neodiprion_lecontei
---CDSWMIKFFARISQMC---IKIHSKYEEVSFFLFQSK--KKKIADSHFFRSLNQDTA
-------LNTVSY----------
>YUUBBOX12
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLDDEIFYKSLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
the ratio between XO009980.1_26784332-20639090_-__Agapornis_vilveti and XO009980.1_20634508-20634890_-__Agapornis_vilveti was : 0/75 = 0
Here as you can see, some sequence does not have the [\d]+[-]+[\d] patterns such as YP_009186705 or YUUBBOX12, these one does not have to be concatenate, they juste have to be added in the outputfile.
Thanks a lot for your help.

First, let's read the text files into tuples of (name, seq):
with open('seq.txt', 'r+') as f:
lines = f.readlines()
seq_map = []
for i in range(0, len(lines), 2):
seq_map.append((lines[i].strip('\n'), lines[i+1].strip('\n')))
#[('>Name1.1_10-60_-__Seq1', 'AAA------'),
# ('>Name1.1_70-120_-__Seq1', '--AAAAAAA'),
# ('>Name2.1_12-78_-__Seq2', '--AAAAAAA')]
#
# or
#
# [('>Name1.1_1-40_-__Sp1', 'AAAAAACC-------------'),
# ('>Name1.1_67-90_-__Sp1', '------CCCCCCCCC------'),
# ('>Name1.1_90-32_-__Sp1', '--------------CCDDDDD'),
# ('>Name2.1_20-89_-__Sp2', 'AAAAAACCCCCCCCCCC----'),
# ('>Name2.1_78-200_-__Sp2', '-------CCCCCCCCCCDDDD')]
Then we define helper functions, one each for checking for a concat, then concat for seq, and merge for name (with a nest helper for getting HSPs counts):
import re
def count_num(x):
num = re.findall(r'[\d]+?(?=HSPs)', x)
count = int(num[0]) if num and 'HSPs' in x else 1
return count
def concat_name(nx, ny):
count, new_name = 0, []
count += count_num(nx)
count += count_num(ny)
for ind, x in enumerate(nx.split('_')):
if ind == 1:
new_name.append('{}HSPs'.format(count))
else:
new_name.append(x)
new_name = '_'.join([x for x in new_name])
return new_name
def concat_seq(x, y):
mash, new_seq = zip(x, y), ''
for i in mash:
if i.count('-') > 1:
new_seq += '-'
else:
new_seq += i[0] if i[1] == '-' else i[1]
return new_seq
def check_concat(x, y):
mash, sim, dissim = zip(x, y), 0 ,0
for i in mash:
if i[0] == i[1] and '-' not in i:
sim += 1
if '-' in i and i.count('-') == 1:
dissim += 1
return False if not dissim or float(sim)/float(dissim) >= 0.2 else True
Then we will write a script to run over the tuples in sequence, checking for spn matches, then concat_checks, and taking forward the new pairing for the next comparison, adding to the final list where necessary:
tmp_seq_map = seq_map[:]
final_seq = []
for ind in range(1, len(seq_map)):
end = True if ind == len(seq_map)-1 else False
pair_a = tmp_seq_map[ind-1]
pair_b = tmp_seq_map[ind]
name_a = pair_a[0][:]
name_b = pair_b[0][:]
if name_a.split('__')[1] == name_b.split('__')[1]:
if check_concat(pair_a[1], pair_b[1]):
new_name = concat_name(pair_a[0], pair_b[0])
new_seq = concat_seq(pair_a[1], pair_b[1])
tmp_seq_map[ind] = (((new_name, new_seq)))
if end:
final_seq.append(tmp_seq_map[ind])
end = False
else:
final_seq.append(pair_a)
else:
final_seq.append(pair_a)
if end:
final_seq.append(pair_b)
print(final_seq)
#[('>Name1.1_2HSPs_-__Seq1', 'AAAAAAAAA'),
# ('>Name2.1_12-78_-__Seq2', '--AAAAAAA')]
#
# or
#
#[('>Name1.1_3HSPs_-__Sp1', 'AAAAAACCCCCCCCCCDDDDD'),
# ('>Name2.1_20-89_-__Sp2', 'AAAAAACCCCCCCCCCC----'),
# ('>Name2.1_78-200_-__Sp2', '-------CCCCCCCCCCDDDD')]
Please note that I have checked for concatenation of only consecutive sequences from the text files, and that you would have to re-use the methods I've written in a different script for accounting for combinations. I leave that to your discretion.
Hope this helps. :)

You can do this as follows.
from collections import defaultdict
with open('lines.txt','r') as fp:
lines=fp.readlines()
dnalist = defaultdict(list)
for i,line in enumerate(lines):
line = line.replace('\n','')
if i%2: #'Name' in line:
dnalist[n].append(line)
else:
n = line.split('-')[-1]
That gives you a dictionary with keys being the file numbers and values being the dna sequences in a list.
def calc_ratio(str1,str2):
n_skipped,n_matched,n_notmatched=0,0,0
print(len(str1),len(str2))
for i,ch in enumerate(str1):
if ch=='-' or str2[i]=='-':
n_skipped +1
elif ch == str2[i]:
n_matched += 1
else:
n_notmatched+=1
retval = float(n_matched)/float(n_matched+n_notmatched+n_skipped)
print(n_matched,n_notmatched,n_skipped)
return retval
That gets you the ratio; you might want to consider the case where characters in the sequences dont match (and neither is '-'), here I assumed that's not a different case than one being '-'.
A helper function to concatenate the strings: here I took the case of non-matching chars and put in an 'X' to mark it (if it ever happens) .
def dna_concat(str1,str2):
outstr=[]
for i,ch in enumerate(str1):
if ch!=str2[i]:
if ch == '-':
outchar = str2[i]
elif str2[i] == '-':
outchar = ch
else:
outchar = 'X'
else:
outchar = ch
outstr.append(outchar)
outstr = ''.join(outstr)
return outstr
And finally a loop thru the dictionary lists to get the concatenated answers, in another dictionary with filenumbers as keys and lists of concatenations as values.
for filenum,dnalist in dnalist.items():
print(dnalist)
answers = defaultdict(list)
for i,seq in enumerate(dnalist):
for seq2 in dnalist[i+1:len(dnalist)]:
ratio = calc_ratio(seq,seq2)
print('i {} {} ration {}'.format(seq,seq2,ratio))
if ratio<0.2:
answers[filenum].append(dna_concat(seq,seq2))
print(dna_concat(seq,seq2))

Print data between positions within a loop

I have one files.
File1 which has 3 columns. Data are tab separated
File1:
2 4 Apple
6 7 Samsung
Let's say if I run a loop of 10 iteration. If the iteration has value between column 1 and column 2 of File1, then print the corresponding 3rd column from File1, else print "0".
The columns may or may not be sorted, but 2nd column is always greater than 1st. Range of values in the two columns do not overlap between lines.
The output Result should look like this.
Result:
0
Apple
Apple
Apple
0
Samsung
Samsung
0
0
0
My program in python is here:
chr5_1 = [[]]
for line in file:
line = line.rstrip()
line = line.split("\t")
chr5_1.append([line[0],line[1],line[2]])
# Here I store all position information in chr5_1 list in list
chr5_1.pop(0)
for i in range (1,10):
for listo in chr5_1:
L1 = " ".join(str(x) for x in listo[:1])
L2 = " ".join(str(x) for x in listo[1:2])
L3 = " ".join(str(x) for x in listo[2:3])
if int(L1) <= i and int(L2) >= i:
print(L3)
break
else:
print ("0")
break
I am confused with loop iteration and it break point.

Try this:
chr5_1 = dict()
for line in file:
line = line.rstrip()
_from, _to, value = line.split("\t")
for i in range(int(_from), int(_to) + 1):
chr5_1[i] = value
for i in range (1, 10):
print chr5_1.get(i, "0")

I think this is a job for else:
position_information = []
with open('file1', 'rb') as f:
for line in f:
position_information.append(line.strip().split('\t'))
for i in range(1, 11):
for start, through, value in position_information:
if i >= int(start) and i <= int(through):
print value
# No need to continue searching for something to print on this line
break
else:
# We never found anything to print on this line, so print 0 instead
print 0
This gives the result you're looking for:
0
Apple
Apple
Apple
0
Samsung
Samsung
0
0
0

Setup:
import io
s = '''2 4 Apple
6 7 Samsung'''
# Python 2.x
f = io.BytesIO(s)
# Python 3.x
#f = io.StringIO(s)
If the lines of the file are not sorted by the first column:
import csv, operator
reader = csv.reader(f, delimiter = ' ', skipinitialspace = True)
f = list(reader)
f.sort(key = operator.itemgetter(0))
Read each line; do some math to figure out what to print and how many of them to print; print stuff; iterate
def print_stuff(thing, n):
while n > 0:
print(thing)
n -= 1
limit = 10
prev_end = 1
for line in f:
# if iterating over a file, separate the columns
begin, end, text = line.strip().split()
# if iterating over the sorted list of lines
#begin, end, text = line
begin, end = map(int, (begin, end))
# don't exceed the limit
begin = begin if begin < limit else limit
# how many zeros?
gap = begin - prev_end
print_stuff('0', gap)
if begin == limit:
break
# don't exceed the limit
end = end if end < limit else limit
# how many words?
span = (end - begin) + 1
print_stuff(text, span)
if end == limit:
break
prev_end = end
# any more zeros?
gap = limit - prev_end
print_stuff('0', gap)

Python: keep top Nth results for csv.reader

I am doing some filtering on csv file where for every title there are many duplicate IDs with different prediction values, so the column 2 (pythoniac) is different. I would like to keep only 30 lowest values but with unique ID. I came to this code, but I don't know how to keep lowest 30 entries.
Can you please help with suggestions how to obtain 30 unique by ID entries?
# title1 id1 100 7.78E-25 # example of the line
with open("test.txt") as fi:
cmp = {}
for R in csv.reader(fi, delimiter='\t'):
for L in ligands:
newR = R[0], R[1]
if R[0] == L:
if (int(R[2]) <= int(1000) and int(R[2]) != int(0) and float(R[3]) < float("1.0e-10")):
if newR in cmp:
if float(cmp[newR][3]) > float(R[3]):
cmp[newR] = R[:-2]
else:
cmp[newR] = R[:-2]

Maybe try something along this line...
from bisect import insort
nth_lowest = [very_high_value] * 30
for x in my_loop:
do_stuff()
...
if x < nth_lowest[-1]:
insort(nth_lowest, x)
nth_lowest.pop() # remove the highest element

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Greedy Algorithm - python

I think in this piece of code i has to be incremented on the else side too while running < max: next_add = int(jewels[values[i]]) if (running + next_add) > max: i += 1 else: running += next_add grabbed_list.append(values[i]) i += 1 #here this and #iblazevic's answer explains why it behaves this way

Related

Python create lists conditionally from txt file

Finding the substring with the most repeats in a dictionary with dna sequences

Concatenate several seq within a file according to their percentage of similarities

Print data between positions within a loop

Python: keep top Nth results for csv.reader

Categories

Resources