Problem with reading data from file in Python - python

EDIT:
Thanks for fixing it! Unfortunatelly, it messed up the logic. I'll explain what this program does. It's a solution to a task about playing cards trick. There are N cards on the table. First and Second are numbers on the front and back of the cards. The trick can only be done, if the visible numbers are in non-decreasing order. Someone from audience can come and swap places of cards. M represents how many cards will be swapped places. A and B represent which cards will be swapped. Magician can flip any number of cards to see the other side. The program must tell, if the magician can do the trick.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
for _ in range(n):
first, second = (int(x) for x in data.readline().split(':'))
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # add to the list by appending
m = data.readline()
m = int(m)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
data:
4
2:5
3:4
6:3
2:7
2
3-4
1-3
results:
YES
YES
YES
YES
YES
YES
YES
What should be in results:
NO
YES

The code is full of bugs: you should write and test it incrementally instead of all at once. It seems that you started using readlines (which is a good way of managing this kind of work) but you kept the rest of the code in a reading one by one style. If you used readlines, the line for i, line in enumerate(data): should be changed to for i, line in enumerate(lines):.
Anyway, here is a corrected version with some explanation. I hope I did not mess with the logic.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
# The following line created a huge list of "Pairs" types, not instances
# pairs = [Pair] * (2*200*1000+1)
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
# removing the reading of all data...
# lines = data.readlines()
# m = lines[n]
# removed bad for: for i, line in enumerate(data):
for _ in range(n): # you don't need the index
first, second = (int(x) for x in data.readline().split(':'))
# removed unnecessary recasting to int
# first = int(first)
# second = int(second)
# changed the swapping to a more elegant way
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # we add to the list by appending
# removed unnecessary for: once you read all the first and seconds,
# you reached M
m = data.readline()
m = int(m)
# you don't need the index... indeed you don't need to count (you can read
# to the end of file, unless it is malformed)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
# removed unnecessary recasting to int
# a = int(a)
# b = int(b)
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
Response previous to edition
range(1, 1) is empty, so this part of the code:
for i in range (1, 1):
n = data.readline()
n = int(n)
does not define n, at when execution gets to line 12 you get an error.
You can remove the for statement, changing those three lines to:
n = data.readline()
n = int(n)

Related

Speed up re.sub() on large strings representing large files in python?

Hi I am running this python code to reduce multi-line patterns to singletons however, I am doing this on extremely large files of 200,000+ lines.
Here is my current code:
import sys
import re
with open('largefile.txt', 'r+') as file:
string = file.read()
string = re.sub(r"((?:^.*\n)+)(?=\1)", "", string, flags=re.MULTILINE)
file.seek(0)
file.write(string)
file.truncate()
The problem is the re.sub() is taking ages (10m+) on my large files. Is it possible to speed this up in any way?
Example input file:
hello
mister
hello
mister
goomba
bananas
goomba
bananas
chocolate
hello
mister
Example output:
hello
mister
goomba
bananas
chocolate
hello
mister
These patterns can be bigger than 2 lines as well.
Regexps are compact here, but will never be speedy. For one reason, you have an inherently line-based problem, but regexps are inherently character-based. The regexp engine has to deduce, over & over & over again, where "lines" are by searching for newline characters, one at a time. For a more fundamental reason, everything here is brute-force character-at-a-time search, remembering nothing from one phase to the next.
So here's an alternative. Split the giant string into a list of lines, just once at the start. Then that work never needs to be done again. And then build a dict, mapping a line to a list of the indices at which that line appears. That takes linear time. Then, given a line, we don't have to search for it at all: the list of indices tells us at once every place it appears.
Worse-case time can still be poor, but I expect it will be at least a hundred times faster on "typical" inputs.
def dedup(s):
from collections import defaultdict
lines = s.splitlines(keepends=True)
line2ix = defaultdict(list)
for i, line in enumerate(lines):
line2ix[line].append(i)
out = []
n = len(lines)
i = 0
while i < n:
line = lines[i]
# Look for longest adjacent match between i:j and j:j+(j-i).
# j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
# Lines at i and j match.
if all(lines[i + k] == lines[j + k]
for k in range(1, j - i)):
searching = False
break
if searching:
out.append(line)
i += 1
else: # skip the repeated block at i:j
i = j
return "".join(out)
EDIT
This incorporates Kelly's idea of incrementally updating line2ix using a deque so that the candidates looked at are always in range(i+1, maxj+1). Then the innermost loop doesn't need to check for those conditions.
It's a mixed bag, losing a little when there are very few duplicates, because in such cases the line2ix sequences are very short (or even singletons for unique lines).
Here's timing for a case where it really pays off: a file containing about 30,000 lines of Python code. Many lines are unique, but a few kinds of lines are very common (for example, the empty "\n" line). Cutting the work in the innermost loop can pay for those common lines. dedup_nuts was picked for the name because this level of micro-optimization is, well, nuts ;-)
71.67997950001154 dedup_original
48.948923900024965 dedup_blhsing
2.204853900009766 dedup_Tim
9.623824400012381 dedup_Kelly
1.0341253000078723 dedup_blhsingTimKelly
0.8434303000103682 dedup_nuts
And the code:
def dedup_nuts(s):
from array import array
from collections import deque
encode = {}
decode = []
lines = array('L')
for line in s.splitlines(keepends=True):
if (code := encode.get(line)) is None:
code = encode[line] = len(encode)
decode.append(line)
lines.append(code)
del encode
line2ix = [deque() for line in lines]
view = memoryview(lines)
out = []
n = len(lines)
i = 0
last_maxj = -1
while i < n:
maxj = (n + i) // 2
for j in range(last_maxj + 1, maxj + 1):
line2ix[lines[j]].appendleft(j)
last_maxj = maxj
line = lines[i]
js = line2ix[line]
assert js[-1] == i, (i, n, js)
js.pop()
for j in js:
#assert i < j <= maxj
if view[i : j] == view[j : j + j - i]:
for k in range(i + 1, j):
js = line2ix[lines[k]]
assert js[-1] == k, (i, k, js)
js.pop()
i = j
break
else:
out.append(line)
i += 1
#assert all(not d for d in line2ix)
return "".join(map(decode.__getitem__, out))
Some key invariants are checked by asserts there, but the expensive ones are commented out for speed. Season to taste.
#TimPeters' line-based comparison approach is good but wastes time in repeated comparisons of the same lines. #KellyBundy's encoding idea is good but wastes time in the overhead of a regex engine and text encoding.
A more efficient approach would be to adopt #KellyBundy's encoding idea in #TimPeters' algorithm, but instead of encoding lines into characters, encode them into an array.array of 32-bit integers to avoid the overhead of text encoding, and then use a memoryview of the array for quick slice-based comparisons:
from array import array
def dedup_blhsingTimKelly2(s):
encode = {}
decode = []
lines = s.splitlines(keepends=True)
n = len(lines)
for line in lines:
if line not in encode:
encode[line] = len(decode)
decode.append(line)
lines = array('L', map(encode.get, lines))
del encode
line2ix = [[] for _ in range(n)]
for i, line in enumerate(lines):
line2ix[line].append(i)
view = memoryview(lines)
out = []
i = 0
while i < n:
line = lines[i]
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
if view[i: j] == view[j: j + j - i]:
searching = False
break
if searching:
out.append(decode[line])
i += 1
else:
i = j
return "".join(out)
A run of #KellyBundy's benchmark code with this approach added, originally named dedup_blhsingTimKelly, now amended with Tim and Kelly's comments and named dedup_blhsingTimKelly2:
2.6650364249944687 dedup_original
1.3109814710041974 dedup_blhsing
0.5598453340062406 dedup_Tim
0.9783012029947713 dedup_Kelly
0.24442325498966966 dedup_blhsingTimKelly
0.21991234300367068 dedup_blhsingTimKelly2
Try it online!
Nesting a quantifier within a quantifier is expensive and in this case unnecessary.
You can use the following regex without nesting instead:
string = re.sub(r"(^.*\n)(?=\1)", "", string, flags=re.M | re.S)
In the following test it more than cuts the time in half compared to your approach:
https://replit.com/#blhsing/HugeTrivialExperiment
Another idea: You're talking about "200,000+ lines", so we can encode each unique line as one of the 1,114,112 possible characters and simplify the regex to r"(.+)(?=\1)". And after the deduplication, decode the characters back to lines.
def dedup(s):
encode = {}
decode = {}
lines = s.split('\n')
for line in lines:
if line not in encode:
c = chr(len(encode))
encode[line] = c
decode[c] = line
s = ''.join(map(encode.get, lines))
s = re.sub(r"(.+)(?=\1)", "", s, flags=re.S)
return '\n'.join(map(decode.get, s))
A little benchmark based on blhsing's but with some repeating lines (times in seconds):
2.5934535119995417 dedup_original
1.2498892020012136 dedup_blhsing
0.5043159520009795 dedup_Tim
0.9235864399997809 dedup_Kelly
I built a pool of 50 lines of 10 random letters, then joined 5000 random lines from that pool.
The two fastest with 10,000 lines instead:
2.0905018440007552 dedup_Tim
3.220036650000111 dedup_Kelly
Code (Try it online!):
import re
import random
import string
from timeit import timeit
strings = [''.join((*random.choices(string.ascii_letters, k=10), '\n')) for _ in range(50)]
s = ''.join(random.choices(strings, k=5000))
def dedup_original(s):
return re.sub(r"((?:^.*\n)+)(?=\1)", "", s, flags=re.MULTILINE)
def dedup_blhsing(s):
return re.sub(r"(^.*\n)(?=\1)", "", s, flags=re.M | re.S)
def dedup_Tim(s):
from collections import defaultdict
lines = s.splitlines(keepends=True)
line2ix = defaultdict(list)
for i, line in enumerate(lines):
line2ix[line].append(i)
out = []
n = len(lines)
i = 0
while i < n:
line = lines[i]
# Look for longest adjacent match between i:j and j:j+(j-i).
# j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
# Lines at i and j match.
if all(lines[i + k] == lines[j + k]
for k in range(1, j - i)):
searching = False
break
if searching:
out.append(line)
i += 1
else: # skip the repeated block at i:j
i = j
return "".join(out)
def dedup_Kelly(s):
encode = {}
decode = {}
lines = s.split('\n')
for line in lines:
if line not in encode:
c = chr(len(encode))
encode[line] = c
decode[c] = line
s = ''.join(map(encode.get, lines))
s = re.sub(r"(.+)(?=\1)", "", s, flags=re.S)
return '\n'.join(map(decode.get, s))
funcs = dedup_original, dedup_blhsing, dedup_Tim, dedup_Kelly
expect = funcs[0](s)
for f in funcs[1:]:
print(f(s) == expect)
for _ in range(3):
for f in funcs:
t = timeit(lambda: f(s), number=1)
print(t, f.__name__)
print()

how to create a list and then print it in ascending order

def list():
list_name = []
list_name_second = []
with open('CoinCount.txt', 'r', encoding='utf-8') as csvfile:
num_lines = 0
for line in csvfile:
num_lines = num_lines + 1
i = 0
while i < num_lines:
for x in volunteers[i].name:
if x not in list_name: # l
f = 0
while f < num_lines:
addition = []
if volunteers[f].true_count == "Y":
addition.append(1)
else:
addition.append(0)
f = f + 1
if f == num_lines:
decimal = sum(addition) / len(addition)
d = decimal * 100
percentage = float("{0:.2f}".format(d))
list_name_second.append({'Name': x , 'percentage': str(percentage)})
list_name.append(x)
i = i + 1
if i == num_lines:
def sort_percentages(list_name_second):
return list_name_second.get('percentage')
print(list_name_second, end='\n\n')
above is a segment of my code, it essentially means:
If the string in nth line of names hasn't been listed already, find the percentage of accurate coins counted and then add that all to a list, then print that list.
the issue is that when I output this, the program is stuck on a while loop continuously on addition.append(1), I'm not sure why so please can you (using the code displayed) let me know how to update the code to make it run as intended, also if it helps, the first two lines of code within the txt file read:
Abena,5p,325.00,Y
Malcolm,1p,3356.00,N
this doesn't matter much but just incase you need it, I suspect that the reason it is stuck looping addition.append(1) is because the first line has a "Y" as its true_count

Replace for loop? This function works but it takes to long time. I'm looking for ways to impove it

It works but takes 40 seconds to work 1 stock 1 simple moving average. I'm a beginner, Is there any ways to replace those for loops or more efficient way to run this? I'm reading about numpy but I don't understand how it could replace a loop.
I'm trying to make a csv to store all the indicatorvalues from current period to the start of my dataframe.
I currently only have one moving average but with this speed its pointless to add anything else :)
def runcheck(df,adress):
row_count = int(0)
row_count=len(df)
print(row_count)
lastp = row_count-1
row_count2 = int(0)
mabuild = int(0)
ma445_count = int(0)
ma_count2 = int(0)
row_count5 = int(0)
row_count3 = int(0)
row_count4 = int(0)
resultat = int(0)
timside_count = int(0)
slott_count = int(0)
sick_count = int(0)
rad_data = []
startT = time.time()
## denna kollar hela vägen till baka t.ex idag. sen igår i förrgår
for row in df.index:
row_count2 += 1
timside_count = row_count-row_count2
if timside_count >= 445:
for row in df.index:
row_count5 = row_count-row_count2
slott_count = row_count5-row_count3
mabuild = mabuild+df.iloc[slott_count,5]
row_count3 += 1
row_count4 += 1
if row_count4 == 445:
resultat = mabuild/row_count4
rad_data.append(resultat)
row_count3 = int(0)
row_count4 = int(0)
mabuild = int(0)
resultat = 0
break
## sparar till csv innan loop börjar om
with open(adress, "a") as fp:
wr = csv.writer(fp,)
wr.writerow(rad_data)
rad_data.clear()
print('Time was :', time.time()-startT)
stop=input('')
Try this:
import numpy as np
from functools import reduce
def runcheck(df,adress):
startT = time.time()
rad_data = map(lambda i: reduce(lambda x, y: x + y, map(lambda z: df.iloc[z, 5], np.arange(i-445, i)))/445, np.arange(445, len(df.index)))
'''
Explanation
list_1 = np.arange(445, len(def.index) -> Create a list of integers from 445 to len(def.index)
rad_data = map(lambda i: function, list_1) -> Apply function (see below) to each value (i) in the generated list_1
function = reduce(lambda x, y: x + y, list_2)/445 -> Take 2 consecutive values (x, y) in list_2 (see below) and sum them, repeat until one value left (i.e. sum of list_2), then divide by 445
list_2 = map(lambda z: df.iloc[z, 5], list_3) -> Map each value (z) in list_3 (see below) to df.iloc[z, 5]
list_3 = np.arange(i-445, i) -> Create a list of integers from i-445 to i (value i from list_1)
'''
# writing to your csv file outside the loop once you have all the values is better, as you remove the overhead of re-opening the file each time
with open(adress, "a") as fp:
wr = csv.writer(fp,)
for data in rad_data:
wr.writerow([data])
print('Time was :', time.time()-startT)
stop=input('')
Not sure it works, as I don't have sample data. Let me know if there are errors and I'll try to debug!

Levenshtein distance in a file

The statement says:
Modify the above program so that given the GGCCTTGCCATTGG pattern, each of the first 10 lines of the previous file indicates:
· The distance of edition that finds the substring more similar of that line.
· The substrings of that line that finds to minimum distance of edition
The above program is this:
import time
def levenshtein_distance (first, second):
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(fist)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0]*second_length for x in range(first_length)]
for i in range(first_length): distance_matrix[i][0] = i
for j in range(second_length): distance_matrix[0][j] = j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 2
substitution = distance_matrix[i-1][j-1] + 1
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
def dna(patro):
t1 = time.clock()
f = open("HUMAN-DNA.txt")
text = f.readlines()
f.close()
distanciaMin = 100000000
distanciaPosicion = 0
distanciaLinea = 0
distanciaSubstring = ""
numeroLinea = 0
for line in text:
numeroLinea = numeroLinea + 1
for i in range(len(line)-len(patro)):
cadena = line[i:i+len(patro)]
distancia = levenshtein_distance(cadena, patro)
if distancia < distanciaMin:
distanciaMin = distancia
distanciaPosicion = 1
distanciaLinea = numeroLinea
distanciaSubstring = cadena
t2 = time.clock()
Now i put the new pattern
dna("GGCCTTGCCATTGG")
I have the distance of edition that is distanciaMin and I'm not sure about result of distanciaSubstring that is the substrings of that line(second point of statement), my question is How can i count the first ten lines in the text?
A part of the file is:
CCCATCTCTTTCTCATTCCTTGGTTGAGAACACGAACTTCAGGACTTGCCTCACACTAGGGCCCATTCTT
TGTTTCCCAGAAAGAAGAGGCTCTCCACACAGAGTCCCATGTACACCAGGCTGTCAACAAACATGAATTG
AATGAAGGAGTGGATGGTTGGGTGGAAGTGATTTAAGAAATCCTAACTGGGGAATTTCACTGGAAACTTA
GGAAATTCAATTTATATAAAGTCTATGAATCGTCCATTTTTGTGTCCGCACATTCAAATGCTGTAGCTAA
TTTCCTGCTAAACAGTAGAAATTCAGTAAGTGTTCATGTTGAAAGGATGAAATTTGAGTGCTCTTGCATC
CTCAAAGAACTCTAGTAAAATAGAAATAAAGCTTTATTTGGAAGATTAAGTCATGAGCATAATTATGAGA
AGGCGGTCATTCTAATAATAGTGTCTTCACAAGTAGATGCTACATGCTGTGTAATATTTTGACTAAAAAA
AGTTCCTCTCAACATTTCTGAAGTGAGATAATGTACAACGATCCATGTTTTTAGCTACCTTGATAAGTTT
AGTGCATCCAGGGCTCCTTTCTTACCTGCTAACCGCCGAGTTTCAAATGCTAAGAAATTCTTCATTTCCT
AACACAAATATTCAATATAATTGCTGGTTGTTTGGGAGAAGAAAAATTTAGAATTCAGAAAGAAATACAG
AATGAAATGTTCTAATCAATCGAAAAAGGATTCTATAGACTTCGACGTTGTCTGGTTTACAAAGCAGTCT
I couldn't understand your full question. But I am trying to solve How can i count the first ten lines in the text?. You can use filehandler.readlines(). It will load files in memory as a list where each row is separated by new line character.
Then you can read 10 lines from the list. You can try something like this,
>>> a = [0,1,2,3,4,5,6,7,8,9] # read file as a list of lines (a)
>>> def line(a, jump=2): # keep jump = 10 for your requirement.
lines = len(a)
i = 0
while i < lines+1:
yield a[i:i+jump]
i += jump
>>> foo = line(a)
>>> foo.next()
[0, 1]
>>> foo.next()
[2, 3]
>>> foo.next()
[4, 5]
For your code it will be,
foo = line(text, 10)
foo.next() # should return you 10 elements in each call

Efficient algorithm for counting unique elements in "suffixes" of an array

I was doing 368B on CodeForces with Python 3, which basically asks you to print the numbers of unique elements in a series of "suffixes" of a given array. Here's my solution (with some additional redirection code for testing):
import sys
if __name__ == "__main__":
f_in = open('b.in', 'r')
original_stdin = sys.stdin
sys.stdin = f_in
n, m = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
a = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
l = [None] * m
for i in range(m):
l[i] = int(sys.stdin.readline().rstrip())
l_sorted = sorted(l)
l_order = sorted(range(m), key=lambda k: l[k])
# the ranks of elements in l
l_rank = sorted(range(m), key=lambda k: l_order[k])
# unique_elem[i] = non-duplicated elements between l_sorted[i] and l_sorted[i+1]
unique_elem = [None] * m
for i in range(m):
unique_elem[i] = set(a[(l_sorted[i] - 1): (l_sorted[i + 1] - 1)]) if i < m - 1 else set(a[(l_sorted[i] - 1): n])
# unique_elem_cumulative[i] = non-duplicated elements between l_sorted[i] and a's end
unique_elem_cumulative = unique_elem[-1]
# unique_elem_cumulative_count[i] = #unique_elem_cumulative[i]
unique_elem_cumulative_count = [None] * m
unique_elem_cumulative_count[-1] = len(unique_elem[-1])
for i in range(m - 1):
i_rev = m - i - 2
unique_elem_cumulative = unique_elem[i_rev] | unique_elem_cumulative
unique_elem_cumulative_count[i_rev] = len(unique_elem_cumulative)
with open('b.out', 'w') as f_out:
for i in range(m):
idx = l_rank[i]
f_out.write('%d\n' % unique_elem_cumulative_count[idx])
sys.stdin = original_stdin
f_in.close()
The code shows correct results except for the possibly last big test, with n = 81220 and m = 48576 (a simulated input file is here, and an expected output created by a naive solution is here). The time limit is 1 sec, within which I can't solve the problem. So is it possible to solve it within 1 sec with Python 3? Thank you.
UPDATE: an "expected" output file is added, which is created by the following code:
import sys
if __name__ == "__main__":
f_in = open('b.in', 'r')
original_stdin = sys.stdin
sys.stdin = f_in
n, m = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
a = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
with open('b_naive.out', 'w') as f_out:
for i in range(m):
l_i = int(sys.stdin.readline().rstrip())
f_out.write('%d\n' % len(set(a[l_i - 1:])))
sys.stdin = original_stdin
f_in.close()
You'll be cutting it close, I think. On my admittedly rather old machine, the I/O alone takes 0.9 seconds per run.
An efficient algorithm, I think, will be to iterate backwards through the array, keeping track of which distinct elements you've found. When you find a new element, add its index to a list. This will therefore be a descending sorted list.
Then for each li, the index of li in this list will be the answer.
For the small sample dataset
10 10
1 2 3 4 1 2 3 4 100000 99999
1
2
3
4
5
6
7
8
9
10
The list would contain [10, 9, 8, 7, 6, 5] since when reading from the right, the first distinct value occurs at index 10, the second at index 9, and so on.
So then if li = 5, it has index 6 in the generated list, so 6 distinct values are found at indices >= li. Answer is 6
If li = 8, it has index 3 in the generated list, so 3 distinct values are found at indices >= li. Answer is 3
It's a little fiddly that the excercise numbers 1-indexed and python counts 0-indexed.
And to find this index quickly using existing library functions, I've reversed the list and then use bisect.
import timeit
from bisect import bisect_left
def doit():
f_in = open('b.in', 'r')
n, m = [int(i) for i in f_in.readline().rstrip().split(' ')]
a = [int(i) for i in f_in.readline().rstrip().split(' ')]
found = {}
indices = []
for i in range(n - 1, 0, -1):
if not a[i] in found:
indices.append(i+1)
found[a[i]] = True
indices.reverse()
length = len(indices)
for i in range(m):
l = int(f_in.readline().rstrip())
index = bisect_left(indices, l)
print length - index
if __name__ == "__main__":
print (timeit.timeit('doit()', setup="from bisect import bisect_left;from __main__ import doit", number=10))
On my machine outputs 12 seconds for 10 runs. Still too slow.

Categories

Resources