speed up regex finditer for large dataset

speed up regex finditer for large dataset - python

I am trying to find positions of a match (N or -) in a large dataset.
The number of matches per string (3 million letters) is around 300,000. I have 110 strings to search in the same file so I made a loop using re.finditer to match and report position of each match but it is taking very long time. Each string (DNA sequence) is composed of only six characters (ATGCN-). Only 17 strings were processed in 11 hours. The question is what can I do to speed up the process?
The part of the code I am talking about is:
for found in re.finditer(r"[-N]", DNA_sequence):
position = found.start() + 1
positions_list.append(position)
positions_set = set(positions_list)
all_positions_set = all_positions_set.union(positions_set)
count += 1
print(str(count) + '\t' +record.id+'\t'+'processed')
output_file.write(record.id+'\t'+str(positions_list)+'\n')
I also tried to use re.compile as I googled and found that it could improve performance but nothing changed (match = re.compile('[-N]'))

If you have roughly 300k matches - you are re-creating increasingly larger sets that contain exactly the same elements as the list you are already adding to:
for found in re.finditer(r"[-N]", DNA_sequence):
position = found.start() + 1
positions_list.append(position)
positions_set = set(positions_list) # 300k times ... why? why at all?
You can instead simply use the list you got anyway and put that into your all_positions_set after you found all of them:
all_positions_set = all_positions_set.union(positions_list) # union takes any iterable
That should reduce the memory by more then 50% (sets are more expensive then lists) and also cut down on the runtime significantly.
I am unsure what is faster, but you could even skip using regex:
t = "ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-"
pos = []
for idx,c in enumerate(t):
if c in "N-":
pos.append(idx)
print(pos) # [4, 5, 10, 11, 16, 17, 22, 23, 28, 29, 34, 35, 40, 41, 46, 47]
and instead use enumerate() on your string to find the positions .... you would need to test if that is faster.

Regarding not using regex, I did actually that and now modified my script to run in less than 45 seconds using a defined function
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start + 1
start += len(sub)
So the new coding part is:
N_list = list(find_all(DNA_sequence, 'N'))
dash_list = list(find_all(DNA_sequence, '-'))
positions_list = N_list + dash_list
all_positions_set = all_positions_set.union(positions_list)
count += 1
print(str(count) + '\t' +record.id+'\t'+'processed')
output_file.write(record.id+'\t'+str(sorted(positions_list))+'\n')

Related

Removing similar items from 2 different lists

Is there a more Pythonic way to execute this code?
sim_inits = [1,100, 12, 3520, 1250]
prod_inits = [2, 101, 13, 14, 3521, 1500]
for t in range(len(sim_inits)-1):
sim_loop_done = False
for s in sim_inits[:]:
if sim_loop_done == True:
continue
prod_loop_done = False
for p in prod_inits[:]:
if prod_loop_done == True:
continue
if abs(s-p) < 3 :
sim_inits.remove(s)
prod_inits.remove(p)
sim_loop_done = True
prod_loop_done = True
print sim_inits
print prod_inits
Output:
[1250]
[14, 1500]
I'm trying to loop over both lists and the moment I find a match (defined by a difference less than 3), I want to move to the next item. I do NOT want 14 removed from prod_inits because the 12 from sim_inits was removed against the 13 in prod_inits.
The above code works, I was just wondering if it can be done more efficiently.

You can skip one of the loops, and you can use break instead of continue to get out of the other one early without using the cumbersome flags you're currently using.
List slicing is pretty expensive - especially in the case of prod_inits, where you're duplicating the entire list to just remove one element from it. Cheaper to iterate by index and then use pop() instead of remove() to remove that index. Similarly, we can use a while loop to count through the list s (instead of a for loop) because it lets us accommodate for the elements we're removing by doing (we do s -= 1 for this reason).
sim_inits = [1,100, 12, 3520, 1250]
prod_inits = [2, 101, 13, 14, 3521, 1500]
s = 0
while s < len(sim_inits):
for p in range(len(prod_inits)):
if abs(sim_inits[s]-prod_inits[p]) < 3:
sim_inits.pop(s)
prod_inits.pop(p)
s -= 1
break
s += 1
print(sim_inits)
print(prod_inits)
After running this code locally:
>>> print(sim_inits)
[1250]
>>> print(prod_inits)
[14, 1500]

ArcGIS:Python - Adding Commas to a String

In ArcGIS I have intersected a large number of zonal polygons with another set and recorded the original zone IDs and the data they are connected with. However the strings that are created are one long list of numbers ranging from 11 to 77 (each ID is 11 characters long). I am looking to add a "," between each one making, it easier to read and export later as a .csv file. To do this I wrote this code:
def StringSplit(StrO,X):
StrN = StrO #Recording original string
StrLen = len(StrN)
BStr = StrLen/X #How many segments are inside of one string
StrC = BStr - 1 #How many times it should loop
if StrC > 0:
while StrC > 1:
StrN = StrN[ :((X * StrC) + 1)] + "," + StrN[(X * StrC): ]
StrC = StrC - 1
while StrC == 1:
StrN = StrN[:X+1] + "," + StrN[(X*StrC):]
StrC = 0
while StrC == 0:
return StrN
else:
return StrN
The main issue is how it has to step through multiple rows (76) with various lengths (11 -> 77). I got the last parts to work, just not the internal loop as it returns an error or incorrect outputs for strings longer than 22 characters.
Thus right now:
1. 01234567890 returns 01234567890
2. 0123456789001234567890 returns 01234567890,01234567890
3. 012345678900123456789001234567890 returns either: Error or ,, or even ,,01234567890
I know it is probably something pretty simple I am missing, but I can't seem remember what it is...

It can be easily done by regex.
those ........... are 11 dots for give split for every 11th char.
you can use pandas to create csv from the array output
Code:
import re
x = re.findall('...........', '01234567890012345678900123456789001234567890')
print(x)
myString = ",".join(x)
print(myString)
output:
['01234567890', '01234567890', '01234567890', '01234567890']
01234567890,01234567890,01234567890,01234567890
for the sake of simplicity you can do this
code:
x = ",".join(re.findall('...........', '01234567890012345678900123456789001234567890'))
print(x)

Don't make the loops by yourself, use python libraries or builtins, it will be easier. For example :
def StringSplit(StrO,X):
substring_starts = range(0, len(StrO), X)
substrings = (StrO[start:start + X] for start in substring_starts)
return ','.join(substrings)
string = '1234567890ABCDE'
print(StringSplit(string, 5))
# '12345,67890,ABCDE'

Concatenating strings faster than appending to lists

I am trying to isolate specific items in a list (e.g. [0, 1, 1] will return [0, 1]). I managed to get through this, but I noticed something strange.
When I tried to append to list it ran about 7 times slower then when I was concatenating strings and then splitting it.
This is my code:
import time
start = time.time()
first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]
values = first + second
distinct_string = ""
for i in values:
if not str(i) in distinct_string:
distinct_string += str(i) + " "
print(distinct_string.split())
print(" --- %s sec --- " % (start - time.time()))
This result end in about 5 seconds... Now for the lists:
import time
start = time.time()
first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]
values = first + second
distinct_list = []
for i in values:
if not i in distinct_list:
distinct_list.append(i)
print(distinct_list)
print(" --- %s sec --- " % (start - time.time()))
Runs at around 40 seconds.
What makes string faster even though I am converting a lot of values to strings?

Note that it's generally better to use timeit to compare functions, which runs the same thing multiple times to get average performance, and to factor out repeated code to focus on the performance that matters. Here's my test script:
first = [x for x in range(999) if x % 2 == 0]
second = [x for x in range(999) if x % 4 == 0]
values = first + second
def str_method(values):
distinct_string = ""
for i in values:
if not str(i) in distinct_string:
distinct_string += str(i) + " "
return [int(s) for s in distinct_string.split()]
def list_method(values):
distinct_list = []
for i in values:
if not i in distinct_list:
distinct_list.append(i)
return distinct_list
def set_method(values):
seen = set()
return [val for val in values if val not in seen and seen.add(val) is None]
if __name__ == '__main__':
assert str_method(values) == list_method(values) == set_method(values)
import timeit
funcs = [func.__name__ for func in (str_method, list_method, set_method)]
setup = 'from __main__ import {}, values'.format(', '.join(funcs))
for func in funcs:
print(func)
print(timeit.timeit(
'{}(values)'.format(func),
setup=setup,
number=1000
))
I've added int conversion to make sure that the functions return the same thing, and get the following results:
str_method
1.1685157899992191
list_method
2.6124089090008056
set_method
0.09523714500392089
Note that it is not true that searching in a list is faster than searching in a string if you have to convert the input:
>>> timeit.timeit('1 in l', setup='l = [9, 8, 7, 6, 5, 4, 3, 2, 1]')
0.15300405000016326
>>> timeit.timeit('str(1) in s', setup='s = "9 8 7 6 5 4 3 2 1"')
0.23205067300295923
Repeated appending to a list is not very efficient, as it means frequent resizing of the underlying object - the list comprehension, as shown in the set version, is more efficient.

searching in strings:
if not str(i) in distinct_string:
is much faster
then searching in lists
if not i in distinct_list:
here are lprofile lines for string search in OP code
Line # Hits Time Per Hit % Time Line Contents
17 75000 80366013 1071.5 92.7 if not str(i) in distinct_string:
18 50000 2473212 49.5 2.9 distinct_string += str(i) + " "
and for list search in OP code
39 75000 769795432 10263.9 99.1 if not i in distinct_list:
40 50000 2813804 56.3 0.4 distinct_list.append(i)

I think there is a flaw of logic that makes the string method seemingly much faster.
When matching substrings in a long string, the in operator will return prematurely at the first substring containing the search item. To prove this, I let the loop run backwards from the highest values down to the smallest, and it returned only 50% of the values of the original loop (I checked the length of the result only). If the matching was exact there should be no difference whether you check the sequence from the start or from the end. I conclude that the string method short-cuts a lot of comparisons by matching near the start of the long string. The particular choice of duplicates is unfortunately masking this.
In a second test, I let the string method search for " " + str(i) + " " to eliminate substring matches. Now it will run only about 2x faster than the list method (but still, faster).
#jonrsharpe: Regarding the set_method I cannot see why one would touch all set elements one by one and not in one set statement like this:
def set_method(values):
return list(set(values))
This produces exactly the same output and runs about 2.5x faster on my PC.

Avoiding off-by-one errors when removing columns based on indices in a python list

I have a target file called TARGFILE of the form:
10001000020002002001100100200000111
10201001020000120210101100110010011
02010010200000011100012021001012021
00102000012001202100101202100111010
My idea here was to leave this as a string, and use slicing in python to remove the indices.
The removal will occur based on a list of integers called INDICES like so:
[1, 115654, 115655, 115656, 2, 4, 134765, 134766, 18, 20, 21, 23, 24, 17659, 92573, 30, 32, 88932, 33, 35, 37, 110463, 38, 18282, 46, 18458, 48, 51, 54]
I want to remove every position of every line in TARGFILE that matches with INDICES. For instance, the first digit in INDICES is 1, so the first column of TARGFILE containing 1,1,0,0 would be removed. However, I am weary of doing this incorrectly due to off-by-one errors and changing index positions if everything is not removed at the same time.
Thus, a solution that removed every column from each row at the same time would likely be both much faster and safer than using a nested loop, but I am unsure of how to code this.
My code so far is here:
#!/usr/bin/env python
import fileinput
SRC_FILES=open('YCP.txt', 'r')
for line in SRC_FILES:
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split(' ')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
TARGFILE=open(EUR, 'r')
It seems to me that a solution using enumerate might be possible, but I have not found it, and that might be suboptimal in the first place...
EDIT: in response to concerns about memory: the longest lines are ~180,000 items, but I should be able to get this into memory without a problem, I have access to a cluster.

I like the simplicity of Peter's answer, even though it's currently off-by-one. My thought is that you can get rid of the index-shifting problem, by sorting INDICES, and doing the process from the back to the front. That led to remove_indices1, which is really inefficient. I think 2 is better, but simplest is 3, which is Peter's answer.
I may do timing in a bit for some large numbers, but my intuition says that my remove_indices2 will be faster than Peter's remove_indices3 if INDICES is very sparse. (Because you don't have to iterate over each character, but only over the indices that are being deleted.)
BTW - If you can sort INDICES once, then you don't need to make the local copy to sort/reverse, but I didn't know if you could do that.
rows = [
'0000000001111111111222222222233333333334444444444555555555566666666667',
'1234567890123456789012345678901234567890123456789012345678901234567890',
]
def remove_nth_character(row,n):
return row[:n-1] + row[n:]
def remove_indices1(row,indices):
local_indices = indices[:]
retval = row
local_indices.sort()
local_indices.reverse()
for i in local_indices:
retval = remove_nth_character(retval,i)
return retval
def remove_indices2(row,indices):
local_indices = indices[:]
local_indices.sort()
local_indices.reverse()
front = row
chunks = []
for i in local_indices:
chunks.insert(0,front[i:])
front = front[:i-1]
chunks.insert(0,front)
return "".join(chunks)
def remove_indices3(row,indices):
return ''.join(c for i,c in enumerate(row) if i+1 not in indices)
indices = [1,11,4,54,33,20,7]
for row in rows:
print remove_indices1(row,indices)
print ""
for row in rows:
print remove_indices2(row,indices)
print ""
for row in rows:
print remove_indices3(row,indices)
EDIT: Adding timing info, plus a new winner!
As I suspected, my algorithm (remove_indices2) wins when there aren't many indices to remove. It turns out that the enumerate-based one, though, gets worse even faster as there are more indices to remove. Here's the timing code (bigrows rows have 210000 characters):
bigrows = []
for row in rows:
bigrows.append(row * 30000)
for indices_len in [10,100,1000,10000,100000]:
print "indices len: %s" % indices_len
indices = range(indices_len)
#for func in [remove_indices1,remove_indices2,remove_indices3,remove_indices4]:
for func in [remove_indices2,remove_indices4]:
start = time.time()
for row in bigrows:
func(row,indices)
print "%s: %s" % (func.__name__,(time.time() - start))
And here are the results:
indices len: 10
remove_indices1: 0.0187089443207
remove_indices2: 0.00184297561646
remove_indices3: 1.40601491928
remove_indices4: 0.692481040955
indices len: 100
remove_indices1: 0.0974130630493
remove_indices2: 0.00125503540039
remove_indices3: 7.92742991447
remove_indices4: 0.679095029831
indices len: 1000
remove_indices1: 0.841033935547
remove_indices2: 0.00370812416077
remove_indices3: 73.0718669891
remove_indices4: 0.680690050125
So, why does 3 do so much worse? Well, it turns out that the in operator isn't efficient on a list. It's got to iterate through all of the list items to check. remove_indices4 is just 3 but converting indices to a set first, so the inner loop can do a fast hash-lookup, instead of iterating through the list:
def remove_indices4(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if i+1 not in indices_set)
And, as I originally expected, this does better than my algorithm for high densities:
indices len: 10
remove_indices2: 0.00230097770691
remove_indices4: 0.686790943146
indices len: 100
remove_indices2: 0.00113391876221
remove_indices4: 0.665997982025
indices len: 1000
remove_indices2: 0.00296902656555
remove_indices4: 0.700706005096
indices len: 10000
remove_indices2: 0.074893951416
remove_indices4: 0.679219007492
indices len: 100000
remove_indices2: 6.65899395943
remove_indices4: 0.701599836349
If you've got fewer than 10000 indices to remove, 2 is fastest (even faster if you do the indices sort/reverse once outside the function). But, if you want something that is pretty stable in time, no matter how many indices, use 4.

The simplest way I can see would be something like:
>>> for line in TARGFILE:
... print ''.join(c for i,c in enumerate(line) if (i+1) not in INDICES)
...
100000200020020100200001
100010200001202010110001
010102000000111021001021
000000120012021012100110
(Substituting print for writing to your output file etc)
This relies on being able to load each line into memory which may or may not be reasonable given your data.
Edit: explaination:
The first line is straightforward:
>>> for line in TARGFILE:
Just iterates through each line in TARGFILE. The second line is a bit more complex:
''.join(...) concatenates a list of strings together with an empty joiner (''). join is often used with a comma like: ','.join(['a', 'b', 'c']) == 'a,b,c', but here we just want to join each item to the next.
enumerate(...) takes an interable and returns pairs of (index, item) for each item in the iterable. For example enumerate('abc') == (0, 'a'), (1, 'b'), (2, 'c')
So the line says,
Join together each character of line whose index are not found in INDICES
However, as John pointed out, Python indexes are zero base, so we add 1 to the value from enumerate.

The script I ended up using is the following:
#!/usr/bin/env python
def remove_indices(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if (i+1) in indices_set)
SRC_FILES=open('YCP2.txt', 'r')
CEUDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/CEU/PARSED/'
YRIDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/YRI/PARSED/'
i=0
for line in SRC_FILES:
i+=1
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split('\t')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
INDEXSORT=sorted(INDICES, key=int)
EURF=open(EUR, 'r')
EURFOUT=open(CEUDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype in EURF:
TRIMLINE=remove_indices(haplotype, INDEXSORT)
EURFOUT.write(TRIMLINE + '\n')
EURFOUT.close()
AFRF=open(YRI, 'r')
AFRFOUT=open(YRIDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype2 in AFRF:
TRIMLINE=remove_indices(haplotype2, INDEXSORT)
AFRFOUT.write(TRIMLINE + '\n')
AFRFOUT.close()

finding duplicates function

I need help incorporating a find_duplicates function in a program. It needs to traverse the list checking every element to see if it matches the target. Keep track of how many matches are found and print out a sentence summary.
Below is the code that I have written thus far:
myList = [69, 1 , 99, 82, 17]
#Selection Sort Function
def selection_sort_rev(aList):
totalcomparisions = 0
totalexchanges = 0
p=0
print("Original List:" , aList)
print()
n = len(aList)
for end in range(n, 1, -1):
comparisions = 0
exchanges = 1
p= p + 1
#Determine Largest Value
max_position = 0
for i in range(1, end):
comparisions = comparisions + 1
if aList[i] < aList[max_position]:
max_position = i
#Passes and Exchanges
exchnages = exchanges + 1
temp = aList [end - 1]
aList [end - 1] = aList [max_position]
aList [max_position] = temp
print ("Pass", p,":", "Comparsions:", comparisions, "\tExchanges:" , exchanges)
print("\t", aList, "\n")
totalcomparisions = totalcomparisions + comparisions
totalexchanges = totalexchanges + exchanges
print("\tTotal Comparisons:", totalcomparisions, "\tTotal Exchanges:", totalexchanges)
return
The find_duplicates function has to have:
Input paramaters: List, target (value to find)
Search for the target in the list and count how many times it occurs
Return value: None
I'm sure there are more efficient ways of doing this but I am a beginner to programming and I would like to know how to do this in the most simple way possible. PLEASE HELP!!!

import collections
def find_duplicates(L, target):
i_got_this_from_stackoverflow = collections.Counter(L)
if target not in i_got_this_from_stackoverflow:
print "target not found"
else:
print i_got_this_from_stackoverflow[target], "occurrences of the target were found"

If you are looking to improve your programming skills this is a great discussion. If you just want the answer to your problem, use the list.count() method, which was created to answer precisely that question:
>>> myList = [69, 1 , 99, 82, 17, 1, 82]
>>> myList.count(82)
2
Also, your test code does not contain any duplicates in the list. This is a necessary test case, but you also need to try your code on at least one list with duplicates to test that is works when they are present.
When you are learning and starting to achieve mastery you can save yourself an awful lot of time by learning how the various components work. If this is an assignment, however, keep on truckin'!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

speed up regex finditer for large dataset - python

Related

Removing similar items from 2 different lists

ArcGIS:Python - Adding Commas to a String

Concatenating strings faster than appending to lists

Avoiding off-by-one errors when removing columns based on indices in a python list

finding duplicates function

Categories

Resources