Speed up re.sub() on large strings representing large files in python?

Speed up re.sub() on large strings representing large files in python? - python

Hi I am running this python code to reduce multi-line patterns to singletons however, I am doing this on extremely large files of 200,000+ lines.
Here is my current code:
import sys
import re
with open('largefile.txt', 'r+') as file:
string = file.read()
string = re.sub(r"((?:^.*\n)+)(?=\1)", "", string, flags=re.MULTILINE)
file.seek(0)
file.write(string)
file.truncate()
The problem is the re.sub() is taking ages (10m+) on my large files. Is it possible to speed this up in any way?
Example input file:
hello
mister
hello
mister
goomba
bananas
goomba
bananas
chocolate
hello
mister
Example output:
hello
mister
goomba
bananas
chocolate
hello
mister
These patterns can be bigger than 2 lines as well.

Regexps are compact here, but will never be speedy. For one reason, you have an inherently line-based problem, but regexps are inherently character-based. The regexp engine has to deduce, over & over & over again, where "lines" are by searching for newline characters, one at a time. For a more fundamental reason, everything here is brute-force character-at-a-time search, remembering nothing from one phase to the next.
So here's an alternative. Split the giant string into a list of lines, just once at the start. Then that work never needs to be done again. And then build a dict, mapping a line to a list of the indices at which that line appears. That takes linear time. Then, given a line, we don't have to search for it at all: the list of indices tells us at once every place it appears.
Worse-case time can still be poor, but I expect it will be at least a hundred times faster on "typical" inputs.
def dedup(s):
from collections import defaultdict
lines = s.splitlines(keepends=True)
line2ix = defaultdict(list)
for i, line in enumerate(lines):
line2ix[line].append(i)
out = []
n = len(lines)
i = 0
while i < n:
line = lines[i]
# Look for longest adjacent match between i:j and j:j+(j-i).
# j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
# Lines at i and j match.
if all(lines[i + k] == lines[j + k]
for k in range(1, j - i)):
searching = False
break
if searching:
out.append(line)
i += 1
else: # skip the repeated block at i:j
i = j
return "".join(out)
EDIT
This incorporates Kelly's idea of incrementally updating line2ix using a deque so that the candidates looked at are always in range(i+1, maxj+1). Then the innermost loop doesn't need to check for those conditions.
It's a mixed bag, losing a little when there are very few duplicates, because in such cases the line2ix sequences are very short (or even singletons for unique lines).
Here's timing for a case where it really pays off: a file containing about 30,000 lines of Python code. Many lines are unique, but a few kinds of lines are very common (for example, the empty "\n" line). Cutting the work in the innermost loop can pay for those common lines. dedup_nuts was picked for the name because this level of micro-optimization is, well, nuts ;-)
71.67997950001154 dedup_original
48.948923900024965 dedup_blhsing
2.204853900009766 dedup_Tim
9.623824400012381 dedup_Kelly
1.0341253000078723 dedup_blhsingTimKelly
0.8434303000103682 dedup_nuts
And the code:
def dedup_nuts(s):
from array import array
from collections import deque
encode = {}
decode = []
lines = array('L')
for line in s.splitlines(keepends=True):
if (code := encode.get(line)) is None:
code = encode[line] = len(encode)
decode.append(line)
lines.append(code)
del encode
line2ix = [deque() for line in lines]
view = memoryview(lines)
out = []
n = len(lines)
i = 0
last_maxj = -1
while i < n:
maxj = (n + i) // 2
for j in range(last_maxj + 1, maxj + 1):
line2ix[lines[j]].appendleft(j)
last_maxj = maxj
line = lines[i]
js = line2ix[line]
assert js[-1] == i, (i, n, js)
js.pop()
for j in js:
#assert i < j <= maxj
if view[i : j] == view[j : j + j - i]:
for k in range(i + 1, j):
js = line2ix[lines[k]]
assert js[-1] == k, (i, k, js)
js.pop()
i = j
break
else:
out.append(line)
i += 1
#assert all(not d for d in line2ix)
return "".join(map(decode.__getitem__, out))
Some key invariants are checked by asserts there, but the expensive ones are commented out for speed. Season to taste.

#TimPeters' line-based comparison approach is good but wastes time in repeated comparisons of the same lines. #KellyBundy's encoding idea is good but wastes time in the overhead of a regex engine and text encoding.
A more efficient approach would be to adopt #KellyBundy's encoding idea in #TimPeters' algorithm, but instead of encoding lines into characters, encode them into an array.array of 32-bit integers to avoid the overhead of text encoding, and then use a memoryview of the array for quick slice-based comparisons:
from array import array
def dedup_blhsingTimKelly2(s):
encode = {}
decode = []
lines = s.splitlines(keepends=True)
n = len(lines)
for line in lines:
if line not in encode:
encode[line] = len(decode)
decode.append(line)
lines = array('L', map(encode.get, lines))
del encode
line2ix = [[] for _ in range(n)]
for i, line in enumerate(lines):
line2ix[line].append(i)
view = memoryview(lines)
out = []
i = 0
while i < n:
line = lines[i]
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
if view[i: j] == view[j: j + j - i]:
searching = False
break
if searching:
out.append(decode[line])
i += 1
else:
i = j
return "".join(out)
A run of #KellyBundy's benchmark code with this approach added, originally named dedup_blhsingTimKelly, now amended with Tim and Kelly's comments and named dedup_blhsingTimKelly2:
2.6650364249944687 dedup_original
1.3109814710041974 dedup_blhsing
0.5598453340062406 dedup_Tim
0.9783012029947713 dedup_Kelly
0.24442325498966966 dedup_blhsingTimKelly
0.21991234300367068 dedup_blhsingTimKelly2
Try it online!

Nesting a quantifier within a quantifier is expensive and in this case unnecessary.
You can use the following regex without nesting instead:
string = re.sub(r"(^.*\n)(?=\1)", "", string, flags=re.M | re.S)
In the following test it more than cuts the time in half compared to your approach:
https://replit.com/#blhsing/HugeTrivialExperiment

Another idea: You're talking about "200,000+ lines", so we can encode each unique line as one of the 1,114,112 possible characters and simplify the regex to r"(.+)(?=\1)". And after the deduplication, decode the characters back to lines.
def dedup(s):
encode = {}
decode = {}
lines = s.split('\n')
for line in lines:
if line not in encode:
c = chr(len(encode))
encode[line] = c
decode[c] = line
s = ''.join(map(encode.get, lines))
s = re.sub(r"(.+)(?=\1)", "", s, flags=re.S)
return '\n'.join(map(decode.get, s))
A little benchmark based on blhsing's but with some repeating lines (times in seconds):
2.5934535119995417 dedup_original
1.2498892020012136 dedup_blhsing
0.5043159520009795 dedup_Tim
0.9235864399997809 dedup_Kelly
I built a pool of 50 lines of 10 random letters, then joined 5000 random lines from that pool.
The two fastest with 10,000 lines instead:
2.0905018440007552 dedup_Tim
3.220036650000111 dedup_Kelly
Code (Try it online!):
import re
import random
import string
from timeit import timeit
strings = [''.join((*random.choices(string.ascii_letters, k=10), '\n')) for _ in range(50)]
s = ''.join(random.choices(strings, k=5000))
def dedup_original(s):
return re.sub(r"((?:^.*\n)+)(?=\1)", "", s, flags=re.MULTILINE)
def dedup_blhsing(s):
return re.sub(r"(^.*\n)(?=\1)", "", s, flags=re.M | re.S)
def dedup_Tim(s):
from collections import defaultdict
lines = s.splitlines(keepends=True)
line2ix = defaultdict(list)
for i, line in enumerate(lines):
line2ix[line].append(i)
out = []
n = len(lines)
i = 0
while i < n:
line = lines[i]
# Look for longest adjacent match between i:j and j:j+(j-i).
# j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
# Lines at i and j match.
if all(lines[i + k] == lines[j + k]
for k in range(1, j - i)):
searching = False
break
if searching:
out.append(line)
i += 1
else: # skip the repeated block at i:j
i = j
return "".join(out)
def dedup_Kelly(s):
encode = {}
decode = {}
lines = s.split('\n')
for line in lines:
if line not in encode:
c = chr(len(encode))
encode[line] = c
decode[c] = line
s = ''.join(map(encode.get, lines))
s = re.sub(r"(.+)(?=\1)", "", s, flags=re.S)
return '\n'.join(map(decode.get, s))
funcs = dedup_original, dedup_blhsing, dedup_Tim, dedup_Kelly
expect = funcs[0](s)
for f in funcs[1:]:
print(f(s) == expect)
for _ in range(3):
for f in funcs:
t = timeit(lambda: f(s), number=1)
print(t, f.__name__)
print()

Related

How to split the data in a group of N lines and find intersection character

I have a dataset like below:
data="""vJrwpWtwJgWrhcsFMMfFFhFp
jqHRNqRjqzjGDLGLrsFMfFZSrLrFZsSL
PmmdzqPrVvPwwTWBwg
wMqvLMZHhHMvwLHjbvcjnnSBnvTQFn
ttgJtRGJQctTZtZT
CrZsJsPPZsGzwwsLwLmpwMDw"""
These are separate lines. Now, I want to group the data in a set of 3 rows and find the intersecting character in those lines. For example, r is the common character in the first group and Z is the typical character in the second group. So, below is my code:
lines = []
for i in range(len(data.splitlines())):
lines.append(data[i])
for j in lines:
new_line = [k for k in j[i] if k in j[i + 1]]
print(new_line)
It gives me a string index out-of-range error.
new_line = [k for k in j[i] if k in j[i + 1]]
IndexError: string index out of range

For the record: this was the Advent of Code 2022 Day 3 Part 2 challenge. I kept my data in a file called input.txt and just read line by line, but this solution can be applied to a string too.
I turned converted every line into a set and used the & intersection operator. From there, I converted it to a list and removed the new line character. s[0] is therefore the only repeated character. Like this:
with open('input.txt') as f:
lines = f.readlines()
for i in range(0, len(lines), 3):
s = list(set(lines[i]) & set(lines[i + 1]) & set(lines[i + 2]))
s.remove('\n')
print(s[0])
Here's an example using your data string. In this case, I'd split by the new line character and no longer need to remove it from the list. I'd also extract the element from the set without converting to a list:
data = """vJrwpWtwJgWrhcsFMMfFFhFp
jqHRNqRjqzjGDLGLrsFMfFZSrLrFZsSL
PmmdzqPrVvPwwTWBwg
wMqvLMZHhHMvwLHjbvcjnnSBnvTQFn
ttgJtRGJQctTZtZT
CrZsJsPPZsGzwwsLwLmpwMDw"""
lines = data.split('\n')
for i in range(0, len(lines), 3):
(ch,) = set(lines[i]) & set(lines[i + 1]) & set(lines[i + 2])
print(ch)

If I understand your question correctly:
Just solved it this morning coincidently. ;-)
# ordering = ascii_lowercase + ascii_uppercase
# with open('day03.in') as fin:
# data = fin.read().strip()
# b = 0
lines = data.split('\n') # assuming some date- read-in already
# go through 3 chunks:
for i in range(0, len(lines), 3):
chunk = lines[i: i+3]
print(chunk)
#for i, c in enumerate(ordering):
# if all(c in ll for ll in chunk):
#b += ordering.index(c) + 1 # answer.

Python Inserting a string

I need to insert a string (character by character) into another string at every 3rd position
For example:- string_1:-wwwaabkccgkll
String_2:- toadhp
Now I need to insert string2 char by char into string1 at every third position
So the output must be wwtaaobkaccdgkhllp
Need in Python.. even Java is ok
So i tried this
Test_str="hiimdumbiknow"
challenge="toadh"
new_st=challenge [k]
Last=list(test_str)
K=0
For i in range(Len(test_str)):
if(i%3==0):
last.insert(i,new_st)
K+=1
and the output i get
thitimtdutmbtiknow

You can split test_str into sub-strings to length 2, and then iterate merging them with challenge:
def concat3(test_str, challenge):
chunks = [test_str[i:i+2] for i in range(0,len(test_str),2)]
result = []
i = j = 0
while i<len(chunks) or j<len(challenge):
if i<len(chunks):
result.append(chunks[i])
i += 1
if j<len(challenge):
result.append(challenge[j])
j += 1
return ''.join(result)
test_str = "hiimdumbiknow"
challenge = "toadh"
print(concat3(test_str, challenge))
# hitimoduambdikhnow
This method works even if the lengths of test_str and challenge are mismatching. (The remaining characters in the longest string will be appended at the end.)

You can split Test_str in to groups of two letters and then re-join with each letter from challenge in between as follows;
import itertools
print(''.join(f'{two}{letter}' for two, letter in itertools.zip_longest([Test_str[i:i+2] for i in range(0,len(Test_str),2)], challenge, fillvalue='')))
Output:
hitimoduambdikhnow
*edited to split in to groups of two rather than three as originally posted

you can try this, make an iter above the second string and iterate over the first one and select which character should be part of the final string according the position
def add3(s1, s2):
def n():
try:
k = iter(s2)
for i,j in enumerate(s1):
yield (j if (i==0 or (i+1)%3) else next(k))
except:
try:
yield s1[i+1:]
except:
pass
return ''.join(n())

def insertstring(test_str,challenge):
result = ''
x = [x for x in test_str]
y = [y for y in challenge]
j = 0
for i in range(len(x)):
if i % 2 != 0 or i == 0:
result += x[i]
else:
if j < 5:
result += y[j]
result += x[i]
j += 1
get_last_element = x[-1]
return result + get_last_element
print(insertstring(test_str,challenge))
#output: hitimoduambdikhnow

Problem with reading data from file in Python

EDIT:
Thanks for fixing it! Unfortunatelly, it messed up the logic. I'll explain what this program does. It's a solution to a task about playing cards trick. There are N cards on the table. First and Second are numbers on the front and back of the cards. The trick can only be done, if the visible numbers are in non-decreasing order. Someone from audience can come and swap places of cards. M represents how many cards will be swapped places. A and B represent which cards will be swapped. Magician can flip any number of cards to see the other side. The program must tell, if the magician can do the trick.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
for _ in range(n):
first, second = (int(x) for x in data.readline().split(':'))
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # add to the list by appending
m = data.readline()
m = int(m)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
data:
4
2:5
3:4
6:3
2:7
2
3-4
1-3
results:
YES
YES
YES
YES
YES
YES
YES
What should be in results:
NO
YES

The code is full of bugs: you should write and test it incrementally instead of all at once. It seems that you started using readlines (which is a good way of managing this kind of work) but you kept the rest of the code in a reading one by one style. If you used readlines, the line for i, line in enumerate(data): should be changed to for i, line in enumerate(lines):.
Anyway, here is a corrected version with some explanation. I hope I did not mess with the logic.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
# The following line created a huge list of "Pairs" types, not instances
# pairs = [Pair] * (2*200*1000+1)
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
# removing the reading of all data...
# lines = data.readlines()
# m = lines[n]
# removed bad for: for i, line in enumerate(data):
for _ in range(n): # you don't need the index
first, second = (int(x) for x in data.readline().split(':'))
# removed unnecessary recasting to int
# first = int(first)
# second = int(second)
# changed the swapping to a more elegant way
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # we add to the list by appending
# removed unnecessary for: once you read all the first and seconds,
# you reached M
m = data.readline()
m = int(m)
# you don't need the index... indeed you don't need to count (you can read
# to the end of file, unless it is malformed)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
# removed unnecessary recasting to int
# a = int(a)
# b = int(b)
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
Response previous to edition
range(1, 1) is empty, so this part of the code:
for i in range (1, 1):
n = data.readline()
n = int(n)
does not define n, at when execution gets to line 12 you get an error.
You can remove the for statement, changing those three lines to:
n = data.readline()
n = int(n)

How can I fix this error for popping a word in a list/string? (Python 3.x)

I'm not exactly the kind of guy you call "good" at coding. In this particular scenario, on line 13, I'm trying to pop the first word in the list until I'm done, but it keeps giving me the 'str' object can not be interpreted as an integer issue.
What am I doing wrong here?
n = n.split(" ")
N = n[0]
K = n[1]
f1 = input()
f1 = f1.split(" ")
f1 = list(f1)
current = 0
for x in f1:
while current <= 7:
print(x)
f1 = list(f1.pop()[0])
current = current + len(x)
if current > 7:
print("\n")
current = 0

According your comments, this program will split lines to contain max K characters:
K = 7
s = "hello my name is Bessie and this is my essay"
out, cnt = [], 0
for word in s.split():
l = len(word)
if cnt + l <= K:
cnt += l
if not out:
out.append([word])
else:
out[-1].append(word)
else:
cnt = l
out.append([word])
print("\n".join(" ".join(line) for line in out))
Prints:
hello my
name is
Bessie
and this
is my
essay

You could try splitting the string on the index, and inserting a newline there. Each time you do this your string gets one character longer, so we can use enumerate (which starts counting at zero) to add a number to our slice indexes.
s = 'Thanks for helping me'
new_line_index = [7,11, 19]
for i, x in enumerate(new_line_index):
s = s[:x+i] + '\n' + s[x+i:]
print(s)
Output
Thanks
for
helping
me

What is best way to remove duplicate lines matching regex from string using Python?

This is a pretty straight forward attempt. I haven't been using python for too long. Seems to work but I am sure I have much to learn. Someone let me know if I am way off here. Needs to find patterns, write the first line which matches, and then add a summary message for remaining consecutive lines which match pattern and return modified string.
Just to be clear...regex .*Dog.* would take
Cat
Dog
My Dog
Her Dog
Mouse
and return
Cat
Dog
::::: Pattern .*Dog.* repeats 2 more times.
Mouse
#!/usr/bin/env python
#
import re
import types
def remove_repeats (l_string, l_regex):
"""Take a string, remove similar lines and replace with a summary message.
l_regex accepts strings and tuples.
"""
# Convert string to tuple.
if type(l_regex) == types.StringType:
l_regex = l_regex,
for t in l_regex:
r = ''
p = ''
for l in l_string.splitlines(True):
if l.startswith('::::: Pattern'):
r = r + l
else:
if re.search(t, l): # If line matches regex.
m += 1
if m == 1: # If this is first match in a set of lines add line to file.
r = r + l
elif m > 1: # Else update the message string.
p = "::::: Pattern '" + t + "' repeats " + str(m-1) + ' more times.\n'
else:
if p: # Write the message string if it has value.
r = r + p
p = ''
m = 0
r = r + l
if p: # Write the message if loop ended in a pattern.
r = r + p
p = ''
l_string = r # Reset string to modified string.
return l_string

The rematcher function seems to do what you want:
def rematcher(re_str, iterable):
matcher= re.compile(re_str)
in_match= 0
for item in iterable:
if matcher.match(item):
if in_match == 0:
yield item
in_match+= 1
else:
if in_match > 1:
yield "%s repeats %d more times\n" % (re_str, in_match-1)
in_match= 0
yield item
if in_match > 1:
yield "%s repeats %d more times\n" % (re_str, in_match-1)
import sys, re
for line in rematcher(".*Dog.*", sys.stdin):
sys.stdout.write(line)
EDIT
In your case, the final string should be:
final_string= '\n'.join(rematcher(".*Dog.*", your_initial_string.split("\n")))

Updated your code to be a bit more effective
#!/usr/bin/env python
#
import re
import types
def remove_repeats (l_string, l_regex):
"""Take a string, remove similar lines and replace with a summary message.
l_regex accepts strings/patterns or tuples of strings/patterns.
"""
# Convert string/pattern to tuple.
if not hasattr(l_regex, '__iter__'):
l_regex = l_regex,
ret = []
last_regex = None
count = 0
for line in l_string.splitlines(True):
if last_regex:
# Previus line matched one of the regexes
if re.match(last_regex, line):
# This one does too
count += 1
continue # skip to next line
elif count > 1:
ret.append("::::: Pattern %r repeats %d more times.\n" % (last_regex, count-1))
count = 0
last_regex = None
ret.append(line)
# Look for other patterns that could match
for regex in l_regex:
if re.match(regex, line):
# Found one
last_regex = regex
count = 1
break # exit inner loop
return ''.join(ret)

First, your regular expression will match more slowly than if you had left off the greedy match.
.*Dog.*
is equivalent to
Dog
but the latter matches more quickly because no backtracking is involved. The longer the strings, the more likely "Dog" appears multiple times and thus the more backtracking work the regex engine has to do. As it is, ".*D" virtually guarantees backtracking.
That said, how about:
#! /usr/bin/env python
import re # regular expressions
import fileinput # read from STDIN or file
my_regex = '.*Dog.*'
my_matches = 0
for line in fileinput.input():
line = line.strip()
if re.search(my_regex, line):
if my_matches == 0:
print(line)
my_matches = my_matches + 1
else:
if my_matches != 0:
print('::::: Pattern %s repeats %i more times.' % (my_regex, my_matches - 1))
print(line)
my_matches = 0
It's not clear what should happen with non-neighboring matches.
It's also not clear what should happen with single-line matches surrounded by non-matching lines. Append "Doggy" and "Hula" to the input file and you'll get the matching message "0" more times.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up re.sub() on large strings representing large files in python? - python

Related

How to split the data in a group of N lines and find intersection character

Python Inserting a string

Problem with reading data from file in Python

How can I fix this error for popping a word in a list/string? (Python 3.x)

What is best way to remove duplicate lines matching regex from string using Python?

Categories

Resources