Highlighting occurrences of a string in a Tkinter textField

Highlighting occurrences of a string in a Tkinter textField - python

I have a regex pattern return a list of all the start and stop indices of an occurring string and I want to be able to highlight each occurrence, it's extremely slow with my current setup — using a 133,000 line file it takes about 8 minutes to highlight all occurrences.
Here's my current solution:
if IPv == 4:
v4FoundUnique = v4FoundUnique + 1
# highlight all regions found
for j in range(qty):
v4Found = v4Found + 1
# don't highlight if they set the checkbox not to
if highlightText:
# get row.column coordinates of start and end of match
# very slow
startIndex = textField.index('1.0 + {} chars'.format(starts[j]))
# compute end based on start, using assumption that IP addresses
# won't span lines drastically faster than computing from raw index
endIndex = "{}.{}".format(startIndex.split(".")[0],
int(startIndex.split(".")[1]) + stops[j]-starts[j])
# apply tag
textField.tag_add("{}v4".format("public" if isPublic else "private"),
startIndex, endIndex)

So, TKinter has a pretty bad implementation of changing "absolute location" to its row.column format:
startIndex = textField.index('1.0 + {} chars'.format(starts[j]))
it's actually faster to do it like this:
for address in v4check.finditer(filetxt):
# address.group() returns matching text
# address.span() returns the indices (start,stop)
start,stop = address.span()
ip = address.group()
srow = filetxt.count("\n",0,start)+1
scol = start-filetxt.rfind("\n",0,start)-1
start = "{}.{}".format(srow,scol)
stop = "{}.{}".format(srow,scol+len(ip))
which takes the regex results and the input file to get the data we need (row.colum)
There could be a faster way of doing this but this is the solution I found that works!

Related

Find a pattern in a stream of bytes read in blocks

I have a stream of gigabytes of data that I read in blocks of 1 MB.
I'd like to find if (and where) one of the patterns PATTERNS = [b"foo", b"bar", ...] is present in the data (case insensitive).
Here is what I'm doing. It works but it is sub-optimal:
oldblock = b''
while True:
block = source_data.get_bytes(1024*1024)
if block == b'':
break
testblock = (oldblock + block).lower()
for PATTERN in PATTERNS:
if PATTERN in testblock:
for l in testblock.split(b'\n'): # display only the line where the
if PATTERN in l: # pattern is found, not the whole 1MB block!
print(l) # note: this line can be incomplete if
oldblock = block # it continues in the next block (**)
Why do we need to search in oldblock + block? This is because the pattern foo could be precisely split in two consecutive 1 MB blocks:
[.......fo] [o........]
block n block n+1
Drawback: it's not optimal to have to concatenate oldblock + block and to perform the search almost twice as much as necessary.
We could use testblock = oldblock[-max_len_of_patterns:] + block, but there is surely a more canonical way to address this problem, as well as the side-remark (**).
How to do a more efficient pattern search in data read by blocks?
Note: the input data is not a file that I can iterate on or memory map, I only receive a stream of 1MB blocks from an external source.

I'd separate the block-getting from the pattern-searching and do it like this (all but the first two lines are from your original):
for block in nice_blocks():
testblock = block.lower()
for PATTERN in PATTERNS:
if PATTERN in testblock:
for l in testblock.split(b'\n'): # display only the line where the
if PATTERN in l: # pattern is found, not the whole 1MB block!
print(l)
Where nice_blocks() is an iterator of "nice" blocks, meaning they don't break lines apart and they don't overlap. And they're ~1 MB large as well.
To support that, I start with a helper just providing an iterator of the raw blocks:
def raw_blocks():
while block := source_data.get_bytes(1024*1024):
yield block
(The := assumes you're not years behind, it was added in Python 3.8. For older versions, do it with your while-True-if-break).
And to get nice blocks:
def nice_blocks():
carry = b''
for block in raw_blocks():
i = block.rfind(b'\n')
if i >= 0:
yield carry + block[:i]
carry = block[i+1:]
else:
carry += block
if carry:
yield carry
The carry carries over remaining bytes from the previous block (or previous blocks, if none of them had newlines, but that's not happening with your "blocks of 1 MB" and your "line_length < 1 KB").
With these two helper functions in place, you can write your code as at the top of my answer.

From the use of testblock.split(b'\n') in your code, as well as the comment about displaying the line where a pattern is found, it is well apparent that your expected input is not a true binary file, but a text file, where each line, separated by b'\n', is of a size reasonable enough to be readable by the end user when displayed on a screen. It is therefore most convenient and efficient to simply iterate through the file by lines instead of in chunks of a fixed size since the iterator of a file-like object already handles buffering and splitting by lines optimally.
However, since it is now clear from your comment that data is not really a file-like object in your real-world scenario, but an API that presumably has just a method that returns a chunk of data per call, we have to wrap that API into a file-like object.
For demonstration purpose, let's simulate the API you're dealing with by creating an API class that returns up to 10 bytes of data at a time with the get_next_chunk method:
class API:
def __init__(self, data):
self.data = data
self.position = 0
def get_next_chunk(self):
chunk = self.data[self.position:self.position + 10]
self.position += 10
return chunk
We can then create a subclass of io.RawIOBase that wraps the API into a file-like object with a readinto method that is necessary for a file iterator to work:
import io
class APIFileWrapper(io.RawIOBase):
def __init__(self, api):
self.api = api
self.leftover = None
def readable(self):
return True
def readinto(self, buffer):
chunk = self.leftover or api.get_next_chunk()
size = len(buffer)
output = chunk[:size]
self.leftover = chunk[size:]
output_size = len(output)
buffer[:output_size] = output
return output_size
With a raw file-like object, we can then wrap it in an io.BufferedReader with a buffer size that matches the size of data returned by your API call, and iterate through the file object by lines and use the built-in in operator to test if a line contains one of the patterns in the list:
api = API(b'foo bar\nHola World\npython\nstackoverflow\n')
PATTERNS = [b't', b'ho']
for line in io.BufferedReader(APIFileWrapper(api), 10): # or 1024 * 1024 in your case
lowered_line = line.lower()
for pattern in PATTERNS:
if pattern in lowered_line:
print(line)
break
This outputs:
b'Hola World\n'
b'python\n'
b'stackoverflow\n'
Demo: https://replit.com/#blhsing/CelebratedCadetblueWifi

I didn't do any benchmarks but this solution has the definite advantage of being straight forward and not looking everywhere twice, print the lines as they actually appear in the stream (and not in all lower case) and print the complete lines even if they cross a block:
import re
regex_patterns = list(re.compile('^.*'+re.escape(pattern)+'.*$',re.I|re.M) for pattern in PATTERNS)
testblock = ""
block = data.read(1024*1024) # **see remark below**
while len(block)>0:
lastLineStart = testblock.rfind('\n')+1
testblock = testblock[lastLineStart:]+block.decode('UFT-8') # **see edit below**
for pattern in regex_patterns:
for line in pattern.findall(testblock):
print(line)
block = data.read(1024*1024) # **see remark below**
Remark: Since you are processing text data here (otherwise the notion of "lines" wouldn't make any sense), you shouldn't be using b'...' anywhere. Your text in the stream has some encoding and you should read it in a way that honours that encoding (instead of data.read(1024*1024)) so that the loops are operating on real (Python internal unicode) strings and not some byte data. Not getting that straight is one of the most frustratingly difficult bugs to find in each and every Python script.
Edit: If your data is coming from someplace you don't have control over, then using block.decode('UTF-8') (where 'UTF-8' should be replaced by your data's actual encoding!) would allow for the patterns to be Python unicode strings as well. Meaning you could drop the b'..' around those as well. Naturally... if your data is all strictly 7-bit anyway, those points are mute.

How about:
Only concat the end of the first block and start of the next block using the length of the pattern that you are currently looking for. Then use a variable (carry) to indicate if you found the pattern or not so that when you move to the next block you automatically print the first line because you already know that line started with the pattern.
E.g.
block_0 = "abcd"
block_1 = "efgh"
pattern = "def"
length = 3
if pattern in block_0[-length + 1:] + block_1[:length - 1]
This if statement will check "cdef" for the pattern "def". No need to check any more characters than that because if it isn't in that selection of characters then it isn't between blocks in the first place. Now you know the pattern is across blocks, you just need to print the first line of the next block which will be done by checking the value of carry as seen below.
This should stop you needing to go through the block twice like you said.
oldblock = b''
carry = False
while True:
block = data.read(1024*1024)
if block == b'':
break
block = block.lower()
lines = block.split(b'\n')
if carry:
print(lines[0])
carry = False
for PATTERN in PATTERNS:
if PATTERN in block:
for l in lines:
if PATTERN in l:
print(l)
length = len(PATTERN)
if PATTERN in (block[-length + 1:] + oldblock[:length - 1]):
carry = True #Found the PATTERN between blocks, indicate that the first line of the next block needs to be printed
oldblock = block

Updated Answer
Given what we now know about the nature of the data, then we only need to retain from a previous call to get_bytes the last N characters where N is the maximum pattern length - 1. And since a portion of the previous block retrieved must be concatenated with the newly read block in order to match patterns that are split across block boundaries, it then becomes possible to match the same pattern twice. Therefore, it only makes sense that once a pattern has been matched we do not try to match it again. And, of course, when there are no more patterns to match we can quit.
The pattern strings, if not Ascii, should be encoded with the same encoding being used in the stream.
PATTERNS = [b'foo', b'bar', 'coûteux'.encode('utf-8')]
BLOCKSIZE = 1024 * 1024
FILE_PATH = 'test.txt'
# Compute maximum pattern length - 1
pad_length = max(map(lambda pattern: len(pattern), PATTERNS)) - 1
with open(FILE_PATH, 'rb') as f:
patterns = PATTERNS
# Initialize with any byte string we are not trying to match.
data = b'\x00' * pad_length
offset = 0
# Any unmatched patterns left?:
while patterns:
# Emulate a call to get_bytes(BLOCKSIZE) using a binary file:
block = f.read(BLOCKSIZE)
if block == b'':
break
# You only need to keep the last pad_length bytes from previous read:
data = data[-pad_length:] + block.lower()
# Once a pattern is matched we do not want to try matching it again:
new_patterns = []
for pattern in patterns:
idx = data.find(pattern)
if idx != -1:
print('Found: ', pattern, 'at offset', offset + idx - pad_length)
else:
new_patterns.append(pattern)
offset += BLOCKSIZE
patterns = new_patterns

if patern matched, try to use "break;" word inside "for" body for breaking execution of already useless code
and use {...} for start and finish "for" loop body like:
for (...)
{ if match(PATTERN) break;
}

Generating multiple strings by replacing wildcards

So i have the following strings:
"xxxxxxx#FUS#xxxxxxxx#ACS#xxxxx"
"xxxxx#3#xxxxxx#FUS#xxxxx"
And i want to generate the following strings from this pattern (i'll use the second example):
Considering #FUS# will represent 2.
"xxxxx0xxxxxx0xxxxx"
"xxxxx0xxxxxx1xxxxx"
"xxxxx0xxxxxx2xxxxx"
"xxxxx1xxxxxx0xxxxx"
"xxxxx1xxxxxx1xxxxx"
"xxxxx1xxxxxx2xxxxx"
"xxxxx2xxxxxx0xxxxx"
"xxxxx2xxxxxx1xxxxx"
"xxxxx2xxxxxx2xxxxx"
"xxxxx3xxxxxx0xxxxx"
"xxxxx3xxxxxx1xxxxx"
"xxxxx3xxxxxx2xxxxx"
Basically if i'm given a string as above, i want to generate multiple strings by replacing the wildcards that can be #FUS#, #WHATEVER# or with a number #20# and generating multiple strings with the ranges that those wildcards represent.
I've managed to get a regex to find the wildcards.
wildcardRegex = f"(#FUS#|#WHATEVER#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
Which finds correctly the target wildcards.
For 1 wildcard present, it's easy.
re.sub()
For more it gets complicated. Or maybe it was a long day...
But i think my algorithm logic is failing hard because i'm failing to write some code that will basically generate the signals. I think i need some kind of recursive function that will be called for each number of wildcards present (up to maybe 4 can be present (xxxxx#2#xxx#2#xx#FUS#xx#2#x)).
I need a list of resulting signals.
Is there any easy way to do this that I'm completely missing?
Thanks.

import re
stringV1 = "xxx#FUS#xxxxi#3#xxx#5#xx"
stringV2 = "XXXXXXXXXX#FUS#XXXXXXXXXX#3#xxxxxx#5#xxxx"
regex = "(#FUS#|#DSP#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
WILDCARD_FUS = "#FUS#"
RANGE_FUS = 3
def getSignalsFromWildcards(app, can):
sigList = list()
if WILDCARD_FUS in app:
for i in range(RANGE_FUS):
outAppSig = app.replace(WILDCARD_FUS, str(i), 1)
outCanSig = can.replace(WILDCARD_FUS, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
elif len(re.findall("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", stringV1)) > 0:
wildcard = re.search("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", app).group()
tarRange = int(wildcard.strip("#"))
for i in range(tarRange):
outAppSig = app.replace(wildcard, str(i), 1)
outCanSig = can.replace(wildcard, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
return sigList
if "#" in stringV1:
resultList = getSignalsFromWildcards(stringV1, stringV2)
for item in resultList:
print(item)
results in
('xxx0xxxxi0xxxxx', 'XXXXXXXXXX0XXXXXXXXXX0xxxxxxxxxx')
('xxx0xxxxi1xxxxx', 'XXXXXXXXXX0XXXXXXXXXX1xxxxxxxxxx')
('xxx0xxxxi2xxxxx', 'XXXXXXXXXX0XXXXXXXXXX2xxxxxxxxxx')
('xxx1xxxxi0xxxxx', 'XXXXXXXXXX1XXXXXXXXXX0xxxxxxxxxx')
('xxx1xxxxi1xxxxx', 'XXXXXXXXXX1XXXXXXXXXX1xxxxxxxxxx')
('xxx1xxxxi2xxxxx', 'XXXXXXXXXX1XXXXXXXXXX2xxxxxxxxxx')
('xxx2xxxxi0xxxxx', 'XXXXXXXXXX2XXXXXXXXXX0xxxxxxxxxx')
('xxx2xxxxi1xxxxx', 'XXXXXXXXXX2XXXXXXXXXX1xxxxxxxxxx')
('xxx2xxxxi2xxxxx', 'XXXXXXXXXX2XXXXXXXXXX2xxxxxxxxxx')
long day after-all...

Consolidate similar patterns into single consensus pattern

In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here.
I have the following items:
a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long);
some long protein sequences, aka my reference.
I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.)
import sys
import re
from itertools import chain, izip
# Read input
with open(sys.argv[1], 'r') as f:
sequences = f.read().splitlines()
with open(sys.argv[2], 'r') as g:
patterns = g.read().splitlines()
# Write output
with open(sys.argv[3], 'w') as outputFile:
data_iter = iter(sequences)
order = ['antibody name', 'epitope sequence', 'start', 'end', 'length']
header = '\t'.join([k for k in order])
outputFile.write(header + '\n')
for seq_name, seq in izip(data_iter, data_iter):
locations = [[{'antibody name': seq_name, 'epitope sequence': pattern, 'start': match.start()+1, 'end': match.end(), 'length': len(pattern)} for match in re.finditer(pattern, seq)] for pattern in patterns]
for loc in chain.from_iterable(locations):
output = '\t'.join([str(loc[k]) for k in order])
outputFile.write(output + '\n')
f.close()
g.close()
outputFile.close()
Problem is, within these 59,000 patterns, after sorted, I found that some part of one pattern match with part of the other patterns, and I would like to consolidate these into one big "consensus" patterns and just keep the consensus (see examples below):
TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
will yield
TLYLQMNSLRAEDTAV
another example:
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR
will yield
KPGQAPRLLIYGASSRATGIPD
PS : I am aligning them here so it's easier to visualize. The 59,000 patterns initially are not sorted so it's hard to see the consensus in the actual file.
In my particular problem, I am not picking the longest patterns, instead, I need to take each pattern into account to find the consensus. I hope I have explained clearly enough for my specific problem.
Thanks!

Here's my solution with randomized input order to improve confidence of the test.
import re
import random
data_values = """TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR"""
test_li1 = data_values.split()
#print(test_li1)
test_li2 = ["abcdefghi", "defghijklmn", "hijklmnopq", "mnopqrst", "pqrstuvwxyz"]
def aggregate_str(data_li):
copy_data_li = data_li[:]
while len(copy_data_li) > 0:
remove_li = []
len_remove_li = len(remove_li)
longest_str = max(copy_data_li, key=len)
copy_data_li.remove(longest_str)
remove_li.append(longest_str)
while len_remove_li != len(remove_li):
len_remove_li = len(remove_li)
for value in copy_data_li:
value_pattern = "".join([x+"?" for x in value])
longest_match = max(re.findall(value_pattern, longest_str), key=len)
if longest_match in value:
longest_str_index = longest_str.index(longest_match)
value_index = value.index(longest_match)
if value_index > longest_str_index and longest_str_index > 0:
longest_str = value[:value_index] + longest_str
copy_data_li.remove(value)
remove_li.append(value)
elif value_index < longest_str_index and longest_str_index + len(longest_match) == len(longest_str):
longest_str += value[len(longest_str)-longest_str_index:]
copy_data_li.remove(value)
remove_li.append(value)
elif value in longest_str:
copy_data_li.remove(value)
remove_li.append(value)
print(longest_str)
print(remove_li)
random.shuffle(test_li1)
random.shuffle(test_li2)
aggregate_str(test_li1)
#aggregate_str(test_li2)
Output from print().
KPGQAPRLLIYGASSRATGIPD
['KPGQAPRLLIYGASSRATGIPD', 'APRLLIYGASS', 'KPGQAPRLLIYGASSR', 'APRLLIYGASSRAT', 'APRLLIYGASSR', 'APRLLIYGASSRA', 'GQAPRLLIY', 'APRLLIYGASSRATGIPD', 'APRLLIYGASSRATG', 'QAPRLLIYGASSR', 'LLIYGASSRATG', 'KPGQAPRLLIYGASSRATG', 'KPGQAPRLLIYGASSRAT', 'LLIYGASSRATGIPD', 'APRLLIYGASSRATGIP']
TLYLQMNSLRAEDTAV
['YLQMNSLRAEDTAV', 'TLYLQMNSLRAED', 'TLYLQMNSLRAEDT', 'YLQMNSLRAED', 'YLQMNSLRAEDTA', 'YLQMNSLRAEDT']
Edit1 - brief explanation of the code.
1.) Find longest string in list
2.) Loop through all remaining strings and find longest possible match.
3.) Make sure that the match is not a false positive. Based on the way I've written this code, it should avoid pairing single overlaps on terminal ends.
4.) Append the match to the longest string if necessary.
5.) When nothing else can be added to the longest string, repeat the process (1-4) for the next longest string remaining.
Edit2 - Corrected unwanted behavior when treating data like ["abcdefghijklmn", "ghijklmZopqrstuv"]

def main():
#patterns = ["TLYLQMNSLRAED","TLYLQMNSLRAEDT","YLQMNSLRAED","YLQMNSLRAEDT","YLQMNSLRAEDTA","YLQMNSLRAEDTAV"]
patterns = ["APRLLIYGASS","APRLLIYGASSR","APRLLIYGASSRA","APRLLIYGASSRAT","APRLLIYGASSRATG","APRLLIYGASSRATGIP","APRLLIYGASSRATGIPD","GQAPRLLIY","KPGQAPRLLIYGASSR","KPGQAPRLLIYGASSRAT","KPGQAPRLLIYGASSRATG","KPGQAPRLLIYGASSRATGIPD","LLIYGASSRATG","LLIYGASSRATGIPD","QAPRLLIYGASSR"]
test = find_core(patterns)
test = find_pre_and_post(test, patterns)
#final = "YLQMNSLRAED"
final = "KPGQAPRLLIYGASSRATGIPD"
if test == final:
print("worked:" + test)
else:
print("fail:"+ test)
def find_pre_and_post(core, patterns):
pre = ""
post = ""
for pattern in patterns:
start_index = pattern.find(core)
if len(pattern[0:start_index]) > len(pre):
pre = pattern[0:start_index]
if len(pattern[start_index+len(core):len(pattern)]) > len(post):
post = pattern[start_index+len(core):len(pattern)]
return pre+core+post
def find_core(patterns):
test = ""
for i in range(len(patterns)):
for j in range(2,len(patterns[i])):
patterncount = 0
for pattern in patterns:
if patterns[i][0:j] in pattern:
patterncount += 1
if patterncount == len(patterns):
test = patterns[i][0:j]
return test
main()
So what I do first is find the main core in the find_core function by starting with a string of length two, as one character is not sufficient information, for the first string. I then compare that substring and see if it is in ALL the strings as the definition of a "core"
I then find the indexes of the substring in each string to then find the pre and post substrings added to the core. I keep track of these lengths and update them if one length is greater than the other. I didn't have time to explore edge cases so here is my first shot

Python - Splitting a large string by number of delimiter occurrences

I'm still learning Python, and I have a question I haven't been able to solve. I have a very long string (millions of lines long) which I would like to be split into a smaller string length based on a specified number of occurrences of a delimeter.
For instance:
ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//
In this case I would want to split based on "//" and return a string of all lines before the nth occurrence of the delimeter.
So an input of splitting the string by // by 1 would return:
ABCDEF
an input of splitting the string by // by 2 would return:
ABCDEF
//
GHIJKLMN
an input of splitting the string by // by 3 would return:
ABCDEF
//
GHIJKLMN
//
OPQ
And so on... However, The length of the original 2 million line string appeared to be a problem when I simply tried to split the entire string and by "//" and just work with the individual indexes. (I was getting a memory error) Perhaps Python can't handle so many lines in one split? So I can't do that.
I'm looking for a way that I don't need to split the entire string into a hundred-thousand indexes when I may only need 100, but instead just start from the beginning until a certain point, stop and return everything before it, which I assume may also be faster? I hope my question is as clear as possible.
Is there a simple or elegant way to achieve this? Thanks!

If you want to work with files instead of strings in memory, here is another answer.
This version is written as a function that reads lines and immediately prints them out until the specified number of delimiters have been found (no extra memory needed to store the entire string).
def file_split(file_name, delimiter, n=1):
with open(file_name) as fh:
for line in fh:
line = line.rstrip() # use .rstrip("\n") to only strip newlines
if line == delimiter:
n -= 1
if n <= 0:
return
print line
file_split('data.txt', '//', 3)
You can use this to write the output to a new file like this:
python split.py > newfile.txt
With a little extra work, you can use argparse to pass parameters to the program.

As a more efficient way you can read the firs N lines separated by your delimiter so if you are sure that all of your lines are splitted by delimiter you can use itertools.islice to do the job:
from itertools import islice
with open('filename') as f :
lines = islice(f,0,2*N-1)

The method that comes to my mind when I read your question uses a for loop
where you cut up the string into several (for example the 100 you called) and iterate through the substring.
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
log = 0
substring = thestring[:log+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
log = log+steps
# and go again from the start only with this offset
now you can go through all the elements go through the whole 2 million(!) line string.
best thing to do here is actually make a recursive function from this(if that is what you want):
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
def iterateThroughHugeString(beginning):
substring = thestring[:beginning+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
iterateThroughHugeString(beginning+steps)
# and go again from the start only with this offset

For instance:
i = 0
s = ""
fd = open("...")
for l in fd:
if l[:-1] == delimiter: # skip last '\n'
i += 1
if i >= max_split:
break
s += l
fd.close()

Since you are learning Python it would be a challenge to model a complete dynamic solution. Here's a notion of how you can model one.
Note: The following code snippet only works for file(s) which is/are in the given format (see the 'For Instance' in the question). Hence, it is a static solution.
num = (int(input("Enter delimiter: ")) * 2)
with open("./data.txt") as myfile:
print ([next(myfile) for x in range(num-1)])
Now that have the idea, you can use pattern matching and so on.

Why does the highlightBlock(text) method of QT class QSyntaxHighlighter process one line of text each time?

Recently I've been working on a PyQt regex tester, I need to highlight the matched result.
Here is my code:
def highlightBlock(self, text):
index = 0
length = 0
for item in self.highlight_data:
index = text.indexOf(item, index + length)
length = len(item)
self.setFormat(index, length, self.matched_format)
the self.highlight_data is a list which stores the matched data, and the method iterate the text to find them and highlight them. But when the matched data include the '\n'(multiple lines), the result won't be highlighted correctly.
When I debugged the code, I found the highlightBlock(text) method will be called several times if the text include multiple lines. Each time the parameter text is one line of the data.
Then I changed my code to:
def highlightBlock(self, text):
index = 0
length = 0
for item in self.highlight_data:
if item.count('\n') != 0:
itemList = item.split('\n')
for part in itemList:
index = text.indexOf(part, index + length)
if index == -1:
index = 0
else:
length = len(part)
self.setFormat(index, length, self.matched_format)
else:
index = text.indexOf(item, index + length)
length = len(item)
self.setFormat(index, length, self.matched_format)
This will solve the problem.
Here is my question: why does the highlightBlock(text) method process one line each time? Why not just transferring the whole text(including '\n') one time instead of one line for several times?

I suppose the clue is in the name: "highlightBlock". It is called whenever blocks of text change within the document.
To quote from the Qt docs for QTextEdit:
QTextEdit works on paragraphs and characters. A paragraph is a
formatted string which is word-wrapped to fit into the width of the
widget. By default when reading plain text, one newline signifies a
paragraph. A document consists of zero or more paragraphs. The words
in the paragraph are aligned in accordance with the paragraph's
alignment. Paragraphs are separated by hard line breaks.
So, since QTextEdit works on paragraphs/blocks, it is only natural that QSyntaxHighlighter should do likewise.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Highlighting occurrences of a string in a Tkinter textField - python

Related

Find a pattern in a stream of bytes read in blocks

Generating multiple strings by replacing wildcards

Consolidate similar patterns into single consensus pattern

Python - Splitting a large string by number of delimiter occurrences

Why does the highlightBlock(text) method of QT class QSyntaxHighlighter process one line of text each time?

Categories

Resources