I have a stream of gigabytes of data that I read in blocks of 1 MB.
I'd like to find if (and where) one of the patterns PATTERNS = [b"foo", b"bar", ...] is present in the data (case insensitive).
Here is what I'm doing. It works but it is sub-optimal:
oldblock = b''
while True:
block = source_data.get_bytes(1024*1024)
if block == b'':
break
testblock = (oldblock + block).lower()
for PATTERN in PATTERNS:
if PATTERN in testblock:
for l in testblock.split(b'\n'): # display only the line where the
if PATTERN in l: # pattern is found, not the whole 1MB block!
print(l) # note: this line can be incomplete if
oldblock = block # it continues in the next block (**)
Why do we need to search in oldblock + block? This is because the pattern foo could be precisely split in two consecutive 1 MB blocks:
[.......fo] [o........]
block n block n+1
Drawback: it's not optimal to have to concatenate oldblock + block and to perform the search almost twice as much as necessary.
We could use testblock = oldblock[-max_len_of_patterns:] + block, but there is surely a more canonical way to address this problem, as well as the side-remark (**).
How to do a more efficient pattern search in data read by blocks?
Note: the input data is not a file that I can iterate on or memory map, I only receive a stream of 1MB blocks from an external source.
I'd separate the block-getting from the pattern-searching and do it like this (all but the first two lines are from your original):
for block in nice_blocks():
testblock = block.lower()
for PATTERN in PATTERNS:
if PATTERN in testblock:
for l in testblock.split(b'\n'): # display only the line where the
if PATTERN in l: # pattern is found, not the whole 1MB block!
print(l)
Where nice_blocks() is an iterator of "nice" blocks, meaning they don't break lines apart and they don't overlap. And they're ~1 MB large as well.
To support that, I start with a helper just providing an iterator of the raw blocks:
def raw_blocks():
while block := source_data.get_bytes(1024*1024):
yield block
(The := assumes you're not years behind, it was added in Python 3.8. For older versions, do it with your while-True-if-break).
And to get nice blocks:
def nice_blocks():
carry = b''
for block in raw_blocks():
i = block.rfind(b'\n')
if i >= 0:
yield carry + block[:i]
carry = block[i+1:]
else:
carry += block
if carry:
yield carry
The carry carries over remaining bytes from the previous block (or previous blocks, if none of them had newlines, but that's not happening with your "blocks of 1 MB" and your "line_length < 1 KB").
With these two helper functions in place, you can write your code as at the top of my answer.
From the use of testblock.split(b'\n') in your code, as well as the comment about displaying the line where a pattern is found, it is well apparent that your expected input is not a true binary file, but a text file, where each line, separated by b'\n', is of a size reasonable enough to be readable by the end user when displayed on a screen. It is therefore most convenient and efficient to simply iterate through the file by lines instead of in chunks of a fixed size since the iterator of a file-like object already handles buffering and splitting by lines optimally.
However, since it is now clear from your comment that data is not really a file-like object in your real-world scenario, but an API that presumably has just a method that returns a chunk of data per call, we have to wrap that API into a file-like object.
For demonstration purpose, let's simulate the API you're dealing with by creating an API class that returns up to 10 bytes of data at a time with the get_next_chunk method:
class API:
def __init__(self, data):
self.data = data
self.position = 0
def get_next_chunk(self):
chunk = self.data[self.position:self.position + 10]
self.position += 10
return chunk
We can then create a subclass of io.RawIOBase that wraps the API into a file-like object with a readinto method that is necessary for a file iterator to work:
import io
class APIFileWrapper(io.RawIOBase):
def __init__(self, api):
self.api = api
self.leftover = None
def readable(self):
return True
def readinto(self, buffer):
chunk = self.leftover or api.get_next_chunk()
size = len(buffer)
output = chunk[:size]
self.leftover = chunk[size:]
output_size = len(output)
buffer[:output_size] = output
return output_size
With a raw file-like object, we can then wrap it in an io.BufferedReader with a buffer size that matches the size of data returned by your API call, and iterate through the file object by lines and use the built-in in operator to test if a line contains one of the patterns in the list:
api = API(b'foo bar\nHola World\npython\nstackoverflow\n')
PATTERNS = [b't', b'ho']
for line in io.BufferedReader(APIFileWrapper(api), 10): # or 1024 * 1024 in your case
lowered_line = line.lower()
for pattern in PATTERNS:
if pattern in lowered_line:
print(line)
break
This outputs:
b'Hola World\n'
b'python\n'
b'stackoverflow\n'
Demo: https://replit.com/#blhsing/CelebratedCadetblueWifi
I didn't do any benchmarks but this solution has the definite advantage of being straight forward and not looking everywhere twice, print the lines as they actually appear in the stream (and not in all lower case) and print the complete lines even if they cross a block:
import re
regex_patterns = list(re.compile('^.*'+re.escape(pattern)+'.*$',re.I|re.M) for pattern in PATTERNS)
testblock = ""
block = data.read(1024*1024) # **see remark below**
while len(block)>0:
lastLineStart = testblock.rfind('\n')+1
testblock = testblock[lastLineStart:]+block.decode('UFT-8') # **see edit below**
for pattern in regex_patterns:
for line in pattern.findall(testblock):
print(line)
block = data.read(1024*1024) # **see remark below**
Remark: Since you are processing text data here (otherwise the notion of "lines" wouldn't make any sense), you shouldn't be using b'...' anywhere. Your text in the stream has some encoding and you should read it in a way that honours that encoding (instead of data.read(1024*1024)) so that the loops are operating on real (Python internal unicode) strings and not some byte data. Not getting that straight is one of the most frustratingly difficult bugs to find in each and every Python script.
Edit: If your data is coming from someplace you don't have control over, then using block.decode('UTF-8') (where 'UTF-8' should be replaced by your data's actual encoding!) would allow for the patterns to be Python unicode strings as well. Meaning you could drop the b'..' around those as well. Naturally... if your data is all strictly 7-bit anyway, those points are mute.
How about:
Only concat the end of the first block and start of the next block using the length of the pattern that you are currently looking for. Then use a variable (carry) to indicate if you found the pattern or not so that when you move to the next block you automatically print the first line because you already know that line started with the pattern.
E.g.
block_0 = "abcd"
block_1 = "efgh"
pattern = "def"
length = 3
if pattern in block_0[-length + 1:] + block_1[:length - 1]
This if statement will check "cdef" for the pattern "def". No need to check any more characters than that because if it isn't in that selection of characters then it isn't between blocks in the first place. Now you know the pattern is across blocks, you just need to print the first line of the next block which will be done by checking the value of carry as seen below.
This should stop you needing to go through the block twice like you said.
oldblock = b''
carry = False
while True:
block = data.read(1024*1024)
if block == b'':
break
block = block.lower()
lines = block.split(b'\n')
if carry:
print(lines[0])
carry = False
for PATTERN in PATTERNS:
if PATTERN in block:
for l in lines:
if PATTERN in l:
print(l)
length = len(PATTERN)
if PATTERN in (block[-length + 1:] + oldblock[:length - 1]):
carry = True #Found the PATTERN between blocks, indicate that the first line of the next block needs to be printed
oldblock = block
Updated Answer
Given what we now know about the nature of the data, then we only need to retain from a previous call to get_bytes the last N characters where N is the maximum pattern length - 1. And since a portion of the previous block retrieved must be concatenated with the newly read block in order to match patterns that are split across block boundaries, it then becomes possible to match the same pattern twice. Therefore, it only makes sense that once a pattern has been matched we do not try to match it again. And, of course, when there are no more patterns to match we can quit.
The pattern strings, if not Ascii, should be encoded with the same encoding being used in the stream.
PATTERNS = [b'foo', b'bar', 'coûteux'.encode('utf-8')]
BLOCKSIZE = 1024 * 1024
FILE_PATH = 'test.txt'
# Compute maximum pattern length - 1
pad_length = max(map(lambda pattern: len(pattern), PATTERNS)) - 1
with open(FILE_PATH, 'rb') as f:
patterns = PATTERNS
# Initialize with any byte string we are not trying to match.
data = b'\x00' * pad_length
offset = 0
# Any unmatched patterns left?:
while patterns:
# Emulate a call to get_bytes(BLOCKSIZE) using a binary file:
block = f.read(BLOCKSIZE)
if block == b'':
break
# You only need to keep the last pad_length bytes from previous read:
data = data[-pad_length:] + block.lower()
# Once a pattern is matched we do not want to try matching it again:
new_patterns = []
for pattern in patterns:
idx = data.find(pattern)
if idx != -1:
print('Found: ', pattern, 'at offset', offset + idx - pad_length)
else:
new_patterns.append(pattern)
offset += BLOCKSIZE
patterns = new_patterns
if patern matched, try to use "break;" word inside "for" body for breaking execution of already useless code
and use {...} for start and finish "for" loop body like:
for (...)
{ if match(PATTERN) break;
}
Related
Using Python 3.x, I need to extract JSON objects from a large file (>5GB), read as a stream. The file is stored on S3 and I dont want to load the entire file into memory for processing. Therefore I read chunks of data with amt=10000 (or some other chunk-size).
The data is in this format
{
object-content
}{
object-content
}{
object-content
}
...and so on.
To manage this, I have tried a few things, but the only working solution I have is to read the chunks-piece by piece, and look for "}". For every "}" I try to convert to json with json.load(), the moving window of indexes. If it fails, pass and move to next "}". If success, yield object and update indexes.
def streamS3File(s3objGet):
chunk = ""
indexStart = 0 # used to find starting point of a moving window of text where JSON-object starts
indexStop = 0 # used to find stopping point of a moving window of text where JSON-object stops
while True:
# Get a new chunk of data
newChunk = s3objGet["Body"].read(amt=100000).decode("utf-8")
# If newChunk is zero, we are at the end of the file
if len(newChunk) == 0:
raise StopIteration
# Add to the leftover from last chunk
chunk = chunk + newChunk
# Look for "}". For every "}", try to convert the part of the chunk
# to JSON. If it fails, pass and look for the next "}".
for m in re.finditer('[\{\}]', chunk):
if m.group(0) == "}":
try:
indexStop = m.end()
yield json.loads(chunk[indexStart:indexStop])
indexStart = indexStop
except:
pass
# Remove the part of the chunk allready processed and returned as objects
chunk = chunk[indexStart:]
# Reset indexes
indexStart = 0
indexStop = 0
for t in streamS3File(s3ReadObj):
# t is the json-object found
# do something with it here
I would like input on other ways to accomplish this: Finding json-objects in a stream of text and extracting the json-objects as they pass by.
So right now I'm looking for something in a file. I am getting a value variable, which is a rather long string, with newlines and so on. Then, I use re.findall(regex, value) to find regex. Regex is rather simple - something like "abc de.*".
Now, I want not only to capture whatever regex has, but also context(exactly like -C flag for grep).
So, assuming that I dumped value to file and ran grep on it, what I'd do is grep -C N 'abc de .*' valueinfile
How can I achieve the same thing in Python? I need the answer to work with Unicode regex/text.
My approach is to split the text block into list of lines. Next, iterate through each line and see if there is a match. In case of a match, then gather the context lines (lines that happens before and after the current line) and return it. Here is my code:
import re
def grep(pattern, block, context_lines=0):
lines = block.splitlines()
for line_number, line in enumerate(lines):
if re.match(pattern, line):
lines_with_context = lines[line_number - context_lines:line_number + context_lines + 1]
yield '\n'.join(lines_with_context)
# Try it out
text_block = """One
Two
Three
abc defg
four
five
six
abc defoobar
seven
eight
abc de"""
pattern = 'abc de.*'
for line in grep(pattern, text_block, context_lines=2):
print line
print '---'
Output:
Two
Three
abc defg
four
five
---
five
six
abc defoobar
seven
eight
---
seven
eight
abc de
---
As recommended by Ignacio Vazquez-Abrams, use a deque to store the last n lines. Once that many lines are present, popleft for each new line added. When your regular expression finds a match, return the previous n lines in the stack then iterate n more lines and return those also.
This keeps you from having to iterate on any line twice (DRY) and stores only minimal data in memory. You also mentioned the need for Unicode, so handling file encoding and adding the Unicode flag to RegEx searches is important. Also, the other answer uses re.match() instead of re.search() and as such may have unintended consequences.
Below is an example. This example only iterates over every line ONCE in the file, which means context lines that also contain hits don't get looked at again. This may or may not be desirable behavior but can easily be tweaked to highlight or otherwise flag lines with additional hits within context for a previous hit.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import re
from collections import deque
def grep(pattern, input_file, context=0, case_sensitivity=True, file_encoding='utf-8'):
stack = deque()
hits = []
lines_remaining = None
with codecs.open(input_file, mode='rb', encoding=file_encoding) as f:
for line in f:
# append next line to stack
stack.append(line)
# keep adding context after hit found (without popping off previous lines of context)
if lines_remaining and lines_remaining > 0:
continue # go to next line in file
elif lines_remaining and lines_remaining == 0:
hits.append(stack)
lines_remaining = None
stack = deque()
# if stack exceeds needed context, pop leftmost line off stack
# (but include current line with possible search hit if applicable)
if len(stack) > context+1:
last_line_removed = stack.popleft()
# search line for pattern
if case_sensitivity:
search_object = re.search(pattern, line, re.UNICODE)
else:
search_object = re.search(pattern, line, re.IGNORECASE|re.UNICODE)
if search_object:
lines_remaining = context
# in case there is not enough lines left in the file to provide trailing context
if lines_remaining and len(stack) > 0:
hits.append(stack)
# return list of deques containing hits with context
return hits # you'll probably want to format the output, this is just an example
I'm still learning Python, and I have a question I haven't been able to solve. I have a very long string (millions of lines long) which I would like to be split into a smaller string length based on a specified number of occurrences of a delimeter.
For instance:
ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//
In this case I would want to split based on "//" and return a string of all lines before the nth occurrence of the delimeter.
So an input of splitting the string by // by 1 would return:
ABCDEF
an input of splitting the string by // by 2 would return:
ABCDEF
//
GHIJKLMN
an input of splitting the string by // by 3 would return:
ABCDEF
//
GHIJKLMN
//
OPQ
And so on... However, The length of the original 2 million line string appeared to be a problem when I simply tried to split the entire string and by "//" and just work with the individual indexes. (I was getting a memory error) Perhaps Python can't handle so many lines in one split? So I can't do that.
I'm looking for a way that I don't need to split the entire string into a hundred-thousand indexes when I may only need 100, but instead just start from the beginning until a certain point, stop and return everything before it, which I assume may also be faster? I hope my question is as clear as possible.
Is there a simple or elegant way to achieve this? Thanks!
If you want to work with files instead of strings in memory, here is another answer.
This version is written as a function that reads lines and immediately prints them out until the specified number of delimiters have been found (no extra memory needed to store the entire string).
def file_split(file_name, delimiter, n=1):
with open(file_name) as fh:
for line in fh:
line = line.rstrip() # use .rstrip("\n") to only strip newlines
if line == delimiter:
n -= 1
if n <= 0:
return
print line
file_split('data.txt', '//', 3)
You can use this to write the output to a new file like this:
python split.py > newfile.txt
With a little extra work, you can use argparse to pass parameters to the program.
As a more efficient way you can read the firs N lines separated by your delimiter so if you are sure that all of your lines are splitted by delimiter you can use itertools.islice to do the job:
from itertools import islice
with open('filename') as f :
lines = islice(f,0,2*N-1)
The method that comes to my mind when I read your question uses a for loop
where you cut up the string into several (for example the 100 you called) and iterate through the substring.
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
log = 0
substring = thestring[:log+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
log = log+steps
# and go again from the start only with this offset
now you can go through all the elements go through the whole 2 million(!) line string.
best thing to do here is actually make a recursive function from this(if that is what you want):
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
def iterateThroughHugeString(beginning):
substring = thestring[:beginning+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
iterateThroughHugeString(beginning+steps)
# and go again from the start only with this offset
For instance:
i = 0
s = ""
fd = open("...")
for l in fd:
if l[:-1] == delimiter: # skip last '\n'
i += 1
if i >= max_split:
break
s += l
fd.close()
Since you are learning Python it would be a challenge to model a complete dynamic solution. Here's a notion of how you can model one.
Note: The following code snippet only works for file(s) which is/are in the given format (see the 'For Instance' in the question). Hence, it is a static solution.
num = (int(input("Enter delimiter: ")) * 2)
with open("./data.txt") as myfile:
print ([next(myfile) for x in range(num-1)])
Now that have the idea, you can use pattern matching and so on.
For now I have tried to define and document my own function to do it, but I am encountering issues with testing the code and I have actually no idea if it is correct. I found some solutions with BioPython, re or other, but I really want to make this work with yield.
#generator for GenBank to FASTA
def parse_GB_to_FASTA (lines):
#set Default label
curr_label = None
#set Default sequence
curr_seq = ""
for line in lines:
#if the line starts with ACCESSION this should be saved as the beginning of the label
if line.startswith('ACCESSION'):
#if the label has already been changed
if curr_label is not None:
#output the label and sequence
yield curr_label, curr_seq
''' if the label starts with ACCESSION, immediately replace the current label with
the next ACCESSION number and continue with the next check'''
#strip the first column and leave the number
curr_label = '>' + line.strip()[12:]
#check for the organism column
elif line.startswith (' ORGANISM'):
#add the organism name to the label line
curr_label = curr_label + " " + line.strip()[12:]
#check if the region of the sequence starts
elif line.startswith ('ORIGIN'):
#until the end of the sequence is reached
while line.startswith ('//') is False:
#get a line without spaces and numbers
curr_seq += line.upper().strip()[12:].translate(None, '1234567890 ')
#if no more lines, then give the last label and sequence
yield curr_label, curr_seq
I often work with very large GenBank files and found (years ago) that the BioPython parsers were too brittle to make it through 100's of thousands of records (at the time), without crashing on an unusual record.
I wrote a pure python(2) function to return the next whole record from an open file, reading in 1k chunks, and leaving the file pointer ready to get the next record. I tied this in with a simple iterator that uses this function, and a GenBank Record class which has a fasta(self) method to get a fasta version.
YMMV, but the function that gets the next record is here as should be pluggable in any iterator scheme you want to use. As far as converting to fasta goes you can use logic similar to your ACCESSION and ORIGIN grabbing above, or you can get the text of sections (like ORIGIN) using:
sectionTitle='ORIGIN'
searchRslt=re.search(r'^(%s.+?)^\S'%sectionTitle,
gbrText,re.MULTILINE | re.DOTALL)
sectionText=searchRslt.groups()[0]
Subsections like ORGANISM, require a left side pad of 5 spaces.
Here's my solution to the main issue:
def getNextRecordFromOpenFile(fHandle):
"""Look in file for the next GenBank record
return text of the record
"""
cSize =1024
recFound = False
recChunks = []
try:
fHandle.seek(-1,1)
except IOError:
pass
sPos = fHandle.tell()
gbr=None
while True:
cPos=fHandle.tell()
c=fHandle.read(cSize)
if c=='':
return None
if not recFound:
locusPos=c.find('\nLOCUS')
if sPos==0 and c.startswith('LOCUS'):
locusPos=0
elif locusPos == -1:
continue
if locusPos>0:
locusPos+=1
c=c[locusPos:]
recFound=True
else:
locusPos=0
if (len(recChunks)>0 and
((c.startswith('//\n') and recChunks[-1].endswith('\n'))
or (c.startswith('\n') and recChunks[-1].endswith('\n//'))
or (c.startswith('/\n') and recChunks[-1].endswith('\n/'))
)):
eorPos=0
else:
eorPos=c.find('\n//\n',locusPos)
if eorPos == -1:
recChunks.append(c)
else:
recChunks.append(c[:(eorPos+4)])
gbrText=''.join(recChunks)
fHandle.seek(cPos-locusPos+eorPos)
return gbrText
I'm trying to find the best way to parse through a file in Python and create a list of namedtuples, with each tuple representing a single data entity and its attributes. The data looks something like this:
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
UI: T145
RL: exhibits
ABR: EX
RIN: exhibited_by
RTN: R3.3.2
DEF: Shows or demonstrates.
HL: {isa} performs
STL: [Animal|Behavior]; [Group|Behavior]
UI: etc...
While several attributes are shared (eg UI), some are not (eg STY). However, I could hardcode an exhaustive list of necessary.
Since each grouping is separated by an empty line, I used split so I can process each chunk of data individually:
input = file.read().split("\n\n")
for chunk in input:
process(chunk)
I've seen some approaches use string find/splice, itertools.groupby, and even regexes. I was thinking of doing a regex of '[A-Z]*:' to find where the headers are, but I'm not sure how to approach pulling out multiple lines afterwards until another header is reached (such as the multilined data following DEF in the first example entity).
I appreciate any suggestions.
I took assumption that if you have string span on multiple lines you want newlines replaced with spaces (and to remove any additional spaces).
def process_file(filename):
reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
tmp = '' # Stored/cached data for mutliline string
key = None # Current key
data = {}
with open(filename,'r') as f:
for row in f:
row = row.rstrip()
match = reg.match(row)
# Matches header or is end, put string to list:
if (match or not row) and key:
data[key] = tmp
key = None
tmp = ''
# Empty row, next dataset
if not row:
# Prevent empty returns
if data:
yield data
data = {}
continue
# We do have header
if match:
key = str(match.group(1))
tmp = row[len(match.group(0)):]
continue
# No header, just append string -> here goes assumption that you want to
# remove newlines, trailing spaces and replace them with one single space
tmp += ' ' + row
# Missed row?
if key:
data[key] = tmp
# Missed group?
if data:
yield data
This generator returns dict with pairs like UI: T020 in each iteration (and always at least one item).
Since it uses generator and continuous reading it should be effective event on large files and it won't read whole file into memory at once.
Here's little demo:
for data in process_file('data.txt'):
print('-'*20)
for i in data:
print('%s:'%(i), data[i])
print()
And actual output:
--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab
--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX
source = """
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
"""
inpt = source.split("\n") #just emulating file
import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
line_match = reg.match(line) #check if we hit the CODE: Content line
if line_match is not None:
if current_key is not None:
output[current_key] = current #if so - update the current_key with contents
current_key = line_match.group(1)
current = line_match.group(2)
else:
current = current + line #if it's not - it should be the continuation of previous key line
output[current_key] = current #don't forget the last guy
print(output)
import re
from collections import namedtuple
def process(chunk):
split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
d = dict()
fields = list()
for i in xrange(len(split_chunk)/2):
fields.append(split_chunk[i])
d[split_chunk[i]] = split_chunk[i+1]
my_tuple = namedtuple(split_chunk[1], fields)
return my_tuple(**d)
should do. I think I'd just do the dict though -- why are you so attached to a namedtuple?