How to extract JSON objects from a large file?

How to extract JSON objects from a large file? - python

Using Python 3.x, I need to extract JSON objects from a large file (>5GB), read as a stream. The file is stored on S3 and I dont want to load the entire file into memory for processing. Therefore I read chunks of data with amt=10000 (or some other chunk-size).
The data is in this format
{
object-content
}{
object-content
}{
object-content
}
...and so on.
To manage this, I have tried a few things, but the only working solution I have is to read the chunks-piece by piece, and look for "}". For every "}" I try to convert to json with json.load(), the moving window of indexes. If it fails, pass and move to next "}". If success, yield object and update indexes.
def streamS3File(s3objGet):
chunk = ""
indexStart = 0 # used to find starting point of a moving window of text where JSON-object starts
indexStop = 0 # used to find stopping point of a moving window of text where JSON-object stops
while True:
# Get a new chunk of data
newChunk = s3objGet["Body"].read(amt=100000).decode("utf-8")
# If newChunk is zero, we are at the end of the file
if len(newChunk) == 0:
raise StopIteration
# Add to the leftover from last chunk
chunk = chunk + newChunk
# Look for "}". For every "}", try to convert the part of the chunk
# to JSON. If it fails, pass and look for the next "}".
for m in re.finditer('[\{\}]', chunk):
if m.group(0) == "}":
try:
indexStop = m.end()
yield json.loads(chunk[indexStart:indexStop])
indexStart = indexStop
except:
pass
# Remove the part of the chunk allready processed and returned as objects
chunk = chunk[indexStart:]
# Reset indexes
indexStart = 0
indexStop = 0
for t in streamS3File(s3ReadObj):
# t is the json-object found
# do something with it here
I would like input on other ways to accomplish this: Finding json-objects in a stream of text and extracting the json-objects as they pass by.

Related

Find a pattern in a stream of bytes read in blocks

I have a stream of gigabytes of data that I read in blocks of 1 MB.
I'd like to find if (and where) one of the patterns PATTERNS = [b"foo", b"bar", ...] is present in the data (case insensitive).
Here is what I'm doing. It works but it is sub-optimal:
oldblock = b''
while True:
block = source_data.get_bytes(1024*1024)
if block == b'':
break
testblock = (oldblock + block).lower()
for PATTERN in PATTERNS:
if PATTERN in testblock:
for l in testblock.split(b'\n'): # display only the line where the
if PATTERN in l: # pattern is found, not the whole 1MB block!
print(l) # note: this line can be incomplete if
oldblock = block # it continues in the next block (**)
Why do we need to search in oldblock + block? This is because the pattern foo could be precisely split in two consecutive 1 MB blocks:
[.......fo] [o........]
block n block n+1
Drawback: it's not optimal to have to concatenate oldblock + block and to perform the search almost twice as much as necessary.
We could use testblock = oldblock[-max_len_of_patterns:] + block, but there is surely a more canonical way to address this problem, as well as the side-remark (**).
How to do a more efficient pattern search in data read by blocks?
Note: the input data is not a file that I can iterate on or memory map, I only receive a stream of 1MB blocks from an external source.

I'd separate the block-getting from the pattern-searching and do it like this (all but the first two lines are from your original):
for block in nice_blocks():
testblock = block.lower()
for PATTERN in PATTERNS:
if PATTERN in testblock:
for l in testblock.split(b'\n'): # display only the line where the
if PATTERN in l: # pattern is found, not the whole 1MB block!
print(l)
Where nice_blocks() is an iterator of "nice" blocks, meaning they don't break lines apart and they don't overlap. And they're ~1 MB large as well.
To support that, I start with a helper just providing an iterator of the raw blocks:
def raw_blocks():
while block := source_data.get_bytes(1024*1024):
yield block
(The := assumes you're not years behind, it was added in Python 3.8. For older versions, do it with your while-True-if-break).
And to get nice blocks:
def nice_blocks():
carry = b''
for block in raw_blocks():
i = block.rfind(b'\n')
if i >= 0:
yield carry + block[:i]
carry = block[i+1:]
else:
carry += block
if carry:
yield carry
The carry carries over remaining bytes from the previous block (or previous blocks, if none of them had newlines, but that's not happening with your "blocks of 1 MB" and your "line_length < 1 KB").
With these two helper functions in place, you can write your code as at the top of my answer.

From the use of testblock.split(b'\n') in your code, as well as the comment about displaying the line where a pattern is found, it is well apparent that your expected input is not a true binary file, but a text file, where each line, separated by b'\n', is of a size reasonable enough to be readable by the end user when displayed on a screen. It is therefore most convenient and efficient to simply iterate through the file by lines instead of in chunks of a fixed size since the iterator of a file-like object already handles buffering and splitting by lines optimally.
However, since it is now clear from your comment that data is not really a file-like object in your real-world scenario, but an API that presumably has just a method that returns a chunk of data per call, we have to wrap that API into a file-like object.
For demonstration purpose, let's simulate the API you're dealing with by creating an API class that returns up to 10 bytes of data at a time with the get_next_chunk method:
class API:
def __init__(self, data):
self.data = data
self.position = 0
def get_next_chunk(self):
chunk = self.data[self.position:self.position + 10]
self.position += 10
return chunk
We can then create a subclass of io.RawIOBase that wraps the API into a file-like object with a readinto method that is necessary for a file iterator to work:
import io
class APIFileWrapper(io.RawIOBase):
def __init__(self, api):
self.api = api
self.leftover = None
def readable(self):
return True
def readinto(self, buffer):
chunk = self.leftover or api.get_next_chunk()
size = len(buffer)
output = chunk[:size]
self.leftover = chunk[size:]
output_size = len(output)
buffer[:output_size] = output
return output_size
With a raw file-like object, we can then wrap it in an io.BufferedReader with a buffer size that matches the size of data returned by your API call, and iterate through the file object by lines and use the built-in in operator to test if a line contains one of the patterns in the list:
api = API(b'foo bar\nHola World\npython\nstackoverflow\n')
PATTERNS = [b't', b'ho']
for line in io.BufferedReader(APIFileWrapper(api), 10): # or 1024 * 1024 in your case
lowered_line = line.lower()
for pattern in PATTERNS:
if pattern in lowered_line:
print(line)
break
This outputs:
b'Hola World\n'
b'python\n'
b'stackoverflow\n'
Demo: https://replit.com/#blhsing/CelebratedCadetblueWifi

I didn't do any benchmarks but this solution has the definite advantage of being straight forward and not looking everywhere twice, print the lines as they actually appear in the stream (and not in all lower case) and print the complete lines even if they cross a block:
import re
regex_patterns = list(re.compile('^.*'+re.escape(pattern)+'.*$',re.I|re.M) for pattern in PATTERNS)
testblock = ""
block = data.read(1024*1024) # **see remark below**
while len(block)>0:
lastLineStart = testblock.rfind('\n')+1
testblock = testblock[lastLineStart:]+block.decode('UFT-8') # **see edit below**
for pattern in regex_patterns:
for line in pattern.findall(testblock):
print(line)
block = data.read(1024*1024) # **see remark below**
Remark: Since you are processing text data here (otherwise the notion of "lines" wouldn't make any sense), you shouldn't be using b'...' anywhere. Your text in the stream has some encoding and you should read it in a way that honours that encoding (instead of data.read(1024*1024)) so that the loops are operating on real (Python internal unicode) strings and not some byte data. Not getting that straight is one of the most frustratingly difficult bugs to find in each and every Python script.
Edit: If your data is coming from someplace you don't have control over, then using block.decode('UTF-8') (where 'UTF-8' should be replaced by your data's actual encoding!) would allow for the patterns to be Python unicode strings as well. Meaning you could drop the b'..' around those as well. Naturally... if your data is all strictly 7-bit anyway, those points are mute.

How about:
Only concat the end of the first block and start of the next block using the length of the pattern that you are currently looking for. Then use a variable (carry) to indicate if you found the pattern or not so that when you move to the next block you automatically print the first line because you already know that line started with the pattern.
E.g.
block_0 = "abcd"
block_1 = "efgh"
pattern = "def"
length = 3
if pattern in block_0[-length + 1:] + block_1[:length - 1]
This if statement will check "cdef" for the pattern "def". No need to check any more characters than that because if it isn't in that selection of characters then it isn't between blocks in the first place. Now you know the pattern is across blocks, you just need to print the first line of the next block which will be done by checking the value of carry as seen below.
This should stop you needing to go through the block twice like you said.
oldblock = b''
carry = False
while True:
block = data.read(1024*1024)
if block == b'':
break
block = block.lower()
lines = block.split(b'\n')
if carry:
print(lines[0])
carry = False
for PATTERN in PATTERNS:
if PATTERN in block:
for l in lines:
if PATTERN in l:
print(l)
length = len(PATTERN)
if PATTERN in (block[-length + 1:] + oldblock[:length - 1]):
carry = True #Found the PATTERN between blocks, indicate that the first line of the next block needs to be printed
oldblock = block

Updated Answer
Given what we now know about the nature of the data, then we only need to retain from a previous call to get_bytes the last N characters where N is the maximum pattern length - 1. And since a portion of the previous block retrieved must be concatenated with the newly read block in order to match patterns that are split across block boundaries, it then becomes possible to match the same pattern twice. Therefore, it only makes sense that once a pattern has been matched we do not try to match it again. And, of course, when there are no more patterns to match we can quit.
The pattern strings, if not Ascii, should be encoded with the same encoding being used in the stream.
PATTERNS = [b'foo', b'bar', 'coûteux'.encode('utf-8')]
BLOCKSIZE = 1024 * 1024
FILE_PATH = 'test.txt'
# Compute maximum pattern length - 1
pad_length = max(map(lambda pattern: len(pattern), PATTERNS)) - 1
with open(FILE_PATH, 'rb') as f:
patterns = PATTERNS
# Initialize with any byte string we are not trying to match.
data = b'\x00' * pad_length
offset = 0
# Any unmatched patterns left?:
while patterns:
# Emulate a call to get_bytes(BLOCKSIZE) using a binary file:
block = f.read(BLOCKSIZE)
if block == b'':
break
# You only need to keep the last pad_length bytes from previous read:
data = data[-pad_length:] + block.lower()
# Once a pattern is matched we do not want to try matching it again:
new_patterns = []
for pattern in patterns:
idx = data.find(pattern)
if idx != -1:
print('Found: ', pattern, 'at offset', offset + idx - pad_length)
else:
new_patterns.append(pattern)
offset += BLOCKSIZE
patterns = new_patterns

if patern matched, try to use "break;" word inside "for" body for breaking execution of already useless code
and use {...} for start and finish "for" loop body like:
for (...)
{ if match(PATTERN) break;
}

Parsing massive XML files to JSON

I am working on a project that requires me to parse massive XML files to JSON. I have written code, however it is too slow. I have looked at using lxml and BeautifulSoup but am unsure how to proceed.
I have included my code. It works exactly how it is supposed to, except it is too slow. It took around 24 hours to go through a sub-100Mb file to parse 100,000 records.
product_data = open('productdata_29.xml', 'r')
read_product_data = product_data.read()
def record_string_to_dict(record_string):
'''This function takes a single record in string form and iterates through
it, and sorts it as a dictionary. Only the nodes present in the parent_rss dict
are appended to the new dict (single_record_dict). After each record,
single_record_dict is flushed to final_list and is then emptied.'''
#Iterating through the string to find keys and values to put in to
#single_record_dict.
while record_string != record_string[::-1]:
try:
k = record_string.index('<')
l = record_string.index('>')
temp_key = record_string[k + 1:l]
record_string = record_string[l+1:]
m = record_string.index('<')
temp_value = record_string[:m]
#Cleaning thhe keys and values of unnecessary characters and symbols.
if '\n' in temp_value:
temp_value = temp_value[3:]
if temp_key[-1] == '/':
temp_key = temp_key[:-1]
n = record_string.index('\n')
record_string = record_string[n+2:]
#Checking parent_rss dict to see if the key from the record is present. If it is,
#the key is replaced with keys and added to single_record_dictionary.
if temp_key in mapped_nodes.keys():
temp_key = mapped_nodes[temp_key]
single_record_dict[temp_key] = temp_value
except Exception:
break
while len(read_product_data) > 10:
#Goes through read_product_data to create blocks, each of which is a single
#record.
i = read_product_data.index('<record>')
j = read_product_data.index('</record>') + 8
single_record_string = read_product_data[i:j]
single_record_string = single_record_string[9:-10]
#Runs previous function with the input being the single string found previously.
record_string_to_dict(single_record_string)
#Flushes single_record_dict to final_list, and empties the dict for the next
#record.
final_list.append(single_record_dict)
single_record_dict = {}
#Removes the record that was previously processed.
read_product_data = read_product_data[j:]
#For keeping track/ease of use.
print('Record ' + str(break_counter) + ' has been appended.')
#Keeps track of the number of records. Once the set value is reached
#in the if loop, it is flushed to a new file.
break_counter += 1
flush_counter += 1
if break_counter == 100 or flush_counter == break_counter:
record_list = open('record_list_'+str(file_counter)+'.txt', 'w')
record_list.write(str(final_list))
#file_counter keeps track of how many files have been created, so the next
#file has a different int at the end.
file_counter += 1
record_list.close()
#resets break counter
break_counter = 0
final_list = []
#For testing purposes. Causes execution to stop once the number of files written
#matches the integer.
if file_counter == 2:
break
print('All records have been appended.')

Any reason, why are you not considering packages such as xml2json and xml2dict. See this post for working examples:
How can i convert an xml file into JSON using python?
Relevant code reproduced from above post:
xml2json
import xml2json
s = '''<?xml version="1.0"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>'''
print xml2json.xml2json(s)
xmltodict
import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'
See this post if working in Python 3:
https://pythonadventures.wordpress.com/2014/12/29/xml-to-dict-xml-to-json/
import json
import xmltodict
def convert(xml_file, xml_attribs=True):
with open(xml_file, "rb") as f: # notice the "rb" mode
d = xmltodict.parse(f, xml_attribs=xml_attribs)
return json.dumps(d, indent=4)

You definitely don't want to be hand-parsing the XML. As well as the libraries others have mentioned, you could use an XSLT 3.0 processor. To go above 100Mb you would benefit from a streaming processor such as Saxon-EE, but up to that kind of level the open source Saxon-HE should be able to hack it. You haven't shown the source XML or target JSON, so I can't give you specific code - the assumption in XSLT 3.0 is that you probably want a customized transformation rather than an off-the-shelf one, so the general idea is to write template rules that define how different parts of your input XML should be handled.

Use Python xlsxwriter module to write srt data into and excel

this time I tried to use Python's xlsxwriter module to write data from a .srt into an excel.
The subtitle file looks like this in sublime text:
but I want to write the data into an excel, so it looks like this:
It's my first time to code python for this, so I'm still in the stage of trial and error...I tried to write some code like below
but I don't think it makes sense...
I'll continue trying out, but if you know how to do it, please let me know. I'll read your code and try to understand them! Thank you! :)

The following breaks the problem into a few pieces:
Parsing the input file. parse_subtitles is a generator that takes a source of lines and yields up a sequence of records in the form {'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'. The approach I took was to track which of three distinct states we're in:
seeking to next entry for when we're looking for the next index number, which should match the regular expression ^\d*$ (nothing but a bunch of numbers)
looking for timestamp when an index is found and we expect a timestamp to come in the next line, which should match the regular expression ^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$ (HH:MM:SS,mmm -> HH:MM:SS,mmm) and
reading subtitles while consuming actual subtitle text, with blank lines and EOF interpreted as subtitle termination points.
Writing the above records to a row in a worksheet. write_dict_to_worksheet accepts a row and worksheet, plus a record and a dictionary defining the Excel 0-indexed column numbers for each of the record's keys, and then it writes the data appropriately.
Organizaing the overall conversion convert accepts an input filename (e.g. 'Wildlife.srt' that'll be opened and passed to the parse_subtitles function, and an output filename (e.g. 'Subtitle.xlsx' that will be created using xlsxwriter. It then writes a header and, for each record parsed from the input file, writes that record to the XLSX file.
Logging statements left in for self-commenting purposes, and because when reproducing your input file I fat-fingered a : to a ; in a timestamp, making it unrecognized, and having the error pop up was handy for debugging!
I've put a text version of your source file, along with the below code, in this Gist
import xlsxwriter
import re
import logging
def parse_subtitles(lines):
line_index = re.compile('^\d*$')
line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
line_seperator = re.compile('^\s*$')
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
state = 'seeking to next entry'
for line in lines:
line = line.strip('\n')
if state == 'seeking to next entry':
if line_index.match(line):
logging.debug('Found index: {i}'.format(i=line))
current_record['index'] = line
state = 'looking for timestamp'
else:
logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))
elif state == 'looking for timestamp':
if line_timestamp.match(line):
logging.debug('Found timestamp: {t}'.format(t=line))
current_record['timestamp'] = line
state = 'reading subtitles'
else:
logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))
elif state == 'reading subtitles':
if line_seperator.match(line):
logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
yield current_record
state = 'seeking to next entry'
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
else:
logging.debug('Appending to subtitle: {s}'.format(s=line))
current_record['subtitles'].append(line)
else:
logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
if state == 'reading subtitles':
# We must have finished the file without encountering a blank line. Dump the last record
yield current_record
def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
"""
Write a subtitle-record to a worksheet.
Return the row number after those that were written (since this may write multiple rows)
"""
current_row = row
#First, horizontally write the entry and timecode
for (colname, colindex) in columns_for_keys.items():
if colname != 'subtitles':
worksheet.write(current_row, colindex, keyed_data[colname])
#Next, vertically write the subtitle data
subtitle_column = columns_for_keys['subtitles']
for morelines in keyed_data['subtitles']:
worksheet.write(current_row, subtitle_column, morelines)
current_row+=1
return current_row
def convert(input_filename, output_filename):
workbook = xlsxwriter.Workbook(output_filename)
worksheet = workbook.add_worksheet('subtitles')
columns = {'index':0, 'timestamp':1, 'subtitles':2}
next_available_row = 0
records_processed = 0
headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)
with open(input_filename) as textfile:
for record in parse_subtitles(textfile):
next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
records_processed += 1
print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
workbook.close()
convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')
Edit: Updated to split multiline subtitles across multiple rows in output

Parsing GenBank to FASTA with yield in Python (x, y)

For now I have tried to define and document my own function to do it, but I am encountering issues with testing the code and I have actually no idea if it is correct. I found some solutions with BioPython, re or other, but I really want to make this work with yield.
#generator for GenBank to FASTA
def parse_GB_to_FASTA (lines):
#set Default label
curr_label = None
#set Default sequence
curr_seq = ""
for line in lines:
#if the line starts with ACCESSION this should be saved as the beginning of the label
if line.startswith('ACCESSION'):
#if the label has already been changed
if curr_label is not None:
#output the label and sequence
yield curr_label, curr_seq
''' if the label starts with ACCESSION, immediately replace the current label with
the next ACCESSION number and continue with the next check'''
#strip the first column and leave the number
curr_label = '>' + line.strip()[12:]
#check for the organism column
elif line.startswith (' ORGANISM'):
#add the organism name to the label line
curr_label = curr_label + " " + line.strip()[12:]
#check if the region of the sequence starts
elif line.startswith ('ORIGIN'):
#until the end of the sequence is reached
while line.startswith ('//') is False:
#get a line without spaces and numbers
curr_seq += line.upper().strip()[12:].translate(None, '1234567890 ')
#if no more lines, then give the last label and sequence
yield curr_label, curr_seq

I often work with very large GenBank files and found (years ago) that the BioPython parsers were too brittle to make it through 100's of thousands of records (at the time), without crashing on an unusual record.
I wrote a pure python(2) function to return the next whole record from an open file, reading in 1k chunks, and leaving the file pointer ready to get the next record. I tied this in with a simple iterator that uses this function, and a GenBank Record class which has a fasta(self) method to get a fasta version.
YMMV, but the function that gets the next record is here as should be pluggable in any iterator scheme you want to use. As far as converting to fasta goes you can use logic similar to your ACCESSION and ORIGIN grabbing above, or you can get the text of sections (like ORIGIN) using:
sectionTitle='ORIGIN'
searchRslt=re.search(r'^(%s.+?)^\S'%sectionTitle,
gbrText,re.MULTILINE | re.DOTALL)
sectionText=searchRslt.groups()[0]
Subsections like ORGANISM, require a left side pad of 5 spaces.
Here's my solution to the main issue:
def getNextRecordFromOpenFile(fHandle):
"""Look in file for the next GenBank record
return text of the record
"""
cSize =1024
recFound = False
recChunks = []
try:
fHandle.seek(-1,1)
except IOError:
pass
sPos = fHandle.tell()
gbr=None
while True:
cPos=fHandle.tell()
c=fHandle.read(cSize)
if c=='':
return None
if not recFound:
locusPos=c.find('\nLOCUS')
if sPos==0 and c.startswith('LOCUS'):
locusPos=0
elif locusPos == -1:
continue
if locusPos>0:
locusPos+=1
c=c[locusPos:]
recFound=True
else:
locusPos=0
if (len(recChunks)>0 and
((c.startswith('//\n') and recChunks[-1].endswith('\n'))
or (c.startswith('\n') and recChunks[-1].endswith('\n//'))
or (c.startswith('/\n') and recChunks[-1].endswith('\n/'))
)):
eorPos=0
else:
eorPos=c.find('\n//\n',locusPos)
if eorPos == -1:
recChunks.append(c)
else:
recChunks.append(c[:(eorPos+4)])
gbrText=''.join(recChunks)
fHandle.seek(cPos-locusPos+eorPos)
return gbrText

Python - reading data from file with variable attributes and line lengths

I'm trying to find the best way to parse through a file in Python and create a list of namedtuples, with each tuple representing a single data entity and its attributes. The data looks something like this:
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
UI: T145
RL: exhibits
ABR: EX
RIN: exhibited_by
RTN: R3.3.2
DEF: Shows or demonstrates.
HL: {isa} performs
STL: [Animal|Behavior]; [Group|Behavior]
UI: etc...
While several attributes are shared (eg UI), some are not (eg STY). However, I could hardcode an exhaustive list of necessary.
Since each grouping is separated by an empty line, I used split so I can process each chunk of data individually:
input = file.read().split("\n\n")
for chunk in input:
process(chunk)
I've seen some approaches use string find/splice, itertools.groupby, and even regexes. I was thinking of doing a regex of '[A-Z]*:' to find where the headers are, but I'm not sure how to approach pulling out multiple lines afterwards until another header is reached (such as the multilined data following DEF in the first example entity).
I appreciate any suggestions.

I took assumption that if you have string span on multiple lines you want newlines replaced with spaces (and to remove any additional spaces).
def process_file(filename):
reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
tmp = '' # Stored/cached data for mutliline string
key = None # Current key
data = {}
with open(filename,'r') as f:
for row in f:
row = row.rstrip()
match = reg.match(row)
# Matches header or is end, put string to list:
if (match or not row) and key:
data[key] = tmp
key = None
tmp = ''
# Empty row, next dataset
if not row:
# Prevent empty returns
if data:
yield data
data = {}
continue
# We do have header
if match:
key = str(match.group(1))
tmp = row[len(match.group(0)):]
continue
# No header, just append string -> here goes assumption that you want to
# remove newlines, trailing spaces and replace them with one single space
tmp += ' ' + row
# Missed row?
if key:
data[key] = tmp
# Missed group?
if data:
yield data
This generator returns dict with pairs like UI: T020 in each iteration (and always at least one item).
Since it uses generator and continuous reading it should be effective event on large files and it won't read whole file into memory at once.
Here's little demo:
for data in process_file('data.txt'):
print('-'*20)
for i in data:
print('%s:'%(i), data[i])
print()
And actual output:
--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab
--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX

source = """
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
"""
inpt = source.split("\n") #just emulating file
import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
line_match = reg.match(line) #check if we hit the CODE: Content line
if line_match is not None:
if current_key is not None:
output[current_key] = current #if so - update the current_key with contents
current_key = line_match.group(1)
current = line_match.group(2)
else:
current = current + line #if it's not - it should be the continuation of previous key line
output[current_key] = current #don't forget the last guy
print(output)

import re
from collections import namedtuple
def process(chunk):
split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
d = dict()
fields = list()
for i in xrange(len(split_chunk)/2):
fields.append(split_chunk[i])
d[split_chunk[i]] = split_chunk[i+1]
my_tuple = namedtuple(split_chunk[1], fields)
return my_tuple(**d)
should do. I think I'd just do the dict though -- why are you so attached to a namedtuple?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract JSON objects from a large file? - python

Related

Find a pattern in a stream of bytes read in blocks

Parsing massive XML files to JSON

Use Python xlsxwriter module to write srt data into and excel

Parsing GenBank to FASTA with yield in Python (x, y)

Python - reading data from file with variable attributes and line lengths

Categories

Resources