I am working on a project that requires me to parse massive XML files to JSON. I have written code, however it is too slow. I have looked at using lxml and BeautifulSoup but am unsure how to proceed.
I have included my code. It works exactly how it is supposed to, except it is too slow. It took around 24 hours to go through a sub-100Mb file to parse 100,000 records.
product_data = open('productdata_29.xml', 'r')
read_product_data = product_data.read()
def record_string_to_dict(record_string):
'''This function takes a single record in string form and iterates through
it, and sorts it as a dictionary. Only the nodes present in the parent_rss dict
are appended to the new dict (single_record_dict). After each record,
single_record_dict is flushed to final_list and is then emptied.'''
#Iterating through the string to find keys and values to put in to
#single_record_dict.
while record_string != record_string[::-1]:
try:
k = record_string.index('<')
l = record_string.index('>')
temp_key = record_string[k + 1:l]
record_string = record_string[l+1:]
m = record_string.index('<')
temp_value = record_string[:m]
#Cleaning thhe keys and values of unnecessary characters and symbols.
if '\n' in temp_value:
temp_value = temp_value[3:]
if temp_key[-1] == '/':
temp_key = temp_key[:-1]
n = record_string.index('\n')
record_string = record_string[n+2:]
#Checking parent_rss dict to see if the key from the record is present. If it is,
#the key is replaced with keys and added to single_record_dictionary.
if temp_key in mapped_nodes.keys():
temp_key = mapped_nodes[temp_key]
single_record_dict[temp_key] = temp_value
except Exception:
break
while len(read_product_data) > 10:
#Goes through read_product_data to create blocks, each of which is a single
#record.
i = read_product_data.index('<record>')
j = read_product_data.index('</record>') + 8
single_record_string = read_product_data[i:j]
single_record_string = single_record_string[9:-10]
#Runs previous function with the input being the single string found previously.
record_string_to_dict(single_record_string)
#Flushes single_record_dict to final_list, and empties the dict for the next
#record.
final_list.append(single_record_dict)
single_record_dict = {}
#Removes the record that was previously processed.
read_product_data = read_product_data[j:]
#For keeping track/ease of use.
print('Record ' + str(break_counter) + ' has been appended.')
#Keeps track of the number of records. Once the set value is reached
#in the if loop, it is flushed to a new file.
break_counter += 1
flush_counter += 1
if break_counter == 100 or flush_counter == break_counter:
record_list = open('record_list_'+str(file_counter)+'.txt', 'w')
record_list.write(str(final_list))
#file_counter keeps track of how many files have been created, so the next
#file has a different int at the end.
file_counter += 1
record_list.close()
#resets break counter
break_counter = 0
final_list = []
#For testing purposes. Causes execution to stop once the number of files written
#matches the integer.
if file_counter == 2:
break
print('All records have been appended.')
Any reason, why are you not considering packages such as xml2json and xml2dict. See this post for working examples:
How can i convert an xml file into JSON using python?
Relevant code reproduced from above post:
xml2json
import xml2json
s = '''<?xml version="1.0"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>'''
print xml2json.xml2json(s)
xmltodict
import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'
See this post if working in Python 3:
https://pythonadventures.wordpress.com/2014/12/29/xml-to-dict-xml-to-json/
import json
import xmltodict
def convert(xml_file, xml_attribs=True):
with open(xml_file, "rb") as f: # notice the "rb" mode
d = xmltodict.parse(f, xml_attribs=xml_attribs)
return json.dumps(d, indent=4)
You definitely don't want to be hand-parsing the XML. As well as the libraries others have mentioned, you could use an XSLT 3.0 processor. To go above 100Mb you would benefit from a streaming processor such as Saxon-EE, but up to that kind of level the open source Saxon-HE should be able to hack it. You haven't shown the source XML or target JSON, so I can't give you specific code - the assumption in XSLT 3.0 is that you probably want a customized transformation rather than an off-the-shelf one, so the general idea is to write template rules that define how different parts of your input XML should be handled.
I would like to extract text from docx files into simple txt file.
I know this problem might seem to be easy or trivial (I hope it will be) but I've looked over dozens of forum topics, spent hours trying to solve by myself and found no solution...
I have borrowed the following code from Etienne's blog.
It works perfectly if I need the content with no formatting. But...
Since my documents contain simple tables, I need them to keep their format with simply using tabulators.
So instead of this:
Name
Age
Wage
John
30
2000
This should appear:
Name Age Wage
John 30 2000
In order not to slide into each other I prefer double tabs for longer lines.
I have examined XML structure a little bit and found out that new rows in tables are indicated by tr, and columns by tc.
So I've tried to modify this a thousand ways but with no success...
Though it's not really working, I copy my idea of approaching the solution:
from lxml.html.defs import form_tags
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
WORD_NAMESPACE='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
ROW = WORD_NAMESPACE + 'tr'
COL = WORD_NAMESPACE + 'tc'
def get_docx_text(path):
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for item in tree.iter(ROW or COL or PARA):
texts = []
print(item)
if item is ROW:
texts.append('\n')
elif item is COL:
texts.append('\t\t')
elif item is PARA:
for node in item.iter(TEXT):
if node.text:
texts.append(node.text)
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
text_file = open("output.txt", "w")
text_file.write(get_docx_text('input.docx'))
text_file.close()
I'm not very sure about how the syntactics should look like. The output gives nothing, and for a few trial it resulted something but it was even worse than nothing.
I put print(item) just for checking. But instead of every ROW, COL and PARA items it will list me ROWs only. So it seems like in the condition of the for loop the program seems to ingore the or connection of terms. If it cannot find ROW, it won't execute the 2 remaining options but skip instantly to the next item. I tried it with giving a list of the terms, as well.
Inside it the if/elif blocks I think e.g. if item is ROW should examine whether 'item' and 'ROW' are identical (and they actually are).
X or Y or Z evaluates to the first of three values, which is casted to True. Non-empty strings are always True. So, for item in tree.iter(ROW or COL or PARA) evaluates to for item in tree.iter(ROW) — this is why you are getting only row elements inside your loop.
iter() method of ElementTree object can only accept one tag name, so you should perhaps just iterate over the whole tree (won't be a problem if document is not big).
is is not going to work here. It is an identity operator and only returns True if objects compared are identical (i. e. variables compared refer to the same Python object). In your if... elif... you're comparing a constant str (ROW, COL, PARA) and Element object, which is created anew in each iteration, so, obviously, these two are not the same object and each comparison will return False.
Instead you should use something like if item.tag == ROW.
All of the above taken into account, you should rewrite your loop section like this:
for item in tree.iter():
texts = []
print(item)
if item.tag == ROW:
texts.append('\n')
elif item.tag == COL:
texts.append('\t\t')
elif item.tag == PARA:
for node in item.iter(TEXT):
if node.text:
texts.append(node.text)
if texts:
paragraphs.append(''.join(texts))
The answer above won't work like you asked. This should work for documents containing only tables; some additional parsing with findall should help you isolate non-table data and make this work for a document with tables and other text:
TABLE = WORD_NAMESPACE + 'tbl'
for item in tree.iter(): # use this for loop instead
#print(item.tag)
if item.tag == TABLE:
for row in item.iter(ROW):
texts.append('\n')
for col in row.iter(COL):
texts.append('\t')
for ent in col.iter(TEXT):
if ent.text:
texts.append(ent.text)
return ''.join(texts)
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.
I'm trying to find the best way to parse through a file in Python and create a list of namedtuples, with each tuple representing a single data entity and its attributes. The data looks something like this:
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
UI: T145
RL: exhibits
ABR: EX
RIN: exhibited_by
RTN: R3.3.2
DEF: Shows or demonstrates.
HL: {isa} performs
STL: [Animal|Behavior]; [Group|Behavior]
UI: etc...
While several attributes are shared (eg UI), some are not (eg STY). However, I could hardcode an exhaustive list of necessary.
Since each grouping is separated by an empty line, I used split so I can process each chunk of data individually:
input = file.read().split("\n\n")
for chunk in input:
process(chunk)
I've seen some approaches use string find/splice, itertools.groupby, and even regexes. I was thinking of doing a regex of '[A-Z]*:' to find where the headers are, but I'm not sure how to approach pulling out multiple lines afterwards until another header is reached (such as the multilined data following DEF in the first example entity).
I appreciate any suggestions.
I took assumption that if you have string span on multiple lines you want newlines replaced with spaces (and to remove any additional spaces).
def process_file(filename):
reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
tmp = '' # Stored/cached data for mutliline string
key = None # Current key
data = {}
with open(filename,'r') as f:
for row in f:
row = row.rstrip()
match = reg.match(row)
# Matches header or is end, put string to list:
if (match or not row) and key:
data[key] = tmp
key = None
tmp = ''
# Empty row, next dataset
if not row:
# Prevent empty returns
if data:
yield data
data = {}
continue
# We do have header
if match:
key = str(match.group(1))
tmp = row[len(match.group(0)):]
continue
# No header, just append string -> here goes assumption that you want to
# remove newlines, trailing spaces and replace them with one single space
tmp += ' ' + row
# Missed row?
if key:
data[key] = tmp
# Missed group?
if data:
yield data
This generator returns dict with pairs like UI: T020 in each iteration (and always at least one item).
Since it uses generator and continuous reading it should be effective event on large files and it won't read whole file into memory at once.
Here's little demo:
for data in process_file('data.txt'):
print('-'*20)
for i in data:
print('%s:'%(i), data[i])
print()
And actual output:
--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab
--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX
source = """
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
"""
inpt = source.split("\n") #just emulating file
import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
line_match = reg.match(line) #check if we hit the CODE: Content line
if line_match is not None:
if current_key is not None:
output[current_key] = current #if so - update the current_key with contents
current_key = line_match.group(1)
current = line_match.group(2)
else:
current = current + line #if it's not - it should be the continuation of previous key line
output[current_key] = current #don't forget the last guy
print(output)
import re
from collections import namedtuple
def process(chunk):
split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
d = dict()
fields = list()
for i in xrange(len(split_chunk)/2):
fields.append(split_chunk[i])
d[split_chunk[i]] = split_chunk[i+1]
my_tuple = namedtuple(split_chunk[1], fields)
return my_tuple(**d)
should do. I think I'd just do the dict though -- why are you so attached to a namedtuple?
Below is the snippet: I'm parsing job log and the output is the formatted result.
def job_history(f):
def get_value(j,n):
return j[n].split('=')[1]
lines = read_file(f)
for line in lines:
if line.find('Exit_status=') != -1:
nLine = line.split(';')
jobID = '.'.join(nLine[2].split('.',2)[:-1]
jData = nLine[3].split(' ')
jUsr = get_value(jData,0)
jHst = get_value(jData,9)
jQue = get_value(jData,3)
eDate = job_value(jData,14)
global LJ,LU,LH,LQ,LE
LJ = max(LJ, len(jobID))
LU = max(LU, len(jUsr))
LH = max(LH, len(jHst))
LQ = max(LQ, len(jQue))
LE = max(LE, len(eDate))
print "%-14s%-12s%-14s%-12s%-10s" % (jobID,jUsr,eDate,jHst,jQue)
return LJ,LU,LE,LH,LQ
In principle, I should have another function like this:
def fmt_print(a,b,c,d,e):
print "%-14s%-12s%-14s%-12s%-10s\n" % (a,b,c,d,e)
to print the header and call the functions like this to print the complete result:
fmt_print('JOB ID','OWNER','E_DATE','R_HOST','QUEUE')
job_history(inFile)
My question is: how can I make fmt_print() to print both the header and the result using the values LJ,LU,LE,LH,LQ for the format spacing. the job_history() will parse a number of log files from the log-directory. The length of the field of similar type will differ from file to file and I don't wanna go static with the spacing (assuming the max length per field) for this as there gonna be lot more columns to print (than the example). Thanks in advance for your help. Cheers!!
PS. For those who know my posts: I don't have to use python v2.3 anymore. I can use even v2.6 but I want my code to be v2.4 compatible to go with RHEL5 default.
Update: 1
I had a fundamental problem in my original script. As I mentioned above that the job_history() will read the multiple files in a directory in a loop, the max_len were being calculated per file and not for the entire result. After modifying
unutbu's script a little bit and following xtofl's (if this is what it meant) suggestion, I came up with this, which seems to be working.
def job_history(f):
result=[]
for line in lines:
if line.find('Exit_status=') != -1:
....
....
global LJ,LU,LH,LQ,LE
LJ = max(LJ, len(jobID))
LU = max(LU, len(jUsr))
LH = max(LH, len(jHst))
LQ = max(LQ, len(jQue))
LE = max(LE, len(eDate))
result.append((jobID,jUsr,eDate,jHst,jQue))
return LJ,LU,LH,LQ,LE,result
# list of log files
inFiles = [ m for m in os.listdir(logDir) ]
saved_ary = []
for inFile in sorted(inFiles):
LJ,LU,LE,LH,LQ,result = job_history(inFile)
saved_ary += result
# format printing
fmt_print = "%%-%ds %%-%ds %%-%ds %%-%ds %%-%ds" % (LJ,LU,LE,LH,LQ)
print_head = fmt_print % ('Job Id','User','End Date','Exec Host','Queue')
print '%s\n%s' % (print_head, len(print_head)*'-')
for lines in saved_ary:
print fmt_print % lines
I'm sure there are lot other better ways of doing this, so suggestion(s) are welcomed. cheers!!
Update: 2
Sorry for brining up this "solved" post again. Later discovered, I was even wrong with my updated script, so I thought I'd post another update for future reference. Even though it appeared to be working, actually length_data were overwritten with the new one for every file in the loop. This works correctly now.
def job_history(f):
def get_value(j,n):
return j[n].split('=')[1]
lines = read_file(f)
for line in lines:
if "Exit_status=" in line:
nLine = line.split(';')
jobID = '.'.join(nLine[2].split('.',2)[:-1]
jData = nLine[3].split(' ')
jUsr = get_value(jData,0)
....
result.append((jobID,jUsr,...,....,...))
return result
# list of log files
inFiles = [ m for m in os.listdir(logDir) ]
saved_ary = []
LJ = 0; LU = 0; LE = 0; LH = 0; LQ = 0
for inFile in sorted(inFiles):
j_data = job_history(inFile)
saved_ary += j_data
for ix in range(len(saved_ary)):
LJ = max(LJ, len(saved_ary[ix][0]))
LU = max(LU, len(saved_ary[ix][1]))
....
# format printing
fmt_print = "%%-%ds %%-%ds %%-%ds %%-%ds %%-%ds" % (LJ,LU,LE,LH,LQ)
print_head = fmt_print % ('Job Id','User','End Date','Exec Host','Queue')
print '%s\n%s' % (print_head, len(print_head)*'-')
for lines in saved_ary:
print fmt_print % lines
The only problem is it's taking a bit of time to start printing the info on the screen, just because, I think, as it's putting all the in the array first and then printing. Is there any why can it be improved? Cheers!!
Since you don't know LJ, LU, LH, LQ, LE until the for-loop ends, you have to complete this for-loop before you print.
result=[]
for line in lines:
if line.find('Exit_status=') != -1:
...
LJ = max(LJ, len(jobID))
LU = max(LU, len(jUsr))
LH = max(LH, len(jHst))
LQ = max(LQ, len(jQue))
LE = max(LE, len(eDate))
result.append((jobID,jUsr,eDate,jHst,jQue))
fmt="%%-%ss%%-%ss%%-%ss%%-%ss%%-%ss"%(LJ,LU,LE,LH,LQ)
for jobID,jUsr,eDate,jHst,jQue in result:
print fmt % (jobID,jUsr,eDate,jHst,jQue)
The fmt line is a bit tricky. When you use string interpolation, each %s gets replaced by a number, and %% gets replaced by a single %. This prepares the correct format for the subsequent print statements.
Since the column header and column content are so closely related, why not couple them into one structure, and return an array of 'columns' from your job_history function? The task of that function would be to
output the header for each colum
create the output for each line, into the corresponding column
remember the maximum width for each column, and store it in the column struct
Then, the prinf_fmt function can 'just'
iterate over the column headers, and print them using the respective width
iterate over the 'rest' of the output, printing each cell with 'the respective width'
This design will separate output definition from actual formatting.
This is the general idea. My python is not that good; but I may think up some example code later...
Depending on how many lines are there, you could:
read everything first to figure out the maximum field lengths, then go through the lines again to actually print out the results (if you have only a handful of lines)
read one page of results at a time and figure out maximum length for the next 30 or so results (if you can handle the delay and have many lines)
don't care about the format and output in a csv or some database format instead - let the final person / actual report generator worry about importing it