Parsing massive XML files to JSON - python

I am working on a project that requires me to parse massive XML files to JSON. I have written code, however it is too slow. I have looked at using lxml and BeautifulSoup but am unsure how to proceed.
I have included my code. It works exactly how it is supposed to, except it is too slow. It took around 24 hours to go through a sub-100Mb file to parse 100,000 records.
product_data = open('productdata_29.xml', 'r')
read_product_data = product_data.read()
def record_string_to_dict(record_string):
'''This function takes a single record in string form and iterates through
it, and sorts it as a dictionary. Only the nodes present in the parent_rss dict
are appended to the new dict (single_record_dict). After each record,
single_record_dict is flushed to final_list and is then emptied.'''
#Iterating through the string to find keys and values to put in to
#single_record_dict.
while record_string != record_string[::-1]:
try:
k = record_string.index('<')
l = record_string.index('>')
temp_key = record_string[k + 1:l]
record_string = record_string[l+1:]
m = record_string.index('<')
temp_value = record_string[:m]
#Cleaning thhe keys and values of unnecessary characters and symbols.
if '\n' in temp_value:
temp_value = temp_value[3:]
if temp_key[-1] == '/':
temp_key = temp_key[:-1]
n = record_string.index('\n')
record_string = record_string[n+2:]
#Checking parent_rss dict to see if the key from the record is present. If it is,
#the key is replaced with keys and added to single_record_dictionary.
if temp_key in mapped_nodes.keys():
temp_key = mapped_nodes[temp_key]
single_record_dict[temp_key] = temp_value
except Exception:
break
while len(read_product_data) > 10:
#Goes through read_product_data to create blocks, each of which is a single
#record.
i = read_product_data.index('<record>')
j = read_product_data.index('</record>') + 8
single_record_string = read_product_data[i:j]
single_record_string = single_record_string[9:-10]
#Runs previous function with the input being the single string found previously.
record_string_to_dict(single_record_string)
#Flushes single_record_dict to final_list, and empties the dict for the next
#record.
final_list.append(single_record_dict)
single_record_dict = {}
#Removes the record that was previously processed.
read_product_data = read_product_data[j:]
#For keeping track/ease of use.
print('Record ' + str(break_counter) + ' has been appended.')
#Keeps track of the number of records. Once the set value is reached
#in the if loop, it is flushed to a new file.
break_counter += 1
flush_counter += 1
if break_counter == 100 or flush_counter == break_counter:
record_list = open('record_list_'+str(file_counter)+'.txt', 'w')
record_list.write(str(final_list))
#file_counter keeps track of how many files have been created, so the next
#file has a different int at the end.
file_counter += 1
record_list.close()
#resets break counter
break_counter = 0
final_list = []
#For testing purposes. Causes execution to stop once the number of files written
#matches the integer.
if file_counter == 2:
break
print('All records have been appended.')

Any reason, why are you not considering packages such as xml2json and xml2dict. See this post for working examples:
How can i convert an xml file into JSON using python?
Relevant code reproduced from above post:
xml2json
import xml2json
s = '''<?xml version="1.0"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>'''
print xml2json.xml2json(s)
xmltodict
import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'
See this post if working in Python 3:
https://pythonadventures.wordpress.com/2014/12/29/xml-to-dict-xml-to-json/
import json
import xmltodict
def convert(xml_file, xml_attribs=True):
with open(xml_file, "rb") as f: # notice the "rb" mode
d = xmltodict.parse(f, xml_attribs=xml_attribs)
return json.dumps(d, indent=4)

You definitely don't want to be hand-parsing the XML. As well as the libraries others have mentioned, you could use an XSLT 3.0 processor. To go above 100Mb you would benefit from a streaming processor such as Saxon-EE, but up to that kind of level the open source Saxon-HE should be able to hack it. You haven't shown the source XML or target JSON, so I can't give you specific code - the assumption in XSLT 3.0 is that you probably want a customized transformation rather than an off-the-shelf one, so the general idea is to write template rules that define how different parts of your input XML should be handled.

Related

How to parse very big files in python?

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("\t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file
several issues here:
testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...
why not use a collections.defaultdict instead?
why not use csv module?
overriding eval built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='\t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
Some suggestions:
Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().
dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("\t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.
Something along the lines of
awk -F $'\t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1\ttata\ti
2\tyipp\ti
3\turks\ti
1\tTTtata\ti
2\tYYyipp\ti
3\tUUurks\ti
1\ttttttttata\ti
2\tyyyyyyyipp\ti
3\tuuuuuuurks\ti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("\t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"\n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.
Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],[]).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.

Difference between two XML parsers in pandas

I have two lightweight XML parsers (see below). One (1) is more efficient but the results interact oddly with pandas. The other is slow but doesn't cause unusual errors. The problem is, I can't find where the two would actually produce anything different.
1)
keep=[]
for en,a in enumerate(open(inputfile,'r')):
if not a.startswith("<Record"):continue
line=a.strip().split("'")
line[0]=line[0].replace("<Record ",'')
line.pop() # deletes the final "/>"
store={}
for x in range(0, len(line), 2):
if not line[x]:continue
line[x] = line[x].strip().rstrip("=")
print(line[x])
print(line[x+1])
store[line[x]]=line[x+1]
if line[x] == "id17":
id17s.append(line[x+1])
keep.append(store)
doc = pandas.DataFrame(keep,index=id17s) # contains all XML records.
2)
keep=[]
for en,a in enumerate(open(inputfile,'r')):
if not a.startswith("<Record"):continue
line=a.strip().split("' ")
line[0]=line[0].replace("<Record ",'')
line.pop()
store={}
for x in range(len(line)):
if not line[x]:continue
try:
trait,evalx=line[x].replace("'",'').strip().split('=',1)
except:
print line
print line[x].replace("'",'').strip().split('=')
sys.exit()
store[trait]=evalx
if trait == "id17":
id17s.append(evalx)
keep.append(store)
doc = pandas.DataFrame(keep,index=id17s) # contains all XML records.
Both parsers are followed with
doc_df = doc.apply(pandas.to_numeric, args=('ignore',)) # set missing values to nan
header=list(doc_df.columns.values) #Get columns names (header)

If statement and writing to file

I am using the KEGG API to download genomic data and writing it to a file. There are 26 files total and some of of them contain the the dictionary 'COMPOUND'. I would like to assign these to CompData and write them to the output file. I tried writing it as an if True statement but this does not work.
# Read in hsa links
hsa = []
with open ('/users/skylake/desktop/pathway-HSAs.txt', 'r') as file:
for line in file:
line = line.strip()
hsa.append(line)
# Import KEGG API Bioservices | Create KEGG Variable
from bioservices.kegg import KEGG
k = KEGG()
# Data Parsing | Writing to File
# for i in range(len(hsa)):
data = k.get(hsa[2])
dict_data = k.parse(data)
if dict_data['COMPOUND'] == True:
compData = str(dict_data['COMPOUND'])
nameData = str(dict_data['NAME'])
geneData = str(dict_data['GENE'])
f = open('/Users/Skylake/Desktop/pathway-info/' + nameData + '.txt' , 'w')
f.write("Genes\n")
f.write(geneData)
f.write("\nCompounds\n")
f.write(compData)
f.close()
I guess that by
if dict_data['COMPOUND'] == True:
You test (wrongly) for the existence of the string key 'COMPOUND' in dict_data. In this case what you want is
if 'COMPOUND' in dict_data:
Furthermore, note that the definition of the variable compData won't occur if the key is not present, which will raise an error when trying to write its value. This means that you should always define it whatever happens, via doing, e.g.
compData = str(dict_data.get('COMPOUND', 'undefined'))
The above line of code means that if the key exists it gets its value and if does not exist, it gets 'undefined' instead. Note that you can chose the alternative value you want, or even not giving any, which results in None by default.

Using BeautifulSoup to find a tag and evaluate whether it fits some criteria

I am writing a program to extract text from a website and write it into a text file. Each entry in the text file should have 3 values separated by a tab. The first value is hard-coded to XXXX, the 2nd value should initialize to the first item on the website with , and the third value is the next item on the website with a . The logic I'm trying to introduce is looking for the first and write the associated string into the text file. Then find the next and write the associated string into the text file. Then, look for the next p class. If it's "style4", start a new line, if it's another "style5", write it into the text file with the first style5 entry but separated with a comma (alternatively, the program could just skip the next style5.
I'm stuck on the part of the program in bold. That is, getting the program to look for the next p class and evaluate it against style4 and style5. Since I was having problems with finding and evaluating the p class tag, I chose to pull my code out of the loop and just try to accomplish the first iteration of the task for starters. Here's my code so far:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
next_vendor = soup.find('p', {'class': 'style4'})
print next_vendor
next_commodity = next_vendor.find_next('p', {'class': 'style5'})
print next_commodity
next = next_commodity.find_next('p')
print next
I'd appreciate any help anybody can provide! Thanks in advance!
I am not entirely sure how you are expecting your output to be. I am assuming that you are trying to get the data in the webpage in the format:
Alphabet \t Vendor \t Category
You can do this:
# The basic things
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
Get the td of interest:
table = soup.find('table')
data = table.find_all('tr')[-1]
data = data.find_all('td')[1:]
Now, we will create a nested output dictionary with alphabets as the keys and an inner dict as the value. The inner dict has vendor name as key and category information as it's value
output_dict = {}
current_alphabet = ""
current_vendor = ""
for td in data:
for p in td.find_all('p'):
print p.text.strip()
if p.get('class')[0] == 'style6':
current_alphabet = p.text.strip()
vendors = {}
output_dict[current_alphabet] = vendors
continue
if p.get('class')[0] == 'style4':
print "Here"
current_vendor = p.text.strip()
category = []
output_dict[current_alphabet][current_vendor] = category
continue
output_dict[current_alphabet][current_vendor].append(p.text.strip())
This gets the output_dict in the format:
{ ...
u'W': { u'WTI - Weatherproofing Technologies': [u'Roofing'],
u'Wenger Corporation': [u'Musical Instruments and Equipment'],
u'Williams Scotsman, Inc': [u'Modular/Portable Buildings'],
u'Witt Company': [u'Interactive Technology']
},
u'X': { u'Xerox': [u"Copiers & MFD's", u'Printers']
}
}
Skipping the earlier parts for brevity. Now it is just a matter of accessing this dictionary and writing out to a tab separated file.
Hope this helps.
Agree with #shaktimaan. Using a dictionary or list is a good approach here. My attempt is slightly different.
import requests as rq
from bs4 import BeautifulSoup as bsoup
import csv
url = "http://www.kcda.org/KCDA_Awarded_Contracts.htm"
r = rq.get(url)
soup = bsoup(r.content)
primary_line = soup.find_all("p", {"class":["style4","style5"]})
final_list = {}
for line in primary_line:
txt = line.get_text().strip().encode("utf-8")
if txt != "\xc2\xa0":
if line["class"][0] == "style4":
key = txt
final_list[key] = []
else:
final_list[key].append(txt)
with open("products.csv", "wb") as ofile:
f = csv.writer(ofile)
for item in final_list:
f.writerow([item, ", ".join(final_list[item])])
For the scrape, we isolate style4 and style5 tags right away. I did not bother going for the style6 or the alphabet headers. We then get the text inside each tag. If the text is not a whitespace of sorts (this is all over the tables, probably obfuscation or bad mark-up), we then check if it's style4 or style5. If it's the former, we assign it as a key to a blank list. If it 's the latter, we append it to the blank list of the most recent key. Obviously the key changes every time we hit a new style4 only so it's a relatively safe approach.
The last part is easy: we just use ", ".join on the value part of the key-value pair to concatenate the list as one string. We then write it to a CSV file.
Due to the dictionary being unsorted, the resulting CSV file will not be sorted alphabetically. Screenshot of result below:
Changing it to a tab-delimited file is up to you. That's simple enough. Hope this helps!

Python - reading data from file with variable attributes and line lengths

I'm trying to find the best way to parse through a file in Python and create a list of namedtuples, with each tuple representing a single data entity and its attributes. The data looks something like this:
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
UI: T145
RL: exhibits
ABR: EX
RIN: exhibited_by
RTN: R3.3.2
DEF: Shows or demonstrates.
HL: {isa} performs
STL: [Animal|Behavior]; [Group|Behavior]
UI: etc...
While several attributes are shared (eg UI), some are not (eg STY). However, I could hardcode an exhaustive list of necessary.
Since each grouping is separated by an empty line, I used split so I can process each chunk of data individually:
input = file.read().split("\n\n")
for chunk in input:
process(chunk)
I've seen some approaches use string find/splice, itertools.groupby, and even regexes. I was thinking of doing a regex of '[A-Z]*:' to find where the headers are, but I'm not sure how to approach pulling out multiple lines afterwards until another header is reached (such as the multilined data following DEF in the first example entity).
I appreciate any suggestions.
I took assumption that if you have string span on multiple lines you want newlines replaced with spaces (and to remove any additional spaces).
def process_file(filename):
reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
tmp = '' # Stored/cached data for mutliline string
key = None # Current key
data = {}
with open(filename,'r') as f:
for row in f:
row = row.rstrip()
match = reg.match(row)
# Matches header or is end, put string to list:
if (match or not row) and key:
data[key] = tmp
key = None
tmp = ''
# Empty row, next dataset
if not row:
# Prevent empty returns
if data:
yield data
data = {}
continue
# We do have header
if match:
key = str(match.group(1))
tmp = row[len(match.group(0)):]
continue
# No header, just append string -> here goes assumption that you want to
# remove newlines, trailing spaces and replace them with one single space
tmp += ' ' + row
# Missed row?
if key:
data[key] = tmp
# Missed group?
if data:
yield data
This generator returns dict with pairs like UI: T020 in each iteration (and always at least one item).
Since it uses generator and continuous reading it should be effective event on large files and it won't read whole file into memory at once.
Here's little demo:
for data in process_file('data.txt'):
print('-'*20)
for i in data:
print('%s:'%(i), data[i])
print()
And actual output:
--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab
--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX
source = """
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
"""
inpt = source.split("\n") #just emulating file
import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
line_match = reg.match(line) #check if we hit the CODE: Content line
if line_match is not None:
if current_key is not None:
output[current_key] = current #if so - update the current_key with contents
current_key = line_match.group(1)
current = line_match.group(2)
else:
current = current + line #if it's not - it should be the continuation of previous key line
output[current_key] = current #don't forget the last guy
print(output)
import re
from collections import namedtuple
def process(chunk):
split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
d = dict()
fields = list()
for i in xrange(len(split_chunk)/2):
fields.append(split_chunk[i])
d[split_chunk[i]] = split_chunk[i+1]
my_tuple = namedtuple(split_chunk[1], fields)
return my_tuple(**d)
should do. I think I'd just do the dict though -- why are you so attached to a namedtuple?

Categories

Resources