Links:
Sample XML
Tools in use:
Power Automate Desktop (PAD)
Command Prompt (CMD) Session operated by PAD
Python 3.10 (importing: Numpy, Pandas & using Element Tree)
SQL to read data and insert into primary database
Method:
XML to CSV conversion
Im using Power Automate Desktop (PAD) to automate all of this because it is what I know.
The conversion method uses Python, inside of CMD, importing numpy & pandas, using element tree
Goal(s):
I would like to avoid the namespace being written to headers so SQL can interact with the csv data
I would like to append a master csv file with the new data coming in from every call
This last Goal Item is more of a wishlist and I will likely take it over to a PAD forum. Just including it in the rare event someone has a solution:
(I'd like to avoid using CMD all together) but given the limitations of PAD, using IronPython, I have to use a CMD session to utilize regular Python in the CMS session.
I am using Power Automate Desktop (PAD), which has a Python module. The only issue is that is uses Iron Python 2.7 & I cannot access the libraries & dont know how to code the Python code below in IronPython. If I could get this sorted out, the entire process would be much more efficient.
IronPython Module Operators (Again, this would be a nice to have.. The other goals are the priority.
Issues:
xmlns is being added to some of the headers when writing converted data to csv
The python code is generating a new .csv file each loop, where I would like for it to simply append the data into a common .csv file, to build the dataset.
Details (Probaably Unnecessary):
First, I am not Python expert and I am pretty novice with SQL as well.
Currently, the Python code (see below) converts a webservice call, body formatted XML, to csv format. Then it exports that .csv data to an actual csv file. This works great but it overwrites the file each time, which means I need to program scripts to read this data before the file is deleted/replaced with the new file. I then use SQL to INSERT INTO a main dataset (Append data) using SQL. It also prints the XMLNS in some of the headers, which is something I need to avoid.
If you are wondering why I am performing the conversion & not simply parsing the xml data, my client requires csv format for their datasets. Otherwise, I'd just parse out the XML data as needed. The other reason is that this data is being updated incrementally at set intervals, building an audit trail. So a bulk conversion is not possible due to
What I would like to do is have the Python code perform the conversion & then append a main dataset inside of a csv file.
The other issue here is that the XMLNS from the XML is being pulled into some (Not all) of the headers of the CSV table, which has made the use of SQL to read and insert into the main table an issue. I cannot figure a way around this (Again, novice)
Also, if anyone knows how one would write this in IronPython 2.7, that would be great too because I'd be able to get around using the CMD Session.
So, if I could use Python to append a primary table with the converted data, while escaping the namespace, this would solve all of my (current) issues & would have the added benefit of being 100% more efficient in the movement of this data.
Also, due to the limited toolset I have, I am scripting this using Power Automate in the CMD, using a CMD session.
Python Code (within CMD environment):
cd\WORK\XML
python
import numpy as np
import pandas as pd
from xml.etree import ElementTree
maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
all_tags = list(set([elem.tag for elem in parentroot.iter()]))
rows = []
for child in parentroot:
temp_dict = {}
for i in all_tags:
tag_values = {}
for inners in child.iter(i):
temp_tag_value = {}
temp_dict.update(inners.attrib)
temp_tag_value[i.rsplit("}", 1)[1]] = inners.text
tag_values.update(temp_tag_value)
temp_dict.update(tag_values)
rows.append(temp_dict)
dataframe = pd.DataFrame.from_dict(rows, orient='columns')
dataframe = dataframe.replace({np.nan: None})
dataframe.to_csv('FILE_TABLE_CMD.csv', index=False)
Given no analysis is needed, avoid pandas and consider building the CSV by using csv.DictWriter where you can specify the append mode of write context. Below parses all underlying descendants of <ClinicalData> and migrates each set into CSV row.
from csv import DictWriter
from xml.etree import ElementTree
maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
nmsp = {"doc": "http://www.cdisc.org/ns/odm/v1.3"}
# RETRIEVE ALL ELEMENT TAGS
all_tags = list(set([elem.tag for elem in parentroot.iter()]))
# RETRIEVE ALL ATTRIB KEYS
all_keys = [list(elem.attrib.keys()) for elem in maintree.iter()]
# UNNEST AND DE-DEDUPE
all_keys = set(sum([key for key in all_keys], []))
# COMBINE ELEM AND ATTRIB NAMES
all_tags = all_tags + list(all_keys)
all_tags = [(tag.split('}')[1] if '}' in tag else tag) for tag in all_tags]
# APPEND TO EXISTING DATA WIH 'a'
with open('FILE_TABLE_CMD.csv', 'a') as f:
writer = DictWriter(f, fieldnames = all_tags)
writer.writeheader()
# ITERATE THROUGH ALL ClincalData ELEMENTS
for cd in parentroot.findall('doc:ClinicalData', namespaces=nmsp):
temp_dict = {}
# ITERATE THROUGH ALL DESCENDANTS
for elem in cd.iter():
# UPDATE DICT FOR ELEMENT TAG/TEXT
temp_dict[elem.tag.split("}", 1)[1]] = elem.text
# MERGE ELEM DICT WITH ATTRIB DICT
temp_dict = {**temp_dict, **elem.attrib}
# REMOVE NAMESPACES IN KEYS
temp_dict = {
(k.split('}')[1] if '}' in k else k):v
for k,v
in temp_dict.items()
}
# WRITE ROW TO CSV
writer.writerow(temp_dict)
Actually, you can use the new iterparse feature of pandas.read_xml in latest v1.5. Though intended for very large XML, this feature allows parsing any underlying element or attribute without the restriction of relationships required of XPath.
You will still need to find all element and attributes names. CSV output will differ with above as method removes any all-empty columns and retains order of elements/attributes presented in XML. Also, pandas.DataFrame.csv does support append mode but conditional logic may be needed for writing headers.
import os
from xml.etree import ElementTree
import pandas as pd # VERSION 1.5+
maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
# RETRIEVE ALL ELEMENT TAGS
all_tags = list(set([elem.tag for elem in parentroot.iter()]))
# RETRIEVE ALL ATTRIB KEYS
all_keys = [list(elem.attrib.keys()) for elem in maintree.iter()]
# UNNEST AND DE-DEDUPE
all_keys = set(sum([key for key in all_keys], []))
# COMBINE ELEM AND ATTRIB NAMES
all_tags = all_tags + list(all_keys)
all_tags = [(tag.split('}')[1] if '}' in tag else tag) for tag in all_tags]
clinical_data_df = pd.read_xml(
"FILE_XML.xml", iterparse = {"ClinicalData": all_tags}, parser="etree"
)
if os.path.exists("FILE_TABLE_CMD.csv"):
# CREATE CSV WITH HEADERS
clinical_data_df.to_csv("FILE_TABLE_CMD.csv", index=False)
else:
# APPEND TO CSV WITHOUT HEADERS
clinical_data_df.to_csv("FILE_TABLE_CMD.csv", index=False, mode="a", header=False)
I recommend to split large XML in branches and parse this parts separately. This can be done in an object. The object can also hold the data until written to the csv or a database like sqlite3, MySQL, etc.. The object can be called also from different threat.
I have not define the csv write, because I don't know which data you like to catch. But I think you will finish this easy.
Her is my recommended concept:
import xml.etree.ElementTree as ET
import pandas as pd
#import re
class ClinicData:
def __init__(self, branch):
self.clinical_data = []
self.tag_list = []
for elem in branch.iter():
self.tag_list.append(elem.tag)
#print(elem.tag)
def parse_cd(self, branch):
for elem in branch.iter():
if elem.tag in self.tag_list:
print(f"{elem.tag}--->{elem.attrib}")
if elem.tag == "{http://www.cdisc.org/ns/odm/v1.3}AuditRecord":
AuditRec_val = pd.read_xml(ET.tostring(elem))
print(AuditRec_val)
branch.clear()
def main():
"""Parse each clinic data into the class """
xml_file = 'Doug_s.xml'
#tree = ET.parse(xml_file)
#root = tree.getroot()
#ns = re.match(r'{.*}', root.tag).group(0)
#print("Namespace:",ns)
parser = ET.XMLPullParser(['end'])
with open(xml_file, 'r', encoding='utf-8') as et_xml:
for line in et_xml:
parser.feed(line)
for event, elem in parser.read_events():
if elem.tag == "{http://www.cdisc.org/ns/odm/v1.3}ClinicalData" and event =='end':
#print(event, elem.tag)
elem_part = ClinicData(elem)
elem_part.parse_cd(elem)
if __name__ == "__main__":
main()
Related
This question already has an answer here:
How to use xml sax parser to read and write a large xml?
(1 answer)
Closed 3 years ago.
I have huge XML datasets (2-40GB). Some of the data is confidential, so I am trying to edit the dataset to mask all of the confidential information. I have a long list of each value that needs to be masked, so for example if I have ID 'GYT-1064' I need to find and replace every instance of it. These values can be in different fields/levels/subclasses, so in one object it might have 'Order-ID = GYT-1064' whereas another might say 'PO-Name = GYT-1064'. I have looked into iterparse but cannot figure out how to in-place edit the xml file instead of building the entire new tree in memory, because I have to loop through it multiple times to find each instance of each ID.
Ideal functionality:
For each element, if a given string is in element, replace the text and change the line in the XML file.
I have a solution that works if the dataset is small enough to load into memory, but I can't figure out how to correctly leverage iterparse. I've also looked into every answer that talks about lxml iterparse, but since I need to iterate through the entire file multiple times, I need to be able to edit it in place
Simple version that works, but has to load the whole xml into memory (and isn't in-place)
values_to_mask = ['val1', 'GMX-103', 'etc-555'] #imported list of vals to mask
with open(dataset_name, encoding='utf8') as f:
tree = ET.parse(f)
root = tree.getroot()
for old in values_to_mask:
new = mu.generateNew(old, randomnumber) #utility to generate new amt
for elem in root.iter():
try:
elem.text = elem.text.replace(old, new)
except AttributeError:
pass
tree.write(output_name, encoding='utf8')
What I attempted with iterparse:
with open(output_name, mode='rb+') as f:
context = etree.iterparse( f )
for old in values_to_mask:
new = mu.generateNew(old, randomnumber)
mu.fast_iter(context, mu.replace_if_exists, old, new, f)
def replace_if_exists(elem, old, new, xf):
try:
if(old in elem.text):
elem.text = elem.text.replace(old, new)
xf.write(elem)
except AttributeError:
pass
It runs but doesn't replace any text, and I get print(context.root) = 'Null'. Additionally, it doesn't seem like it would correctly write back to the file in place.
Basically how the XML data looks (hierarchical objects with subclasses)
It looks generally like this:
<Master_Data_Object>
<Package>
<PackageNr>1000</PackageNr>
<Quantity>900</Quantity>
<ID>FAKE_CONFIDENTIALGYO421</ID>
<Item_subclass>
<ItemType>C</ItemType>
<MasterPackageID>FAKE_CONFIDENTIALGYO421</MasterPackageID>
<Package>
<Other_Types>
Since Lack of Dataset , I would like to suggest you to
1) use readlines() in loop to read substantial amount of data at a time
2) use a regular expression for identifying confidential information (if Possible) then replace it.
Let me know if it works
You can pretty much use SAX parser for big xml files.
Here is your answer -
Editing big xml files using sax parser
I am new to Python and am attempting to process some XML into CSV files for later diff validation against the output from a database. The code I have below does a good job of taking the 'tct-id' attributes from the XML and outputting them in a nice column under the heading 'DocumentID', as I need for my validation.
However, the outputs from the database just come as numbers, whereas the outputs from this code include the version number of the XML ID; for example
tct-id="D-TW-0010054;3;"
where I need the ;3; removed so I can validate properly.
This is the code I have; is there any way I can go about rewriting this so it will pre-process the XML snippets to remove that - like only take the first 12 characters from each attribute and write those to the CSV, for example?
from lxml import etree
import csv
xml_fname = 'example.xml'
csv_fname = 'output.csv'
fields = ['tct-id']
xml = etree.parse(xml_fname)
with open(xml_fname) as infile, open(csv_fname, 'w', newline='') as outfile:
r = csv.DictReader(infile)
w = csv.DictWriter(outfile, fields, delimiter=';', extrasaction="ignore")
wtr = csv.writer(outfile)
wtr.writerow(["DocumentID"])
for node in xml.xpath("//*[self::StringVariables or self::ElementVariables or self::PubInfo or self::Preface or self::DocHistory or self::Glossary or self::StatusInfo or self::Chapter]"):
atts = node.attrib
atts["elm_name"] = node.tag
w.writerow(node.attrib)
All help is very appreciated.
Assuming you'll only have one ;3; type string to remove from the tct-id, you can use regular expressions
import re
tct_id="D-TW-0010054;3;"
to_rem=re.findall(r'(;.*;)',tct_id)[0]
tct_id=tct_id.replace(to_rem,'')
note i'm using tct_id instead of tct-id as python doesn't usually allow variables to be set like that
I have a python code to implement in Spark, however I am unable to get the logic right for the RDD working to implement in Spark 1.1 version. This code is perfectly working in Python ,but I would like to implement in Spark with this code.
import lxml.etree
import csv
sc = SparkContext
data = sc.textFile("pain001.xml")
rdd = sc.parallelize(data)
# compile xpath selectors for ele ment text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]
# open result csv file
with open('pain.csv', 'w') as paincsv:
writer = csv.writer(paincsv)
# read file with 1 'CstmrCdtTrfInitn' record per line
with open(rdd) as painxml:
# process each record
for index, line in enumerate(painxml):
if not line.strip(): # allow empty lines
continue
try:
# each line is an xml doc
pain001 = lxml.etree.fromstring(line)
# move to the customer elem
elem = pain001.find('CstmrCdtTrfInitn')
# select each value and write to csv
writer.writerow([xp(elem)[0].strip() for xp in xpath])
except Exception, e:
# give a hint where things go bad
sys.stderr.write("Error line {}, {}".format(index, str(e)))
raise
I am getting error as RDD not iteratable
I want to implement this code as a function and implement as a standalone program in Spark
I would want the input file to be processed in HDFS as well as local mode in Spark with the python module.
Appreciate responses for the problem.
The error you are getting is very informative, when you do with open(rdd) as painxml: and after that, you try to iterate over the RDD as if it was a normal List or Tuple in python, and an RDD is not iterable, furthermore if you read the textFile documentation, you can notice that it returns an RDD.
I think the problem you have is that you are trying to achieve this in a classic way, and you must approach it inside the MapReduce paradigm, if you are really new into Apache Spark, you can audit this course Scalable Machine Learning with Apache Spark, furthermore I would recommend you to update your spark's version to 1.5 or 1.6 (that will come out soon).
Just as a small example (but not using xmls):
Import the required files
import re
import csv
Read the input file
content = sc.textFile("../test")
content.collect()
# Out[8]: [u'1st record-1', u'2nd record-2', u'3rd record-3', u'4th record-4']
Map the RDD to manipulate each row
# Map it and convert it to tuples
rdd = content.map(lambda s: tuple(re.split("-+",s)))
rdd.collect()
# Out[9]: [(u'1st record', u'1'),
# (u'2nd record', u'2'),
# (u'3rd record', u'3'),
# (u'4th record', u'4')]
Write your data
with open("../test.csv", "w") as fw:
writer = csv.writer(fw)
for r1 in rdd.toLocalIterator():
writer.writerow(r1)
Take a look...
$ cat test.csv
1st record,1
2nd record,2
3rd record,3
4th record,4
Note: If you want to read a xml with Apache Spark, there are some libraries in GitHub like spark-xml; you can also find this question interesting xml processing in spark.
I'm trying to write a simple JSON to CSV converter in Python for Kiva. The JSON file I am working with looks like this:
{"header":{"total":412045,"page":1,"date":"2012-04-11T06:16:43Z","page_size":500},"loans":[{"id":84,"name":"Justine","description":{"languages":["en"], REST OF DATA
The problem is, when I use json.load, I only get the strings "header" and "loans" in data, but not the actual information such as id, name, description, etc. How can I skip over everything until the [? I have a lot of files to process, so I can't manually delete the beginning in each one. My current code is:
import csv
import json
fp = csv.writer(open("test.csv","wb+"))
f = open("loans/1.json")
data = json.load(f)
f.close()
for item in data:
fp.writerow([item["name"]] + [item["posted_date"]] + OTHER STUFF)
Instead of
for item in data:
use
for item in data['loans']:
The header is stored in data['header'] and data itself is a dictionary, so you'll have to key into it in order to access the data.
data is a dictionary, so for item in data iterates the keys.
You probably want for loan in data['loans']:
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T