Transform the CSV to XML in python - python

I have a scenario where the data is extracted from oracle in the form of CSV and then it should be transformed to desired XML format.
Input CSV File:
Id,SubID,Rank,Size
1,123,1,0.1
1,234,2,0.2
2,456,1,0.1
2,123,2,0.2
Expected XML output:
<AA_ITEMS>
<Id ID="1">
<SubId ID="123">
<Rank>1</Rank>
<Size>0.1</Size>
</SubId>
<SubId ID="234">
<Rank>2</Rank>
<Size>0.2</Size>
</SubId>
</Id>
<Id ID="2">
<SubId ID="456">
<Rank>1</Rank>
<Size>0.1</Size>
</SubId>
<SubId ID="123">
<Rank>2</Rank>
<Size>0.2</Size>
</SubId>
</Id>
Note: The CSV file is a daily load and contains around 150K to 200K records
Please assist. Thanks in advance

There are a couple of ways to approach it and though some people dislike building xml from a template, I believe it works best:
from itertools import groupby
from lxml import etree
csv_string = """[your csv ab0ve]
"""
#first deal with the csv
#split it into lines and discard the headers
lines = csv_string.splitlines()[1:]
#group the lines by the first character
grpfunc = lambda x: x[0]
grps = [list(group) for key, group in groupby(lines, grpfunc)]
#now convert the whole thing into xml:
xml_string = """
<AA_ITEMS>
"""
for grp in grps:
elem = f' <Id ID="{grp[0][0]}">'
for g in grp:
entry = g.split(',')
#create an entry template:
id_tmpl = f"""
<SubId ID="{entry[1]}">
<Rank>{entry[2]}</Rank>
<Size>{entry[3]}</Size>
</SubId>
"""
elem+=id_tmpl
#close elem
elem+="""</Id>
"""
xml_string+=elem
#close the xml string
xml_string += """</AA_ITEMS>"""
#finally, show that the output is well formed xml:
print(etree.tostring(etree.fromstring(xml_string)).decode())
The output should be your expected xml.

Related

How can we convert a nested XML to CSV in Python Dynamically, Nested XML may contain array of values as well?

Sharing Sample XML file. Need to convert this fie to CSV, even if extra tags are added in this file. {without using tag names}. And XML file tag names should be used as column names while converting it to CSV}
Example Data:
<?xml version="1.0" encoding="UTF-8"?>
<Food>
<Info>
<Msg>Food Store items.</Msg>
</Info>
<store slNo="1">
<foodItem>meat</foodItem>
<price>200</price>
<quantity>1kg</quantity>
<discount>7%</discount>
</store>
<store slNo="2">
<foodItem>fish</foodItem>
<price>150</price>
<quantity>1kg</quantity>
<discount>5%</discount>
</store>
<store slNo="3">
<foodItem>egg</foodItem>
<price>100</price>
<quantity>50 pieces</quantity>
<discount>5%</discount>
</store>
<store slNo="4">
<foodItem>milk</foodItem>
<price>50</price>
<quantity>1 litre</quantity>
<discount>3%</discount>
</store>
</Food>
Tried Below code but getting error with same.
import xml.etree.ElementTree as ET
import pandas as pd
ifilepath = r'C:\DATA_DIR\feeds\test\sample.xml'
ofilepath = r'C:\DATA_DIR\feeds\test\sample.csv'
root = ET.parse(ifilepath).getroot()
print(root)
with open(ofilepath, "w") as file:
for child in root:
print(child.tag, child.attrib)
# naive example how you could save to csv line wise
file.write(child.tag+";"+child.attrib)
Above code is able to find root node, but unable to concatenate its attributes though
Tried one more code, but this works for 1 level nested XML, who about getting 3-4 nested tags in same XML file. And currently able to print values of all tags and their text. need to convert these into relational model { CSV file}
import xml.etree.ElementTree as ET
tree = ET.parse(ifilepath)
root = tree.getroot()
for member in root.findall('*'):
print(member.tag,member.attrib)
for i in (member.findall('*')):
print(i.tag,i.text)
Above example works well with pandas read_xml { using lxml parser}
But when we try to use the similar way out for below XML data, it doesn't produce indicator ID value and Country ID value as output in CSV file
Example Data ::
<?xml version="1.0" encoding="UTF-8"?>
<du:data xmlns:du="http://www.dummytest.org" page="1" pages="200" per_page="20" total="1400" sourceid="5" sourcename="Dummy ID Test" lastupdated="2022-01-01">
<du:data>
<du:indicator id="AA.BB">various, tests</du:indicator>
<du:country id="MM">test again</du:country>
<du:date>2021</du:date>
<du:value>1234567</du:value>
<du:unit />
<du:obs_status />
<du:decimal>0</du:decimal>
</du:data>
<du:data>
<du:indicator id="XX.YY">testing, cases</du:indicator>
<du:country id="DD">coverage test</du:country>
<du:date>2020</du:date>
<du:value>3456223</du:value>
<du:unit />
<du:obs_status />
<du:decimal>0</du:decimal>
</du:data>
</du:data>
Solution Tried ::
import pandas as pd
pd.read_xml(ifilepath, xpath='.//du:data', namespaces= {"du": "http://www.dummytest.org"}).to_csv(ofilepath, sep=',', index=None, header=True)
Output Got ::
indicator,country,date,value,unit,obs_status,decimal
"various, tests",test again,2021,1234567,,,0
"testing, cases",coverage test,2020,3456223,,,0
Expected output ::
indicator id,indicator,country id,country,date,value,unit,obs_status,decimal
AA.BB,"various, tests",MM,test again,2021,1234567,,,0
XX.YY,"testing, cases",DD,coverage test,2020,3456223,,,0
Adding Example data , having usage of 2 or more xpath's.
Looking for ways to convert the same using pandas to_csv()
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl'?>
<CATALOG>
<PLANT>
<COMMON>rose</COMMON>
<BOTANICAL>canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Shady</LIGHT>
<PRICE>202</PRICE>
<AVAILABILITY>446</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>mango</COMMON>
<BOTANICAL>sunny</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>shady</LIGHT>
<PRICE>301</PRICE>
<AVAILABILITY>569</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Marigold</COMMON>
<BOTANICAL>palustris</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Sunny</LIGHT>
<PRICE>500</PRICE>
<AVAILABILITY>799</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>carrot</COMMON>
<BOTANICAL>Caltha</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>sunny</LIGHT>
<PRICE>205</PRICE>
<AVAILABILITY>679</AVAILABILITY>
</PLANT>
<FOOD>
<NAME>daal fry</NAME>
<PRICE>300</PRICE>
<DESCRIPTION>
Famous daal tadka from surat
</DESCRIPTION>
<CALORIES>60</CALORIES>
</FOOD>
<FOOD>
<NAME>Dhosa</NAME>
<PRICE>350</PRICE>
<DESCRIPTION>
The famous south indian dish
</DESCRIPTION>
<CALORIES>80</CALORIES>
</FOOD>
<FOOD>
<NAME>Khichdi</NAME>
<PRICE>150</PRICE>
<DESCRIPTION>
The famous gujrati dish
</DESCRIPTION>
<CALORIES>40</CALORIES>
</FOOD>
<BOOK>
<AUTHOR>Santosh Bihari</AUTHOR>
<TITLE>PHP Core</TITLE>
<GENER>programming</GENER>
<PRICE>44.95</PRICE>
<DATE>2000-10-01</DATE>
</BOOK>
<BOOK>
<AUTHOR>Shyam N Chawla</AUTHOR>
<TITLE>.NET Begin</TITLE>
<GENER>Computer</GENER>
<PRICE>250</PRICE>
<DATE>2002-17-05</DATE>
</BOOK>
<BOOK>
<AUTHOR>Anci C</AUTHOR>
<TITLE>Dr. Ruby</TITLE>
<GENER>Computer</GENER>
<PRICE>350</PRICE>
<DATE>2001-04-11</DATE>
</BOOK>
</CATALOG>
ElementTree is not really the best tool for what I believe you're trying to do. Since you have well-formed, relatively simple xml, try using pandas:
import pandas as pd
#from here, it's just a one liner
pd.read_xml('input.xml',xpath='.//store').to_csv('output.csv',sep=',', index = None, header=True)
and that should get you your csv file.
Given parsing element values and their corresponding attributes involves a second layer of iteration, consider a nested list/dict comphrehension with dictionary merge. Also, use csv.DictWriter to build CSV via dictionaries:
from csv import DictWriter
import xml.etree.ElementTree as ET
ifilepath = "Input.xml"
tree = ET.parse(ifilepath)
nmsp = {"du": "http://www.dummytest.org"}
data = [
{
**{el.tag.split('}')[-1]: (el.text.strip() if el.text is not None else None) for el in d.findall("*")},
**{f"{el.tag.split('}')[-1]} {k}":v for el in d.findall("*") for k,v in el.attrib.items()},
**d.attrib
}
for d in tree.findall(".//du:data", namespaces=nmsp)
]
dkeys = list(data[0].keys())
with open("DummyXMLtoCSV.csv", "w", newline="") as f:
dw = DictWriter(f, fieldnames=dkeys)
dw.writeheader()
dw.writerows(data)
Output
indicator,country,date,value,unit,obs_status,decimal,indicator id,country id
"various, tests",test again,2021,1234567,,,0,AA.BB,MM
"testing, cases",coverage test,2020,3456223,,,0,XX.YY,DD
While above will add attributes to last columns of CSV. For specific ordering, re-order the dictionaries:
data = [ ... ]
cols = ["indicator id", "indicator", "country id", "country", "date", "value", "unit", "obs_status", "decimal"]
data = [
{k: d[k] for k in cols} for d in data
]
with open("DummyXMLtoCSV.csv", "w", newline="") as f:
dw = DictWriter(f, fieldnames=cols)
dw.writeheader()
dw.writerows(data)
Output
indicator id,indicator,country id,country,date,value,unit,obs_status,decimal
AA.BB,"various, tests",MM,test again,2021,1234567,,,0
XX.YY,"testing, cases",DD,coverage test,2020,3456223,,,0

I am converting xml to csv for a client project & cannot get the conversion to work. I am using Python in the CMD

Any ideas why this one is not working??
The XML that is being converted (much longer than this)
<XML>
<ClinicalData StudyOID="XXXXXXXXX" MetaDataVersionOID="53" mdsol_AuditSubCategoryName="QueryAnswer">
<SubjectData SubjectKey="XXXXXXXX-b7cd-4f97-8d25-594219de192f" mdsol_SubjectKeyType="SubjectUUID" mdsol_SubjectName="XX-002">
<SiteRef LocationOID="15" XXXX_StudyEnvSiteNumber="15" />
<StudyEventData StudyEventOID="DAY1" StudyEventRepeatKey="DAY1[1]" mdsol_InstanceId="47077">
<FormData FormOID="SS_DISP" FormRepeatKey="1" mdsol_DataPageId="320656">
<ItemGroupData ItemGroupOID="SS_DISP" mdsol_RecordId="797737">
<ItemData ItemOID="SS_DISP.DISPDAT" TransactionType="Upsert">
<AuditRecord>
<UserRef UserOID="XXXX#XXXXX.com1" />
<LocationRef LocationOID="15" mdsol_StudyEnvSiteNumber="15" />
<DateTimeStamp>2022-01-28T05:27:54</DateTimeStamp>
<ReasonForChange>
</ReasonForChange>
<SourceID>12345678</SourceID>
</AuditRecord>
<mdsol_Query QueryRepeatKey="123456" Value="Date of XXXX does not equal the XXXY Date. Please review and correct else clarify." Status="Answered" Response="Issues with XXXXX IWRS XXXXXX" />
</ItemData>
</ItemGroupData>
</FormData>
</StudyEventData>
</SubjectData>
</ClinicalData>
</XML>
I am using this python script to do the conversion, or I am trying to. I am pretty new to this.
from xml.etree import ElementTree
tree = ElementTree.parse('xml.xml')
root = tree.getroot()
data = []
for ClinicalData in root:
StudyOID = getattr(child.find('StudyOID'), 'text', None)
MetaDataVersionOID = getattr(child.find('MetaDataVersionOID'), 'text', None)
mdsol_AuditSubCategoryName = getattr(child.find('mdsol_AuditSubCategoryName'), 'text', None)
SubjectKey = getattr(child.find('SubjectKey'), 'text', None)
#print('{}, {}, {}, {}'.format(StudyOID, MetaDataVersionOID, mdsol_AuditSubCategoryName, SubjectKey))
data.append('{}, {}, {}, {}'.format(StudyOID, MetaDataVersionOID, mdsol_AuditSubCategoryName, SubjectKey))
#print (data)
with open('output.csv', 'w') as f: f.write('\n'.join([row for row in data[1:]]))
The error message I get is as follows:
File "<stdin>", line 9
with open('output.csv', 'w') as f: f.write('\n'.join([row for row in data[1:]]))
^^^^
SyntaxError: invalid syntax
In the above,
you have the data list ("data") convert that into a pandas dataframe as below and write to csv
cols = [StudyOID, MetaDataVersionOID, mdsol_AuditSubCategoryName, SubjectKey]
df = pd.DataFrame(data, columns=cols)
# Writing dataframe to csv
df.to_csv('output.csv')

How to extract text from xml file using python

I'm trying to extract text data from this xml file but I don't know why my code not working. How do I get this phone number? Please have a look at this XML file and my code format as well.I'm trying to extract data from this tag Thank you in advance :)
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:voc="urn:hl7-org:v3/voc" xmlns:sdtc="urn:hl7-org:sdtc" xsi:schemaLocation="CDA.xsd">
<realmCode code="US"/>
<languageCode code="en-US"/>
<recordTarget>
<patientRole>
<addr use="HP">
<streetAddressLine>3345 Elm Street</streetAddressLine>
<city>Aurora</city>
<state>CO</state>
<postalCode>80011</postalCode>
<country>US</country>
</addr>
<telecom value="tel:+1(303)-554-8889" use="HP"/>
<patient>
<name use="L">
<given>Janson</given>
<given>J</given>
<family>Example</family>
</name>
</patient>
</patientRole>
</recordTarget>
</ClinicalDocument>
Here is my python code
import xml.etree.ElementTree as ET
tree = ET.parse('country.xml')
root = tree.getroot()
print(root)
for country in root.findall('patientRole'):
number = country.get('telecom')
print(number)
Your XML document has namespace specified, so it becomes something like:
for country in tree.findall('.//{urn:hl7-org:v3}patientRole'):
number = country.find('{urn:hl7-org:v3}telecom').attrib['value']
print(number)
Output:
tel:+1(303)-554-8889

XML not returning correct child tags/data in Python

Hello I am making a requests call to return order data from a online store. My issue is that once I have passed my data to a root variable the method iter is not returning the correct results. e.g. Display multiple tags of the same name rather than one and not showing the data within the tag.
I thought this was due to the XML not being correctly formatted so I formatted it by saving it to a file using pretty_print but that hasn't fixed the error.
How do I fix this? - Thanks in advance
Code:
import requests, xml.etree.ElementTree as ET, lxml.etree as etree
url="http://publicapi.ekmpowershop24.com/v1.1/publicapi.asmx"
headers = {'content-type': 'application/soap+xml'}
body = """<?xml version="1.0" encoding="utf-8"?>
<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
<soap12:Body>
<GetOrders xmlns="http://publicapi.ekmpowershop.com/">
<GetOrdersRequest>
<APIKey>my_api_key</APIKey>
<FromDate>01/07/2018</FromDate>
<ToDate>04/07/2018</ToDate>
</GetOrdersRequest>
</GetOrders>
</soap12:Body>
</soap12:Envelope>"""
#send request to ekm
r = requests.post(url,data=body,headers=headers)
#save output to file
file = open("C:/Users/Mark/Desktop/test.xml", "w")
file.write(r.text)
file.close()
#take the file and format the xml
x = etree.parse("C:/Users/Mark/Desktop/test.xml")
newString = etree.tostring(x, pretty_print=True)
file = open("C:/Users/Mark/Desktop/test.xml", "w")
file.write(newString.decode('utf-8'))
file.close()
#parse the file to get the roots
tree = ET.parse("C:/Users/Mark/Desktop/test.xml")
root = tree.getroot()
#access elements names in the data
for child in root.iter('*'):
print(child.tag)
#show orders elements attributes
tree = ET.parse("C:/Users/Mark/Desktop/test.xml")
root = tree.getroot()
for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
out = {}
for child in order:
if child.tag in ('OrderID'):
out[child.tag] = child.text
print(out)
Elements output:
{http://publicapi.ekmpowershop.com/}Orders
{http://publicapi.ekmpowershop.com/}Order
{http://publicapi.ekmpowershop.com/}OrderID
{http://publicapi.ekmpowershop.com/}OrderNumber
{http://publicapi.ekmpowershop.com/}CustomerID
{http://publicapi.ekmpowershop.com/}CustomerUserID
{http://publicapi.ekmpowershop.com/}Order
{http://publicapi.ekmpowershop.com/}OrderID
{http://publicapi.ekmpowershop.com/}OrderNumber
{http://publicapi.ekmpowershop.com/}CustomerID
{http://publicapi.ekmpowershop.com/}CustomerUserID
Orders Output:
{http://publicapi.ekmpowershop.com/}Order {}
{http://publicapi.ekmpowershop.com/}Order {}
XML Structure after formating:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetOrdersResponse xmlns="http://publicapi.ekmpowershop.com/">
<GetOrdersResult>
<Status>Success</Status>
<Errors/>
<Date>2018-07-10T13:47:00.1682029+01:00</Date>
<TotalOrders>10</TotalOrders>
<TotalCost>100</TotalCost>
<Orders>
<Order>
<OrderID>100</OrderID>
<OrderNumber>102/040718/67</OrderNumber>
<CustomerID>6910</CustomerID>
<CustomerUserID>204</CustomerUserID>
<FirstName>TestFirst</FirstName>
<LastName>TestLast</LastName>
<CompanyName>Test Company</CompanyName>
<EmailAddress>test#Test.com</EmailAddress>
<OrderStatus>Dispatched</OrderStatus>
<OrderStatusColour>#00CC00</OrderStatusColour>
<TotalCost>85.8</TotalCost>
<OrderDate>10/07/2018 14:30:43</OrderDate>
<OrderDateISO>2018-07-10T14:30:43</OrderDateISO>
<AbandonedOrder>false</AbandonedOrder>
<EkmStatus>SUCCESS</EkmStatus>
</Order>
</Orders>
<Currency>GBP</Currency>
</GetOrdersResult>
</GetOrdersResponse>
</soap:Body>
</soap:Envelope>
You need to consider the namespace when checking for tags.
>>> # Include the namespace part of the tag in the tag values that we check.
>>> tags = ('{http://publicapi.ekmpowershop.com/}OrderID', '{http://publicapi.ekmpowershop.com/}OrderNumber')
>>> for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
... out = {}
... for child in order:
... if child.tag in tags:
... out[child.tag] = child.text
... print(out)
...
{'{http://publicapi.ekmpowershop.com/}OrderID': '100', '{http://publicapi.ekmpowershop.com/}OrderNumber': '102/040718/67'}
If you don't want the namespace prefixes in the output, you can strip them by only including that part of the tag after the } character.
>>> for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
... out = {}
... for child in order:
... if child.tag in tags:
... out[child.tag[child.tag.index('}')+1:]] = child.text
... print(out)
...
{'OrderID': '100', 'OrderNumber': '102/040718/67'}

How to extract data from xml file that is deep down the tag

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
I am only interested in 'accelx' and 'accely' value in this data and need to create a csv out of it.
Update: The code given below breaks when I change the second row with the following. Nothing is displayed because of this;
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas">
The following code works:
import xml.etree.ElementTree as etree
tree = etree.parse(r"C:\Users\data.xml")
root = tree.getroot()
val_of_interest = root.findall("./Values/SensorValue")
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text

Categories

Resources