Converting pandas dataframe to XML - python

I know this question has been asked before and my last was put on hold, so now I'm specifying it detailed.
I have a CSV file of population information, I read it to pandas and now have to transform it to XML, for example like this
<?xml version="1.0" encoding="utf-8"?>
<populationdata>
<municipality>
<name>
Akaa
</name>
<year>
2014
</year>
<total>
17052
......
This is the reading part of my code:
import pandas as pd
pop = pd.read_csv(r'''directory\population.csv''', delimiter=";")
Tried doing it like in mentioned before in the link here with function and cycle: How do convert a pandas/dataframe to XML?. Haven't succeeded, any other recommendations maybe?
This is an example of my dataframe:
Alahärmä 2014 0 0.1 0.2
0 Alajärvi 2014 10171 5102 5069
1 Alastaro 2014 0 0 0
2 Alavieska 2014 2687 1400 1287
3 Alavus 2014 12103 6102 6001
4 Anjalankoski 2014 0 0 0
Fairly new to python, so any help is apreciated.

The question you have linked to actually has a great answer to your question but I guess you’re having difficulty transposing your data into that solution so Ive done it below for you.
Ok your level of detail is a bit sketchy. If your specific situation differs slightly then you'll need to tweak my answer but heres something that works for me:
First off assuming you have a text file as follows :
0 Alahärmä 2014 0 0.1 0.2
1 Alajärvi 2014 10171 5102 5069
2 Alastaro 2014 0 0 0
3 Alavieska 2014 2687 1400 1287
4 Alavus 2014 12103 6102 6001
5 Anjalankoski 2014 0 0 0
Moving on to creating the python script, we first import that text file using the following line:
pop = pd.read_csv(r'directory\population.csv', delimiter=r"\s+", names=['cityname', 'year', 'total', 'male', 'females'])
This brings in the text file as a dataframe and gives the new dataframe the correct column headers.
Then taking the data from the question you linked to, we add the following to our python script:
def func(row):
xml = ['<item>']
for field in row.index:
xml.append(' <field name="{0}">{1}</field>'.format(field, row[field]))
xml.append('</item>')
return '\n'.join(xml)
print('\n'.join(pop.apply(func, axis=1)))
Now we put it all together and we get the below:
import pandas as pd
pop = pd.read_csv(r'directory\population.csv', delimiter=r"\s+", names=['cityname', 'year', 'total', 'male', ‘females'])
def func(row):
xml = ['<item>']
for field in row.index:
xml.append(' <field name="{0}">{1}</field>'.format(field, row[field]))
xml.append('</item>')
return '\n'.join(xml)
print('\n'.join(pop.apply(func, axis=1)))
When we run the above file we get the following output:
<item>
<field name="cityname">Alahärmä</field>
<field name="year">2014</field>
<field name="total">0</field>
<field name="male">0.1</field>
<field name="females">0.2</field>
</item>
<item>
<field name="cityname">Alajärvi</field>
<field name="year">2014</field>
<field name="total">10171</field>
<field name="male">5102.0</field>
<field name="females">5069.0</field>
</item>
<item>
<field name="cityname">Alastaro</field>
<field name="year">2014</field>
<field name="total">0</field>
<field name="male">0.0</field>
<field name="females">0.0</field>
</item>
<item>
<field name="cityname">Alavieska</field>
<field name="year">2014</field>
<field name="total">2687</field>
<field name="male">1400.0</field>
<field name="females">1287.0</field>
</item>
<item>
<field name="cityname">Alavus</field>
<field name="year">2014</field>
<field name="total">12103</field>
<field name="male">6102.0</field>
<field name="females">6001.0</field>
</item>
<item>
<field name="cityname">Anjalankoski</field>
<field name="year">2014</field>
<field name="total">0</field>
<field name="male">0.0</field>
<field name="females">0.0</field>
</item>

Related

How can I extract elementary values with ElementTree in Python?

I try to extract values attributes (ex. 'Filename') of that XML file in Python.
Can you help me ?
Here is the MC 'Librarytest.xml' file :
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<MPL Version="2.0" Title="Library">
<Item>
<Field Name="Filename">Y:\Styx\08 - Styx - Snowblind8. Snowblind.flac</Field>
<Field Name="Name">Snowblind</Field>
<Field Name="Artist">Styx</Field>
<Field Name="Album">Paradise Theater</Field>
<Field Name="Genre">Rock</Field>
</Item>
<Item>
<Field Name="Filename">Y:\David Gilmour\04 A Boat Lies Waiting.flac</Field>
<Field Name="Name">A Boat Lies Waiting</Field>
<Field Name="Artist">David Gilmour</Field>
<Field Name="Album">Rattle That Lock (Deluxe)</Field>
<Field Name="Genre">Progressive</Field>
</Item>
</MPL>
I try this :
import xml.etree.ElementTree as ET
xml_file = 'C:/Users/ClientMD/Downloads/MC Librarytest.xml'
tree = ET.parse(xml_file)
root = tree.getroot()
for each in root.findall('.//Field'):
rating = each.find('.//Filename')
print ('Nothing' if rating is None else rating.text)
and I obtain :
Nothing
...
Nothing
Like this:
import xml.etree.ElementTree as ET
xml_file = 'C:/Users/ClientMD/Downloads/MC Librarytest.xml'
tree = ET.parse(xml_file)
root = tree.getroot()
for each in root.findall('.//Field[#Name="Filename"]'):
rating = each.text
print ('Nothing' if rating is None else rating)
Output
Y:\Styx\08 - Styx - Snowblind8. Snowblind.flac
Y:\David Gilmour\04 A Boat Lies Waiting.flac
If you want to grab more elements and keep them under a single item context - you can use the below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<MPL Version="2.0" Title="Library">
<Item>
<Field Name="Filename">Y:\Styx\\08 - Styx - Snowblind8. Snowblind.flac</Field>
<Field Name="Name">Snowblind</Field>
<Field Name="Artist">Styx</Field>
<Field Name="Album">Paradise Theater</Field>
<Field Name="Genre">Rock</Field>
</Item>
<Item>
<Field Name="Filename">Y:\David Gilmour\\04 A Boat Lies Waiting.flac</Field>
<Field Name="Name">A Boat Lies Waiting</Field>
<Field Name="Artist">David Gilmour</Field>
<Field Name="Album">Rattle That Lock (Deluxe)</Field>
<Field Name="Genre">Progressive</Field>
</Item>
</MPL>'''
INTERESTING_NAMES = ['Filename','Artist']
data = []
root = ET.fromstring(xml)
for item in root.findall('.//Item'):
temp = {}
for name in INTERESTING_NAMES:
temp[name] = item.find(f'Field[#Name="{name}"]').text
data.append(temp)
print(data)
output
[{'Filename': 'Y:\\Styx\\08 - Styx - Snowblind8. Snowblind.flac', 'Artist': 'Styx'}, {'Filename': 'Y:\\David Gilmour\\04 A Boat Lies Waiting.flac', 'Artist': 'David Gilmour'}]

xml.etree.ElementTree access subelement without creating

I have a code
ffdata = ET.Element("FFData")
fForm = ET.SubElement(ffdata, "Form")
fForm.set("FormDefId","{DD0F88DD-A858-4595-AF2F-3643D0069A39}")
fPages = ET.SubElement(fForm, "Pages")
for xml_file in xml_files:
xml_file = os.path.join(*[CurrentFolderPath,xml_file])
tree = ET.parse(xml_file)
xml_data = tree.getroot()
for xPage in xml_data.iter('Page'):
# --- Ignore first element
if int(xPage.attrib['PageNumber']) >1:
#---- Change Paginators index
xPage.set('PageNumber',str(sPageNumber))
# -- Set page number to fields
fFields = ET.SubElement(xPage, "Fields")
fxField = ET.SubElement(fFields, "Field")
fxField.set('PageNumber',str(sPageNumber-1))
fPages.append(xPage) # Add element to root
sPageNumber= sPageNumber +1
else:
if sImoneExists == 0:
fPages.append(xPage) # Add element to root
sImoneExists = 1
fPages.set("Count",str(sPageNumber-1))
indent(ffdata)
tree = ET.ElementTree(ffdata)
xml_file_save = os.path.join(*[CurrentFolderPath,"Merged.ffdata"])
tree.write(xml_file_save)
i trying to change sub element inside loop
fFields = ET.SubElement(xPage, "Fields")
fxField = ET.SubElement(fFields, "Field")
fxField.set('PageNumber',str(sPageNumber-1))
But it create new element instead of change existing
so i get
<FFData>
<Form FormDefId="{DD0F88DD-A858-4595-AF2F-3643D0069A39}">
<Pages Count="41">
<Page PageDefName="1" PageNumber="2">
<Fields Count="135">
<Field Name="L1-1"></Field>
<Field Name="PageNumber">1</Field>
</Fields>
<Fields>
<Field PageNumber="2" />
</Fields>
</Page>
</Pages>
</Form>
</FFData>
expected
<FFData>
<Form FormDefId="{DD0F88DD-A858-4595-AF2F-3643D0069A39}">
<Pages Count="41">
<Page PageDefName="1" PageNumber="2">
<Fields Count="135">
<Field Name="L1-1"></Field>
<Field Name="PageNumber">2</Field>
</Fields>
</Page>
</Pages>
</Form>
</FFData>
So how to change existing sub element of each iterating page?

complex xml to csv using python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
<app>
<doc>
<field name="id">013</field>
<field name="groupid">013</field>
<field name="img_url">8b4</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward-u-s-/</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest</field>
</doc>
<doc>
<field name="id">0131</field>
<field name="groupid">013</field>
<field name="img_url">8b</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest1</field>
<field name="topic">biggest2</field>
<field name="topic">biggest3</field>
</doc>
</app>
I have a xml similar to this i need to convert it to a csv in python. Does anyone know how to do it and also the field name topic differs for different doc and the csv headers should be similar to the field name and for topics it should be in a single cell with comma separated.
Expected Output
enter image description here
You could use an XML parser that emits element data as it parses to build the csv. On every end tag, you could either add a value to the row or write the row itself. One advantage of iterparse is that you don't need to load the entire document into memory before processing.
import xml.etree.ElementTree as ET
import io
import csv
field_names = ["id", "groupid", "img_url", "filetype", "url", "topic"]
field_names_set = set(field_names)
with open("test.csv", "w", newline="") as out_file:
writer = csv.DictWriter(out_file, field_names)
writer.writeheader()
row = {}
topic = []
for event, elem in ET.iterparse("test.xml"): # iterate tag end events
if elem.tag == "doc":
# doc elem end, write row to csv and setup for next
row["topic"] = ",".join(topic)
writer.writerow(row)
row = {}
topic = []
elif elem.tag == "field":
# field elem end, add to current row
if elem.attrib["name"] == "topic":
topic.append(elem.text)
else:
row[elem.attrib["name"]] = elem.text
The below creates a csv like output. Is that what you are looking for?
Note that you cant tell which field is a 'topic' and which field is non 'topic'
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<app>
<doc>
<field name="id">013</field>
<field name="groupid">013</field>
<field name="img_url">8b4</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward-u-s-/</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest</field>
</doc>
<doc>
<field name="id">0131</field>
<field name="groupid">013</field>
<field name="img_url">8b</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest1</field>
<field name="topic">biggest2</field>
<field name="topic">biggest3</field>
</doc>
</app>'''
root = ET.fromstring(xml)
first_time = True
headers = set()
for doc in root.findall('.//doc'):
data = []
for field in doc.findall('field'):
if first_time:
headers.add(field.attrib['name'])
data.append((field.attrib['name'], field.text))
if first_time:
print(','.join(sorted(list(headers))))
first_time = False
print(','.join(y[1] for y in sorted(data, key=lambda x: x[0])))
output
filetype,groupid,id,img_url,topic,url
HTML,013,013,8b4,accurate,additional,agriculture,area,biggest,https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward-u-s-/
HTML,013,0131,8b,accurate,additional,agriculture,area,biggest1,biggest2,biggest3,https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward

Customizing dictionary conversion with dicttoxml?

I want to convert dictionary to xml in python, I am using dicttoxml for that. My code looks like this:
>>> import os
>>> filepath=os.path.normpath('C:\\Users\\\\Desktop\\abc\\bc.txt')
>>> with open(filepath,'r') as f:
... for line in f:
... x=line.split(':')
... x[-1]=x[-1].strip()
... a=x[0]
... b=x[1]
... d[a]=b
... xml = dicttoxml(d,custom_root='doc',attr_type=False)
... xml2 = parseString(xml)
... xml3 = xml2.toprettyxml()
... print( xml3 )
The output is like this:
<?xml version="1.0" ?>
<doc>
<key name="product/productId">B000GKXY34</key>
<key name="product/title">Nun Chuck, Novelty Nun Toss Toy</key>
<key name="product/price">17.99</key>
<key name="review/userId">ADX8VLDUOL7BG</key>
<key name="review/profileName">M. Gingras</key>
<key name="review/helpfulness">0/0</key>
<key name="review/score">5.0</key>
<key name="review/time">1262304000</key>
<key name="review/summary">Great fun!</key>
<key name="review/text">Got these last Christmas as a gag gift. They are great fun, but obviously this is not a toy that lasts!</key>
</doc>
but I want to replace the key name with field name.
<field name="product/productId">B000GKXY34</field>
This is the dictionary generated by the code:
{'product/productId': 'B000GKXY34', 'product/title': 'Nun Chuck, Novelty Nun Toss Toy', 'product/price': '17.99', 'review/userId': 'ADX8VLDUOL7BG', 'review/profileName': 'M. Gingras', 'review/helpfulness': '0/0', 'review/score': '5.0', 'review/time': '1262304000', 'review/summary': 'Great fun!', 'review/text': 'Got these last Christmas as a gag gift. They are great fun, but obviously this is not a toy that lasts!'}
And also i want to write that xml in a new file,i am trying with write function but its not working:
with open ('filename','w') as output:
output.write(xml3)
According to dicttoxml documentation:
Define Custom Item Names
Starting in version 1.7, if you don’t want item elements in a list to be called ‘item’, you can specify the element name using a function that takes the parent element name (i.e. the list name) as an argument.
>>> import dicttoxml
>>> obj = {u'mylist': [u'foo', u'bar', u'baz'], u'mydict': {u'foo': u'bar', u'baz': 1}, u'ok': True}
>>> my_item_func = lambda x: 'list_item'
>>> xml = dicttoxml.dicttoxml(obj, item_func=my_item_func)
>>> print(xml)
<?xml version="1.0" encoding="UTF-8"?>
<root>
<mydict type="dict">
<foo type="str">bar</foo>
<baz type="int">1</baz>
</mydict>
<mylist type="list">
<list_item type="str">foo</list_item>
<list_item type="str">bar</list_item>
<list_item type="str">baz</list_item>
</mylist>
<ok type="bool">True</ok>
</root>
From the documentation, we have to use a function to provide the custom key name for the xml.
Reference - dicttoxml github
>>> import os
>>> filepath=os.path.normpath('C:\\Users\\\\Desktop\\abc\\bc.txt')
>>> with open(filepath,'r') as f:
... for line in f:
... x=line.split(':')
... x[-1]=x[-1].strip()
... a=x[0]
... b=x[1]
... d[a]=b
... returnfield = lambda x: 'field'
... xml = dicttoxml(d,custom_root='doc',attr_type=False, item_func=returnfield)
... xml2 = parseString(xml)
... xml3 = xml2.toprettyxml()
... print( xml3 )
Output is :
<?xml version="1.0" ?>
<doc>
<field name="product/productId">B000GKXY34</field>
<field name="product/title">Nun Chuck, Novelty Nun Toss Toy</field>
<field name="product/price">17.99</field>
<field name="review/userId">ADX8VLDUOL7BG</field>
<field name="review/profileName">M. Gingras</field>
<field name="review/helpfulness">0/0</field>
<field name="review/score">5.0</field>
<field name="review/time">1262304000</field>
<field name="review/summary">Great fun!</field>
<field name="review/text">Got these last Christmas as a gag gift. They are great fun, but obviously this is not a toy that lasts!</field>
</doc>

Parsing name/value pairs from XML

I am trying to pull account details from XML files supplied by vendors.
I have one vendor that supplied XML files like:
<Accounts>
<Account>
<AccountNumber>1234567</AccountNumber>
<Balance>$200.00</Balance>
</Account>
<Account>
...
</Account>
</Accounts>
And I can parse this fairly easily using python:
mytree = et.parse(xml_path)
myroot = mytree.getroot()
for acc in charges_root.findall('Account'):
acctnum = acc.find('AccountNumber').text
balance = acc.find('Balance').text
print(acctnum, balance)
Which outputs like this:
1234567 $200.00
However another vendor supplies the XML files in something more like name/value pairs, and I am unsure how to easily access that data. It doesn't work the same way as above:
<Accounts>
<Account>
<field name='AccountNumber' value='1234567' />
<field name='Balance' value='$200.00' />
</Account>
<Account>
...
</Account>
</Accounts>
So far I've got this, but would like to be able to access the values separately and easily:
mytree = et.parse(xml_path)
myroot = mytree.getroot()
for field in myroot.findall('Account'):
for line in field:
print(line.attrib)
Which outputs something like:
{'name': 'AccountNumber', 'value': '1234567'}
{'name': 'Balance', 'value': '$200.00'}
So my question is this - How can I access the values and assign them to variables (based on the name) so that I can make use of them elsewhere in the script, like I have with acctnum and balance in the first example?
Populate a new datastructure (like a dict) from the field when you iterate instead of just discarding:
account_d = {}
for field in myroot.findall('Account'):
for line in field:
account_d[line.attrib['name']] = line.attrib['value']
# account_d should now be:
# { 'AccountNumber': '1234567', 'Balance': '$200.00' }
You can use a list of lists/tuples too:
account_a = []
for field in myroot.findall('Account'):
for line in field:
account_d.append(line.attrib['name'], line.attrib['value'])
# account_a should now be:
# [('AccountNumber', '1234567'), ('Balance', '$200.00')]
ElementTree 1.3 has the ability to locate nodes with particular attributes:
from xml.etree import ElementTree as et
data = '''\
<Accounts>
<Account>
<field name='AccountNumber' value='1234567' />
<field name='Balance' value='$200.00' />
</Account>
<Account>
<field name='AccountNumber' value='9999999' />
<field name='Balance' value='$300.00' />
</Account>
</Accounts>'''
tree = et.fromstring(data)
for acc in tree.iterfind('Account'):
acctnum = acc.find("field[#name='AccountNumber']").attrib['value']
balance = acc.find("field[#name='Balance']").attrib['value']
print(acctnum,balance)
1234567 $200.00
9999999 $300.00
You can do it by collecting all the Account element's field attributes into a dictionary and then using the information in it as needed:
accounts.xml sample input file:
<?xml version="1.0"?>
<Accounts>
<Account>
<field name='AccountNumber' value='1234567' />
<field name='Balance' value='$200.00' />
</Account>
<Account>
<field name='AccountNumber' value='8901234' />
<field name='Balance' value='$100.00' />
</Account>
</Accounts>
Code:
import xml.etree.ElementTree as et
xml_path = 'accounts.xml'
mytree = et.parse(xml_path)
myroot = mytree.getroot()
for acct in myroot.findall('Account'):
info = {field.attrib['name']: field.attrib['value']
for field in acct.findall('field')}
acctnum, balance = info['AccountNumber'], info['Balance']
print(acctnum, balance)
Result:
1234567 $200.00
8901234 $100.00
Question: How can I access the values and assign them to variables (based on the name)
Convert all Accounts to a Dict[AccountNumber] of Dict[field].
The Attribute name becomes the dict Key:
Accounts = {}
for account in root.findall('Account'):
fields = {}
for field in account.findall('field'):
fields[field.attrib['name']] = field.attrib['value']
print('{a[AccountNumber]} {a[Balance]}'.format(a=fields))
Accounts[fields['AccountNumber']] = fields
print(Accounts)
Output:
1234567 $200.00
9999999 $300.00
{'9999999': {'AccountNumber': '9999999', 'Balance': '$300.00'}, '1234567': {'AccountNumber': '1234567', 'Balance': '$200.00'}}
Tested with Python: 3.4.2

Categories

Resources