complex xml to csv using python [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
<app>
<doc>
<field name="id">013</field>
<field name="groupid">013</field>
<field name="img_url">8b4</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward-u-s-/</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest</field>
</doc>
<doc>
<field name="id">0131</field>
<field name="groupid">013</field>
<field name="img_url">8b</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest1</field>
<field name="topic">biggest2</field>
<field name="topic">biggest3</field>
</doc>
</app>
I have a xml similar to this i need to convert it to a csv in python. Does anyone know how to do it and also the field name topic differs for different doc and the csv headers should be similar to the field name and for topics it should be in a single cell with comma separated.
Expected Output
enter image description here

You could use an XML parser that emits element data as it parses to build the csv. On every end tag, you could either add a value to the row or write the row itself. One advantage of iterparse is that you don't need to load the entire document into memory before processing.
import xml.etree.ElementTree as ET
import io
import csv
field_names = ["id", "groupid", "img_url", "filetype", "url", "topic"]
field_names_set = set(field_names)
with open("test.csv", "w", newline="") as out_file:
writer = csv.DictWriter(out_file, field_names)
writer.writeheader()
row = {}
topic = []
for event, elem in ET.iterparse("test.xml"): # iterate tag end events
if elem.tag == "doc":
# doc elem end, write row to csv and setup for next
row["topic"] = ",".join(topic)
writer.writerow(row)
row = {}
topic = []
elif elem.tag == "field":
# field elem end, add to current row
if elem.attrib["name"] == "topic":
topic.append(elem.text)
else:
row[elem.attrib["name"]] = elem.text

The below creates a csv like output. Is that what you are looking for?
Note that you cant tell which field is a 'topic' and which field is non 'topic'
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<app>
<doc>
<field name="id">013</field>
<field name="groupid">013</field>
<field name="img_url">8b4</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward-u-s-/</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest</field>
</doc>
<doc>
<field name="id">0131</field>
<field name="groupid">013</field>
<field name="img_url">8b</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest1</field>
<field name="topic">biggest2</field>
<field name="topic">biggest3</field>
</doc>
</app>'''
root = ET.fromstring(xml)
first_time = True
headers = set()
for doc in root.findall('.//doc'):
data = []
for field in doc.findall('field'):
if first_time:
headers.add(field.attrib['name'])
data.append((field.attrib['name'], field.text))
if first_time:
print(','.join(sorted(list(headers))))
first_time = False
print(','.join(y[1] for y in sorted(data, key=lambda x: x[0])))
output
filetype,groupid,id,img_url,topic,url
HTML,013,013,8b4,accurate,additional,agriculture,area,biggest,https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward-u-s-/
HTML,013,0131,8b,accurate,additional,agriculture,area,biggest1,biggest2,biggest3,https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward

Related

How can I extract elementary values with ElementTree in Python?

I try to extract values attributes (ex. 'Filename') of that XML file in Python.
Can you help me ?
Here is the MC 'Librarytest.xml' file :
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<MPL Version="2.0" Title="Library">
<Item>
<Field Name="Filename">Y:\Styx\08 - Styx - Snowblind8. Snowblind.flac</Field>
<Field Name="Name">Snowblind</Field>
<Field Name="Artist">Styx</Field>
<Field Name="Album">Paradise Theater</Field>
<Field Name="Genre">Rock</Field>
</Item>
<Item>
<Field Name="Filename">Y:\David Gilmour\04 A Boat Lies Waiting.flac</Field>
<Field Name="Name">A Boat Lies Waiting</Field>
<Field Name="Artist">David Gilmour</Field>
<Field Name="Album">Rattle That Lock (Deluxe)</Field>
<Field Name="Genre">Progressive</Field>
</Item>
</MPL>
I try this :
import xml.etree.ElementTree as ET
xml_file = 'C:/Users/ClientMD/Downloads/MC Librarytest.xml'
tree = ET.parse(xml_file)
root = tree.getroot()
for each in root.findall('.//Field'):
rating = each.find('.//Filename')
print ('Nothing' if rating is None else rating.text)
and I obtain :
Nothing
...
Nothing
Like this:
import xml.etree.ElementTree as ET
xml_file = 'C:/Users/ClientMD/Downloads/MC Librarytest.xml'
tree = ET.parse(xml_file)
root = tree.getroot()
for each in root.findall('.//Field[#Name="Filename"]'):
rating = each.text
print ('Nothing' if rating is None else rating)
Output
Y:\Styx\08 - Styx - Snowblind8. Snowblind.flac
Y:\David Gilmour\04 A Boat Lies Waiting.flac
If you want to grab more elements and keep them under a single item context - you can use the below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<MPL Version="2.0" Title="Library">
<Item>
<Field Name="Filename">Y:\Styx\\08 - Styx - Snowblind8. Snowblind.flac</Field>
<Field Name="Name">Snowblind</Field>
<Field Name="Artist">Styx</Field>
<Field Name="Album">Paradise Theater</Field>
<Field Name="Genre">Rock</Field>
</Item>
<Item>
<Field Name="Filename">Y:\David Gilmour\\04 A Boat Lies Waiting.flac</Field>
<Field Name="Name">A Boat Lies Waiting</Field>
<Field Name="Artist">David Gilmour</Field>
<Field Name="Album">Rattle That Lock (Deluxe)</Field>
<Field Name="Genre">Progressive</Field>
</Item>
</MPL>'''
INTERESTING_NAMES = ['Filename','Artist']
data = []
root = ET.fromstring(xml)
for item in root.findall('.//Item'):
temp = {}
for name in INTERESTING_NAMES:
temp[name] = item.find(f'Field[#Name="{name}"]').text
data.append(temp)
print(data)
output
[{'Filename': 'Y:\\Styx\\08 - Styx - Snowblind8. Snowblind.flac', 'Artist': 'Styx'}, {'Filename': 'Y:\\David Gilmour\\04 A Boat Lies Waiting.flac', 'Artist': 'David Gilmour'}]

xml.etree.ElementTree access subelement without creating

I have a code
ffdata = ET.Element("FFData")
fForm = ET.SubElement(ffdata, "Form")
fForm.set("FormDefId","{DD0F88DD-A858-4595-AF2F-3643D0069A39}")
fPages = ET.SubElement(fForm, "Pages")
for xml_file in xml_files:
xml_file = os.path.join(*[CurrentFolderPath,xml_file])
tree = ET.parse(xml_file)
xml_data = tree.getroot()
for xPage in xml_data.iter('Page'):
# --- Ignore first element
if int(xPage.attrib['PageNumber']) >1:
#---- Change Paginators index
xPage.set('PageNumber',str(sPageNumber))
# -- Set page number to fields
fFields = ET.SubElement(xPage, "Fields")
fxField = ET.SubElement(fFields, "Field")
fxField.set('PageNumber',str(sPageNumber-1))
fPages.append(xPage) # Add element to root
sPageNumber= sPageNumber +1
else:
if sImoneExists == 0:
fPages.append(xPage) # Add element to root
sImoneExists = 1
fPages.set("Count",str(sPageNumber-1))
indent(ffdata)
tree = ET.ElementTree(ffdata)
xml_file_save = os.path.join(*[CurrentFolderPath,"Merged.ffdata"])
tree.write(xml_file_save)
i trying to change sub element inside loop
fFields = ET.SubElement(xPage, "Fields")
fxField = ET.SubElement(fFields, "Field")
fxField.set('PageNumber',str(sPageNumber-1))
But it create new element instead of change existing
so i get
<FFData>
<Form FormDefId="{DD0F88DD-A858-4595-AF2F-3643D0069A39}">
<Pages Count="41">
<Page PageDefName="1" PageNumber="2">
<Fields Count="135">
<Field Name="L1-1"></Field>
<Field Name="PageNumber">1</Field>
</Fields>
<Fields>
<Field PageNumber="2" />
</Fields>
</Page>
</Pages>
</Form>
</FFData>
expected
<FFData>
<Form FormDefId="{DD0F88DD-A858-4595-AF2F-3643D0069A39}">
<Pages Count="41">
<Page PageDefName="1" PageNumber="2">
<Fields Count="135">
<Field Name="L1-1"></Field>
<Field Name="PageNumber">2</Field>
</Fields>
</Page>
</Pages>
</Form>
</FFData>
So how to change existing sub element of each iterating page?

Converting pandas dataframe to XML

I know this question has been asked before and my last was put on hold, so now I'm specifying it detailed.
I have a CSV file of population information, I read it to pandas and now have to transform it to XML, for example like this
<?xml version="1.0" encoding="utf-8"?>
<populationdata>
<municipality>
<name>
Akaa
</name>
<year>
2014
</year>
<total>
17052
......
This is the reading part of my code:
import pandas as pd
pop = pd.read_csv(r'''directory\population.csv''', delimiter=";")
Tried doing it like in mentioned before in the link here with function and cycle: How do convert a pandas/dataframe to XML?. Haven't succeeded, any other recommendations maybe?
This is an example of my dataframe:
Alahärmä 2014 0 0.1 0.2
0 Alajärvi 2014 10171 5102 5069
1 Alastaro 2014 0 0 0
2 Alavieska 2014 2687 1400 1287
3 Alavus 2014 12103 6102 6001
4 Anjalankoski 2014 0 0 0
Fairly new to python, so any help is apreciated.
The question you have linked to actually has a great answer to your question but I guess you’re having difficulty transposing your data into that solution so Ive done it below for you.
Ok your level of detail is a bit sketchy. If your specific situation differs slightly then you'll need to tweak my answer but heres something that works for me:
First off assuming you have a text file as follows :
0 Alahärmä 2014 0 0.1 0.2
1 Alajärvi 2014 10171 5102 5069
2 Alastaro 2014 0 0 0
3 Alavieska 2014 2687 1400 1287
4 Alavus 2014 12103 6102 6001
5 Anjalankoski 2014 0 0 0
Moving on to creating the python script, we first import that text file using the following line:
pop = pd.read_csv(r'directory\population.csv', delimiter=r"\s+", names=['cityname', 'year', 'total', 'male', 'females'])
This brings in the text file as a dataframe and gives the new dataframe the correct column headers.
Then taking the data from the question you linked to, we add the following to our python script:
def func(row):
xml = ['<item>']
for field in row.index:
xml.append(' <field name="{0}">{1}</field>'.format(field, row[field]))
xml.append('</item>')
return '\n'.join(xml)
print('\n'.join(pop.apply(func, axis=1)))
Now we put it all together and we get the below:
import pandas as pd
pop = pd.read_csv(r'directory\population.csv', delimiter=r"\s+", names=['cityname', 'year', 'total', 'male', ‘females'])
def func(row):
xml = ['<item>']
for field in row.index:
xml.append(' <field name="{0}">{1}</field>'.format(field, row[field]))
xml.append('</item>')
return '\n'.join(xml)
print('\n'.join(pop.apply(func, axis=1)))
When we run the above file we get the following output:
<item>
<field name="cityname">Alahärmä</field>
<field name="year">2014</field>
<field name="total">0</field>
<field name="male">0.1</field>
<field name="females">0.2</field>
</item>
<item>
<field name="cityname">Alajärvi</field>
<field name="year">2014</field>
<field name="total">10171</field>
<field name="male">5102.0</field>
<field name="females">5069.0</field>
</item>
<item>
<field name="cityname">Alastaro</field>
<field name="year">2014</field>
<field name="total">0</field>
<field name="male">0.0</field>
<field name="females">0.0</field>
</item>
<item>
<field name="cityname">Alavieska</field>
<field name="year">2014</field>
<field name="total">2687</field>
<field name="male">1400.0</field>
<field name="females">1287.0</field>
</item>
<item>
<field name="cityname">Alavus</field>
<field name="year">2014</field>
<field name="total">12103</field>
<field name="male">6102.0</field>
<field name="females">6001.0</field>
</item>
<item>
<field name="cityname">Anjalankoski</field>
<field name="year">2014</field>
<field name="total">0</field>
<field name="male">0.0</field>
<field name="females">0.0</field>
</item>

Parsing name/value pairs from XML

I am trying to pull account details from XML files supplied by vendors.
I have one vendor that supplied XML files like:
<Accounts>
<Account>
<AccountNumber>1234567</AccountNumber>
<Balance>$200.00</Balance>
</Account>
<Account>
...
</Account>
</Accounts>
And I can parse this fairly easily using python:
mytree = et.parse(xml_path)
myroot = mytree.getroot()
for acc in charges_root.findall('Account'):
acctnum = acc.find('AccountNumber').text
balance = acc.find('Balance').text
print(acctnum, balance)
Which outputs like this:
1234567 $200.00
However another vendor supplies the XML files in something more like name/value pairs, and I am unsure how to easily access that data. It doesn't work the same way as above:
<Accounts>
<Account>
<field name='AccountNumber' value='1234567' />
<field name='Balance' value='$200.00' />
</Account>
<Account>
...
</Account>
</Accounts>
So far I've got this, but would like to be able to access the values separately and easily:
mytree = et.parse(xml_path)
myroot = mytree.getroot()
for field in myroot.findall('Account'):
for line in field:
print(line.attrib)
Which outputs something like:
{'name': 'AccountNumber', 'value': '1234567'}
{'name': 'Balance', 'value': '$200.00'}
So my question is this - How can I access the values and assign them to variables (based on the name) so that I can make use of them elsewhere in the script, like I have with acctnum and balance in the first example?
Populate a new datastructure (like a dict) from the field when you iterate instead of just discarding:
account_d = {}
for field in myroot.findall('Account'):
for line in field:
account_d[line.attrib['name']] = line.attrib['value']
# account_d should now be:
# { 'AccountNumber': '1234567', 'Balance': '$200.00' }
You can use a list of lists/tuples too:
account_a = []
for field in myroot.findall('Account'):
for line in field:
account_d.append(line.attrib['name'], line.attrib['value'])
# account_a should now be:
# [('AccountNumber', '1234567'), ('Balance', '$200.00')]
ElementTree 1.3 has the ability to locate nodes with particular attributes:
from xml.etree import ElementTree as et
data = '''\
<Accounts>
<Account>
<field name='AccountNumber' value='1234567' />
<field name='Balance' value='$200.00' />
</Account>
<Account>
<field name='AccountNumber' value='9999999' />
<field name='Balance' value='$300.00' />
</Account>
</Accounts>'''
tree = et.fromstring(data)
for acc in tree.iterfind('Account'):
acctnum = acc.find("field[#name='AccountNumber']").attrib['value']
balance = acc.find("field[#name='Balance']").attrib['value']
print(acctnum,balance)
1234567 $200.00
9999999 $300.00
You can do it by collecting all the Account element's field attributes into a dictionary and then using the information in it as needed:
accounts.xml sample input file:
<?xml version="1.0"?>
<Accounts>
<Account>
<field name='AccountNumber' value='1234567' />
<field name='Balance' value='$200.00' />
</Account>
<Account>
<field name='AccountNumber' value='8901234' />
<field name='Balance' value='$100.00' />
</Account>
</Accounts>
Code:
import xml.etree.ElementTree as et
xml_path = 'accounts.xml'
mytree = et.parse(xml_path)
myroot = mytree.getroot()
for acct in myroot.findall('Account'):
info = {field.attrib['name']: field.attrib['value']
for field in acct.findall('field')}
acctnum, balance = info['AccountNumber'], info['Balance']
print(acctnum, balance)
Result:
1234567 $200.00
8901234 $100.00
Question: How can I access the values and assign them to variables (based on the name)
Convert all Accounts to a Dict[AccountNumber] of Dict[field].
The Attribute name becomes the dict Key:
Accounts = {}
for account in root.findall('Account'):
fields = {}
for field in account.findall('field'):
fields[field.attrib['name']] = field.attrib['value']
print('{a[AccountNumber]} {a[Balance]}'.format(a=fields))
Accounts[fields['AccountNumber']] = fields
print(Accounts)
Output:
1234567 $200.00
9999999 $300.00
{'9999999': {'AccountNumber': '9999999', 'Balance': '$300.00'}, '1234567': {'AccountNumber': '1234567', 'Balance': '$200.00'}}
Tested with Python: 3.4.2

How can I display the child element of a node from an xml file, in python?

Here is my xml file
<root>
<Module name="ac4" offset="32" width="12">
<register name="xga_control" offset="0x000" width="32" access="R/W">
<field name="reserved" offset="0" bit_span="5"/>
<field name="force_all_fault_clear" bit_span="1" default="0">
<description>Rising edge forces all fault registers to clear</description>
</field>
<field name="force_warning" default="0" bit_span="1">
<description>Forces AC2 to report a Master Warning</description>
</field>
<field name="force_error" default="0" bit_span="1">
<description>Forces AC2 to report a Master Error</description>
</field>
</register>
</Module>
<root>
Right now I can access the names of my registers and display them. However I also want to display the names and attributes of my field elements. How can I do that? Here is my code so far.
input_file = etree.parse('file1.xml')
output=open("ac4.vhd","w+")
output.write("Registers \n")
for node in input_file.iter():
if node.tag=="register":
name=node.attrib.get("name")
print(name)
output.write(name)
output.write("\n")
if node.tag=="field":
name=node.attrib.get("name")
output.write(name)
Right now the output looks like
Registers
xga_control
i_cmd_reg
I want it to look like
Registers
xga_control
reserved
force_all_fault_clear
force_warning
force_error
i_cmd_reg
field name
field name
Any ideas on how to do this?
Instead of iterating over input_file.iter() you can do input_file.getroot() and iterate systematically over that.
This is how you would write your code:
import xml.etree.ElementTree as ET
tree = ET.parse('file1.xml')
root = tree.getroot()
with open('ac4.vhd', 'w+') as fd:
fd.write('Registers\n')
for node in root:
if node.tag == 'Module':
for sub_node in node:
fd.write('{0}\n'.format(sub_node.get('name')))
for child in sub_node:
fd.write('\t{0}\n'.format(child.get('name')))
Your output becomes:
Registers
xga_control
reserved
force_all_fault_clear
force_warning
force_error

Categories

Resources