Convert XML to CSV with python - python

I have the following XML files that need to be converted to CSV for Magento import. Under
<values> <Value AttributeID="attributes"></Value> there are a few hundred possibilities that are different for each product.
I've tried using xml2csv and xmlutils.xml2csv via command line with no luck. Any help would be appreciated.
<?xml version="1.0" encoding="UTF-8" ?>
<STEP-ProductInformation>
<Products>
<Product UserTypeID="Item" ParentID="12345678" AnalyzerResult="included" ID="123456">
<Name>X8MM</Name>
<ClassificationReference InheritedFrom="" ClassificationID="" AnalyzerResult=""/>
<Values>
<Value AttributeID="Retail Price">46.44</Value>
<Value AttributeID="Item Class">03017</Value>
<Value AttributeID="Item Group">03</Value>
<Value AttributeID="Consumer Description">Super-X 8mm Mauser (8x57) 170 Grain Power-Point</Value>
<Value AttributeID="Quantity Case">10</Value>
<Value AttributeID="Bullet Weight">170 gr.</Value>
<Value AttributeID="Made In The USA">Made In The USA</Value>
<Value AttributeID="Item Code">X8MM</Value>
<Value AttributeID="Caliber">8x57 Mauser</Value>
<Value AttributeID="Catalog Vendor Name">WINCHESTER</Value>
<Value AttributeID="Quantity per Box">20</Value>
<Value AttributeID="Item Status">OPEN</Value>
<Value AttributeID="Wildcat Eligible">Y</Value>
<Value AttributeID="Item Description">WIN SUPX 8MAU 170 PP 20</Value>
<Value AttributeID="Primary Vendor">307AM</Value>
<Value AttributeID="Caliber-Gauge">8X57 MAUSER</Value>
<Value AttributeID="InventoryTyp">REG</Value>
<Value AttributeID="Bullet Style">Power-Point</Value>
<Value AttributeID="ProductPageNumber"/>
<Value AttributeID="Model Header">8mm Mauser (8x57)</Value>
<Value AttributeID="Master Model Body Copy">Power Point assures quick and massive knock-down. Strategic notching provides consistent and reliable expansion. Contoured jacket maximum expansion performance. Alloyed lead core increases retained weight for deeper penetration.</Value>
<Value AttributeID="Master Model Header">Super-X Power-Point</Value>
<Value AttributeID="Vendor Group">WIN</Value>
</Values>
<AssetCrossReference Type="Primary Image" AssetID="WIN_X8MM" AnalyzerResult="included"/>
</Product>
</Products>
</STEP-ProductInformation>

I'm not familiar with "Magento", but this program converts your XML file to a CSV file. The resulting CSV file has one column for Name and one column for each Value.
from xml.etree import ElementTree as ET
import csv
tree = ET.parse('x.xml')
root = tree.getroot()
columns = ['Name'] + [
value.attrib.get('AttributeID').encode('utf-8')
for value in tree.findall('.//Product//Value')]
with open('x.csv', 'w') as ofile:
ofile = csv.DictWriter(ofile, set(columns))
ofile.writeheader()
for product in tree.findall('.//Product'):
d = {value.attrib.get('AttributeID').encode('utf-8'):
(value.text or '').encode('utf-8')
for value in product.findall('.//Values/Value')}
d['Name'] = product.findtext('Name')
ofile.writerow(d)

Related

Python - using element tree to get data from specific nodes in xml

I have been looking around and there are a lot of similar questions, but none that solved my issue sadly.
My XML file looks like this
<?xml version="1.0" encoding="utf-8"?>
<Nodes>
<Node ComponentID="1">
<Settings>
<Value name="Text Box (1)"> SettingA </Value>
<Value name="Text Box (2)"> SettingB </Value>
<Value name="Text Box (3)"> SettingC </Value>
<Value name="Text Box (4)"> SettingD </Value>
<AdvSettings State="On"/>
</Settings>
</Node>
<Node ComponentID="2">
<Settings>
<Value name="Text Box (1)"> SettingA </Value>
<Value name="Text Box (2)"> SettingB </Value>
<Value name="Text Box (3)"> SettingC </Value>
<Value name="Text Box (4)"> SettingD </Value>
<AdvSettings State="Off"/>
</Settings>
</Node>
<Node ComponentID="3">
<Settings>
<Value name="Text Box (1)"> SettingG </Value>
<Value name="Text Box (2)"> SettingH </Value>
<Value name="Text Box (3)"> SettingI </Value>
<Value name="Text Box (4)"> SettingJ </Value>
<AdvSettings State="Yes"/>
</Settings>
</Node>
</Nodes>
With Python I'm trying to get the Values of text box 1 and text box 2 for each Node that has "AdvSettings" set on ON.
So in this case I would like a result like
ComponentID State Textbox1 Textbox2
1 On SettingA SettingB
3 On SettingG SettingH
I have done some attempts but didn't get far. With this I managed to get the AdvSettings tag, but that's as far as I got:
import xml.etree.ElementTree as ET
tree = ET.parse('XMLSearch.xml')
root = tree.getroot()
for AdvSettingsin root.iter('AdvSettings'):
print(AdvSettings.tag, AdvSettings.attrib)
You can use an XPath to find all the relevant nodes and then extract the needed data out of them. An example to this will be like below. (Comments as explanation)
from lxml import etree
xml = etree.fromstring('''
<Nodes>...
</Nodes>
''')
# Use XPath to select the relevant nodes
on_nodes = xml.xpath("//Node[Settings[AdvSettings[#State='Yes' or #State='On']]]")
# Get all needed information from every node
data_collected = [dict(
[("ComponentID", node.attrib['ComponentID'])] +
[(c.get("name"), c.text) for c in node.find("Settings").getchildren() if c.text]) for node in on_nodes]
# You got a list of dicts with all relevant information
# print it out, I used pandas for formatting. Optional
import pandas
print(pandas.DataFrame.from_records(data_collected).to_markdown(index=False))
Would give you an output like
| ComponentID | Text Box (1) | Text Box (2) | Text Box (3) | Text Box (4) |
|--------------:|:---------------|:---------------|:---------------|:---------------|
| 1 | SettingA | SettingB | SettingC | SettingD |
| 3 | SettingG | SettingH | SettingI | SettingJ |
Below (using python core xml lib)
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="utf-8"?>
<Nodes>
<Node ComponentID="1">
<Settings>
<Value name="Text Box (1)"> SettingA </Value>
<Value name="Text Box (2)"> SettingB </Value>
<Value name="Text Box (3)"> SettingC </Value>
<Value name="Text Box (4)"> SettingD </Value>
<AdvSettings State="On"/>
</Settings>
</Node>
<Node ComponentID="2">
<Settings>
<Value name="Text Box (1)"> SettingA </Value>
<Value name="Text Box (2)"> SettingB </Value>
<Value name="Text Box (3)"> SettingC </Value>
<Value name="Text Box (4)"> SettingD </Value>
<AdvSettings State="Off"/>
</Settings>
</Node>
<Node ComponentID="3">
<Settings>
<Value name="Text Box (1)"> SettingG </Value>
<Value name="Text Box (2)"> SettingH </Value>
<Value name="Text Box (3)"> SettingI </Value>
<Value name="Text Box (4)"> SettingJ </Value>
<AdvSettings State="Yes"/>
</Settings>
</Node>
</Nodes>'''
data = []
root = ET.fromstring(xml)
nodes = root.findall('.//Node')
for node in nodes:
adv = node.find('.//AdvSettings')
if adv is None:
continue
flag = adv.attrib.get('State','Off')
if flag == 'On' or flag == 'Yes':
data.append({'id':node.attrib.get('ComponentID'),'txt_box_1':node.find('.//Value[#name="Text Box (1)"]').text.strip(),'txt_box_2':node.find('.//Value[#name="Text Box (2)"]').text.strip()})
df = pd.DataFrame(data)
print(df)
output
id txt_box_1 txt_box_2
0 1 SettingA SettingB
1 3 SettingG SettingH

etree data extraction from xml with odd tree structure

here is a piece of the xml data before i go any further
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<sequence id="episode1">
<media>
<video>
<track>
<generatoritem id="Gen Subtitle1">
<effect>
<name>Gen Subtitle</name>
<effectid>Gen Subtitle</effectid>
<effectcategory>Text</effectcategory>
<effecttype>generator</effecttype>
<mediatype>video</mediatype>
<parameter>
<parameterid>part1</parameterid>
<name>Text Settings</name>
<value/>
</parameter>
<parameter>
<parameterid>str</parameterid>
<name>Text</name>
<value>You're a coward for picking on people
who are weaker than you.</value>
</parameter>
<parameter>
<parameterid>font</parameterid>
<name>Font</name>
<value>Arial</value>
</parameter>
</effect>
</media>
</sequence>
</xmeml>
now as you can see the tree starts with <effect> and inside there are multiple <parameters> but im only ater the <value> from <parameters> that also contain
<parameterid>str</parameterid>
<name>Text</name>
so i can get an output of "That child is so cute.
And he's smart."
Here is my code
lst = tree.findall('xmeml/sequence/media/video/track/generatoritem/effect/parameter/value')
counts = tree.findall('.//value')
for each in counts:
print(each.text)
And this is what i get
And he's smart.
Arial
See below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<sequence id="episode1">
<effect>
<name>Gen Subtitle</name>
<effectid>Gen Subtitle</effectid>
<effectcategory>Text</effectcategory>
<effecttype>generator</effecttype>
<mediatype>video</mediatype>
<parameter>
<parameterid>part1</parameterid>
<name>Text Settings</name>
<value/>
</parameter>
<parameter>
<parameterid>str</parameterid>
<name>Text</name>
<value>That child is so cute. And he's smart</value>
</parameter>
<parameter>
<parameterid>font</parameterid>
<name>Font</name>
<value>Arial</value>
</parameter>
</effect>
</sequence>
</xmeml>'''
root = ET.fromstring(xml)
str_params = root.findall('.//parameter/[parameterid="str"]')
for param in str_params:
if param.find('./name').text == 'Text':
print('The text: {}'.format(param.find('./value').text))
break
output
The text: That child is so cute. And he's smart

How to retain XML version and comments while writing in XML file using python?

I have to add 1 element at runtime on the XML file using Python.
My original XML file has content like this below
<?xml version='1.0' encoding='utf-8'?>
<!--
Some comments.
-->
<rootTag>
<childTag className="org.Tiger" SSLEngine="on" />
<childTag name="serv1">
<Connector port="8001" SSLEnabled="true"
maxThreads="800"
URIEncoding="UTF-8"
clientAuth="false" />
<Track name="Pacific" defaultHost="localhost">
<Realm className="Realm" appName="kernel"
userClassNames="User"
roleClassNames="Role"/>
<Host name="localhost"
createDirs="false">
<Value className="Remote"
httpsServerPort="223" />
</Host>
</Track>
</childTag>
</rootTag>
Below is the code which I wrote to add (Value) element at runtime
import xml.etree.ElementTree as ET
myTree = ET.parse("new2.xml")
myRoot = myTree.getroot()
x = myTree.findall('.//Valve[#className="Error"]')
print(len(x))
if int(len(x)) == 0:
for a in myRoot.findall('childTag'):
for b in a.findall('Track'):
for c in b.findall('Host'):
ele = ET.Element('Value')
ele.set("className", "Error")
ele.set("showReport", "false")
ele.set("showServerInfo", "false")
c.append(ele)
myTree.write("new2.xml")
The output which I got is this:-
<rootTag>
<childTag className="org.Tiger" SSLEngine="on" />
<childTag name="serv1">
<Connector port="8001" SSLEnabled="true" maxThreads="800" URIEncoding="UTF-8" clientAuth="false" />
<Track name="Pacific" defaultHost="localhost">
<Realm className="Realm" appName="kernel" userClassNames="User" roleClassNames="Role" />
<Host name="localhost" autoDeploy="false" createDirs="false">
<Value className="Remote" httpsServerPort="223" />
<Value className="Error" showReport="false" showServerInfo="false" /></Host>
</Track>
</childTag>
</rootTag>
The problem here is it removes the XML version, comments from the file and it also
change the indentation of file
How can I only add the subelement with correct indentation without changing anything else from the file
?
O/p should be like this
<?xml version='1.0' encoding='utf-8'?>
<!--
Some comments.
-->
<rootTag>
<childTag className="org.Tiger" SSLEngine="on" />
<childTag name="serv1">
<Connector port="8001" SSLEnabled="true"
maxThreads="800"
URIEncoding="UTF-8"
clientAuth="false" />
<Track name="Pacific" defaultHost="localhost">
<Realm className="Realm" appName="kernel"
userClassNames="User"
roleClassNames="Role"/>
<Host name="localhost"
createDirs="false">
<Value className="Remote"
httpsServerPort="223" />
<Value className="Error"
showReport="false" showServerInfo="false" />
</Host>
</Track>
</childTag>
</rootTag>

issue: python xml append element inside a for loop

thanks for taking the time with this one.
i have an xml file with an element called selectionset. the idea is to take that element and modify some of the subelements attributes and tails, that part i have done.
the shady thing for me to get is why when i try to add the new subelements to the original (called selectionsets) its only pushing the last on the list inplist
import xml.etree.ElementTree as etree
from xml.etree.ElementTree import *
from xml.etree.ElementTree import ElementTree
tree=ElementTree()
tree.parse('STRUCTURAL.xml')
root = tree.getroot()
col=tree.find('selectionsets/selectionset')
#find the value needed
val=tree.findtext('selectionsets/selectionset/findspec/conditions/condition/value/data')
setname=col.attrib['name']
listnames=val + " 6"
inplist=["D","E","F","G","H"]
entry=3
catcher=[]
ss=root.find('selectionsets')
outxml=ss
for i in range(len(inplist)):
str(val)
col.set('name',(setname +" "+ inplist[i]))
col.find('findspec/conditions/condition/value/data').text=str(inplist[i]+val[1:3])
#print (etree.tostring(col)) #everything working well til this point
timper=col.find('selectionset')
root[0].append(col)
# new=etree.SubElement(outxml,timper)
#you need to create a tree with element tree before creating the xml file
itree=etree.ElementTree(outxml)
itree.write('Selection Sets.xml')
print (etree.tostring(outxml))
# print (Test_file.selectionset())
#Initial xml
<?xml version="1.0" encoding="UTF-8" ?>
<exchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://download.autodesk.com/us/navisworks/schemas/nw-exchange-12.0.xsd" units="ft" filename="STRUCTURAL.nwc" filepath="C:\Users\Ricardo\Desktop\Comun\Taller 3">
<selectionsets>
<selectionset name="Column Location" guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8">
<findspec mode="all" disjoint="0">
<conditions>
<condition test="contains" flags="10">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">C-A </data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
</selectionsets>
</exchange>
#----Current Output
<selectionsets>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
<selectionset guid="565f5345-de06-4f5b-aa0f-1ae751c98ea8" name="Column Location H">
<findspec disjoint="0" mode="all">
<conditions>
<condition flags="10" test="contains">
<category>
<name internal="LcRevitData_Element">Element</name>
</category>
<property>
<name internal="lcldrevit_parameter_-1002563">Column Location Mark</name>
</property>
<value>
<data type="wstring">H-A</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</selectionset>
</selectionsets>
Here's what I've been able to put together and it looks like it'll do what you're looking for. Here are the main differences: (1) This will iterate over multiple selectionset items (if you end up with more than one), (2) It creates a deepcopy of the element before modifying the values (I think you were always modifying the original "col"), (3) It appends the new selectionset to the selectionsets tag rather than the root.
Here's the deepcopy documentation
import xml.etree.ElementTree as etree
import copy
tree=etree.ElementTree()
tree.parse('test.xml')
root = tree.getroot()
inplist=["D","E","F","G","H"]
for selectionset in tree.findall('selectionsets/selectionset'):
for i in inplist:
col = copy.deepcopy(selectionset)
col.set('name', '%s %s' % (col.attrib['name'], i))
data = col.find('findspec/conditions/condition/value/data')
data.text = '%s%s' % (i, data.text[1:3])
root.find('selectionsets').append(col)
itree = etree.ElementTree(root)
itree.write('Selection Sets.xml')

how to get the index of a child node under a parent node using python?

my xml file goes like this:
<?xml version="1.0"?>
<BCPFORMAT
xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="12"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="20" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="30" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="age" xsi:type="SQLINT"/>
<COLUMN SOURCE="2" NAME="firstname" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="3" NAME="lastname" xsi:type="SQLVARYCHAR"/>
</ROW>
</BCPFORMAT>
i need to know the index of the child node ID="1" in its parent node 'RECORD'.(ie, index is 0 in this case)
please help me solve this.
thanks.. :)
Using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
root = ET.fromstring('''<?xml version="1.0"?>
<BCPFORMAT
...
</BCPFORMAT>''')
# Accessing parent node: http://effbot.org/zone/element.htm#accessing-parents
parent_map = {c: p for p in root.getiterator() for c in p} child = root.find('.//*[#ID="1"]')
print(list(parent_map[child]).index(child)) # => 0
Using lxml:
import lxml.etree as ET
root = ET.fromstring('''<?xml version="1.0"?>
<BCPFORMAT
...
</BCPFORMAT>''')
child = root.find('.//*[#ID="1"]')
print(child.getparent().index(child)) # => 0

Categories

Resources