I'm learning how to parse KML files in Python using the pyKML module. The specific file I'm using can be found here and I've also added it at the bottom of this post. I have saved the file on my computer and name it test.kml.
After some research, I managed to extract a specific portion of the test.kml file and save the result to a DataFrame. Here's my code:
from pykml import parser
import pandas as pd
filename = 'test.kml'
with open(filename) as fobj:
folder = parser.parse(fobj).getroot().Document
plnm = []
for pm in folder.Placemark:
plnm1 = pm.name
plnm.append(plnm1.text)
df = pd.DataFrame()
df['name'] = plnm
print(df)
name
0 Club house
1 By the lake
I would like to add a new column to my DataFrame corresponding to the value of the "holeNumber". I have tried to add the following lines in my for loop but without success.
for pm in folder.Placemark:
plnm1 = pm.name
val1 = pm.ExtendedData.holeNumber.value
plnm.append(plnm1.text)
val.append(val1.text)
I'm not sure how to access the value from that specific node. The resulting DataFrame I'm looking for is the following:
| name | holeNumber |
|-------------|------------|
| Club house | 1 |
| By the lake | 5 |
Any help would be appreciated.
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<name>My Golf Course Example</name>
<Placemark>
<name>Club house</name>
<ExtendedData>
<Data name="holeNumber">
<value>1</value>
</Data>
<Data name="holeYardage">
<value>234</value>
</Data>
<Data name="holePar">
<value>4</value>
</Data>
</ExtendedData>
<Point>
<coordinates>-111.956,33.5043</coordinates>
</Point>
</Placemark>
<Placemark>
<name>By the lake</name>
<ExtendedData>
<Data name="holeNumber">
<value>5</value>
</Data>
<Data name="holeYardage">
<value>523</value>
</Data>
<Data name="holePar">
<value>5</value>
</Data>
</ExtendedData>
<Point>
<coordinates>-111.95,33.5024</coordinates>
</Point>
</Placemark>
</Document>
</kml>
Here's a quick way to parse the KML.
plnm = []
holeNumber = []
for pm in folder.Placemark:
plnm1 = pm.name
val1 = pm.ExtendedData.Data[0].value
plnm.append(plnm1.text)
holeNumber.append(val1.text)
df = pd.DataFrame()
df['name'] = plnm
df['holeNumber'] = holeNumber
print(df)
Or
df = pd.DataFrame(columns=('name', 'holeNumber'))
for pm in folder.Placemark:
name = pm.name.text
value = pm.ExtendedData.Data[0].value.text
df = df.append({ 'name' : name, 'holeNumber' : value }, ignore_index=True)
print(df)
Output:
name holeNumber
0 Club house 1
1 By the lake 5
Related
I am working on a simple python script to extract certain data from an xml file. The xml contains windows events and eventid. Below I am showing the code. It is failing when it needs to extract the data, but it is creating the file but is empty.
from xml.etree import ElementTree as ET
import csv
tree = ET.parse("SecurityLog-rev2.xml")
root = tree.getroot()
url = root[0].tag[:-len("Event")]
fieldnames = ['EventID']
with open ('event_log.csv', 'w') as csvfile:
writecsv = csv.DictWriter(csvfile, fieldnames = fieldnames)
writecsv.writeheader()
for event in root:
system = event.find(url + "System")
output = {}
fields = ['EventID']
# for tag,att in fields:
# output[tag] = system.find(url + tag).attrib[att]
if event.find(url + "EventData") != None:
for data in event.find(url + "EventData"):
name = data.attrib['Name']
output[name] = data.text
writecsv.writerow(output)
<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'><System><Provider Name='Microsoft-Windows-Security-Auditing' Guid='{54849625-5478-4994-A5BA-3E3B0328C30D}'/>
<EventID>4634</EventID>
<Version>0</Version><Level>0</Level><Task>12545</Task><Opcode>0</Opcode><Keywords>0x8020000000000000</Keywords><TimeCreated SystemTime='2011-04-16T15:07:53.890625000Z'/>
<EventRecordID>1410962</EventRecordID><Correlation/><Execution ProcessID='452' ThreadID='3900'/><Channel>Security</Channel><Computer>DC01.AFC.com</Computer><Security/></System>
<EventData><Data Name='TargetUserSid'>S-1-5-21-2795111079-3225111112-3329435632-1610</Data>
<Data Name='TargetUserName'>grant.larson</Data>
<Data Name='TargetDomainName'>AFC</Data><Data Name='TargetLogonId'>0x3642df8</Data><Data Name='LogonType'>3</Data></EventData></Event>
I am not sure what exactly you would parse. Here is a solution for the Id and the events:
Your XML File provided above as Input:
<?xml version="1.0" encoding="utf-8"?>
<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
<System>
<Provider Name='Microsoft-Windows-Security-Auditing' Guid='{54849625-5478-4994-A5BA-3E3B0328C30D}' />
<EventID>4634</EventID>
<Version>0</Version>
<Level>0</Level>
<Task>12545</Task>
<Opcode>0</Opcode>
<Keywords>0x8020000000000000</Keywords>
<TimeCreated SystemTime='2011-04-16T15:07:53.890625000Z' />
<EventRecordID>1410962</EventRecordID>
<Correlation />
<Execution ProcessID='452' ThreadID='3900' />
<Channel>Security</Channel>
<Computer>DC01.AFC.com</Computer>
<Security />
</System>
<EventData>
<Data Name='TargetUserSid'>S-1-5-21-2795111079-3225111112-3329435632-1610</Data>
<Data Name='TargetUserName'>grant.larson</Data>
<Data Name='TargetDomainName'>AFC</Data>
<Data Name='TargetLogonId'>0x3642df8</Data>
<Data Name='LogonType'>3</Data>
</EventData>
</Event>
The program code without regex for catching the namespace:
from xml.etree import ElementTree as ET
import pandas as pd
import csv
tree = ET.parse("SecurityLog-rev2.xml")
root = tree.getroot()
ns = "{http://schemas.microsoft.com/win/2004/08/events/event}"
data = []
for eventID in root.findall(".//"):
if eventID.tag == f"{ns}System":
for e_id in eventID.iter():
if e_id.tag == f'{ns}EventID':
row = "EventID", e_id.text
data.append(row)
if eventID.tag == f"{ns}EventData":
for attr in eventID.iter():
if attr.tag == f'{ns}Data':
#print(attr.attrib)
row = attr.get('Name'), attr.text
data.append(row)
df = pd.DataFrame.from_dict(data, orient='columns')
df.to_csv('event_log.csv', index=False, header=False)
print(df)
Output:
0 1
0 EventID 4634
1 TargetUserSid S-1-5-21-2795111079-3225111112-3329435632-1610
2 TargetUserName grant.larson
3 TargetDomainName AFC
4 TargetLogonId 0x3642df8
5 LogonType 3
The CSV File doesn't contain the index and header:
EventID,4634
TargetUserSid,S-1-5-21-2795111079-3225111112-3329435632-1610
TargetUserName,grant.larson
TargetDomainName,AFC
TargetLogonId,0x3642df8
LogonType,3
You can tanspose() the output:
df.T.to_csv('event_log.csv', index=False, header=False)
I have three xml files with a certain tag and different schema as shown below.
file:1
<LETADA-LOOK_TYP>
<COLUMNS> Long Short Value </COLUMNS>
<DATA> XC XC 10670003039 </DATA>
<DATA> GH GH 10450003040 </DATA>
<DATA> HJ HJ 10220002989 </DATA>
<DATA> FF FF 10990002988 </DATA>
<DATA> DD DD 10660003041 </DATA>
<DATA> FE FE 10660002991 </DATA>
<DATA> SS SS 10090003042 </DATA>
<DATA> LL LL 10100002990 </DATA>
</LETADA-LOOK_TYP>
file:2
<LETADA-LOOK_TYP>
<COLUMNS> Long Name Value </COLUMNS>
<DATA> LD ER 10670045039 </DATA>
<DATA> FR RT 10450065040 </DATA>
<DATA> YT VG 10220090989 </DATA>
<DATA> QW TY 10990023988 </DATA>
<DATA> WE ER 10660034041 </DATA>
<DATA> ER FG 10660045991 </DATA>
<DATA> ER ER 10090067042 </DATA>
<DATA> PO PO 10100044990 </DATA>
</LETADA-LOOK_TYP>
file:3
<LETADA-LOOK_TYP>
<COLUMNS> Punt GrubName Value </COLUMNS>
<DATA> GF ER 10689045039 </DATA>
<DATA> TY RT 10434065040 </DATA>
<DATA> JJ VG 10212090989 </DATA>
<DATA> QW TY 10989023988 </DATA>
<DATA> TY ER 10676034041 </DATA>
<DATA> II FG 10609045991 </DATA>
<DATA> OI ER 10023067042 </DATA>
<DATA> OW PO 10145044990 </DATA>
</LETADA-LOOK_TYP>
so I have written a python script to parse these files but because these files have different schema I had to manual write a script for each file and get dataframe out of it and then concat all the dataframes achieved to get the desired output.
parsing files:
import pandas as pd
import numpy as np
import xml.etree.ElementTree as et
data1 =[]
tree = et.parse(file1)
root = tree.getroot()
for mt in root.findall("LETADA-LOOK_TYP"):
column = mt.find('COLUMNS').text
column1 = column.split('\t')
for ms in root.findall("LETADA-LOOK_TYP"):
datatab = ms.findall("DATA")
for dat in datatab:
data = dat.text.split('\t')
data1.append(data)
dataframe1 = pd.DataFrame(data1, columns = column1)
Parsing file 2 in a different cell:
data2 =[]
tree = et.parse(file2)
root = tree.getroot()
for mt in root.findall("LETADA-LOOK_TYP"):
column = mt.find('COLUMNS').text
column2 = column.split('\t')
for ms in root.findall("LETADA-LOOK_TYP"):
datatab = ms.findall("DATA")
for dat in datatab:
data = dat.text.split('\t')
data2.append(data)
dataframe2 = pd.DataFrame(data2, columns = column2)
Parsing file 3 in a different cell:
data3 =[]
tree = et.parse(file3)
root = tree.getroot()
for mt in root.findall("LETADA-LOOK_TYP"):
column = mt.find('COLUMNS').text
column3 = column.split('\t')
for ms in root.findall("LETADA-LOOK_TYP"):
datatab = ms.findall("DATA")
for dat in datatab:
data = dat.text.split('\t')
data3.append(data)
dataframe3 = pd.DataFrame(data3, columns = column3)
list_df =[dataframe1,dataframe2,dataframe3]
final_df = pd.concat(list_df).reset_index(drop = True)
using the above multiple lines of code I can get the desired output but is there a way to parse multiple files with different schema and return multiple dataframes and then concat them to get a final output
I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles
I have a basic xml file called meals.xml which looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<meals name="Sample Text">
<meal id="1" name="Poached Eggs" type="breakfast"/>
<meal id="2" name="Club Sandwich" type="lunch"/>
<meal id="3" name="Steak" type="dinner"/>
<meal id="4" name="Steak" type="dinner"/>
</meals>
I want to extract both 'id' and 'name' attributes in to a dataframe. I can extract one when specifying one column and one attribute (eg, name only), but can't seem to figure out the syntax for getting multiple attributes in the for loop. This what I've tried, adding id to the 'df_cols' and 'attrib.get' function:
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.parse('meals.xml').getroot()
df_cols = ["id", "name"]
rows = []
for node in root:
value = node.attrib.get('id', 'name')
rows.append(value)
df = pd.DataFrame(rows, columns = df_cols)
df
Can someone advise how to do this?
The below may work for you
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<meals name="Sample Text">
<meal id="1" name="Poached Eggs" type="breakfast"/>
<meal id="2" name="Club Sandwich" type="lunch"/>
<meal id="3" name="Steak" type="dinner"/>
<meal id="4" name="Steak" type="dinner"/>
</meals>'''
root = ET.fromstring(xml)
data = [{'id': m.attrib['id'], 'name': m.attrib['name']} for m in root.findall('.//meal')]
df = pd.DataFrame(data)
print(df)
output
id name
0 1 Poached Eggs
1 2 Club Sandwich
2 3 Steak
3 4 Steak
I want to create a function to modify XML content without changing the format. I managed to change the text but I can't do it without changing the format in XML.
So now, what I wanted to do is to add space before and after CDATA in a XML file.
Default XML file:
<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice/>
<Bin>
<Bin Bin="001"/>
</Bin>
<Data>
<Row> <![CDATA[001 001 001]]> </Row>
</Data>
</Device>
</Map>
</Maps>
And I am getting this result:
<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice/>
<Bin>
<Bin Bin="001"/>
</Bin>
<Data>
<Row><![CDATA[001 001 099]]></Row>
</Data>
</Device>
</Map>
</Maps>
However, I want the new xml to be like this:
<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice/>
<Bin>
<Bin Bin="001"/>
</Bin>
<Data>
<Row> <![CDATA[001 001 099]]> </Row>
</Data>
</Device>
</Map>
</Maps>
Here is my code:
from lxml import etree as ET
def xml_new(f,fpath,newtext,xmlrow):
xmlrow = 19
parser = ET.XMLParser(strip_cdata=False)
tree = ET.parse(f, parser)
root = tree.getroot()
for child in root:
value = child[0][2][xmlrow].text
text = ET.CDATA("001 001 099")
child[0][2][xmlrow] = ET.Element('Row')
child[0][2][xmlrow].text = text
child[0][2][xmlrow].tail = "\n"
ET.register_namespace('A', "http://www.semi.org")
tree.write(fpath,encoding='utf-8',xml_declaration=True)
return value
Anyone can help me on this? thanks in advance!
I don't quite understand what you want to do. Here's an example for you. I don't know if it can meet your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
html ='''<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice/>
<Bin>
<Bin Bin="001"/>
</Bin>
<Data>
<Row> <![CDATA[001 001 001]]> </Row>
</Data>
</Device>
</Map>
</Maps>'''
doc = SimplifiedDoc(html)
row = doc.Data.Row # Get the node you want to modify.
row.setContent(" "+row.html+" ") # Modify the node content.
print (doc.html)
Result:
<?xml version="1.0" encoding="utf-8"?>
<Mapsxmlns="http://www.semi.org">
<Map>
<Device>
<ReferenceDevice />
<Bin>
<Bin Bin="001" />
</Bin>
<Data>
<Row> <![CDATA[001 001 001]]> </Row>
</Data>
</Device>
</Map>
</Maps>
thanks for all your help. I have found another way to achieve the result I want
This is the code:
# what you want to change
replaceby = '020]]> </Row>\n'
# row you want to change
row = 1
# col you want to change based on list
col = 3
file = open(file,'r')
line = file.readlines()
i = 0
editedXML=[]
for l in line:
if 'cdata' in l.lower():
i=i+1
if i == row:
oldVal = l.split(' ')
newVal = []
for index, old in enumerate(oldVal):
if index == col:
newVal.append(replaceby)
else:
newVal.append(old)
editedXML.append(' '.join(newVal))
else:
editedXML.append(l)
else:
editedXML.append(l)
file2 = open(newfile,'w')
file2.write(''.join(editedXML))
file2.close()