I am working on a simple python script to extract certain data from an xml file. The xml contains windows events and eventid. Below I am showing the code. It is failing when it needs to extract the data, but it is creating the file but is empty.
from xml.etree import ElementTree as ET
import csv
tree = ET.parse("SecurityLog-rev2.xml")
root = tree.getroot()
url = root[0].tag[:-len("Event")]
fieldnames = ['EventID']
with open ('event_log.csv', 'w') as csvfile:
writecsv = csv.DictWriter(csvfile, fieldnames = fieldnames)
writecsv.writeheader()
for event in root:
system = event.find(url + "System")
output = {}
fields = ['EventID']
# for tag,att in fields:
# output[tag] = system.find(url + tag).attrib[att]
if event.find(url + "EventData") != None:
for data in event.find(url + "EventData"):
name = data.attrib['Name']
output[name] = data.text
writecsv.writerow(output)
<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'><System><Provider Name='Microsoft-Windows-Security-Auditing' Guid='{54849625-5478-4994-A5BA-3E3B0328C30D}'/>
<EventID>4634</EventID>
<Version>0</Version><Level>0</Level><Task>12545</Task><Opcode>0</Opcode><Keywords>0x8020000000000000</Keywords><TimeCreated SystemTime='2011-04-16T15:07:53.890625000Z'/>
<EventRecordID>1410962</EventRecordID><Correlation/><Execution ProcessID='452' ThreadID='3900'/><Channel>Security</Channel><Computer>DC01.AFC.com</Computer><Security/></System>
<EventData><Data Name='TargetUserSid'>S-1-5-21-2795111079-3225111112-3329435632-1610</Data>
<Data Name='TargetUserName'>grant.larson</Data>
<Data Name='TargetDomainName'>AFC</Data><Data Name='TargetLogonId'>0x3642df8</Data><Data Name='LogonType'>3</Data></EventData></Event>
I am not sure what exactly you would parse. Here is a solution for the Id and the events:
Your XML File provided above as Input:
<?xml version="1.0" encoding="utf-8"?>
<Event xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
<System>
<Provider Name='Microsoft-Windows-Security-Auditing' Guid='{54849625-5478-4994-A5BA-3E3B0328C30D}' />
<EventID>4634</EventID>
<Version>0</Version>
<Level>0</Level>
<Task>12545</Task>
<Opcode>0</Opcode>
<Keywords>0x8020000000000000</Keywords>
<TimeCreated SystemTime='2011-04-16T15:07:53.890625000Z' />
<EventRecordID>1410962</EventRecordID>
<Correlation />
<Execution ProcessID='452' ThreadID='3900' />
<Channel>Security</Channel>
<Computer>DC01.AFC.com</Computer>
<Security />
</System>
<EventData>
<Data Name='TargetUserSid'>S-1-5-21-2795111079-3225111112-3329435632-1610</Data>
<Data Name='TargetUserName'>grant.larson</Data>
<Data Name='TargetDomainName'>AFC</Data>
<Data Name='TargetLogonId'>0x3642df8</Data>
<Data Name='LogonType'>3</Data>
</EventData>
</Event>
The program code without regex for catching the namespace:
from xml.etree import ElementTree as ET
import pandas as pd
import csv
tree = ET.parse("SecurityLog-rev2.xml")
root = tree.getroot()
ns = "{http://schemas.microsoft.com/win/2004/08/events/event}"
data = []
for eventID in root.findall(".//"):
if eventID.tag == f"{ns}System":
for e_id in eventID.iter():
if e_id.tag == f'{ns}EventID':
row = "EventID", e_id.text
data.append(row)
if eventID.tag == f"{ns}EventData":
for attr in eventID.iter():
if attr.tag == f'{ns}Data':
#print(attr.attrib)
row = attr.get('Name'), attr.text
data.append(row)
df = pd.DataFrame.from_dict(data, orient='columns')
df.to_csv('event_log.csv', index=False, header=False)
print(df)
Output:
0 1
0 EventID 4634
1 TargetUserSid S-1-5-21-2795111079-3225111112-3329435632-1610
2 TargetUserName grant.larson
3 TargetDomainName AFC
4 TargetLogonId 0x3642df8
5 LogonType 3
The CSV File doesn't contain the index and header:
EventID,4634
TargetUserSid,S-1-5-21-2795111079-3225111112-3329435632-1610
TargetUserName,grant.larson
TargetDomainName,AFC
TargetLogonId,0x3642df8
LogonType,3
You can tanspose() the output:
df.T.to_csv('event_log.csv', index=False, header=False)
Related
I have a script which almost works to conver text to xml in python. But I have just one issue and need help.
Each line item has 14 entries ( the 12th, 13th and 14th entries are in the 2nd row ). So the entries in my text file looks like:
5372|,EMF|2023011309094800|000|ABCSYS||RANDOMTXT||1|25727,00078,B4||43AE5E5C169904E0E0063BBEAE2CDF2F|42010A2A25FA1EDBB2D97D4DE84DB471|
5373|,EME|2023011309094800|000|ABCSYS||RANDOMTXT||1|25727,00078,B4|USER001|43AE5E5C169904E0E0063BBEAE2CDF2F|42010A2A25FA1EDBB2D97D4DE84DB471|
5374|,EME|2023011309094800|000|ABCSYS||RANDOMTXT||1|25727,00078,B4|Job:ABC_WORKFLOW_SYSTEM09084801|43AE5E5C169904E0E0063BBEAE2CDF2F|42010A2A25FA1EDBB2D97D4DE84DB471|
5375|p,E0A|2023011309094800|000|ABCSYS||RANDOMTXT||1|25727,00078,B4|&aUSER&b001|43AE5E5C169904E0E0063BBEAE2CDF2F|42010A2A25FA1EDBB2D97D4DE84DB471|
5376|n,D01|2023011309094800|000|ABCSYS||RANDOMTXT||1|25727,00078,B4|00560|43AE5E5C169904E0E0063BBEAE2CDF2F|42010A2A25FA1EDBB2D97D4DE84DB471|
5363|x,A14|2023013117274500|500|SXG6JYN|TXN|YDMMR_TRNS_COMSALES_TO_FNR_DTH|C7000EBA|5|13538,00057,D3|YDMMR_TRNS_COMSALES_TO_FNR_F010243GET_DATA|43AE5E5C16990390E0063D769C6864E8|42
010A2A25FA1EEDA8B68FD75FA13EBE|
5364|l,A19|2023013117274500|500|SXG6JYN|TXN|YDMMR_TRNS_COMSALES_TO_FNR_DTH|C7000EBA|5|13538,00057,D3|GT_MATDOC[1]-BKTXT->APIabtpdasy6185|43AE5E5C16990390E0063D769C6864E8|42010A2A2
5FA1EEDA8B68FD75FA13EBE|
Expected out in xml will be like
The issue with the below code is, the moment the entries are in the 2nd row, it creates a parent tag. I want all 14 elements do come under 1 parent tag and then create a new parent tag for the subsequent set of 14 elements. I have attached a screenshot of how i get the o/p now which is incorrect.
import csv
from lxml import etree as et
csv.field_size_limit(sys.maxsize)
root = et.Element("Processes")
row_names = [
'Time',
'Client',
'User',
'number',
'processid',
'program',
'randomnumber',
'processidandwp',
'userclient',
'transactionid',
'additional1',
'additional2',
'additional3',
'additional4'
]
with open("test.txt") as file:
for row in csv.reader(file, delimiter="|"):
name = et.SubElement(root, "name")
for i in range(len(row)):
node = et.SubElement(name, row_names[i])
node.text = row[i]
xml_datas = et.tostring(root, pretty_print=True,
xml_declaration=True, encoding="utf-8")
print(xml_datas.decode())
Current output
[1]: https://i.stack.imgur.com/3sNmH.png
</Processes>
<?xml version='1.0' encoding='utf-8'?>
<name>
<Time>6354</Time>
<Client>,EGZ</Client>
<User>2023012711283700</User>
<number>900</number>
<processid>DDIC</processid>
<program>S000</program>
<randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
<processidandwp></processidandwp>
<userclient>1</userclient>
<transactionid>25737,00088,B5</transactionid>
<additional1>text</additional1>
<additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
<additional3>42010A2A25FA1EDDA7CN</additional3>
<additional4>BDA81EE66224C</additional4>
<additional5>000000000000000000/00000000000</additional5>
</name>
<name>
<Time>6355</Time>
<Client>,EGZ</Client>
<User>2023012711283700</User>
<number>900</number>
<processid>DDIC</processid>
<program>S000</program>
<randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
<processidandwp></processidandwp>
<userclient>1</userclient>
<transactionid>25737,00088,B5</transactionid>
<additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
<additional3>42010A2A25FA1EDDA7CN</additional3>
<additional4>BDA81EE66224C</additional4>
<additional5>000000000000000000/00000000000</additional5>
</name>
</Processes>
The below should work
import csv
import xml.etree.ElementTree as ET
row_names = [
'Time',
'Client',
'User',
'number',
'processid',
'program',
'randomnumber',
'processidandwp',
'userclient',
'transactionid',
'additional1',
'additional2',
'additional3',
'additional4'
]
root = ET.Element("Processes")
counter = 0
with open("data.csv", 'r') as file:
csv_reader = csv.reader(file, delimiter="|")
sub_root = ET.SubElement(root, 'name')
for row in csv_reader:
for name in row:
if counter < len(row_names) and name:
ele = ET.SubElement(sub_root, row_names[counter])
ele.text = name
counter += 1
ET.dump(root)
output
<Processes>
<name>
<Time>A</Time>
<Client>B</Client>
<User>C</User>
<number>D</number>
<processid>E</processid>
<program>F</program>
<randomnumber>G</randomnumber>
<processidandwp>H</processidandwp>
<userclient>I</userclient>
<transactionid>J</transactionid>
<additional1>K</additional1>
<additional2>L</additional2>
<additional3>M</additional3>
<additional4>N</additional4>
</name>
</Processes>
Any ideas why this one is not working??
The XML that is being converted (much longer than this)
<XML>
<ClinicalData StudyOID="XXXXXXXXX" MetaDataVersionOID="53" mdsol_AuditSubCategoryName="QueryAnswer">
<SubjectData SubjectKey="XXXXXXXX-b7cd-4f97-8d25-594219de192f" mdsol_SubjectKeyType="SubjectUUID" mdsol_SubjectName="XX-002">
<SiteRef LocationOID="15" XXXX_StudyEnvSiteNumber="15" />
<StudyEventData StudyEventOID="DAY1" StudyEventRepeatKey="DAY1[1]" mdsol_InstanceId="47077">
<FormData FormOID="SS_DISP" FormRepeatKey="1" mdsol_DataPageId="320656">
<ItemGroupData ItemGroupOID="SS_DISP" mdsol_RecordId="797737">
<ItemData ItemOID="SS_DISP.DISPDAT" TransactionType="Upsert">
<AuditRecord>
<UserRef UserOID="XXXX#XXXXX.com1" />
<LocationRef LocationOID="15" mdsol_StudyEnvSiteNumber="15" />
<DateTimeStamp>2022-01-28T05:27:54</DateTimeStamp>
<ReasonForChange>
</ReasonForChange>
<SourceID>12345678</SourceID>
</AuditRecord>
<mdsol_Query QueryRepeatKey="123456" Value="Date of XXXX does not equal the XXXY Date. Please review and correct else clarify." Status="Answered" Response="Issues with XXXXX IWRS XXXXXX" />
</ItemData>
</ItemGroupData>
</FormData>
</StudyEventData>
</SubjectData>
</ClinicalData>
</XML>
I am using this python script to do the conversion, or I am trying to. I am pretty new to this.
from xml.etree import ElementTree
tree = ElementTree.parse('xml.xml')
root = tree.getroot()
data = []
for ClinicalData in root:
StudyOID = getattr(child.find('StudyOID'), 'text', None)
MetaDataVersionOID = getattr(child.find('MetaDataVersionOID'), 'text', None)
mdsol_AuditSubCategoryName = getattr(child.find('mdsol_AuditSubCategoryName'), 'text', None)
SubjectKey = getattr(child.find('SubjectKey'), 'text', None)
#print('{}, {}, {}, {}'.format(StudyOID, MetaDataVersionOID, mdsol_AuditSubCategoryName, SubjectKey))
data.append('{}, {}, {}, {}'.format(StudyOID, MetaDataVersionOID, mdsol_AuditSubCategoryName, SubjectKey))
#print (data)
with open('output.csv', 'w') as f: f.write('\n'.join([row for row in data[1:]]))
The error message I get is as follows:
File "<stdin>", line 9
with open('output.csv', 'w') as f: f.write('\n'.join([row for row in data[1:]]))
^^^^
SyntaxError: invalid syntax
In the above,
you have the data list ("data") convert that into a pandas dataframe as below and write to csv
cols = [StudyOID, MetaDataVersionOID, mdsol_AuditSubCategoryName, SubjectKey]
df = pd.DataFrame(data, columns=cols)
# Writing dataframe to csv
df.to_csv('output.csv')
I'm learning how to parse KML files in Python using the pyKML module. The specific file I'm using can be found here and I've also added it at the bottom of this post. I have saved the file on my computer and name it test.kml.
After some research, I managed to extract a specific portion of the test.kml file and save the result to a DataFrame. Here's my code:
from pykml import parser
import pandas as pd
filename = 'test.kml'
with open(filename) as fobj:
folder = parser.parse(fobj).getroot().Document
plnm = []
for pm in folder.Placemark:
plnm1 = pm.name
plnm.append(plnm1.text)
df = pd.DataFrame()
df['name'] = plnm
print(df)
name
0 Club house
1 By the lake
I would like to add a new column to my DataFrame corresponding to the value of the "holeNumber". I have tried to add the following lines in my for loop but without success.
for pm in folder.Placemark:
plnm1 = pm.name
val1 = pm.ExtendedData.holeNumber.value
plnm.append(plnm1.text)
val.append(val1.text)
I'm not sure how to access the value from that specific node. The resulting DataFrame I'm looking for is the following:
| name | holeNumber |
|-------------|------------|
| Club house | 1 |
| By the lake | 5 |
Any help would be appreciated.
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<name>My Golf Course Example</name>
<Placemark>
<name>Club house</name>
<ExtendedData>
<Data name="holeNumber">
<value>1</value>
</Data>
<Data name="holeYardage">
<value>234</value>
</Data>
<Data name="holePar">
<value>4</value>
</Data>
</ExtendedData>
<Point>
<coordinates>-111.956,33.5043</coordinates>
</Point>
</Placemark>
<Placemark>
<name>By the lake</name>
<ExtendedData>
<Data name="holeNumber">
<value>5</value>
</Data>
<Data name="holeYardage">
<value>523</value>
</Data>
<Data name="holePar">
<value>5</value>
</Data>
</ExtendedData>
<Point>
<coordinates>-111.95,33.5024</coordinates>
</Point>
</Placemark>
</Document>
</kml>
Here's a quick way to parse the KML.
plnm = []
holeNumber = []
for pm in folder.Placemark:
plnm1 = pm.name
val1 = pm.ExtendedData.Data[0].value
plnm.append(plnm1.text)
holeNumber.append(val1.text)
df = pd.DataFrame()
df['name'] = plnm
df['holeNumber'] = holeNumber
print(df)
Or
df = pd.DataFrame(columns=('name', 'holeNumber'))
for pm in folder.Placemark:
name = pm.name.text
value = pm.ExtendedData.Data[0].value.text
df = df.append({ 'name' : name, 'holeNumber' : value }, ignore_index=True)
print(df)
Output:
name holeNumber
0 Club house 1
1 By the lake 5
The code below goes through the xml files and parses them into a single csv file
from xml.etree import ElementTree as ET
from collections import defaultdict
import csv
from pathlib import Path
directory = 'path to a folder with xml files'
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
headers = ['id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']
writer.writerow(headers)
xml_files_list = list(map(str, Path(directory).glob('**/*.xml')))
print(xml_files_list)
for xml_file in xml_files_list:
tree = ET.parse(xml_file)
root = tree.getroot()
start_nodes = root.findall('.//START')
for sn in start_nodes:
row = defaultdict(str)
repeated_values = dict()
for k,v in sn.attrib.items():
repeated_values[k] = v
for rn in sn.findall('.//Rational'):
repeated_values['rational'] = rn.text
for qu in sn.findall('.//Qualify'):
repeated_values['qualify'] = qu.text
for ds in sn.findall('.//Description'):
repeated_values['description_txt'] = ds.text
repeated_values['description_num'] = ds.attrib['num']
for st in sn.findall('.//SetData'):
for k,v in st.attrib.items():
row['set_data_'+ str(k)] = v
for key in repeated_values.keys():
row[key] = repeated_values[key]
row_data = [row[i] for i in headers]
writer.writerow(row_data)
row = defaultdict(str)
This is the xml file.
<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<Phones>
<Date />
<Prog />
<Box />
<Feature />
<IN>MAFWDS</IN>
<Set>234234</Set>
<Pr>23423</Pr>
<Number>afasfhrtv</Number>
<Simple>dfasd</Simple>
<Nr />
<Get>6070106091</Get>
<Reno>1233</Reno>
</Phones>
<FINAL>
<START id="B001" service_code="0x5196">
<Docs Docs_type="START">
<Rational>225196</Rational>
<Qualify>6251960000A0DE</Qualify>
</Docs>
<Description num="1213f2312">The parameter</Description>
<DataFile dg="12" dg_id="let">
<SetData value="32" />
</DataFile>
</START>
<START id="C003" service_code="0x517B">
<Docs Docs_type="START">
<Rational>23423</Rational>
<Qualify>342342</Qualify>
</Docs>
<Description num="3423423f3423">The third</Description>
<DataFile dg="55" dg_id="big">
<SetData x="E1" value="21259" />
<SetData x="E2" value="02" />
</DataFile>
</START>
<START id="Z048" service_code="0x5198">
<RawData rawdata_type="ASDS">
<Rational>225198</Rational>
<Qualify>343243324234234</Qualify>
</RawData>
<Description num="434234234">The forth</Description>
<DataFile unit="21" unit_id="FEDS">
<FileX unit="eg" discrete="false" axis_pts="19" name="Vsome" text_id="bx5" unit_id="GDFSD" />
<SetData xin="5" xax="233" value="323" />
<SetData xin="123" xax="77" value="555" />
<SetData xin="17" xax="65" value="23" />
</DataFile>
</START>
</FINAL>
</ProjectData>
This is how the output looks like
Currently struggling to modify the code , so it goes to Phones (which is another child of Projectdata) takes elements from Set and Get attaches them together with _ and parses them into the first column that has the header names ** Identify**
The picture bellow shows how It should look.
Modify your headers line to
headers = ['identify', 'id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']
p_get = tree.find('.//Phones/Get').text
p_set = tree.find('.//Phones/Set').text
and add this info to the row_data just before the line writer.writerow(row_data)
like this:
row_data.insert(0, p_get + '_' + p_set)
Update
row_data[0] = p_get + '_' + p_set
I want to read a CSV file and replace the tags within the xml file with the second column of the CSV file. The tag 'name' values are in the first column.
A | B
Value1 | ValueX
Value2 | ValueX
Value3 | ValueY
XML structure looks like.
<products>
<product>
<name>Value1</name>
</product>
<product>
<name>Values2</name>
</product>
<product>
<name>Values3</name>
</product>
</products>
Python code
import csv
import collections
import xml.etree.ElementTree
tree = xml.etree.ElementTree.parse("jolly.xml").getroot()
with open('file.csv', 'r') as f:
reader = csv.DictReader(f)# read rows into a dictionary format
reader = csv.reader(f, dialect=csv.excel_tab)
list = list(reader)
columns = collections.defaultdict(list)# each value in each column is appended to a list
for (k, v) in row.items(): #go over each column name and value
columns[k].append(v)# append the value into the appropriate list
print columns['A']
print columns['B']
for elem in tree.findall('.//name'):
if elem.attrib['name'] == columns['A']:
elem.attrib['name'] == columns['B']
How can I handle it?
Here is how the CSV file looks like:
Reading CSV file looks like
The output should be looks like this:
Value1 should be replaced with ValueX
Ok here is my solution:
import lxml.etree as ET
arr = ["Value1", "Value2", "Value3"]
arr2 = ["ValuX", "ValuX", "ValueY"]
with open('file.xml', 'rb+') as f:
tree = ET.parse(f)
root = tree.getroot()
for i, item in enumerate(arr):
for elem in root.findall('.//Value1'):
print(elem);
if elem.tag:
print(item)
print(arr2[i])
elem.text = elem.text.replace(item, arr2[i])
f.seek(0)
f.write(ET.tostring(tree, encoding='UTF-8', xml_declaration=True))
f.truncate()
Well I am using an array. I can just copy the values from file into array. For huge files it needs a better code.
Consider using XSLT, the special purpose, declarative language designed to restructure XML files. Like most other general purpose languages including ASP, C#, Java, PHP, Perl, VB, Python maintains an XSLT 1.0 processor, specifically in its lxml module.
And for your purposes, you can dynamically create an XSLT string that can be used for the transformation. Only loop needed is looping through csv data:
import csv
import lxml.etree as ET
# READ IN CSV DATA AND APPEND TO LIST
csvdata = []
with open('file.csv'), 'r') as csvfile:
readCSV = csv.reader(csvfile)
for line in readCSV:
csvdata.append(line)
# DYNAMICALLY CREATE XSLT STRING
xsltstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
'''
for i in range(len(csvdata)):
xsltstr = xsltstr + \
'''<xsl:template match="name[.='{0}']">
<xsl:element name="{1}">
<xsl:apply-templates />
</xsl:element>
</xsl:template>
'''.format(*csvdata[i])
xsltstr = xsltstr + '</xsl:transform>'
# PARSE ORIGINAL FILE AND XSLT STRING
dom = ET.parse('jolly.xml')
xslt = ET.fromstring(xsltstr)
# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT FINAL XML (PRETTY PRINT)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open('final.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
OUTPUT
<?xml version='1.0' encoding='UTF-8'?>
<products>
<product>
<ValueX>Value1</ValueX>
</product>
<product>
<ValueY>Value2</ValueY>
</product>
<product>
<ValueZ>Value3</ValueZ>
</product>
</products>