I try to make a table (or csv, I'm using pandas dataframe) from the information of an XML file.
The file is here (.zip is 14 MB, XML is ~370MB), https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip . It has package information of different languages - node.js, python, java etc. aka, CPE 2.3 list by the US government org NVD.
this is how it looks like in the first 30 rows:
<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<generator>
<product_name>National Vulnerability Database (NVD)</product_name>
<product_version>4.9</product_version>
<schema_version>2.3</schema_version>
<timestamp>2022-03-17T03:51:01.909Z</timestamp>
</generator>
<cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
<title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
<references>
<reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
<reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
</references>
<cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
</cpe-item>
The tree structure of the XML file is quite simple, the root is 'cpe-list', the child element is 'cpe-item', and the grandchild elements are 'title', 'references' and 'cpe23-item'.
From 'title', I want the text in the element;
From 'cpe23-item', I want the attribute 'name';
From 'references', I want the attributes 'href' from its great-grandchildren, 'reference'.
The dataframe should look like this:
| cpe23_name | title_text | ref1 | ref2 | ref3 | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN | NaN
1 | 'cpe23name 2'| 'this is a java pkg' | 'url1'| 'url2'| NaN | NaN
...
my code is here,finished in ~100sec:
import xml.etree.ElementTree as et
xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()
import time
start_time = time.time()
df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']
title = '{http://cpe.mitre.org/dictionary/2.0}title'
ref = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []
i = 0
while i < len(xroot):
for elm in xroot[i]:
if elm.tag == title:
p_text = elm.text
#assign p_text
elif elm.tag == ref:
for nn in elm:
s = nn.text.lower()
#check the lower text in refs
if 'version' in s:
p_vers = nn.attrib.get('href')
#assign p_vers
elif 'advisor' in s:
p_advi = nn.attrib.get('href')
#assign p_advi
elif 'product' in s:
p_prod = nn.attrib.get('href')
#assign p_prod
elif 'vendor' in s:
p_vend = nn.attrib.get('href')
#assign p_vend
elif 'change' in s:
p_chan = nn.attrib.get('href')
#assign p_vend
else:
p_othe = nn.attrib.get('href')
elif elm.tag == cpe_item:
p_cpe = elm.attrib.get("name")
#assign p_cpe
else:
print(elm.tag)
row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
rows.append(row)
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
print(len(rows)) #this shows how far I got during the running time
i+=1
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent
print("---853k rows take %s seconds ---" % (time.time() - start_time))
updated: the faster way is to move the 2nd last row out side the loop. Since 'rows' already get info in each loop, there is no need to make a new dataframe every time.
the running time now is 136.0491042137146 seconds. yay!
Since your XML is fairly flat, consider the recently added IO module, pandas.read_xml introduced in v1.3. Given XML uses a default namespace, to reference elements in xpath use namespaces argument:
url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)
If you do not have the default parser, lxml, installed, use the etree parser:
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)
I am trying to parse a file as follows:
testp.txt
title = Test Suite A;
timeout = 10000
exp_delay = 500;
log = TRUE;
sect
{
type = typeA;
name = "HelloWorld";
output_log = "c:\test\out.log";
};
sect
{
name = "GoodbyeAll";
type = typeB;
comm1_req = 0xDEADBEEF;
comm1_resp = (int, 1234366);
};
The file first contains a section with parameters and then some sects. I can parse a file containing just parameters and I can parse a file just containing sects but I can't parse both.
from pyparsing import *
from pathlib import Path
command_req = Word(alphanums)
command_resp = "(" + delimitedList(Word(alphanums)) + ")"
kW = Word(alphas+'_', alphanums+'_') | command_req | command_resp
keyName = ~Literal("sect") + Word(alphas+'_', alphanums+'_') + FollowedBy("=")
keyValue = dblQuotedString.setParseAction( removeQuotes ) | OneOrMore(kW,stopOn=LineEnd())
param = dictOf(keyName, Suppress("=")+keyValue+Optional(Suppress(";")))
node = Group(Literal("sect") + Literal("{") + OneOrMore(param) + Literal("};"))
final = OneOrMore(node) | OneOrMore(param)
param.setDebug()
p = Path(__file__).with_name("testp.txt")
with open(p) as f:
try:
x = final.parseFile(f, parseAll=True)
print(x)
print("...")
dx = x.asDict()
print(dx)
except ParseException as pe:
print(pe)
The issue I have is that param matches against sect so it expects a =. So I tried putting in ~Literal("sect") in keyName but that just leads to another error:
Exception raised:Found unwanted token, "sect", found '\n' (at char 188), (line:4, col:56)
Expected end of text, found 's' (at char 190), (line:6, col:1)
How do I get it use one parse method for sect and another (param) if not sect?
My final goal would be to have the whole lot in a Dict with the global params and sects included.
EDIT
Think I've figured it out:
This line...
final = OneOrMore(node) | OneOrMore(param)
...should be:
final = ZeroOrMore(param) + ZeroOrMore(node)
But I wonder if there is a more structured way (as I'd ultimately like a dict)?
I have a problem with my Python parsing. I have this kind of xml file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="maria" audio_filename="agora_2007_11_05_a" version="11" version_date="080826" xml:lang="catalan">
<Topics>
<Topic id="to1" desc="music"/>
<Topic id="to2" desc="bgnoise"/>
<Topic id="to4" desc="silence"/>
<Topic id="to5" desc="speech"/>
<Topic id="to6" desc="speech+music"/>
</Topics>
<Speakers>
<Speaker id="spk1" name="Xavi Coral" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk2" name="Ferran Martínez" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk3" name="Jordi Barbeta" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
</Speakers>
<Section type="report" topic="to6" startTime="111.286" endTime="119.308">
<Turn speaker="spk1" startTime="111.286" endTime="119.308" mode="planned" channel="studio">
<Sync time="111.286"/>
ha estat director del diari La Vanguàrdia,
<Sync time="113.56"/>
ha estat director general de Barcelona Televisió i director del Centre Territorial de Televisió Espanyola a Catalunya,
<Sync time="119.308"/>
actualment col·labora en el diari
<Event desc="es" type="language" extent="begin"/>
El Periódico
<Event desc="es" type="language" extent="end"/>
de Catalunya.
</Turn>
</Section>
And this is my Python code:
import xml.etree.ElementTree as etree
import os
import sys
xmlD = etree.parse(sys.stdin)
root = xmlD.getroot()
sections = root.getchildren()[2].getchildren()
for section in sections:
turns = section.getchildren()
for turn in turns:
speaker = turn.get('speaker')
mode = turn.get('mode')
childs = turn.getchildren()
for child in childs:
time = child.get('time')
opt = child.get('desc')
extent = child.get('extent')
if opt == 'es' and extent == 'begin':
opt = "ESP:"
elif opt == "la" extent == 'begin':
opt = "LAT:"
elif opt == "en" extent == 'begin':
opt = "ENG:"
else:
opt = ""
if time:
time = time
else:
time = ""
print time, opt+child.tail.encode('latin-1')
I need to mark the words pronounced in other language with this tag LANG: For example:
spanish words ENG:hello, spanish words, but when I have 2 consecutive words pronounced in other language I don't know how to do this: spanish words ENG:hello ENG:man, spanish words . The change of language is in the Event xml tag.
Now, at the Output I have:
actualment col·labora en el diari ESP:El Periódico de Catalunya. and I want: actualment col·labora en el diari ESP:El ESP:Periódico de Catalunya.
Anyone could help me?
Thank you!
You can do something like -
print time, opt+(" " + opt).join([c.encode('latin-1').decode('latin-1') for c in child.tail.split(' ')])
instead of your print statement
I have written python script for parsing a file.
python script :
from xml.dom.minidom import parse
import xml.dom.minidom
DOMTree = xml.dom.minidom.parse("details.xml")
CallDetailRecord = DOMTree.documentElement
def getText(data):
detail = str(data)
#match = re.search(r'(.*\s)(false).*|(.*\s)(true).*',detail,re.IGNORECASE)
match_false = re.search(r'(.*\s)(false).*',detail,re.IGNORECASE)
if (match_false):
return match_false.group(2)
match_true = re.search(r'(.*\s)(true).*',detail,re.IGNORECASE)
if (match_true):
return match_true.group(2)
org_addr = CallDetailRecord.getElementsByTagName("origAddress")
for record in org_addr:
ton_1 = record.getElementsByTagName("ton")[0]
npi_1 = record.getElementsByTagName("npi")[0]
pid_1 = record.getElementsByTagName("pid")[0]
msdn_1 = record.getElementsByTagName("msisdn")[0]
org_ton = ton_1.childNodes[0].data
org_npi = npi_1.childNodes[0].data
org_pid = pid_1.childNodes[0].data
org_msdn = msdn_1.childNodes[0].data
recp_addr = CallDetailRecord.getElementsByTagName("recipAddress")
for record in recp_addr:
ton_1 = record.getElementsByTagName("ton")[0]
npi_1 = record.getElementsByTagName("npi")[0]
pid_1 = record.getElementsByTagName("pid")[0]
msdn_1 = record.getElementsByTagName("msisdn")[0]
rec_ton = ton_1.childNodes[0].data
rec_npi = npi_1.childNodes[0].data
rec_pid = pid_1.childNodes[0].data
rec_msdn = msdn_1.childNodes[0].data
dgti_addr = CallDetailRecord.getElementsByTagName("dgtiAddress")
for record in dgti_addr:
ton_1 = record.getElementsByTagName("ton")[0]
npi_1 = record.getElementsByTagName("npi")[0]
pid_1 = record.getElementsByTagName("pid")[0]
msdn_1 = record.getElementsByTagName("msisdn")[0]
dgti_ton = ton_1.childNodes[0].data
dgti_npi = npi_1.childNodes[0].data
dgti_pid = pid_1.childNodes[0].data
dgti_msdn = msdn_1.childNodes[0].data
calling_line_id = CallDetailRecord.getElementsByTagName("callingLineId")
for record in calling_line_id:
ton_1 = record.getElementsByTagName("ton")[0]
npi_1 = record.getElementsByTagName("npi")[0]
pid_1 = record.getElementsByTagName("pid")[0]
msdn_1 = record.getElementsByTagName("msisdn")[0]
clid_ton = ton_1.childNodes[0].data
clid_npi = npi_1.childNodes[0].data
clid_pid = pid_1.childNodes[0].data
clid_msdn = msdn_1.childNodes[0].data
untransl_OrigAddress = CallDetailRecord.getElementsByTagName("untranslOrigAddress")
sub_time = CallDetailRecord.getElementsByTagName("submitTime")[0]
if(sub_time):
sub_time_value = sub_time.childNodes[0].data
print " \n SUBMIT TIME: %s \n" %sub_time_value
sub_date = CallDetailRecord.getElementsByTagName("submitDate")[0]
if(sub_date):
sub_date_value = sub_date.childNodes[0].data
print " \n SUBMIT DATE: %s\n" %sub_time_value
termin_time = CallDetailRecord.getElementsByTagName("terminTime")[0]
if(termin_time):
termin_time_value = termin_time.childNodes[0].data
print " \n TERMIN TIME: %s \n" %termin_time_value
termin_date = CallDetailRecord.getElementsByTagName("terminDate")[0]
if(termin_date):
termin_date_value = termin_date.childNodes[0].data
print " \n TERMIN DATE: %s\n" %termin_time_value
status = CallDetailRecord.getElementsByTagName("status")[0]
if(status):
status_value = status.childNodes[0].data
print " \n STATUS: %s\n" %status_value
msglength = CallDetailRecord.getElementsByTagName("lengthOfMessage")[0]
if(msglength):
msglength_value = msglength.childNodes[0].data
print " \n MESSAGE LENGTH: %s\n" %msglength_value
prioIndicator = CallDetailRecord.getElementsByTagName("prioIndicator")[0]
if (prioIndicator):
#print prioIndicator.childNodes[0].data
prioIndicator_value = getText(prioIndicator.childNodes[0])
print " \n PRIO INDICATOR: %s\n" %prioIndicator_value
To reduce the Size, I'm not posting my entire script.
INPUT FILE:
<CallDetailRecord>
<origAddress>
<ton>international</ton>
<npi>telephone</npi>
<pid>plmn</pid>
<msisdn>32410000</msisdn>
</origAddress>
<recipAddress>
<ton>international</ton>
<npi>telephone</npi>
<pid>plmn</pid>
<msisdn>918337807718</msisdn>
</recipAddress>
<submitDate>14-08-20</submitDate>
<submitTime>19:36:29</submitTime>
<status>deleted</status>
<terminDate>14-08-23</terminDate>
<terminTime>19:51:52</terminTime>
<lengthOfMessage>38</lengthOfMessage>
<prioIndicator><false/></prioIndicator>
<deferIndicator><true/></deferIndicator>
<notifIndicator><false/></notifIndicator>
<recipIntlMobileSubId>26204487</recipIntlMobileSubId>
<callingLineId>
<ton>international</ton>
<npi>telephone</npi>
<pid>plmn</pid>
<msisdn>32410000</msisdn>
</callingLineId>
<smsContentDcs>0</smsContentDcs>
<messageReference>13</messageReference>
<deliveryAttempts>151</deliveryAttempts>
<untranslOrigAddress>
<ton>international</ton>
<npi>telephone</npi>
<pid>plmn</pid>
<msisdn>32410000</msisdn>
</untranslOrigAddress>
<tpDCS>0</tpDCS>
<genericUrgencyLevel>bulk</genericUrgencyLevel>
<teleserviceId>4098</teleserviceId>
<recipNetworkType>gsm</recipNetworkType>
<rbdlFlags1>
10000000000000000000000000000000
</rbdlFlags1>
</CallDetailRecord>
Script works fine for this file. But suppose consider I have more than one
CallDetailRecord>, then how to parse that file.
EXAMPLE:
<CallDetailRecord>
.
.
.
</CallDetailRecord>
<CallDetailRecord>
.
.
.
</CallDetailRecord>
<CallDetailRecord>
.
.
.
</CallDetailRecord>
Hopefully waiting for some good results :)!!!
Use wrapper class to parse this file. Wrap your file which contains multiple top elements into a wrapper like this
< wrapper>
#your file
< /wrapper>
and then start parsing the file with the root element . The parser will construct a document with a root element wrapper containing all the elements that were in the file.
I want to parse srt subtitles:
1
00:00:12,815 --> 00:00:14,509
Chlapi, jak to jde s
těma pracovníma světlama?.
2
00:00:14,815 --> 00:00:16,498
Trochu je zesilujeme.
3
00:00:16,934 --> 00:00:17,814
Jo, sleduj.
Every item into structure. With this regexs:
A:
RE_ITEM = re.compile(r'(?P<index>\d+).'
r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
r'(?P<text>.*?)', re.DOTALL)
B:
RE_ITEM = re.compile(r'(?P<index>\d+).'
r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
r'(?P<text>.*)', re.DOTALL)
And this code:
for i in Subtitles.RE_ITEM.finditer(text):
result.append((i.group('index'), i.group('start'),
i.group('end'), i.group('text')))
With code B I have only one item in array (because of greedy .*) and with code A I have empty 'text' because of no-greedy .*?
How to cure this?
Thanks
Why not use pysrt?
I became quite frustrated with srt libraries available for Python (often because they were heavyweight and eschewed language-standard types in favour of custom classes), so I've spent the last year or so working on my own srt library. You can get it at https://github.com/cdown/srt.
I tried to keep it simple and light on classes (except for the core Subtitle class, which more or less just stores the SRT block data). It can read and write SRT files, and turn noncompliant SRT files into compliant ones.
Here's a usage example with your sample input:
>>> import srt, pprint
>>> gen = srt.parse('''\
... 1
... 00:00:12,815 --> 00:00:14,509
... Chlapi, jak to jde s
... těma pracovníma světlama?.
...
... 2
... 00:00:14,815 --> 00:00:16,498
... Trochu je zesilujeme.
...
... 3
... 00:00:16,934 --> 00:00:17,814
... Jo, sleduj.
...
... ''')
>>> pprint.pprint(list(gen))
[Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'),
Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'),
Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')]
The text is followed by an empty line, or the end of file. So you can use:
r' .... (?P<text>.*?)(\n\n|$)'
Here's some code I had lying around to parse SRT files:
from __future__ import division
import datetime
class Srt_entry(object):
def __init__(self, lines):
def parsetime(string):
hours, minutes, seconds = string.split(u':')
hours = int(hours)
minutes = int(minutes)
seconds = float(u'.'.join(seconds.split(u',')))
return datetime.timedelta(0, seconds, 0, 0, minutes, hours)
self.index = int(lines[0])
start, arrow, end = lines[1].split()
self.start = parsetime(start)
if arrow != u"-->":
raise ValueError
self.end = parsetime(end)
self.lines = lines[2:]
if not self.lines[-1]:
del self.lines[-1]
def __unicode__(self):
def delta_to_string(d):
hours = (d.days * 24) \
+ (d.seconds // (60 * 60))
minutes = (d.seconds // 60) % 60
seconds = d.seconds % 60 + d.microseconds / 1000000
return u','.join((u"%02d:%02d:%06.3f"
% (hours, minutes, seconds)).split(u'.'))
return (unicode(self.index) + u'\n'
+ delta_to_string(self.start)
+ ' --> '
+ delta_to_string(self.end) + u'\n'
+ u''.join(self.lines))
srt_file = open("foo.srt")
entries = []
entry = []
for line in srt_file:
if options.decode:
line = line.decode(options.decode)
if line == u'\n':
entries.append(Srt_entry(entry))
entry = []
else:
entry.append(line)
srt_file.close()
splits = [s.strip() for s in re.split(r'\n\s*\n', text) if s.strip()]
regex = re.compile(r'''(?P<index>\d+).*?(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> (?P<end>\d{2}:\d{2}:\d{2},\d{3})\s*.*?\s*(?P<text>.*)''', re.DOTALL)
for s in splits:
r = regex.search(s)
print r.groups()
Here's a snippet I wrote which converts SRT files into dictionaries:
import re
def srt_time_to_seconds(time):
split_time=time.split(',')
major, minor = (split_time[0].split(':'), split_time[1])
return int(major[0])*1440 + int(major[1])*60 + int(major[2]) + float(minor)/1000
def srt_to_dict(srtText):
subs=[]
for s in re.sub('\r\n', '\n', srtText).split('\n\n'):
st = s.split('\n')
if len(st)>=3:
split = st[1].split(' --> ')
subs.append({'start': srt_time_to_seconds(split[0].strip()),
'end': srt_time_to_seconds(split[1].strip()),
'text': '<br />'.join(j for j in st[2:len(st)])
})
return subs
Usage:
import srt_to_dict
with open('test.srt', "r") as f:
srtText = f.read()
print srt_to_dict(srtText)