Converting XML file to Data Frame in Python - python

I have to convert an XML file from a URL link, to a dataframe
I have written the code which gives a dictionary from the XML file. I am not able to convert it into a dataframe. Please suggest if there any other way is suitable for this XML file.
import xml.etree.ElementTree as ET
import pandas as pd
xml_data=requests.get('http://wbes.nrldc.in/xml/FullSchedule-(130)-19-01-2021.xml')
root = ET.fromstring(xml_data.text)
root = ET.tostring(root, encoding='utf8', method='xml')
data_dict = dict(xmltodict.parse(root))

Consider parsing data with DOM tools like etree (or feature-rich, third-party lxml) and then build a list of dictionaries at the repeating <FullSchedule> element to be passed into DataFrame constructor:
import urllib.request as rq
import xml.etree.ElementTree as et
import pandas as pd
url = "https://wbes.nrldc.in/xml/FullSchedule-(184)-30-01-2021.xml"
doc = rq.urlopen(url)
tree = et.fromstring(doc.read()) # NOTE: TAKES SEVERAL MINUTES DUE TO SIZE
data = [{t.tag:t.text.strip() if t.text is not None else None
for t in fs.findall("*")
} for fs in tree.findall(".//FullSchedule")]
df = pd.DataFrame(data)
df.shape
# (1152, 21)
df.columns
# Index(['Buyer', 'Seller', 'ScheduleName', 'ScheduleSubTypeName', 'ScheduleDate',
# 'ScheduleAmount', 'BuyerAmount', 'SellerAmount', 'PocInjectionLoss',
# 'PocDrawalLoss', 'StateInjectionLoss', 'StateDrawalLoss',
# 'DiscomInjectionLoss', 'DiscomDrawalLoss', 'Trader', 'LinkName',
# 'OffBarTotal', 'OffBarAllocatedFromPool', 'Open', 'Combined',
# 'ApprovalNo'], dtype='object')
Because <Buyer> and <Seller> contain nested elements, they are blank above. Hence consider additional parsing and compilation. Only difference above is the findall XPath.
data = [{t.tag:t.text.strip() if t.text is not None else None
for t in fs.findall("*")
} for fs in tree.findall(".//FullSchedule/Buyer")]
df = pd.DataFrame(data)
print(df)
# Acronym ParentState WBESParentStateAcronym
# 0 HARYANA HARYANA HARYANA_STATE
# 1 JK&LADAKH JAMMU AND KASHMIR JK&LADAKH_UT
# 2 UPPCL UTTAR PRADESH UTTARPRADESH_STATE
# 3 JK&LADAKH JAMMU AND KASHMIR JK&LADAKH_UT
# 4 UPPCL UTTAR PRADESH UTTARPRADESH_STATE
# ... ... ...
# 1147 CHANDIGARH CHANDIGARH CHANDIGARH_UT
# 1148 PUNJAB PUNJAB PUNJAB_STATE
# 1149 DELHI DELHI DELHI_UT
# 1150 HARYANA HARYANA HARYANA_STATE
# 1151 CHANDIGARH CHANDIGARH CHANDIGARH_UT
data = [{t.tag:t.text.strip() if t.text is not None else None
for t in fs.findall("*")
} for fs in tree.findall(".//FullSchedule/Seller")]
df = pd.DataFrame(data)
print(df)
# Acronym ParentState WBESParentStateAcronym
# 0 KAMENG None None
# 1 KAPS None None
# 2 VSTPS V None None
# 3 SOLAPUR None None
# 4 LARA-I None None
# ... ... ...
# 1147 NAPP None None
# 1148 BHAKRA None None
# 1149 CHAMERA3 None None
# 1150 RAPPC None None
# 1151 BHAKRA None None
By the way, pandas.read_xml() is in the works by me and uses above algorithm where above may soon be handled with below. See Git issues post.
url = "https://wbes.nrldc.in/xml/FullSchedule-(184)-30-01-2021.xml"
fs_df = pd.read_xml(url, xpath=".//FullScheule", parser="lxml")
fs_df = pd.read_xml(url, xpath=".//FullScheule", parser="etree")
buyer_df = pd.read_xml(url, xpath=".//FullScheule/Buyer")
seller_df = pd.read_xml(url, xpath=".//FullScheule/Seller")

Related

What is the best way to parse large XML and genarate a dataframe with the data in the XML (with python or else)?

I try to make a table (or csv, I'm using pandas dataframe) from the information of an XML file.
The file is here (.zip is 14 MB, XML is ~370MB), https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip . It has package information of different languages - node.js, python, java etc. aka, CPE 2.3 list by the US government org NVD.
this is how it looks like in the first 30 rows:
<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<generator>
<product_name>National Vulnerability Database (NVD)</product_name>
<product_version>4.9</product_version>
<schema_version>2.3</schema_version>
<timestamp>2022-03-17T03:51:01.909Z</timestamp>
</generator>
<cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
<title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
<references>
<reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
<reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
</references>
<cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
</cpe-item>
The tree structure of the XML file is quite simple, the root is 'cpe-list', the child element is 'cpe-item', and the grandchild elements are 'title', 'references' and 'cpe23-item'.
From 'title', I want the text in the element;
From 'cpe23-item', I want the attribute 'name';
From 'references', I want the attributes 'href' from its great-grandchildren, 'reference'.
The dataframe should look like this:
| cpe23_name | title_text | ref1 | ref2 | ref3 | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN | NaN
1 | 'cpe23name 2'| 'this is a java pkg' | 'url1'| 'url2'| NaN | NaN
...
my code is here,finished in ~100sec:
import xml.etree.ElementTree as et
xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()
import time
start_time = time.time()
df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']
title = '{http://cpe.mitre.org/dictionary/2.0}title'
ref = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []
i = 0
while i < len(xroot):
for elm in xroot[i]:
if elm.tag == title:
p_text = elm.text
#assign p_text
elif elm.tag == ref:
for nn in elm:
s = nn.text.lower()
#check the lower text in refs
if 'version' in s:
p_vers = nn.attrib.get('href')
#assign p_vers
elif 'advisor' in s:
p_advi = nn.attrib.get('href')
#assign p_advi
elif 'product' in s:
p_prod = nn.attrib.get('href')
#assign p_prod
elif 'vendor' in s:
p_vend = nn.attrib.get('href')
#assign p_vend
elif 'change' in s:
p_chan = nn.attrib.get('href')
#assign p_vend
else:
p_othe = nn.attrib.get('href')
elif elm.tag == cpe_item:
p_cpe = elm.attrib.get("name")
#assign p_cpe
else:
print(elm.tag)
row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
rows.append(row)
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
print(len(rows)) #this shows how far I got during the running time
i+=1
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent
print("---853k rows take %s seconds ---" % (time.time() - start_time))
updated: the faster way is to move the 2nd last row out side the loop. Since 'rows' already get info in each loop, there is no need to make a new dataframe every time.
the running time now is 136.0491042137146 seconds. yay!
Since your XML is fairly flat, consider the recently added IO module, pandas.read_xml introduced in v1.3. Given XML uses a default namespace, to reference elements in xpath use namespaces argument:
url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)
If you do not have the default parser, lxml, installed, use the etree parser:
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)

How to create a resuable functions in python to get the required columns and apply conditions in query without hardcoding values directly?

Sample dataframe
ticket_start_time ticket_end_time status customer_type ticket_type customer_type
0 None None None None None None
1 None None None None None None
2 None None None None None None
3 None None None None None None
8 2021-10-22 16:26:50 2021-10-22 19:16:28 Por Acción R INSTALLATION R
9 2021-10-22 16:26:50 2021-10-22 16:38:23 Por Acción R INSTALLATION R
10 2021-10-22 16:26:50 2021-10-22 19:16:28 Por Acción R INSTALLATION R
I'm using the below code but it is hardcoded.Create a reusabe fntns for the above dataframe
import pyarrow
import pandas as pd
df = read_df()
columns_list = [req_cols]
filter_conditions = ["status = 'closed'" and "customer_type = 'R'"]
df.query()
def select_filter_df(df, columns_list, filter_conditions):
# apply the filters, and query
return df
Use function parameters?
def select_filter_df(filename, columns, querystring):
try:
df = pd.read_parquet(filename, columns=columns, engine='pyarrow')
df = df.query(querystring)
except Exception as error:
logger.error(error)
return df
# How to use it:
file_path = "D:\Project_centriam"
filename = os.path.join(file_path, "merged_result.parquet")
cols = ["ticket_start_time","ticket_end_time","status","customer_type","ticket_type","customer_type"]
qs = 'status == "Rechazado" and ticket_type =="INSTALLATION" and customer_type =="R"'
df = select_filter_df(filename, cols, qs)
required_cols = ["install_ticket_start_time","install_ticket_end_time","install_status","install_customer_type","install_ticket_type"]
filter_condition = 'install_status == "Rechazado" and install_ticket_type =="INSTALLATION" and install_customer_type =="R"'
def filter_df(df, column_list, filtered_df):
try:
df = pd.read_parquet(filename, engine='pyarrow')
column_list = df.filter(required_cols)
filtered_df = column_list.query(filter_condition)
print(filtered_df)
except Exception as error:
logger.error(error)
filter_df(df, column_list, filtered_df)

Python - XML file to Pandas Dataframe [duplicate]

This question already has answers here:
How to convert an XML file to nice pandas dataframe?
(5 answers)
Closed 1 year ago.
I'm fairly new to python and am hoping to get some help transforming an XML file into Pandas Dataframe. I have searched other resources but am still stuck. I'm looking to get all the fields in between tag into a table. Any help is greatly appreciated! Thank you.
Below is the code I tried but it not working properly.
import xml.etree.ElementTree as ET
import pandas as pd
xml_data = open('5249009-08-34-59-126029.xml', 'r').read()
root = ET.XML(xml_data)
data = []
cols = []
for i, child in enumerate(root):
data.append([subchild.text for subchild in child])
cols.append(child.tag)
df = pd.DataFrame(data).T
df.columns = cols
print(df)
Below is sample input data"
<?xml version="1.0"?>
-<RECORDING>
<IDENT>0</IDENT>
<DEVICEID>133242232</DEVICEID>
<DEVICEALIAS>52232009</DEVICEALIAS>
<GROUP>1823481655</GROUP>
<GATE>1011655</GATE>
<ANI>7777777777</ANI>
<DNIS>777777777</DNIS>
<USER1>00:07:53.2322691,00:03:21.34232761</USER1>
<USER2>text</USER2>
<USER3/>
<USER4/>
<USER5>34fc0a8d-d5632c9b1</USER5>
<USER6>000dfsdf98701596638094</USER6>
<USER7>97</USER7>
<USER8>00701596638094</USER8>
<USER9>10155</USER9>
<USER10/>
<USER11/>
<USER12/>
<USER13>Text</USER13>
<USER14>4</USER14>
<USER15>10</USER15>
<CALLSEGMENTID/>
<CALLID>9870</CALLID>
<FILENAME>\\folderpath\folderpath\folderpath\folderpath\2020\Aug\05\5249009\52343109-234234-34-59-1234234029</FILENAME>
<DURATION>201</DURATION>
<STARTYEAR>2020</STARTYEAR>
<STARTMONTH>08</STARTMONTH>
<STARTMONTHNAME>August</STARTMONTHNAME>
<STARTDAY>05</STARTDAY>
<STARTDAYNAME>Wednesday</STARTDAYNAME>
<STARTHOUR>08</STARTHOUR>
<STARTMINUTE>34</STARTMINUTE>
<STARTSECOND>59</STARTSECOND>
<PRIORITY>50</PRIORITY>
<RECORDINGTYPE>S</RECORDINGTYPE>
<CALLDIRECTION>I</CALLDIRECTION>
<SCREENCAPTURE>7</SCREENCAPTURE>
<KEEPCALLFORDAYS>90</KEEPCALLFORDAYS>
<BLACKOUTREMOTEAUDIO>false</BLACKOUTREMOTEAUDIO>
<BLACKOUTS/>
</RECORDING>
One possible solution how to parse the file:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("your_file.xml", "r"), "xml")
d = {}
for tag in soup.RECORDING.find_all(recursive=False):
d[tag.name] = tag.get_text(strip=True)
df = pd.DataFrame([d])
print(df)
Prints:
IDENT DEVICEID DEVICEALIAS GROUP GATE ANI DNIS USER1 USER2 USER3 USER4 USER5 USER6 USER7 USER8 USER9 USER10 USER11 USER12 USER13 USER14 USER15 CALLSEGMENTID CALLID FILENAME DURATION STARTYEAR STARTMONTH STARTMONTHNAME STARTDAY STARTDAYNAME STARTHOUR STARTMINUTE STARTSECOND PRIORITY RECORDINGTYPE CALLDIRECTION SCREENCAPTURE KEEPCALLFORDAYS BLACKOUTREMOTEAUDIO BLACKOUTS
0 0 133242232 52232009 1823481655 1011655 7777777777 777777777 00:07:53.2322691,00:03:21.34232761 text 34fc0a8d-d5632c9b1 000dfsdf98701596638094 97 00701596638094 10155 Text 4 10 9870 \\folderpath\folderpath\folderpath\folderpath\... 201 2020 08 August 05 Wednesday 08 34 59 50 S I 7 90 false

How to reshape data in Python?

I have a data set as given below-
Timestamp = 22-05-2019 08:40 :Light = 64.00 :Temp_Soil = 20.5625 :Temp_Air = 23.1875 :Soil_Moisture_1 = 756 :Soil_Moisture_2 = 780 :Soil_Moisture_3 = 1002
Timestamp = 22-05-2019 08:42 :Light = 64.00 :Temp_Soil = 20.5625 :Temp_Air = 23.125 :Soil_Moisture_1 = 755 :Soil_Moisture_2 = 782 :Soil_Moisture_3 = 1002
And I want to Reshape(rearrange) the dataset to orient header columns like [Timestamp, Light, Temp_Soil, Temp_Air, Soil_Moisture_1, Soil_Moisture_2, Soil_Moisture_3] and their values as the row entry in Python.
One of possible solutions:
Instead of a "true" input file, I used a string:
inp="""Timestamp = 22-05-2019 08:40 :Light = 64.00 :TempSoil = 20.5625 :TempAir = 23.1875 :SoilMoist1 = 756 :SoilMoist2 = 780 :SoilMoist3 = 1002
Timestamp = 22-05-2019 08:42 :Light = 64.00 :TempSoil = 20.5625 :TempAir = 23.125 :SoilMoist1 = 755 :SoilMoist2 = 782 :SoilMoist3 = 1002"""
buf = pd.compat.StringIO(inp)
To avoid "folding" of output lines, I shortened field names.
Then let's create the result DataFrame and a list of "rows" to append to it.
For now - both of them are empty.
df = pd.DataFrame(columns=['Timestamp', 'Light', 'TempSoil', 'TempAir',
'SoilMoist1', 'SoilMoist2', 'SoilMoist3'])
src = []
Below is a loop processing input rows:
while True:
line = buf.readline()
if not(line): # EOF
break
lst = re.split(r' :', line.rstrip()) # Field list
if len(lst) < 2: # Skip empty source lines
continue
dct = {} # Source "row" (dictionary)
for elem in lst: # Process fields
k, v = re.split(r' = ', elem)
dct[k] = v # Add field : value to "row"
src.append(dct)
And the last step is to append rows from src to df :
df = df.append(src, ignore_index =True, sort=False)
When you print(df), for my test data, you will get:
Timestamp Light TempSoil TempAir SoilMoist1 SoilMoist2 SoilMoist3
0 22-05-2019 08:40 64.00 20.5625 23.1875 756 780 1002
1 22-05-2019 08:42 64.00 20.5625 23.125 755 782 1002
For now all columns are of string type, so you can change the required
columns to either float or int:
df.Light = pd.to_numeric(df.Light)
df.TempSoil = pd.to_numeric(df.TempSoil)
df.TempAir = pd.to_numeric(df.TempAir)
df.SoilMoist1 = pd.to_numeric(df.SoilMoist1)
df.SoilMoist2 = pd.to_numeric(df.SoilMoist2)
df.SoilMoist3 = pd.to_numeric(df.SoilMoist3)
Note that to_numeric() function is clever enough to recognize the possible
type to convert to, so first 3 columns changed their type to float64
and the next 3 to int64.
You can check it executing df.info().
One more possible conversion is to change Timestamp column
to DateTime type:
df.Timestamp = pd.to_datetime(df.Timestamp)

Handling exceptions with df.apply

I am using the tld python library to grab the first level domain from the proxy request logs using a apply function.
When I run into a strange request that tld doesnt know how to handle like 'http:1 CON' or 'http:/login.cgi%00' I run into an error message like this:
TldBadUrl: Is not a valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in engine
----> 1 new_fld_column = request_2['request'].apply(get_fld)
/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)()
/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in get_fld(url,
fail_silently, fix_protocol, search_public, search_private, **kwargs)
385 fix_protocol=fix_protocol,
386 search_public=search_public,
--> 387 search_private=search_private
388 )
389
/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in process_url(url, fail_silently, fix_protocol, search_public, search_private)
289 return None, None, parsed_url
290 else:
--> 291 raise TldBadUrl(url=url)
292
293 domain_parts = domain_name.split('.')
In the mean time I have been weeding these out by using many lines like following code but there are hundreds or thousands of them in this dataset:
request_2 = request_1[request_1['request'] != 'http:1 CON']
request_2 = request_1[request_1['request'] != 'http:/login.cgi%00']
Dataframe:
request
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com 65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com 12
The code:
from tld import get_tld
from tld import get_fld
from impala.dbapi import connect
from impala.util import as_pandas
import pandas as pd
import numpy as np
request = pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove rows where there were null values in the request column
request = request[pd.notnull(request['request'])]
#Reset index
request.reset_index(drop=True)
#Find the urls that contain IP addresses and exclude them from the new dataframe
request_1 = request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset index
request_1 = request_1.reset_index(drop=True)
#Appply the get_fld lib on the request column
new_fld_column = request_2['request'].apply(get_fld)
Is there anyway to keep this error from firing and instead add those that would error to a separate dataframe?
If you can wrap your function around a try-except clause, you can determine what rows error out by querying those rows with NaN:
import tld
from tld import get_fld
def try_get_fld(x):
try:
return get_fld(x)
except tld.exceptions.TldBadUrl:
return np.nan
print(df)
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com 65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com 12
6 http:1 CON 10
7 http:/login.cgi%00 200
df['flds'] = df['request_url'].apply(try_get_fld)
print(df['flds'])
0 microsoftonline.com
1 adsafeprotected.com
2 doubleclick.net
3 amazon.com
4 microsoft.com
5 adnxs.com
6 NaN
7 NaN
Name: flds, dtype: object
faulty_url_df = df[df['flds'].isna()]
print(faulty_url_df)
request_url count flds
6 http:1 CON 10 NaN
7 http:/login.cgi%00 200 NaN

Categories

Resources