Extract an XML file (TED Europe) to a pandas dataframe - python

I have an XML from TED Europe (like this:TED Europa XML Files (Login required)) Within the XML Files are public procurement contracts.
My question is now how can I parse the XML File to a pandas dataframe.
So far I tried to achieve this using the ElementTree package.
However since I am still a beginner habe trouble extracting the information since the relevant text is marked with only "p" tags.
How can I extract this information for the english translation so that for example "TI_MARK" is the column header and the information within "TXT_MARK" and the "p" tags are the rows? The other rows are later filled with information from other public procurement XML Files.
<FORM_SECTION>
<OTH_NOT LG="DA" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
<OTH_NOT LG="DE" VERSION="R2.0.8.S03.E01" CATEGORY="TRANSLATION">
<OTH_NOT LG="EN" VERSION="R2.0.8.S03.E01" CATEGORY="ORIGINAL">
<FD_OTH_NOT>
<TI_DOC>
<P>BE-Brussels: IPA - Improved implementation of animal health, food safety and phytosanitary legislation and corresponding information systems</P>
</TI_DOC>
<STI_DOC>
<P>Location — The former Yugoslav Republic of Macedonia</P>
</STI_DOC>
<STI_DOC>
<P>SERVICE CONTRACT NOTICE</P>
</STI_DOC>
<CONTENTS>
<GR_SEQ>
<TI_GRSEQ>
<BLK_BTX/>
</TI_GRSEQ>
<BLK_BTX_SEQ>
<MARK_LIST>
<MLI_OCCUR NO_SEQ="001">
<NO_MARK>1.</NO_MARK>
<TI_MARK>Publication reference</TI_MARK>
<TXT_MARK>
<P>EuropeAid/139253/DH/SER/MK</P>
</TXT_MARK>
</MLI_OCCUR>
<MLI_OCCUR NO_SEQ="002">
<NO_MARK>2.</NO_MARK>
<TI_MARK>Procedure</TI_MARK>
<TXT_MARK>
<P>Restricted</P>
</TXT_MARK>
</MLI_OCCUR>
So far my code is:
import xml.etree.cElementTree as ET
tree = ET.parse('196658_2018.xml')
#Print Tree
print(tree)
#tree=ET.ElementTree(file='196658_2018.xml')
root = tree.getroot()
#Print root
print(root)
for element in root.findall('{ted/R2.0.8.S03/publication}FORM_SECTION/{ted/R2.0.8.S03/publication}OTH_NOT/{ted/R2.0.8.S03/publication}FD_OTH_NOT/{ted/R2.0.8.S03/publication}TI_DOC/{ted/R2.0.8.S03/publication}P'):
print(element.text)
Strangely the extraction only works if I add {ted/R2.0.8.S03/publication} to each path element.
Moving on from that I have problems writing a function which contains all paths with infos and appends them to a pandas dataframe. Ideally only the english translation should be extracted.
For another part of the XML File I used a function like this:
from lxml import etree
import pandas as pd
import xml.etree.ElementTree as ET
def parse_xml_fields(file, base_tag, tag_list, final_list):
root = etree.parse(file)
nodes = root.findall("//{}".format(base_tag))
for node in nodes:
item = {}
for tag in tag_list:
if node.find(".//{}".format(tag)) is not None:
item[tag] = node.find(".//{}".format(tag)).text.strip()
final_list.append(item)
# My variables
field_list = ["{ted/R2.0.8.S03/publication}TI_CY","{ted/R2.0.8.S03/publication}TI_TOWN", "{ted/R2.0.8.S03/publication}TI_TEXT"]
entities_list = []
parse_xml_fields("196658_2018.xml", "{ted/R2.0.8.S03/publication}ML_TI_DOC", field_list, entities_list)
df = pd.DataFrame(entities_list, columns=field_list)
print(df)
#better column names
df.columns = ['Country', 'Town', 'Text']
df.to_csv("TED_Europa_List.csv", sep=',', encoding='utf-8')
The Path and Tags however are much more distinguishable for this section because the tags already are named after their content and the tags are more distinguishable.

Related

Parsing subfields in XML and merging with matching columns

This is a follow-up question from here. it got lost due to high amount of other topic on this forum. Maybe i presented the question too complicated. Since then I improved and simplified the approach.
To sum up: i'd like to extract data from subfields in multiple XML files and attach those to a new df on a matching positions.
This is a sample XML-1:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>238981</vrednost>
</energent>
<energent>
<sifra>energy_to</sifra>
<naziv>Do</naziv>
<vrednost>16359</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
his is a sample XML-2:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>424242</vrednost>
</energent>
<energent>
<sifra>energy_en</sifra>
<naziv>Do</naziv>
<vrednost>29</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>
My code:
import xml.etree.ElementTree as ETree
import pandas as pd
xmldata = r"C:\...\S1.xml"
prstree = ETree.parse(xmldata)
root = prstree.getroot()
# print(root)
store_items = []
all_items = []
for storeno in root.iter('energent'):
cona_sifra = storeno.find('sifra').text
cona_vrednost = storeno.find('vrednost').text
store_items = [cona_sifra, cona_vrednost]
all_items.append(store_items)
xmlToDf = pd.DataFrame(all_items, columns=[
'sifra', 'vrednost'])
print(xmlToDf.to_string(index=False))
This results in:
sifra vrednost
energy_e 238981
energy_to 16359
Which is fine for 1 example. But i have 1,000 of XML files and the wish is to 1) have all results in 1 row for each XML and 2) to differentiate between different 'sifra' codes.
There can be e.g. energy_e, energy_en, energy_to
So ideally the final df would look like this
xml energy_e energy_en energy_to
xml-1 238981 0 16539
xml-2 424242 29 0
can it be done?
Simply use pandas.read_xml since the part of the XML you need is a flat part of the document:
energy_df = pd.read_xml("Input.xml", xpath=".//energent") # IF lxml INSTALLED
energy_df = pd.read_xml("Input.xml", xpath=".//energent", parser="etree") # IF lxml NOT INSTALLED
And to bind across many XML files, simply build a list of data frames from a list of XML file paths, adding a column for source file, and then run pandas.concat to row bind all into a single data frame:
xml_files = [...]
energy_dfs = [
pd.read_xml(f, xpath=".//energent", parser="etree").assign(source=f) for f in xml_files
]
energy_long_df = pd.concat(energy_dfs, ignore_index=True)
And from your desired output, you can then pivot values from sifra columns with pivot_table:
energy_wide_df = energy_long_df.pivot_table(
values="vrednost", index="source", columns="sifra", aggfunc="sum"
)
If I understand the situation correctly, this can be done - but because of the complexity, I would use here lxml, instead of ElementTree.
I'll try to annotate the code a bit, but you'll have to really do read up on this.
By the way, the two xml files you posted are not well formed (closing tags for <energenti> and <cone> are missing), but assuming that is fixed - try this:
from lxml import etree
xmls =[XML-1,XML-2]
#note: For simplicity, I'm using the well formed version of the xml strings in your question; you'll have to use actual file names and paths
energies = ["xml", "energy_e", "energy_en", "energy_to", "whatever"]
#I just made up some names - you'll have to use actual names, of course; the first one is for the file identifier - see below
rows = []
for xml in xmls:
row = []
id = "xml-"+str(xmls.index(xml)+1)
#this creates the file identifier
row.append(id)
root = etree.XML(xml.encode())
#in real life, you'll have to use the parse() method
for energy in energies[1:]:
#the '[1:]' is used to skip the first "energy"; it's only used as the file identifier
target = root.xpath(f'//energent[./sifra[.="{energy}"]]/vrednost/text()')
#note the use of f-strings
row.extend( target if len(target)>0 else "0" )
rows.append(row)
print(pd.DataFrame(rows,columns=energies))
Output:
xml energy_e energy_en energy_to whatever
0 xml-1 238981 0 16359 0
1 xml-2 424242 29 0 0

Python web scrape url to dataframe

I want to web scrape a website (https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp) and create a dataframe.
This is the dataframe that I want:
name text
M. le président La séance est...
M. le président L'ordre du jour...
M. Jean-Marc Ayrault Je demande la ...
Initially I thought that I should use BeautifulSoup, and I started to write the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.text, 'html.parser')
first=soup_data.find_all('div')
name=first.b.text
But I obtained the error:
AttributeError: ResultSet object has no attribute 'b'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Because I could not go further, then I thought that the best idea has to get the html, and work in a similar way as if I had a xml file:
import urllib
import xml.etree.ElementTree as ET
import pandas as pd
import lxml
from lxml import etree
urllib.request.urlretrieve("https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp", "file.txt")
d = {'head': ['title'],
'body':['b', 'p']}
tree = ET.parse("file.txt")
root = tree.getroot()
# initialize two lists: `cols` and `data`
cols, data = list(), list()
# loop through d.items
for k, v in d.items():
# find child
child = root.find(f'{{*}}{k}')
# use iter to check each descendant (`elem`)
for elem in child.iter():
# get `tag_end` for each descendant, e.g. `texte` in "{http://schemas.assemblee-nationale.fr/referentiel}texte"
tag_end = elem.tag.split('}')[-1]
# check if `tag_end` in `v(alue)`
if tag_end in v:
# add `tag_end` and `elem.text` to appropriate list
cols.append(tag_end)
data.append(elem.text)
df = pd.DataFrame(data).T
But I obtain the error: "not well-formed (invalid token)".
Here is a summary of the html:
<html>
<head>
<title> Assemblée Nationale - Séance du mercredi ... </title>
</head>
<body>
<div id="englobe">
<p>
<orateur>
<b> M. le président </b>
</orateur>
La séance est...
</p>
<p>
<orateur>
<b> M. le président </b>
</orateur>
L'ordre du jour...
</p>
</div>
</body>
</html>
How I should web scrape the website? I will want to do this for several similar websites.
So, your approach with beautifulsoup is definitely the way to go. The error already points you towards your error: what you call first is really of type bs4.element.ResultSet, which -- as the name suggests -- is not a single element. The easiest way to access the actual results is to loop through it using a for loop.
I'm not sure if you really need to go for the div's as, really, you're looking for the p's that include an orateur element (long story short: the first for-loop is unnecessary and you could heavily simplify this further), but anyways, here's how you can access the elements you want
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.text, 'html.parser')
list_soup_div=soup_data.find_all('div')
# Edit: list of dicts to store the orateur's name and the text he/she spoke
list_dict_transcript = []
for item_soup_div in list_soup_div:
list_soup_sub_orateur = item_soup_div.find_all('orateur')
# Check on whether the <div> contains an <orateur> element
if len(list_soup_sub_orateur):
for item_soup_p in item_soup_div.find_all('p'):
list_orateur = item_soup_p.find_all('orateur')
if len(list_orateur):
# Edit: recording
# print(item_soup_p)
for item_b in item_soup_p.find_all('b'):
text_orateur = item_b.get_text()
text_speech = item_soup_p.find('orateur').next_sibling
list_dict_transcript.append({'orateur': text_orateur, 'speech': text_speech})
# Edit: conversion of list into dataframe
df_transcript = pd.DataFrame(data = list_dict_transcript)
After that, you only need to filter out the lines with links, append to a dictionary, convert to a datarame and voilà, there's your desired dataframe. Hope that helps! If not, do let me know.
Edit:
I have added a couple of lines to a) initialize an empty list of dicts, b) filling these dicts with the respective texts (using the .next_sibling() function as per Extracting text outside of a tag BeautifulSoup to get hold of the text, the orateur was saying and c) getting this into a dataframe.
find_all method is used to find all elements with filters you want like the div tag in your example and you can't extract the text of all elements.
you just have to make for loop and extract the text of each element and store them into a list then add it as a column in your data frame like this.
import requests
from bs4 import BeautifulSoup
import pandas as pd
df = pd.DataFrame()
names_list = []
url = "https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.text, 'html.parser')
names=soup_data.find_all('div')
for name in names:
names_list.append(name.text)
df['name']=names_list

Extract variables from XML to Pandas

I am working on parsing XML variables to pandas dataframe. The XML files looks like (
This XML file has been simplified for demo)
<Instrm>
<Rcrd>
<FinPpt>
<Id>BT0007YSAWK</Id>
<FullNm>Turbo Car</FullNm>
<Ccy>EUR</Ccy>
<Cmmdty>false</Cmmdty>
</FinPpt>
<Issr>529900M2F7D5795H1A49</Issr>
<Attrbts>
<Authrty>US</Authrty>
<Prd>
<Dt>2002-03-20</Dt>
</Prd>
<Ven>NYSE</Ven>
</Attrbts >
</Rcrd>
</Instrm>
<Instrm>
<Rcrd>
<FinPpt>
<Id>BX0009YNOYK</Id>
<FullNm>Turbo truk</FullNm>
<Ccy>EUR</Ccy>
<Cmmdty>false</Cmmdty>
</FinPpt>
<Issr>58888M2F7D579536J4</Issr>
<Attrbts>
<Authrty>UK</Authrty>
<Prd>
<Dt>2002-04-21</Dt>
</Prd>
<Ven>BOX</Ven>
</Attrbts >
</Rcrd>
</Instrm>
...
I attempted to parse this XML file to a dataframe with attributes to be the column names, like this:
Id FullNm Ccy Cmmdty Issr Authrty Dt Ven
BT0007YSAWK Turbo Car EUR false 529900M2F7D5795H1A49 US 2002-03-20 NYSE
BX0009YNOYK Turbo truk EUR false 58888M2F7D579536J4 UK 2002-04-21 BOX
..... ......
but still don't know how after I reviewed some post. All I can do is to extract ID in a list, like
import xml.etree.ElementTree as ET
import pandas as pd
import sys
tree = ET.parse('sample.xml')
root = tree.getroot()
report = root[1][0][0]
records = report.findall('Instrm')
ids = []
for r in records:
ids.append(r[0][0][0].text)
print(ids[0:100])
out:
[BT0007YSAWK, BX0009YNOYK, …….]
I don't quite understand how to utilize 'nodes' here. Can someone help? Thank you.
Assuming a <root> node in posted XML without namespaces, consider building a dictionary via list/dict comprehension and combining sub dictionaries (available in Python 3.5+) that parse to needed nodes. Then call the DataFrame() constructor on returned list of dictionaries.
data = [{**{el.tag:el.text.strip() for el in r.findall('FinPpt/*')},
**{el.tag:el.text.strip() for el in r.findall('Issr')},
**{el.tag:el.text.strip() for el in r.findall('Attrbts/*')},
**{el.tag:el.text.strip() for el in r.findall('Attrbts/Prd/*')}
} for r in root.findall('Instrm/Rcrd')]
df = pd.DataFrame(data)
To get your target data without converting use an xml parser (like lxml) and xpath.
Something along these lines: [note that you have to wrap you xml with a root element]
string = """
<doc>
[your xml above]
</doc>
"""
from lxml import etree
doc = etree.XML(string)
insts = doc.xpath('//Instrm')
for inst in insts:
f_nams = inst.xpath('//FullNm')
ccys = inst.xpath('//Ccy')
cmds = inst.xpath('//Cmmdty')
issuers = inst.xpath('//Issr')
for a,b,c,d in zip(f_nams,ccys,cmds,issuers):
print(a.text,b.text,c.text,d.text)
Output:
Turbo Car EUR false 529900M2F7D5795H1A49
Turbo truk EUR false 58888M2F7D579536J4

Extract data from ORCID XML files using Python

I ma trying to (offline) parse names from ORCID XML files using Python, which is downloaded from :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<record:record xmlns:internal="http://www.orcid.org/ns/internal" xmlns:address="http://www.orcid.org/ns/address" xmlns:email="http://www.orcid.org/ns/email" xmlns:history="http://www.orcid.org/ns/history" xmlns:employment="http://www.orcid.org/ns/employment" xmlns:person="http://www.orcid.org/ns/person" xmlns:education="http://www.orcid.org/ns/education" xmlns:other-name="http://www.orcid.org/ns/other-name" xmlns:personal-details="http://www.orcid.org/ns/personal-details" xmlns:bulk="http://www.orcid.org/ns/bulk" xmlns:common="http://www.orcid.org/ns/common" xmlns:record="http://www.orcid.org/ns/record" xmlns:keyword="http://www.orcid.org/ns/keyword" xmlns:activities="http://www.orcid.org/ns/activities" xmlns:deprecated="http://www.orcid.org/ns/deprecated" xmlns:external-identifier="http://www.orcid.org/ns/external-identifier" xmlns:funding="http://www.orcid.org/ns/funding" xmlns:error="http://www.orcid.org/ns/error" xmlns:preferences="http://www.orcid.org/ns/preferences" xmlns:work="http://www.orcid.org/ns/work" xmlns:researcher-url="http://www.orcid.org/ns/researcher-url" xmlns:peer-review="http://www.orcid.org/ns/peer-review" path="/0000-0001-5006-8001">
<common:orcid-identifier>
<common:uri>http://orcid.org/0000-0001-5006-8001</common:uri>
<common:path>0000-0001-5006-8001</common:path>
<common:host>orcid.org</common:host>
</common:orcid-identifier>
<preferences:preferences>
<preferences:locale>en</preferences:locale>
</preferences:preferences>
<person:person path="/0000-0001-5006-8001/person">
<common:last-modified-date>2016-06-06T15:29:36.952Z</common:last-modified-date>
<person:name visibility="public" path="0000-0001-5006-8001">
<common:created-date>2016-04-15T20:45:16.141Z</common:created-date>
<common:last-modified-date>2016-04-15T20:45:16.141Z</common:last-modified-date>
<personal-details:given-names>Marjorie</personal-details:given-names>
<personal-details:family-name>Biffi</personal-details:family-name>
</person:name>
What I want is to extract given-names and family-name: Marjorie Biffi. I am trying to use this code:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('f.xml').getroot()
>>> p=root.findall('{http://www.orcid.org/ns/personal-details}personal-details')
>>> p
[]
I can't figure out how to extract names/surname from this XML file. I am trying also yo use XPath/Selector, but no succes.
This will get you the results you want, but by climbing through each one.
p1 = root.find('{http://www.orcid.org/ns/person}person')
name = p1.find('{http://www.orcid.org/ns/person}name')
given_names = name.find('{http://www.orcid.org/ns/personal-details}given-names')
family_name = name.find('{http://www.orcid.org/ns/personal-details}family-name')
print(given_names.text, '', family_name.text)
You could also just go directly to that sublevel with .\\
family_name = root.find('.//{http://www.orcid.org/ns/personal-details}family-name')
Also I just posted here about simpler ways to parse through xml if you're doing more basic operations. These include xmltodict (converting to an OrderedDict) or untangle which is a little inefficient but very quick and easy to learn.

Parsing XML files with repeated tags that have differing data using BeautifulSoup in Python

I have been stuck on this problem for a while now but no solution. I have a snippet of my Python script that looks like so:
pub_ref = soup.findAll("publication-reference")
with open('./output.csv', 'ab+') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref:
pat_cite = soup.findAll("patcit")
for item in pat_cite:
if item.find("name"):
name = item.find("name").text
writer.writerow([name])
This part of the script I want to parse children of a citation child root "pacit" of the parent "publication-reference" that crops up multiple times in the XML file and looks like this:
.
.
.
<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
.
.
.
The dots indicate that the file is larger than this and isn't showing the parent root "publication-reference". The problem is that my script only parses one of the many children of pacit, the "name" root, once through as you can tell. And this works fine for those roots that have only one entry per invention, but not multiples.
I also want to store these in an CSV file, as you can see with the writer, whereby the output shows these multiple patcit citations down a column like so:
invention name country city .... patcit name1 patcit date1....
white space patcit name2 patcit date2....
white space patcit name2 patcit date3....
The XML files I'm using can be found here at https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/
Any help would be appreciated as I've tried multiple ways and I feel this is a beginner's problem.
First of all I downloaded one of the zip files "ipg170103.zip" and found it contained multiple xml documents. So I ran (on Linux)
csplit ipg170103.xml '/xml version/' '{*}'
To split the files into multiple single documents. Working with one of these files "xx995" I managed to see what you are working with. using "grep" on the file for "country" I discovered many instances of the word so I guessed you wanted the "country" under "publication-reference" (if not you will have to change the script) and likewise "invention" from "invention-title". I also discovered multiple instances of "date" under "patcit" not all of them had a name with them so my script omits these. I found too many "city" elements to know which one you wanted.
But in any case I could not determine exactly what you wanted so you may well have to tweak it a bit for your exact needs.
from bs4 import BeautifulSoup
import csv
xml = open("xx995",'r').read()
soup = BeautifulSoup(xml, 'lxml')
pat = soup.find("us-patent-grant")
country = pat.find("publication-reference").find("country").text
invention = pat.find("invention-title").text
data = []
pat_cite = pat.findAll("patcit")
for item in pat_cite:
name = None
date = None
if item.find("name"):
name = item.find("name").text
# Only get date if name
if item.find("date"):
date = item.find("date").text
data.append((name,date))
with open('./output.csv', 'wt') as f:
writer = csv.writer(f, dialect='excel')
writer.writerow(('invention', 'country', 'patcit name', 'patcit date'))
for d in data:
writer.writerow((invention, country, d[0], d[1]))
invention = None
country = None
Outputs:

Categories

Resources