I have the following code to parse from an xml file to produce a pandas dataframe. The XML file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<Entries>
<EntrySynopsisDetail_1_0>
<EntryID>262148</EntryID>
<EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
<CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
<EntryTitle>Call for Mobility Program</EntryTitle>
<CategoryOfEntry>ENG</CategoryOfEntry>
<CategoryOfEntry>MAT</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
</Entries>
And my code is below:
from bs4 import BeautifulSoup
import pandas as pd
fd = open("file_120123.xml",'r')
data = fd.read()
Bs_data = BeautifulSoup(data,'xml')
ID = Bs_data.find_all('EntryID')
Title = Bs_data.find_all('EntryTitle')
try:
Cat = Bs_data.find_all('CategoryOfEntry')
except IndexError:
Cat = ''
CatDict = {
"ENG":"English",
"MAT" :"Mathematics"
}
dataDf = []
for i in range(0,len(ID)):
if (Cat[i] == CatDict):
Cat[i] == CatDict.get(Cat[i])
rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
dataDf.append(rows)
df = pd.DataFrame(dataDf, columns =['ID', 'Title', 'Category'], dtype=float)
df.to_csv('120123.csv')
As you see, the code reads a xml file called 'file_120123.xml' using BeautifulSoup library, and calls each of the elements present in the file. Now one of the elements is a key and I have created a dictionary listing all possible keys. Not all parents have that element. I want to compare the extracted key with the ones in the dictionary and replace that with the value corresponding to that key.
With this code, I get the error IndexError: list index out of range on Cat[i] on if (Cat[i] == CatDict): line. Any insights on how to resolve this?
If you just want to avoid raising the error, add a conditional break
for i in range(0,len(ID)):
if not i < len(Cat): break ## <-- break loop if length of Cat is exceeded
if (Cat[i] == CatDict):
Cat[i] == CatDict.get(Cat[i])
rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
dataDf.append(rows)
First, as to why lxml is better than BeautifulSoup for xml, the answer is simple: the best way to query xml is with xpath. lxml supports xpath (though only version 1.0; for more complex xml and queries you will need xpath 2.0 to 3.1 and a library like elementpath). BS doesn't support xpath, though it does have excellent support for css selectors, which works better with html.
Having said all that - in your particular case, you probably don't need lxml either - only pandas and a one liner! Though you haven't shown your expected output, my guess is you expect the output below. Note that in your sample xml there is probability an error: the 2nd <EntrySynopsisDetail_1_0> has <CategoryOfEntry> twice, so I removed one:
entries = """<Entries>
<EntrySynopsisDetail_1_0>
<EntryID>262148</EntryID>
<EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
<CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
<EntryTitle>Call for Mobility Program</EntryTitle>
<CategoryOfEntry>MAT</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
</Entries>"""
pd.read_xml(entries,xpath="//EntrySynopsisDetail_1_0")
Output:
EntryID EntryTitle CategoryOfEntry
0 262148 Establishment of the Graduate Internship Program ENG
1 2667654 Call for Mobility Program MAT
I am currently in a data science Bootcamp and I am ahead of the curriculum for the moment, so I wanted to take the chance to apply some of the skills that I have learned in service of my first project. I am scraping movie information from Box Office Mojo and would like to eventually compile all of this information into a pandas dataframe. So far I have a pagination function that collects all of the links for the individual films:
def pagination_func(req_url):
soup = bs(req_url.content, 'lxml')
table = soup.find('table')
links = [a['href'] for a in table.find_all('a', href=True)]
pagination_list = []
substring = '/release'
for link in links:
if substring in link:
pagination_list.append(link)
return pagination_list
I have sort of lazily implemented a hardwired URL to pass through this function to retrieve the requested data:
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
link_list_by_year = []
for count, year in tqdm(enumerate(years)):
pagination_url = 'https://www.boxofficemojo.com/year/{}/?grossesOption=calendarGrosses'.format(year)
pagination = requests.get(pagination_url)
link_list_by_year.append(pagination_func(pagination))
This will give me incomplete URLs that I then convert into complete URLs with this for loop:
complete_links = []
for link in link_list_by_year:
for url in link:
complete_links.append('https://www.boxofficemojo.com{}'.format(url))
I have then used the lxml library to retrieve the elements from the page that I wanted with this function:
def scrape_page(req_page):
tree = html.fromstring(req_page.content)
title.append(tree.xpath('//*[#id="a-page"]/main/div/div[1]/div[1]/div/div/div[2]/h1/text()')[0])
domestic.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[1]/span[2]/span/text()')[0].replace('$','').replace(',',''))
international.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[2]/span[2]/a/span/text()')[0].replace('$','').replace(',',''))
worldwide.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[3]/span[2]/a/span/text()')[0].replace('$','').replace(',',''))
opening.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')[0].replace('$','').replace(',',''))
opening_theatres.append(tree.xpath(
'/html/body/div[1]/main/div/div[3]/div[4]/div[2]/span[2]/text()')[0].replace('\n', '').split()[0])
MPAA.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[4]/span[2]/text()')[0])
run_time.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]/text()')[0])
genres.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[6]/span[2]/text()')[0].replace('\n','').split())
run_time.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]/text()')[0])
I go on to initialize these lists which I am going to spare posting for sake of text walls, they're all just standard var = [].
Finally, I have a for loop that will iterate over my list of completed links:
for link in tqdm(complete_links[:200]):
movie = requests.get(link)
scrape_page(movie)
So it is all pretty basic and not very optimized, but it has helped me understand a lot of things about the basic nature of Python. Unfortunately, when I run the loop to scrape the pages after it scrapes for about a minute it throws an IndexError: list index out of range and gives the following debug traceback (or one of a similar nature concerning an operation within the scrape_page function):
IndexError Traceback (most recent call last)
<ipython-input-381-739b3dc267d8> in <module>
4 for link in tqdm(test_links[:200]):
5 movie = requests.get(link)
----> 6 scrape_page(movie)
7
8
<ipython-input-378-7c13bea848f6> in scrape_page(req_page)
14
15 opening.append(tree.xpath(
---> 16 '//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')[0].replace('$','').replace(',',''))
17
18 opening_theatres.append(tree.xpath(
IndexError: list index out of range
What I think is going wrong is that the particular page that it is hanging up on either lacks that particular element, it's tagged differently, or there is some sort of oddity. I have searched for a way to error handle this, but I couldn't find one that was relevant to what I was looking for. I honestly have been banging my head against this for the better part of 2 hours and have done everything (in my limited knowledge) but searched every page by hand for some sort of issue.
Check if xpath() returned anything before trying to append the result to the list.
openings = tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')
if openings:
opening.append(openings[0].replace('$','').replace(',',''))
Since you should probably do this for all the lists, you may want to extract the pattern into a function:
def append_xpath(tree, list, path):
matches = tree.xpath(path)
if matches:
list.append(matches[0].replace('$','').replace(',',''))
Then you would use it like this:
append_xpath(tree, openings, '//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')
append_xpath(tree, domestic, '//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[1]/span[2]/span/text()')
...
As an example, I have a generic script that outputs the default table styles using python-docx (this code runs fine):
import docx
d=docx.Document()
type_of_table=docx.enum.style.WD_STYLE_TYPE.TABLE
list_table=[['header1','header2'],['cell1','cell2'],['cell3','cell4']]
numcols=max(map(len,list_table))
numrows=len(list_table)
styles=(s for s in d.styles if s.type==type_of_table)
for stylenum,style in enumerate(styles,start=1):
label=d.add_paragraph('{}) {}'.format(stylenum,style.name))
label.paragraph_format.keep_with_next=True
label.paragraph_format.space_before=docx.shared.Pt(18)
label.paragraph_format.space_after=docx.shared.Pt(0)
table=d.add_table(numrows,numcols)
table.style=style
for r,row in enumerate(list_table):
for c,cell in enumerate(row):
table.row_cells(r)[c].text=cell
d.save('tablestyles.docx')
Next, I opened the document, highlighted a split table and under paragraph format, selected "Keep with next," which successfully prevented the table from being split across a page:
Here is the XML code of the non-broken table:
You can see the highlighted line shows the paragraph property that should be keeping the table together. So I wrote this function and stuck it in the code above the d.save('tablestyles.docx') line:
def no_table_break(document):
tags=document.element.xpath('//w:p')
for tag in tags:
ppr=tag.get_or_add_pPr()
ppr.keepNext_val=True
no_table_break(d)
When I inspect the XML code the paragraph property tag is set properly and when I open the Word document, the "Keep with next" box is checked for all tables, yet the table is still split across pages. Am I missing an XML tag or something that's preventing this from working properly?
Ok, I also needed this. I think we were all making the incorrect assumption that the setting in Word's table properties (or the equivalent ways to achieve this in python-docx) was about keeping the table from being split across pages. It's not -- instead, it's simply about whether or not a table's rows can be split across pages.
Given that we know how successfully do this in python-docx, we can prevent tables from being split across pages by putting each table within the row of a larger master table. The code below successfully does this. I'm using Python 3.6 and Python-Docx 0.8.6
import docx
from docx.oxml.shared import OxmlElement
import os
import sys
def prevent_document_break(document):
"""https://github.com/python-openxml/python-docx/issues/245#event-621236139
Globally prevent table cells from splitting across pages.
"""
tags = document.element.xpath('//w:tr')
rows = len(tags)
for row in range(0, rows):
tag = tags[row] # Specify which <w:r> tag you want
child = OxmlElement('w:cantSplit') # Create arbitrary tag
tag.append(child) # Append in the new tag
d = docx.Document()
type_of_table = docx.enum.style.WD_STYLE_TYPE.TABLE
list_table = [['header1', 'header2'], ['cell1', 'cell2'], ['cell3', 'cell4']]
numcols = max(map(len, list_table))
numrows = len(list_table)
styles = (s for s in d.styles if s.type == type_of_table)
big_table = d.add_table(1, 1)
big_table.autofit = True
for stylenum, style in enumerate(styles, start=1):
cells = big_table.add_row().cells
label = cells[0].add_paragraph('{}) {}'.format(stylenum, style.name))
label.paragraph_format.keep_with_next = True
label.paragraph_format.space_before = docx.shared.Pt(18)
label.paragraph_format.space_after = docx.shared.Pt(0)
table = cells[0].add_table(numrows, numcols)
table.style = style
for r, row in enumerate(list_table):
for c, cell in enumerate(row):
table.row_cells(r)[c].text = cell
prevent_document_break(d)
d.save('tablestyles.docx')
# because I'm lazy...
openers = {'linux': 'libreoffice tablestyles.docx',
'linux2': 'libreoffice tablestyles.docx',
'darwin': 'open tablestyles.docx',
'win32': 'start tablestyles.docx'}
os.system(openers[sys.platform])
Have been straggling with the problem for some hours and finally found the solution worked fine for me. I just changed the XPath in the topic starter's code so now it looks like this:
def keep_table_on_one_page(doc):
tags = self.doc.element.xpath('//w:tr[position() < last()]/w:tc/w:p')
for tag in tags:
ppr = tag.get_or_add_pPr()
ppr.keepNext_val = True
The key moment is this selector
[position() < last()]
We want all but the last row in each table to keep with the next one
Would have left this is a comment under #DeadAd 's answer, but had low rep.
In case anyone is looking to stop a specific table from breaking, rather than all tables in a doc, change the xpath to the following:
tags = table._element.xpath('./w:tr[position() < last()]/w:tc/w:p')
where table refers to the instance of <class 'docx.table.Table'> which you want to keep together.
"//" will select all nodes that match the xpath (regardless of relative location), "./" will start selection from current node
Here is the Website I am trying to scrape http://livingwage.mit.edu/
The specific URLs are from
http://livingwage.mit.edu/states/01
http://livingwage.mit.edu/states/02
http://livingwage.mit.edu/states/04 (For some reason they skipped 03)
...all the way to...
http://livingwage.mit.edu/states/56
And on each one of these URLs, I need the last row of the second table:
Example for http://livingwage.mit.edu/states/01
Required annual income before taxes $20,260 $42,786 $51,642
$64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691
$56,934 $66,997
Desire output:
Alabama $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997
Alaska $24,070 $49,295 $60,933 $79,871 $38,561 $47,136 $52,233 $61,531 $38,561 $54,433 $66,316 $82,403
...
...
Wyoming $20,867 $42,689 $52,007 $65,892 $34,988 $41,887 $46,983 $53,549 $34,988 $47,826 $57,391 $68,424
After 2 hours of messing around, this is what I have so far (I am a beginner):
import requests, bs4
res = requests.get('http://livingwage.mit.edu/states/01')
res.raise_for_status()
states = bs4.BeautifulSoup(res.text)
state_name=states.select('h1')
table = states.find_all('table')[1]
rows = table.find_all('tr', 'odd')[4:]
result=[]
result.append(state_name)
result.append(rows)
When I viewed the state_name and rows in Python Console it give me the html elements
[<h1>Living Wag...Alabama</h1>]
and
[<tr class = "odd... </td> </tr>]
Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?
Problem 2: How do I loop through the request.get(url01 to url56)?
Thank you for your help.
And if you can offer a more efficient way of getting to the rows variable in my code, I would greatly appreciate it, because the way I get there is not very Pythonic.
Just get all the states from the initial page, then you can select the second table and use the css classes odd results to get the tr you need, there is no need to slice as the class names are unique:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin # python2 -> from urlparse import urljoin
base = "http://livingwage.mit.edu"
res = requests.get(base)
res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
# The hrefs look like "/states/51/locations".
# We want everything before /locations so we split on / from the right -> /states/51/
# and join to the base url. The anchor text also holds the state name,
# so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))
def parse(soup):
# Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
table = soup.select_one("table:nth-of-type(2)")
# To get the text, we just need find all the tds and call .text on each.
# Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]
# Unpack the url and state from each tuple in our states list.
for url, state in states:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(state, parse(soup))
If you run the code you will see output like:
Alabama ['$21,144', '$43,213', '$53,468', '$67,788', '$34,783', '$41,847', '$46,876', '$52,531', '$34,783', '$48,108', '$58,748', '$70,014']
Alaska ['$24,070', '$49,295', '$60,933', '$79,871', '$38,561', '$47,136', '$52,233', '$61,531', '$38,561', '$54,433', '$66,316', '$82,403']
Arizona ['$21,587', '$47,153', '$59,462', '$78,112', '$36,332', '$44,913', '$50,200', '$58,615', '$36,332', '$52,483', '$65,047', '$80,739']
Arkansas ['$19,765', '$41,000', '$50,887', '$65,091', '$33,351', '$40,337', '$45,445', '$51,377', '$33,351', '$45,976', '$56,257', '$67,354']
California ['$26,249', '$55,810', '$64,262', '$81,451', '$42,433', '$52,529', '$57,986', '$68,826', '$42,433', '$61,328', '$70,088', '$84,192']
Colorado ['$23,573', '$51,936', '$61,989', '$79,343', '$38,805', '$47,627', '$52,932', '$62,313', '$38,805', '$57,283', '$67,593', '$81,978']
Connecticut ['$25,215', '$54,932', '$64,882', '$80,020', '$39,636', '$48,787', '$53,857', '$61,074', '$39,636', '$60,074', '$70,267', '$82,606']
You could loop in a range from 1-53 but extracting the anchor from the base page also gives us the state name in a single step, using the h1 from that page would also give you output Living Wage Calculation for Alabama which you would have to then try to parse to just get the name which would not be trivial considering some states have more the one word names.
Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?
You can get the text by simply by doing something on the lines of:
state_name=states.find('h1').text
The same can be applied for each of the rows too.
Problem 2: How do I loop through the request.get(url01 to url56)?
The same code block can be put inside a loop from 1 to 56 like so:
for i in range(1,57):
res = requests.get('http://livingwage.mit.edu/states/'+str(i).zfill(2))
...rest of the code...
zfill will add those leading zeroes. Also, it would be better if requests.get is enclosed in a try-except block so that the loop continues gracefully even when the url is wrong.