scraping table data from a webpage - python

I am trying to learn python and Portuguese so thought I could kill two birds with one stone.
Here is an example of one of the pages. I want to download the data that is in the blue tables, so the first such table is called Presente the next table is called Pretérito Perfeito and so on.
Below is my code however I'm struggling. My results variable does contain the data I need however trying to pull out the exact bit is beyond me as the div tags don't have id's.
Is there a better way to do this?
import requests
from bs4 import BeautifulSoup
URL = 'https://conjugator.reverso.net/conjugation-portuguese-verb-ser.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='ch_divSimple')
mychk = results.prettify()
tbl_elems = results.find_all('section', class_='wrap-verbs-listing')

They don't have ids but they have classes. You can do:
results.find_all("div", "blue-box-wrap")
Where blue-box-wrap is a class.
It will return a ResultSet object of length 22, as there are 22 blue tables. You can select the one you want with indexing, like this for the first one:
blue_tables = results.find_all("div", "blue-box-wrap")
blue_tables[0]

Replace:
results = soup.find(id='ch_divSimple')
mychk = results.prettify()
tbl_elems = results.find_all('section', class_='wrap-verbs-listing')
With:
results = soup.find("div", attrs={"class": 'blue-box-wrap'})
tbl_elems = results.find_all('ul', class_='wrap-verbs-listing')

Related

Out of Index Error while Scraping Web Pages and storing information in a list

I am currently in a data science Bootcamp and I am ahead of the curriculum for the moment, so I wanted to take the chance to apply some of the skills that I have learned in service of my first project. I am scraping movie information from Box Office Mojo and would like to eventually compile all of this information into a pandas dataframe. So far I have a pagination function that collects all of the links for the individual films:
def pagination_func(req_url):
soup = bs(req_url.content, 'lxml')
table = soup.find('table')
links = [a['href'] for a in table.find_all('a', href=True)]
pagination_list = []
substring = '/release'
for link in links:
if substring in link:
pagination_list.append(link)
return pagination_list
I have sort of lazily implemented a hardwired URL to pass through this function to retrieve the requested data:
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
link_list_by_year = []
for count, year in tqdm(enumerate(years)):
pagination_url = 'https://www.boxofficemojo.com/year/{}/?grossesOption=calendarGrosses'.format(year)
pagination = requests.get(pagination_url)
link_list_by_year.append(pagination_func(pagination))
This will give me incomplete URLs that I then convert into complete URLs with this for loop:
complete_links = []
for link in link_list_by_year:
for url in link:
complete_links.append('https://www.boxofficemojo.com{}'.format(url))
I have then used the lxml library to retrieve the elements from the page that I wanted with this function:
def scrape_page(req_page):
tree = html.fromstring(req_page.content)
title.append(tree.xpath('//*[#id="a-page"]/main/div/div[1]/div[1]/div/div/div[2]/h1/text()')[0])
domestic.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[1]/span[2]/span/text()')[0].replace('$','').replace(',',''))
international.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[2]/span[2]/a/span/text()')[0].replace('$','').replace(',',''))
worldwide.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[3]/span[2]/a/span/text()')[0].replace('$','').replace(',',''))
opening.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')[0].replace('$','').replace(',',''))
opening_theatres.append(tree.xpath(
'/html/body/div[1]/main/div/div[3]/div[4]/div[2]/span[2]/text()')[0].replace('\n', '').split()[0])
MPAA.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[4]/span[2]/text()')[0])
run_time.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]/text()')[0])
genres.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[6]/span[2]/text()')[0].replace('\n','').split())
run_time.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]/text()')[0])
I go on to initialize these lists which I am going to spare posting for sake of text walls, they're all just standard var = [].
Finally, I have a for loop that will iterate over my list of completed links:
for link in tqdm(complete_links[:200]):
movie = requests.get(link)
scrape_page(movie)
So it is all pretty basic and not very optimized, but it has helped me understand a lot of things about the basic nature of Python. Unfortunately, when I run the loop to scrape the pages after it scrapes for about a minute it throws an IndexError: list index out of range and gives the following debug traceback (or one of a similar nature concerning an operation within the scrape_page function):
IndexError Traceback (most recent call last)
<ipython-input-381-739b3dc267d8> in <module>
4 for link in tqdm(test_links[:200]):
5 movie = requests.get(link)
----> 6 scrape_page(movie)
7
8
<ipython-input-378-7c13bea848f6> in scrape_page(req_page)
14
15 opening.append(tree.xpath(
---> 16 '//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')[0].replace('$','').replace(',',''))
17
18 opening_theatres.append(tree.xpath(
IndexError: list index out of range
What I think is going wrong is that the particular page that it is hanging up on either lacks that particular element, it's tagged differently, or there is some sort of oddity. I have searched for a way to error handle this, but I couldn't find one that was relevant to what I was looking for. I honestly have been banging my head against this for the better part of 2 hours and have done everything (in my limited knowledge) but searched every page by hand for some sort of issue.
Check if xpath() returned anything before trying to append the result to the list.
openings = tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')
if openings:
opening.append(openings[0].replace('$','').replace(',',''))
Since you should probably do this for all the lists, you may want to extract the pattern into a function:
def append_xpath(tree, list, path):
matches = tree.xpath(path)
if matches:
list.append(matches[0].replace('$','').replace(',',''))
Then you would use it like this:
append_xpath(tree, openings, '//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')
append_xpath(tree, domestic, '//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[1]/span[2]/span/text()')
...

How can I get the text from a table in HTML?

I am trying to scrape data from https://fortnitetracker.com/events/epicgames_S10_FNCS_Week5_NAE. Specifically, I am trying to get the placement and number of points earned by a specific player. I went to the website and found the instance where the specific player ("Nickmercs") was located in the HTML which looked like this:
HTML Text
You can see the "rank" is shown above his name as 56, and the points are shown a few lines below his name which is also 56. I then wrote the following Python 3 program to scrape the data from the website:
import requests
class tracker:
url = "https://fortnitetracker.com/events/epicgames_S10_FNCS_Week5_NAE"
def getReq(website):
req = requests.get(website)
if req:
return req
req = getReq(url)
text = req.text
index = text.find("nickmercs")
split = text[index:index+1000]
print (split)
Running the program resulted in a large portion of the HTML code, but the instance of "Nickmercs" that it found was not the one I was looking for. The one shown in the picture of the HTML code shown above was the actual first instance if the "Nickmercs" string on the page, but for some reason it was not in the req.text / the response for my request. As a result I went back and modified my code to print out where the first instance actually was, and found that the line was different from what was shown in the HTML code picture. The line that was supposed to list the names "Nate Hill, Nickmercs, SypherPK" actually looked like this:
<span :style="{ 'color': '#' + metadata.primary_color }">{{ getPlayerNameList(entry.teamAccountIds, 4) }}</span>
I have little knowledge of how HTML works, so I am wondering if it is possible to fix this problem. It seems to be calling some (what I imagine is a) method called getPlayerNameList() which places the names in the correct spot, but makes it so I can't easily search the names / scrape the data. Is there a way to get around this? Any help is much appreciated!
The site is dynamic, thus, you need some way to access the data populated after the page originally loads. One such way is to use selenium:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://fortnitetracker.com/events/epicgames_S10_FNCS_Week5_NAE')
h, *r = [[i.text for i in b.find_all('th' if b.td is None else 'td')] for b in soup(d.page_source, 'html.parser').find('div', {'id':'leaderboard'}).table.find_all('tr')]
new_data = {tuple(b.split(', ')):dict(zip([h[0], *h[2:]], [a[1:-1], *c])) for a, b, *c in r}
Now, to look up a player by name:
data = [b for a, b in new_data.items() if 'Nickmercs' in a][0]
Output:
{'Rank': '56', 'Points': '56 Top 0.373%', 'Matches': '10', 'Wins': '0', 'K/D': '3.50', 'Avg Place': '16.10'}
For your specific target value (Rank):
rank = [b for a, b in new_data.items() if 'Nickmercs' in a][0]['Rank']
Output:
56
Data is dynamically loaded from script tags so content is present in response. You can regex out the leaderboard/session info and the accounts info and connect the two via account_id. You find the right account_id based on the player name of interest
import requests, re, json
def get_json(pattern):
p = re.compile(pattern, re.DOTALL)
return p.findall(r.text)[0]
r = requests.get('https://fortnitetracker.com/events/epicgames_S10_FNCS_Week5_NAE')
player = 'Nickmercs'
session_info = json.loads(get_json('imp_leaderboard = (.*?);'))
player_info = json.loads(get_json('imp_accounts = (.*?);'))
account_id = [i['accountId'] for i in player_info if i['playerName'] == player][0]
team_info = [i for i in session_info['entries'] if account_id in i['teamId']]
print(team_info)
This gives you all the relevant info. Part of that is shown here:
Specific items:
print(team_info[0]['pointsEarned'])
print(team_info[0]['rank'])
You are scraping the HTML along the javascript code and it is not rendered.
For this task you could use computer vision to extract the table from the page.
Otherwise you can use PhantomJS (https://phantomjs.org/) to scrape the table without using images as it gives you the rendered page.

How to find textual differences between revisions on Wikipedia pages with mwclient?

I'm trying to find the textual differences between two revisions of a given Wikipedia page using mwclient. I have the following code:
import mwclient
import difflib
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Bowdoin College']
texts = [rev for rev in page.revisions(prop='content')]
if not (texts[-1][u'*'] == texts[0][u'*']):
##show me the differences between the pages
Thank you!
It's not clear weather you want a difflib-generated diff or a mediawiki-generated diff using mwclient.
In the first case, you have two strings (the text of two revisions) and you want to get the diff using difflib:
...
t1 = texts[-1][u'*']
t2 = texts[0][u'*']
print('\n'.join(difflib.unified_diff(t1.splitlines(), t2.splitlines())))
(difflib can also generate an HTML diff, refer to the documentation for more info.)
But if you want the MediaWiki-generated HTML diff using mwclient you'll need revision ids:
# TODO: Loading all revisions is slow,
# try to load only as many as required.
revisions = list(page.revisions(prop='ids'))
last_revision_id = revisions[-1]['revid']
first_revision_id = revisions[0]['revid']
Then use the compare action to compare the revision ids:
compare_result = site.get('compare', fromrev=last_revision_id, torev=first_revision_id)
html_diff = compare_result['compare']['*']

How do you keep table rows together in python-docx?

As an example, I have a generic script that outputs the default table styles using python-docx (this code runs fine):
import docx
d=docx.Document()
type_of_table=docx.enum.style.WD_STYLE_TYPE.TABLE
list_table=[['header1','header2'],['cell1','cell2'],['cell3','cell4']]
numcols=max(map(len,list_table))
numrows=len(list_table)
styles=(s for s in d.styles if s.type==type_of_table)
for stylenum,style in enumerate(styles,start=1):
label=d.add_paragraph('{}) {}'.format(stylenum,style.name))
label.paragraph_format.keep_with_next=True
label.paragraph_format.space_before=docx.shared.Pt(18)
label.paragraph_format.space_after=docx.shared.Pt(0)
table=d.add_table(numrows,numcols)
table.style=style
for r,row in enumerate(list_table):
for c,cell in enumerate(row):
table.row_cells(r)[c].text=cell
d.save('tablestyles.docx')
Next, I opened the document, highlighted a split table and under paragraph format, selected "Keep with next," which successfully prevented the table from being split across a page:
Here is the XML code of the non-broken table:
You can see the highlighted line shows the paragraph property that should be keeping the table together. So I wrote this function and stuck it in the code above the d.save('tablestyles.docx') line:
def no_table_break(document):
tags=document.element.xpath('//w:p')
for tag in tags:
ppr=tag.get_or_add_pPr()
ppr.keepNext_val=True
no_table_break(d)
When I inspect the XML code the paragraph property tag is set properly and when I open the Word document, the "Keep with next" box is checked for all tables, yet the table is still split across pages. Am I missing an XML tag or something that's preventing this from working properly?
Ok, I also needed this. I think we were all making the incorrect assumption that the setting in Word's table properties (or the equivalent ways to achieve this in python-docx) was about keeping the table from being split across pages. It's not -- instead, it's simply about whether or not a table's rows can be split across pages.
Given that we know how successfully do this in python-docx, we can prevent tables from being split across pages by putting each table within the row of a larger master table. The code below successfully does this. I'm using Python 3.6 and Python-Docx 0.8.6
import docx
from docx.oxml.shared import OxmlElement
import os
import sys
def prevent_document_break(document):
"""https://github.com/python-openxml/python-docx/issues/245#event-621236139
Globally prevent table cells from splitting across pages.
"""
tags = document.element.xpath('//w:tr')
rows = len(tags)
for row in range(0, rows):
tag = tags[row] # Specify which <w:r> tag you want
child = OxmlElement('w:cantSplit') # Create arbitrary tag
tag.append(child) # Append in the new tag
d = docx.Document()
type_of_table = docx.enum.style.WD_STYLE_TYPE.TABLE
list_table = [['header1', 'header2'], ['cell1', 'cell2'], ['cell3', 'cell4']]
numcols = max(map(len, list_table))
numrows = len(list_table)
styles = (s for s in d.styles if s.type == type_of_table)
big_table = d.add_table(1, 1)
big_table.autofit = True
for stylenum, style in enumerate(styles, start=1):
cells = big_table.add_row().cells
label = cells[0].add_paragraph('{}) {}'.format(stylenum, style.name))
label.paragraph_format.keep_with_next = True
label.paragraph_format.space_before = docx.shared.Pt(18)
label.paragraph_format.space_after = docx.shared.Pt(0)
table = cells[0].add_table(numrows, numcols)
table.style = style
for r, row in enumerate(list_table):
for c, cell in enumerate(row):
table.row_cells(r)[c].text = cell
prevent_document_break(d)
d.save('tablestyles.docx')
# because I'm lazy...
openers = {'linux': 'libreoffice tablestyles.docx',
'linux2': 'libreoffice tablestyles.docx',
'darwin': 'open tablestyles.docx',
'win32': 'start tablestyles.docx'}
os.system(openers[sys.platform])
Have been straggling with the problem for some hours and finally found the solution worked fine for me. I just changed the XPath in the topic starter's code so now it looks like this:
def keep_table_on_one_page(doc):
tags = self.doc.element.xpath('//w:tr[position() < last()]/w:tc/w:p')
for tag in tags:
ppr = tag.get_or_add_pPr()
ppr.keepNext_val = True
The key moment is this selector
[position() < last()]
We want all but the last row in each table to keep with the next one
Would have left this is a comment under #DeadAd 's answer, but had low rep.
In case anyone is looking to stop a specific table from breaking, rather than all tables in a doc, change the xpath to the following:
tags = table._element.xpath('./w:tr[position() < last()]/w:tc/w:p')
where table refers to the instance of <class 'docx.table.Table'> which you want to keep together.
"//" will select all nodes that match the xpath (regardless of relative location), "./" will start selection from current node

Python->Beautifulsoup->Webscraping->Looping over URL (1 to 53) and saving Results

Here is the Website I am trying to scrape http://livingwage.mit.edu/
The specific URLs are from
http://livingwage.mit.edu/states/01
http://livingwage.mit.edu/states/02
http://livingwage.mit.edu/states/04 (For some reason they skipped 03)
...all the way to...
http://livingwage.mit.edu/states/56
And on each one of these URLs, I need the last row of the second table:
Example for http://livingwage.mit.edu/states/01
Required annual income before taxes $20,260 $42,786 $51,642
$64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691
$56,934 $66,997
Desire output:
Alabama $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997
Alaska $24,070 $49,295 $60,933 $79,871 $38,561 $47,136 $52,233 $61,531 $38,561 $54,433 $66,316 $82,403
...
...
Wyoming $20,867 $42,689 $52,007 $65,892 $34,988 $41,887 $46,983 $53,549 $34,988 $47,826 $57,391 $68,424
After 2 hours of messing around, this is what I have so far (I am a beginner):
import requests, bs4
res = requests.get('http://livingwage.mit.edu/states/01')
res.raise_for_status()
states = bs4.BeautifulSoup(res.text)
state_name=states.select('h1')
table = states.find_all('table')[1]
rows = table.find_all('tr', 'odd')[4:]
result=[]
result.append(state_name)
result.append(rows)
When I viewed the state_name and rows in Python Console it give me the html elements
[<h1>Living Wag...Alabama</h1>]
and
[<tr class = "odd... </td> </tr>]
Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?
Problem 2: How do I loop through the request.get(url01 to url56)?
Thank you for your help.
And if you can offer a more efficient way of getting to the rows variable in my code, I would greatly appreciate it, because the way I get there is not very Pythonic.
Just get all the states from the initial page, then you can select the second table and use the css classes odd results to get the tr you need, there is no need to slice as the class names are unique:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin # python2 -> from urlparse import urljoin
base = "http://livingwage.mit.edu"
res = requests.get(base)
res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
# The hrefs look like "/states/51/locations".
# We want everything before /locations so we split on / from the right -> /states/51/
# and join to the base url. The anchor text also holds the state name,
# so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))
def parse(soup):
# Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
table = soup.select_one("table:nth-of-type(2)")
# To get the text, we just need find all the tds and call .text on each.
# Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]
# Unpack the url and state from each tuple in our states list.
for url, state in states:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(state, parse(soup))
If you run the code you will see output like:
Alabama ['$21,144', '$43,213', '$53,468', '$67,788', '$34,783', '$41,847', '$46,876', '$52,531', '$34,783', '$48,108', '$58,748', '$70,014']
Alaska ['$24,070', '$49,295', '$60,933', '$79,871', '$38,561', '$47,136', '$52,233', '$61,531', '$38,561', '$54,433', '$66,316', '$82,403']
Arizona ['$21,587', '$47,153', '$59,462', '$78,112', '$36,332', '$44,913', '$50,200', '$58,615', '$36,332', '$52,483', '$65,047', '$80,739']
Arkansas ['$19,765', '$41,000', '$50,887', '$65,091', '$33,351', '$40,337', '$45,445', '$51,377', '$33,351', '$45,976', '$56,257', '$67,354']
California ['$26,249', '$55,810', '$64,262', '$81,451', '$42,433', '$52,529', '$57,986', '$68,826', '$42,433', '$61,328', '$70,088', '$84,192']
Colorado ['$23,573', '$51,936', '$61,989', '$79,343', '$38,805', '$47,627', '$52,932', '$62,313', '$38,805', '$57,283', '$67,593', '$81,978']
Connecticut ['$25,215', '$54,932', '$64,882', '$80,020', '$39,636', '$48,787', '$53,857', '$61,074', '$39,636', '$60,074', '$70,267', '$82,606']
You could loop in a range from 1-53 but extracting the anchor from the base page also gives us the state name in a single step, using the h1 from that page would also give you output Living Wage Calculation for Alabama which you would have to then try to parse to just get the name which would not be trivial considering some states have more the one word names.
Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?
You can get the text by simply by doing something on the lines of:
state_name=states.find('h1').text
The same can be applied for each of the rows too.
Problem 2: How do I loop through the request.get(url01 to url56)?
The same code block can be put inside a loop from 1 to 56 like so:
for i in range(1,57):
res = requests.get('http://livingwage.mit.edu/states/'+str(i).zfill(2))
...rest of the code...
zfill will add those leading zeroes. Also, it would be better if requests.get is enclosed in a try-except block so that the loop continues gracefully even when the url is wrong.

Categories

Resources