Extract links after th in beautifulsoup - python

Im trying to extract links from this page:
http://www.tadpoletunes.com/tunes/celtic1/
view-source:http://www.tadpoletunes.com/tunes/celtic1/
but I only want the reels: which in the page are delineated by :
start:
<th align="left"><b><a name="reels">REELS</a></b></th>
end ( the lines above the following):
<th align="left"><b><a name="slides">SLIDES</a></b></th>
The question is how to do this. I have the following code which gets the links for everything with a .mid extension:
def import_midifiles():
archive_url="http://www.tadpoletunes.com/tunes/celtic1/"
sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read()
soup=bs.BeautifulSoup(sauce,'lxml')
tables=soup.find_all('table')
for table in tables:
for link in table.find_all('a',href=True):
if link['href'].endswith('.mid'):
listofmidis.append(archive_url + link['href'])
if listofmidis:
listoflists.append(listofmidis)
midi_list = [item for sublist in listoflists for item in sublist]
return midi_list
I cannot figure this out from the beautifulsoup docs. I need the code because I will be repeating the activity on other sites in order to scrape data for training a model.

To get all the "REELS" links, you need to do the following:
Get the links in between "REELS" and "SLIDES" as you mentioned. To do that, first you'll need to find the <tr> tag containing <a name="reels">REELS</a>. This can be done using the .find_parent() method.
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
Now, you can use the .find_next_siblings() method to get all the <tr> tags after "REELS". We can break the loop when we find the <tr> tag with <a name="slides">SLIDES</a> (or .find('a').text == 'SLIDES').
Complete code:
def import_midifiles():
BASE_URL = 'http://www.tadpoletunes.com/tunes/celtic1/'
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
midi_list = []
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text == 'SLIDES':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return midi_list
print(import_midifiles())
Partial output:
['http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bashful.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid',
'http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid',
'http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid',
...
...
'http://www.tadpoletunes.com/tunes/celtic1/silspear.mid',
'http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid',
'http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid',
'http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid',
'http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid']

Related

Fetching <td> text next to <th> tag with specific text

I'd linke to retrieve information form a couple of players from transfermarkt.de, e.g Manuel Neuer's birthday.
Here is how the relevant html looks like:
<tr>
<th>Geburtsdatum:</th>
<td>
27.03.1986
</td>
</tr>
I know I could get the date by using the following code:
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
date_of_birth = re.search(r'([0-9]+\.[0-9]+\.[0-9]+)', rows[1].get_text(), re.M)[0]
but that is quite fragile. E.g. for Robert Lewandowski the date of birth is in a different position of the table. So, which attributes appear at the players profile differs. Is there a way to logically do
finde the tag with 'Geburtsdatum:' in it
get the text of the tag right after it
the more robust the better :)
BeautifulSoup allows retrieve next sibling using method findNext():
from bs4 import BeautifulSoup
import requests
html = requests.get('https://www.transfermarkt.de/manuel-neuer/profil/spieler/17259', headers = {'User-Agent': 'Custom'})
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
def get_table_value(rows, table_header):
for row in rows:
helpers = row.find_all(text=re.compile(table_header))
if helpers is not None:
for helper in helpers:
return helper.find_next('td').get_text()

How to store variations for html elements in BeautifulSoup calls through many sites?

I am scraping site data, the data itself is very simple and obviously displayed on each page. But since each site is different, the actual HTML structure varies.
eg
site 1
site = 'https://www.site1.com/items'
page = requests.get(site)
soup = BeautifulSoup(page.content, 'html.parser')
#items are stored in divs with class "item-container"
item_soup = soup.find_all(class_ = 'item-container')
items_to_store = []
for item in item_soup:
#within the div, the "true" name is stored as the alt-text for the item's image
items_to_store.append(item.img['alt'])
site 2
site = 'https://www.site2.com/section/items'
page = requests.get(site)
soup = BeautifulSoup(page.content, 'html.parser')
#items are stored in spans with class "item-title"
item_soup = soup.find_all(class_ = "item-title")
items_to_store = []
for item in item_soup:
#within the span, the "true" name is stored as text
items_to_store.append(item.text)
These two snippets are identical except for
url
the tag details for finding the items
the way the item name is extracted
Rather than just copy/paste the code for each site, and replacing the relevant bits, I'd prefer to write a function. Ideally this function would take in the site, reference a dictionary where I store the relevant tag details, and pull the relevant data to re-scrape.
Is there a way to fill the parameters in soup.find_all() and items_to_store.append() dynamically?
Rather than .find_all() you can use CSS selectors. Then you can define one function that takes URL and return CSS selector string and function that takes tag as parameter and extracts the information from the tag.
For example:
txt_site1 = '''
<div class="item-container"><img alt="this is the information from site 1" /></div>
'''
txt_site2 = '''
<span class="item-title">this is the information from site 2</span>
'''
def sites(site):
if 'www.site1.com' in site:
return 'div.item-container img[alt]', lambda tag: tag['alt']
elif 'www.site2.com' in site:
return 'span.item-title', lambda tag: tag.text
def universal_scraper(soup, url):
css_selector, item_getter = sites(url)
item_soup = soup.select(css_selector)
for item in item_soup:
yield item_getter(item)
all_items = []
soup = BeautifulSoup(txt_site1, 'html.parser') # load site1
all_items.extend(universal_scraper(soup, 'https://www.site1.com/items'))
soup = BeautifulSoup(txt_site2, 'html.parser') # load site2
all_items.extend(universal_scraper(soup, 'https://www.site2.com/section/items'))
print(all_items)
Prints:
['this is the information from site 1', 'this is the information from site 2']
If you want solution with .find_all(), you can try this:
def sites(site):
if 'www.site1.com' in site:
return lambda soup: soup.find_all(class_ = 'item-container'), lambda tag: tag.img['alt']
elif 'www.site2.com' in site:
return lambda soup: soup.find_all(class_ = "item-title"), lambda tag: tag.text
def universal_scraper(soup, url):
item_finder, item_getter = sites(url)
item_soup = item_finder(soup)
for item in item_soup:
yield item_getter(item)
# the rest is the same...

How to scrape a table from any site and store it to data frame?

I need to scrape a table from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
and store this data in python dataframe.
I have pulled the table but unable to pick the columns (Postcode, Borough, Neighbourhood)
My table looks like this:
<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td>North York</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A</td>
<td>North York</td>
<td>Victoria Village
</td></tr>
...
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td')
Postcode = row.columns[1].get_text()
Borough = row.columns[2].get_text()
Neighbourhood = row.column[3].get_text()
df.append([Postcode,Borough,Neighbourhood])
With the above code I am getting
TypeError: 'NoneType' object is not subscriptable
I googled it and got to know that I cannot do
Postcode = row.columns[1].get_text()
because of inline propery of the function.
I tried something else too but got some "Index error message".
It's simple. I need to traverse the row and goes on picking the three columns for each row and store it in a list. But I am not able to write it in a code.
Expected output is
Postcode Borough Neighbourhood
M1A Not assigned Not assigned
M2A Not assigned Not assigned
M3A North York Parkwoods
The code for scraping is wrong in below parts.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td') # the first row is returning <th> tags, but since you queried <td> tags, it's returning empty list.
if len(columns)>0: #In order to skip first row or in general, empty rows, you need to put an if check.
#Use the indices properly to get different values.
Postcode = columns[0].get_text()
Borough =columns[1].get_text()
Neighbourhood = columns[2].get_text()
df.append([Postcode,Borough,Neighbourhood])
Then again, be careful, using get_text will also return the links and anchor tags intact. You might wanna change the code to avoid that.
Happy web scraping :)
I don't know pandas but I use this script to scrape table. Hope it is helpful.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
tbl= soup.find('table', {'class': 'wikitable sortable'})
table_dict = {
"head": [th.text.strip() for th in tbl.find_all('th')],
"rows": [
[td.text.strip() for td in tr.find_all("td")]
for tr in tbl.find_all("tr")
if not tr.find("th")
]
}
If you want to scrape a table from web, you can use pandas library.
import pandas as pd
url = 'valid_url'
df = pd.read_html(url)
print(df[0].head())

Python scraper (Scraperwiki) only getting half of the table

I'm learning how to write scrapers using Python in Scraperwiki. So far so good, but I have spent a couple of days scratching my head now over a problem I can't get my head around. I am trying to take all links from a table. It works, but from the list of links which go from 001 to 486 it only ever starts grabbing them at 045. The url/source is just a list of cities on a website, the source can be seen here:
http://www.tripadvisor.co.uk/pages/by_city.html and the specific html starts here:
</td></tr>
<tr><td class=dt1>'s-Gravenzande, South Holland Province - Aberystwyth, Ceredigion, Wales</td>
<td class=dt1>Los Corrales de Buelna, Cantabria - Lousada, Porto District, Northern Portugal</td>
</tr>
<tr><td class=dt1>Abetone, Province of Pistoia, Tuscany - Adamstown, Lancaster County, Pennsylvania /td>
<td class=dt1>Louth, Lincolnshire, England - Lucciana, Haute-Corse, Corsica</td>
</tr>
<tr><td class=dt1>Adamswiller, Bas-Rhin, Alsace - Aghir, Djerba Island, Medenine Governorate </td>
<td class=dt1>Luccianna, Haute-Corse, Corsica - Lumellogno, Novara, Province of Novara, Piedmont</td>
</tr>
What I am after is the links from "by_city_001.html" through to "by_city_486.html". Here is my code:
def scrapeCityList(pageUrl):
html = scraperwiki.scrape(pageUrl)
root = lxml.html.fromstring(html)
print html
links = root.cssselect('td.dt1 a')
for link in links:
url = 'http://www.tripadvisor.co.uk' + link.attrib['href']
print url
Called in the code as follows:
scrapeCityList('http://www.tripadvisor.co.uk/pages/by_city.html')
Now when I run it, it only ever returns the links starting at 0045!
The output (045~486)
http://www.tripadvisor.co.ukby_city_045.html
http://www.tripadvisor.co.ukby_city_288.html
http://www.tripadvisor.co.ukby_city_046.html
http://www.tripadvisor.co.ukby_city_289.html
http://www.tripadvisor.co.ukby_city_047.html
http://www.tripadvisor.co.ukby_city_290.html and so on...
I've tried changing the selector to:
links = root.cssselect('td.dt1')
And it grabs 487 'elements' like this:
<Element td at 0x13d75f0>
<Element td at 0x13d7650>
<Element td at 0x13d76b0>
But I'm not able to get the 'href' value from this. I can't figure out why it loses the first 44 links when I select 'a' in the cssselect line. I've looked at the code but I have no clue.
Thanks in advance for any help!
Claire
Your code works fine. You can see it in action here: https://scraperwiki.com/scrapers/tripadvisor_cities/
I've added in saving to the datastore so you can see that it actually processes all the links.
import scraperwiki
import lxml.html
def scrapeCityList(pageUrl):
html = scraperwiki.scrape(pageUrl)
root = lxml.html.fromstring(html)
links = root.cssselect('td.dt1 a')
print len(links)
batch = []
for link in links[1:]: #skip the first link since it's only a link to tripadvisor and not a subpage
record = {}
url = 'http://www.tripadvisor.co.uk/' + link.attrib['href']
record['url'] = url
batch.append(record)
scraperwiki.sqlite.save(["url"],data=batch)
scrapeCityList('http://www.tripadvisor.co.uk/pages/by_city.html')
If you use the second css selector:
links = root.cssselect('td.dt1')
then you are selecting the td element and not the a element (which is a sub-element of the td). You could select the a by doing this:
url = 'http://www.tripadvisor.co.uk/' + link[0].attrib['href']
where you are selecting the first sub-element of the td (that's the [0]).
If you want to see all attributes of an element in lxml.html then use:
print element.attrib
which for the td gives:
{'class': 'dt1'}
{'class': 'dt1'}
{'class': 'dt1'}
...
and for the a:
{'href': 'by_city_001.html'}
{'href': 'by_city_244.html'}
{'href': 'by_city_002.html'}
...

How to find tags with only certain attributes - BeautifulSoup

How would I, using BeautifulSoup, search for tags containing ONLY the attributes I search for?
For example, I want to find all <td valign="top"> tags.
The following code:
raw_card_data = soup.fetch('td', {'valign':re.compile('top')})
gets all of the data I want, but also grabs any <td> tag that has the attribute valign:top
I also tried:
raw_card_data = soup.findAll(re.compile('<td valign="top">'))
and this returns nothing (probably because of bad regex)
I was wondering if there was a way in BeautifulSoup to say "Find <td> tags whose only attribute is valign:top"
UPDATE
FOr example, if an HTML document contained the following <td> tags:
<td valign="top">.....</td><br />
<td width="580" valign="top">.......</td><br />
<td>.....</td><br />
I would want only the first <td> tag (<td width="580" valign="top">) to return
As explained on the BeautifulSoup documentation
You may use this :
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
EDIT :
To return tags that have only the valign="top" attribute, you can check for the length of the tag attrs property :
from BeautifulSoup import BeautifulSoup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
for result in results :
if len(result.attrs) == 1 :
print result
That returns :
<td valign="top">.....</td>
You can use lambda functions in findAll as explained in documentation. So that in your case to search for td tag with only valign = "top" use following:
td_tag_list = soup.findAll(
lambda tag:tag.name == "td" and
len(tag.attrs) == 1 and
tag["valign"] == "top")
if you want to only search with attribute name with any value
from bs4 import BeautifulSoup
import re
soup= BeautifulSoup(html.text,'lxml')
results = soup.findAll("td", {"valign" : re.compile(r".*")})
as per Steve Lorimer better to pass True instead of regex
results = soup.findAll("td", {"valign" : True})
The easiest way to do this is with the new CSS style select method:
soup = BeautifulSoup(html)
results = soup.select('td[valign="top"]')
Just pass it as an argument of findAll:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("""
... <html>
... <head><title>My Title!</title></head>
... <body><table>
... <tr><td>First!</td>
... <td valign="top">Second!</td></tr>
... </table></body><html>
... """)
>>>
>>> soup.findAll('td')
[<td>First!</td>, <td valign="top">Second!</td>]
>>>
>>> soup.findAll('td', valign='top')
[<td valign="top">Second!</td>]
Adding a combination of Chris Redford's and Amr's answer, you can also search for an attribute name with any value with the select command:
from bs4 import BeautifulSoup as Soup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = Soup(html, 'lxml')
results = soup.select('td[valign]')
If you are looking to pull all tags where a particular attribute is present at all, you can use the same code as the accepted answer, but instead of specifying a value for the tag, just put True.
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : True})
This will return all td tags that have valign attributes. This is useful if your project involves pulling info from a tag like div that is used all over, but can handle very specific attributes that you might be looking for.
find using an attribute in any tag
<th class="team" data-sort="team">Team</th>
soup.find_all(attrs={"class": "team"})
<th data-sort="team">Team</th>
soup.find_all(attrs={"data-sort": "team"})

Categories

Resources