I'm learning how to write scrapers using Python in Scraperwiki. So far so good, but I have spent a couple of days scratching my head now over a problem I can't get my head around. I am trying to take all links from a table. It works, but from the list of links which go from 001 to 486 it only ever starts grabbing them at 045. The url/source is just a list of cities on a website, the source can be seen here:
http://www.tripadvisor.co.uk/pages/by_city.html and the specific html starts here:
</td></tr>
<tr><td class=dt1>'s-Gravenzande, South Holland Province - Aberystwyth, Ceredigion, Wales</td>
<td class=dt1>Los Corrales de Buelna, Cantabria - Lousada, Porto District, Northern Portugal</td>
</tr>
<tr><td class=dt1>Abetone, Province of Pistoia, Tuscany - Adamstown, Lancaster County, Pennsylvania /td>
<td class=dt1>Louth, Lincolnshire, England - Lucciana, Haute-Corse, Corsica</td>
</tr>
<tr><td class=dt1>Adamswiller, Bas-Rhin, Alsace - Aghir, Djerba Island, Medenine Governorate </td>
<td class=dt1>Luccianna, Haute-Corse, Corsica - Lumellogno, Novara, Province of Novara, Piedmont</td>
</tr>
What I am after is the links from "by_city_001.html" through to "by_city_486.html". Here is my code:
def scrapeCityList(pageUrl):
html = scraperwiki.scrape(pageUrl)
root = lxml.html.fromstring(html)
print html
links = root.cssselect('td.dt1 a')
for link in links:
url = 'http://www.tripadvisor.co.uk' + link.attrib['href']
print url
Called in the code as follows:
scrapeCityList('http://www.tripadvisor.co.uk/pages/by_city.html')
Now when I run it, it only ever returns the links starting at 0045!
The output (045~486)
http://www.tripadvisor.co.ukby_city_045.html
http://www.tripadvisor.co.ukby_city_288.html
http://www.tripadvisor.co.ukby_city_046.html
http://www.tripadvisor.co.ukby_city_289.html
http://www.tripadvisor.co.ukby_city_047.html
http://www.tripadvisor.co.ukby_city_290.html and so on...
I've tried changing the selector to:
links = root.cssselect('td.dt1')
And it grabs 487 'elements' like this:
<Element td at 0x13d75f0>
<Element td at 0x13d7650>
<Element td at 0x13d76b0>
But I'm not able to get the 'href' value from this. I can't figure out why it loses the first 44 links when I select 'a' in the cssselect line. I've looked at the code but I have no clue.
Thanks in advance for any help!
Claire
Your code works fine. You can see it in action here: https://scraperwiki.com/scrapers/tripadvisor_cities/
I've added in saving to the datastore so you can see that it actually processes all the links.
import scraperwiki
import lxml.html
def scrapeCityList(pageUrl):
html = scraperwiki.scrape(pageUrl)
root = lxml.html.fromstring(html)
links = root.cssselect('td.dt1 a')
print len(links)
batch = []
for link in links[1:]: #skip the first link since it's only a link to tripadvisor and not a subpage
record = {}
url = 'http://www.tripadvisor.co.uk/' + link.attrib['href']
record['url'] = url
batch.append(record)
scraperwiki.sqlite.save(["url"],data=batch)
scrapeCityList('http://www.tripadvisor.co.uk/pages/by_city.html')
If you use the second css selector:
links = root.cssselect('td.dt1')
then you are selecting the td element and not the a element (which is a sub-element of the td). You could select the a by doing this:
url = 'http://www.tripadvisor.co.uk/' + link[0].attrib['href']
where you are selecting the first sub-element of the td (that's the [0]).
If you want to see all attributes of an element in lxml.html then use:
print element.attrib
which for the td gives:
{'class': 'dt1'}
{'class': 'dt1'}
{'class': 'dt1'}
...
and for the a:
{'href': 'by_city_001.html'}
{'href': 'by_city_244.html'}
{'href': 'by_city_002.html'}
...
Related
I'd linke to retrieve information form a couple of players from transfermarkt.de, e.g Manuel Neuer's birthday.
Here is how the relevant html looks like:
<tr>
<th>Geburtsdatum:</th>
<td>
27.03.1986
</td>
</tr>
I know I could get the date by using the following code:
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
date_of_birth = re.search(r'([0-9]+\.[0-9]+\.[0-9]+)', rows[1].get_text(), re.M)[0]
but that is quite fragile. E.g. for Robert Lewandowski the date of birth is in a different position of the table. So, which attributes appear at the players profile differs. Is there a way to logically do
finde the tag with 'Geburtsdatum:' in it
get the text of the tag right after it
the more robust the better :)
BeautifulSoup allows retrieve next sibling using method findNext():
from bs4 import BeautifulSoup
import requests
html = requests.get('https://www.transfermarkt.de/manuel-neuer/profil/spieler/17259', headers = {'User-Agent': 'Custom'})
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
def get_table_value(rows, table_header):
for row in rows:
helpers = row.find_all(text=re.compile(table_header))
if helpers is not None:
for helper in helpers:
return helper.find_next('td').get_text()
I need to scrape a table from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
and store this data in python dataframe.
I have pulled the table but unable to pick the columns (Postcode, Borough, Neighbourhood)
My table looks like this:
<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td>North York</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A</td>
<td>North York</td>
<td>Victoria Village
</td></tr>
...
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td')
Postcode = row.columns[1].get_text()
Borough = row.columns[2].get_text()
Neighbourhood = row.column[3].get_text()
df.append([Postcode,Borough,Neighbourhood])
With the above code I am getting
TypeError: 'NoneType' object is not subscriptable
I googled it and got to know that I cannot do
Postcode = row.columns[1].get_text()
because of inline propery of the function.
I tried something else too but got some "Index error message".
It's simple. I need to traverse the row and goes on picking the three columns for each row and store it in a list. But I am not able to write it in a code.
Expected output is
Postcode Borough Neighbourhood
M1A Not assigned Not assigned
M2A Not assigned Not assigned
M3A North York Parkwoods
The code for scraping is wrong in below parts.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td') # the first row is returning <th> tags, but since you queried <td> tags, it's returning empty list.
if len(columns)>0: #In order to skip first row or in general, empty rows, you need to put an if check.
#Use the indices properly to get different values.
Postcode = columns[0].get_text()
Borough =columns[1].get_text()
Neighbourhood = columns[2].get_text()
df.append([Postcode,Borough,Neighbourhood])
Then again, be careful, using get_text will also return the links and anchor tags intact. You might wanna change the code to avoid that.
Happy web scraping :)
I don't know pandas but I use this script to scrape table. Hope it is helpful.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
tbl= soup.find('table', {'class': 'wikitable sortable'})
table_dict = {
"head": [th.text.strip() for th in tbl.find_all('th')],
"rows": [
[td.text.strip() for td in tr.find_all("td")]
for tr in tbl.find_all("tr")
if not tr.find("th")
]
}
If you want to scrape a table from web, you can use pandas library.
import pandas as pd
url = 'valid_url'
df = pd.read_html(url)
print(df[0].head())
Im trying to extract links from this page:
http://www.tadpoletunes.com/tunes/celtic1/
view-source:http://www.tadpoletunes.com/tunes/celtic1/
but I only want the reels: which in the page are delineated by :
start:
<th align="left"><b><a name="reels">REELS</a></b></th>
end ( the lines above the following):
<th align="left"><b><a name="slides">SLIDES</a></b></th>
The question is how to do this. I have the following code which gets the links for everything with a .mid extension:
def import_midifiles():
archive_url="http://www.tadpoletunes.com/tunes/celtic1/"
sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read()
soup=bs.BeautifulSoup(sauce,'lxml')
tables=soup.find_all('table')
for table in tables:
for link in table.find_all('a',href=True):
if link['href'].endswith('.mid'):
listofmidis.append(archive_url + link['href'])
if listofmidis:
listoflists.append(listofmidis)
midi_list = [item for sublist in listoflists for item in sublist]
return midi_list
I cannot figure this out from the beautifulsoup docs. I need the code because I will be repeating the activity on other sites in order to scrape data for training a model.
To get all the "REELS" links, you need to do the following:
Get the links in between "REELS" and "SLIDES" as you mentioned. To do that, first you'll need to find the <tr> tag containing <a name="reels">REELS</a>. This can be done using the .find_parent() method.
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
Now, you can use the .find_next_siblings() method to get all the <tr> tags after "REELS". We can break the loop when we find the <tr> tag with <a name="slides">SLIDES</a> (or .find('a').text == 'SLIDES').
Complete code:
def import_midifiles():
BASE_URL = 'http://www.tadpoletunes.com/tunes/celtic1/'
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
midi_list = []
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text == 'SLIDES':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return midi_list
print(import_midifiles())
Partial output:
['http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bashful.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid',
'http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid',
'http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid',
...
...
'http://www.tadpoletunes.com/tunes/celtic1/silspear.mid',
'http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid',
'http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid',
'http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid',
'http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid']
Hello all…I am trying to use BeautifulSoup to pick up the content of “Date of Employment:” on a webpage. the webpage contains 5 tables. the 5 tables are similar and looked like below.
<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Design Team</th><th class="top">Top</th></tr></thead><tbody><tr><td style="width:20px;">Designer:</td><td>Michael Linnen</td></tr>
<tr><td style="width:20px;">Date of Employment:</td><td>07 Jan 2012</td></tr>
<tr><td style="width:20px;">No of Works:</td><td>6</td></tr>
<tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>
<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Operation Team</th><th class="top">Top</th></tr></thead><tbody><tr><td style="width:20px;">Manager:</td><td>Nich Sharmen</td></tr>
<tr><td style="width:20px;">Date of Employment:</td><td>02 Nov 2005</td></tr>
<tr><td style="width:20px;">Zones:</td><td>6</td></tr>
<tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>
The texts I want is in the 3rd table, the table header is "Design Team".
I am Using below:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
aa = soup.find_all(text=re.compile("Date of Employment:"))
bb = aa[2].findNext('td')
print bb.text
the problem is that, the “Date of Employment:” in this table sometimes is not available. when it's not there, the code picks the "Date of Employment:" in the next table.
How do I restrict my code to pick only the wanted ones in the table named “Design Team”? thanks.
Rather than finding all the Date of Employment and finding the next td you can directy find the 5th table, given that the th is Design Team
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
aa = soup.find_all(text="Design Team")
nexttr = aa.next_sibling
if nexttr.td.text == "Date of Employment:":
print nexttr.td.next_sibling.text
else:
print "No Date of Employment:"
nexttr = aa.next_sibling finds the next tr tag within the table tag.
if nexttr.td.text == "Date of Employment:": ensures that the text within the next td tag withn the tr is "No Date of Employment:"
nexttr.td.next_sibling extracts the immediate td tag following the "Date of Employment"
print nexttr.td.next_sibling.text prints the date
How would I, using BeautifulSoup, search for tags containing ONLY the attributes I search for?
For example, I want to find all <td valign="top"> tags.
The following code:
raw_card_data = soup.fetch('td', {'valign':re.compile('top')})
gets all of the data I want, but also grabs any <td> tag that has the attribute valign:top
I also tried:
raw_card_data = soup.findAll(re.compile('<td valign="top">'))
and this returns nothing (probably because of bad regex)
I was wondering if there was a way in BeautifulSoup to say "Find <td> tags whose only attribute is valign:top"
UPDATE
FOr example, if an HTML document contained the following <td> tags:
<td valign="top">.....</td><br />
<td width="580" valign="top">.......</td><br />
<td>.....</td><br />
I would want only the first <td> tag (<td width="580" valign="top">) to return
As explained on the BeautifulSoup documentation
You may use this :
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
EDIT :
To return tags that have only the valign="top" attribute, you can check for the length of the tag attrs property :
from BeautifulSoup import BeautifulSoup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
for result in results :
if len(result.attrs) == 1 :
print result
That returns :
<td valign="top">.....</td>
You can use lambda functions in findAll as explained in documentation. So that in your case to search for td tag with only valign = "top" use following:
td_tag_list = soup.findAll(
lambda tag:tag.name == "td" and
len(tag.attrs) == 1 and
tag["valign"] == "top")
if you want to only search with attribute name with any value
from bs4 import BeautifulSoup
import re
soup= BeautifulSoup(html.text,'lxml')
results = soup.findAll("td", {"valign" : re.compile(r".*")})
as per Steve Lorimer better to pass True instead of regex
results = soup.findAll("td", {"valign" : True})
The easiest way to do this is with the new CSS style select method:
soup = BeautifulSoup(html)
results = soup.select('td[valign="top"]')
Just pass it as an argument of findAll:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("""
... <html>
... <head><title>My Title!</title></head>
... <body><table>
... <tr><td>First!</td>
... <td valign="top">Second!</td></tr>
... </table></body><html>
... """)
>>>
>>> soup.findAll('td')
[<td>First!</td>, <td valign="top">Second!</td>]
>>>
>>> soup.findAll('td', valign='top')
[<td valign="top">Second!</td>]
Adding a combination of Chris Redford's and Amr's answer, you can also search for an attribute name with any value with the select command:
from bs4 import BeautifulSoup as Soup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = Soup(html, 'lxml')
results = soup.select('td[valign]')
If you are looking to pull all tags where a particular attribute is present at all, you can use the same code as the accepted answer, but instead of specifying a value for the tag, just put True.
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : True})
This will return all td tags that have valign attributes. This is useful if your project involves pulling info from a tag like div that is used all over, but can handle very specific attributes that you might be looking for.
find using an attribute in any tag
<th class="team" data-sort="team">Team</th>
soup.find_all(attrs={"class": "team"})
<th data-sort="team">Team</th>
soup.find_all(attrs={"data-sort": "team"})