Wiki scraping using python - python

I am trying to scrape the data stored in the table of this wikipedia page https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India).
However i am unable to scrape the full data
Hers's what i wrote so far:
from bs4 import BeautifulSoup
import urllib2
wiki = "https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page,"html.parser")
name = ""
pic = ""
strt = ""
end = ""
pri = ""
x=""
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 8:
name = cells[0].find(text=True)
print name`
The output obtained is:
Jairamdas Daulatram, Surjit Singh Barnala, Rao Birendra Singh
Whereas the output should be: Jairamdas Daulatram followed by Panjabrao Deshmukh

Have you read the raw html?
Because some of the cells span several rows (e.g. Political Party), most rows do not have 8 cells in them.
You cannot therefore do if len(cells) == 8 and expect it to work. Think about what this line was meant to achieve. If it was to ignore the header row then you could replace it with if len(cells) > 0 because all the header cells are <th> tags (and therefore will not appear in your list).
Page source (showing your problem):
<tr>
<td>Jairamdas Daulatram</td>
<td></td>
<td>1948</td>
<td>1952</td>
<td rowspan="6">Indian National Congress</td>
<td rowspan="6" bgcolor="#00BFFF" width="4px"></td>
<td rowspan="3">Jawaharlal Nehru</td>
<td><sup id="cite_ref-1" class="reference"><span>[</span>1<span>]</span></sup></td>
</tr>
<tr>
<td>Panjabrao Deshmukh</td>
<td></td>
<td>1952</td>
<td>1962</td>
<td><sup id="cite_ref-2" class="reference"><span>[</span>2<span>]</span></sup></td>
</tr>

Like already stated in a previous post. It does not make sense to set a static length. Just check if <td> exists. The code below is written in Python 3, but should work in Python 2.7 as well with some small adjustments.
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = urlopen("https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)")
soup = BeautifulSoup(wiki, "html.parser")
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if cells:
name = cells[0].find(text=True)
print(name)

Related

Fetching <td> text next to <th> tag with specific text

I'd linke to retrieve information form a couple of players from transfermarkt.de, e.g Manuel Neuer's birthday.
Here is how the relevant html looks like:
<tr>
<th>Geburtsdatum:</th>
<td>
27.03.1986
</td>
</tr>
I know I could get the date by using the following code:
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
date_of_birth = re.search(r'([0-9]+\.[0-9]+\.[0-9]+)', rows[1].get_text(), re.M)[0]
but that is quite fragile. E.g. for Robert Lewandowski the date of birth is in a different position of the table. So, which attributes appear at the players profile differs. Is there a way to logically do
finde the tag with 'Geburtsdatum:' in it
get the text of the tag right after it
the more robust the better :)
BeautifulSoup allows retrieve next sibling using method findNext():
from bs4 import BeautifulSoup
import requests
html = requests.get('https://www.transfermarkt.de/manuel-neuer/profil/spieler/17259', headers = {'User-Agent': 'Custom'})
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
def get_table_value(rows, table_header):
for row in rows:
helpers = row.find_all(text=re.compile(table_header))
if helpers is not None:
for helper in helpers:
return helper.find_next('td').get_text()

Web scrape and pull an attribute value instead of the text value out of td for the entire table beautiful soup

I am trying to scrape some data from a table, but they have the content that I actually would like in an attribute.
Example xml:
'''
<tr data-row="0">
<th scope ="row" class="left" data_append-csv="AlleRi00" data-stat="player" csk="Allen, Ricardo">
Ricardo Allen
</th>
<td class="center poptip out dnp" data-stat="week_4" data-tip"Out: Concussion" csk= "4">
<strong>O</strong>
</td>
'''
When scraping the table I use the following code:
'''
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')
final_data = []
for tr in table_rows:
td = tr.find_all(['th','td'])
row = [tr.text for tr in td]
final_data.append(row)
df = pd.DataFrame(final_data[1:],final_data[0])
'''
With my current code, I get a good looking dataframe with headers and all the info that is visible when looking at the table. However, I would like to get "Out: Concussion" instead of "O" within the table. I've been trying numerous ways and cannot figure it out. Please let me know if this is possible with the current process or if I am approaching it all wrong.
This should help you:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')
final_data = []
for tr in table_rows:
td = tr.find_all(['th','td'])
row = [tr['data-tip'] if tr.has_attr("data-tip") else tr.text for tr in td]
final_data.append(row)
m = final_data[1:]
final_dataa = [[m[j][i] for j in range(len(m))] for i in range(len(m[0]))]
df = pd.DataFrame(final_dataa,final_data[0]).T
df.to_csv("D:\\injuries.csv", index = False)
Screenshot of csv file (I've done some formatting so that it looks neat):

How to scrape a table from any site and store it to data frame?

I need to scrape a table from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
and store this data in python dataframe.
I have pulled the table but unable to pick the columns (Postcode, Borough, Neighbourhood)
My table looks like this:
<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td>North York</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A</td>
<td>North York</td>
<td>Victoria Village
</td></tr>
...
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td')
Postcode = row.columns[1].get_text()
Borough = row.columns[2].get_text()
Neighbourhood = row.column[3].get_text()
df.append([Postcode,Borough,Neighbourhood])
With the above code I am getting
TypeError: 'NoneType' object is not subscriptable
I googled it and got to know that I cannot do
Postcode = row.columns[1].get_text()
because of inline propery of the function.
I tried something else too but got some "Index error message".
It's simple. I need to traverse the row and goes on picking the three columns for each row and store it in a list. But I am not able to write it in a code.
Expected output is
Postcode Borough Neighbourhood
M1A Not assigned Not assigned
M2A Not assigned Not assigned
M3A North York Parkwoods
The code for scraping is wrong in below parts.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td') # the first row is returning <th> tags, but since you queried <td> tags, it's returning empty list.
if len(columns)>0: #In order to skip first row or in general, empty rows, you need to put an if check.
#Use the indices properly to get different values.
Postcode = columns[0].get_text()
Borough =columns[1].get_text()
Neighbourhood = columns[2].get_text()
df.append([Postcode,Borough,Neighbourhood])
Then again, be careful, using get_text will also return the links and anchor tags intact. You might wanna change the code to avoid that.
Happy web scraping :)
I don't know pandas but I use this script to scrape table. Hope it is helpful.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
tbl= soup.find('table', {'class': 'wikitable sortable'})
table_dict = {
"head": [th.text.strip() for th in tbl.find_all('th')],
"rows": [
[td.text.strip() for td in tr.find_all("td")]
for tr in tbl.find_all("tr")
if not tr.find("th")
]
}
If you want to scrape a table from web, you can use pandas library.
import pandas as pd
url = 'valid_url'
df = pd.read_html(url)
print(df[0].head())

Extract text in proper format (with spaces in between) from <td> tags using beautiful soup

I am trying to extract column headings from one of the tables from ABBV 10-k sec filing (`Issuer Purchases of Equity Securities' table on page 25 - below the graph.)
inside <td> tag in the column heading <tr> tag, text is in separate <div> tags as in the example below
<tr>
<td>
<div>string1</div>
<div>string2</div>
<div>string3</div>
</td>
</tr>
when trying to extract all text fro a tag, there is no space separation between texts (e.g. for the above html output will be string1string3string3 expectedstring1 string3 string3).
Using below code to extract column headings from table
url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
table = soup.find_all('table')[76]
rows = table.find_all('tr')
table_data = []
for tr in rows[2:3]:
row_data=[]
cells = tr.find_all(['td', 'th'], recursive=False)
for cell in cells[1:4]:
row_data.append(cell.text.encode('utf-8'))
table_data.append([x.decode('utf-8').strip() for x in row_data])
print(table_data)
output:[['(a) TotalNumberof Shares(or Units)Purchased', '', '(b) AveragePricePaid per Share(or Unit)']]
Expected output:[['(a) Total Number of Shares (or Units) Purchased', '', '(b) Average Price Paid per Share (or Unit)']] (each word separated bay a space)
use the separator parameter with .get_text():
html = '''<tr>
<td>
<div>string1</div>
<div>string2</div>
<div>string3</div>
</td>
</tr>'''
import bs4
soup = bs4.BeautifulSoup(html, 'html.parser')
td = soup.find('td')
td.get_text(separator=' ')
Here's how it looks with your code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
table = soup.find_all('table')[76]
rows = table.find_all('tr')
table_data = []
for tr in rows[2:3]:
row_data=[]
cells = tr.find_all(['td', 'th'], recursive=False)
for cell in cells[1:4]:
row_data.append(cell.get_text(separator=' ').encode('utf-8'))
table_data.append([x.decode('utf-8').strip() for x in row_data])
print(table_data)
Output:
print(table_data)
[['(a) Total Number of Shares (or Units) Purchased', '', '(b) Average Price Paid per Share (or Unit)']]

Trying to select rows in a table, always getting NavigableString error

I'm trying unsuccessfully to scrape a list of countries and altitudes from a wiki page:
Here's the relevant HTML from this page:
<table class="wikitable sortable jquery-tablesorter">
<thead>
<tbody>
<tr>
<td>
And here's my code
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
soup = BeautifulSoup(read_url(url), 'html.parser')
table = soup.find("table", {"class":"wikitable"})
tbody = table.find("tbody")
rows = tbody.find("tr") <---this gives the error, saying tbody is None
countries = []
altitudes = []
for row in rows:
cols = row.findAll('td')
for td in cols:
if td.a:
countries.append(td.a.text)
elif "m (" in td.text:
altitudes.append(float(td.text.split("m")[0].replace(",", "")))
Here's the error:
Traceback (most recent call last):
File "wiki.py", line 18, in <module>
rows = tbody.find("tr")
AttributeError: 'NoneType' object has no attribute 'find'
So then I tried just selecting the rows straight up with soup.find('tr').
This results in a NavigableString error. What else can I try to retrieve the info in a paired fashion?
If you go to the page source and search for tbody, you will get 0 results, so that could be the cause of the first problem. It seems like Wikipedia uses a custom <table class="wikitable sortable"> and does not specify tbody.
For your second problem, you need to be using find_all and not find because find just returns the first tr. So instead you want
rows = soup.find_all("tr")
Hope this helps :)
Below code worked for me-
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table')
countries = []
altitudes = []
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
country= col[0].text.strip()
elevation = float(''.join(map(unicode.strip,col[1].text.split("m")[0])).replace(',',''))
countries.append(country)
altitudes.append(elevation)
print countries,'\n',altitudes

Categories

Resources