scraping a table from wikipedia with python : can't get a column

scraping a table from wikipedia with python : can't get a column - python

I am trying to scrape a table from Wikipedia
<tr>
<td>1</td>
<td><span class="nowrap"><span class="datasortkey" data-sort-value="Etats unis"><span class="flagicon"><a class="image" href="/wiki/Fichier:Flag_of_the_United_States.svg" title="Drapeau des États-Unis"><img alt="Drapeau des États-Unis" class="noviewer thumbborder" data-file-height="650" data-file-width="1235" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/20px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/30px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/40px-Flag_of_the_United_States.svg.png 2x" width="20" /></a> </span>États-Unis</span></span></td>
<td>19 390,60 </td>
</tr>
as you have noticed there are 3 columns, and here is the code i'm using
A = []
B = []
C = []
for row in DataFondMonetaireInt.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 3:
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
It works well for A and C but not for B, i can't get the country name (in the example : Etats Unis)
why doesn't it work ?
thank you in advance,

use .text instead of .find(text=True)
DataFondMonetaireInt = BeautifulSoup(html_text, "html.parser")
A = []
B = []
C = []
for row in DataFondMonetaireInt.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 3:
A.append(cells[0].text)
B.append(cells[1].text.strip())
C.append(cells[2].text)

You could do the following to get each table
import pandas as pd
tables = pd.read_html("https://fr.wikipedia.org/wiki/Liste_des_pays_par_PIB_nominal")
[tables[i] for i in range(3)]

You can also use Wikipedia API to get the WikiText data :
import requests
import wikitextparser as wtp
import re
r = requests.get(
'https://fr.wikipedia.org/w/api.php',
params = {
'action': 'parse',
'page': 'Liste_des_pays_par_PIB_nominal',
'contentmodel': 'wikitext',
'prop': 'wikitext',
'format': 'json'
}
)
data = wtp.parse(r.json()['parse']['wikitext']['*'])
f = re.compile(r'[0-9]+[.[0-9]+]?')
for i in range(1, 4):
print([
(t[0], wtp.parse(t[1]).templates[0].name, float(f.findall(t[2])[0]))
for t in data.tables[i].data()
if len(wtp.parse(t[1]).templates) > 0
])
The above will give you data from the 3 tables using WikiTextParser library

Related

Web scrape and pull an attribute value instead of the text value out of td for the entire table beautiful soup

I am trying to scrape some data from a table, but they have the content that I actually would like in an attribute.
Example xml:
'''
<tr data-row="0">
<th scope ="row" class="left" data_append-csv="AlleRi00" data-stat="player" csk="Allen, Ricardo">
Ricardo Allen
</th>
<td class="center poptip out dnp" data-stat="week_4" data-tip"Out: Concussion" csk= "4">
<strong>O</strong>
</td>
'''
When scraping the table I use the following code:
'''
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')
final_data = []
for tr in table_rows:
td = tr.find_all(['th','td'])
row = [tr.text for tr in td]
final_data.append(row)
df = pd.DataFrame(final_data[1:],final_data[0])
'''
With my current code, I get a good looking dataframe with headers and all the info that is visible when looking at the table. However, I would like to get "Out: Concussion" instead of "O" within the table. I've been trying numerous ways and cannot figure it out. Please let me know if this is possible with the current process or if I am approaching it all wrong.

This should help you:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')
final_data = []
for tr in table_rows:
td = tr.find_all(['th','td'])
row = [tr['data-tip'] if tr.has_attr("data-tip") else tr.text for tr in td]
final_data.append(row)
m = final_data[1:]
final_dataa = [[m[j][i] for j in range(len(m))] for i in range(len(m[0]))]
df = pd.DataFrame(final_dataa,final_data[0]).T
df.to_csv("D:\\injuries.csv", index = False)
Screenshot of csv file (I've done some formatting so that it looks neat):

How to scrape a table from any site and store it to data frame?

I need to scrape a table from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
and store this data in python dataframe.
I have pulled the table but unable to pick the columns (Postcode, Borough, Neighbourhood)
My table looks like this:
<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td>North York</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A</td>
<td>North York</td>
<td>Victoria Village
</td></tr>
...
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td')
Postcode = row.columns[1].get_text()
Borough = row.columns[2].get_text()
Neighbourhood = row.column[3].get_text()
df.append([Postcode,Borough,Neighbourhood])
With the above code I am getting
TypeError: 'NoneType' object is not subscriptable
I googled it and got to know that I cannot do
Postcode = row.columns[1].get_text()
because of inline propery of the function.
I tried something else too but got some "Index error message".
It's simple. I need to traverse the row and goes on picking the three columns for each row and store it in a list. But I am not able to write it in a code.
Expected output is
Postcode Borough Neighbourhood
M1A Not assigned Not assigned
M2A Not assigned Not assigned
M3A North York Parkwoods

The code for scraping is wrong in below parts.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td') # the first row is returning <th> tags, but since you queried <td> tags, it's returning empty list.
if len(columns)>0: #In order to skip first row or in general, empty rows, you need to put an if check.
#Use the indices properly to get different values.
Postcode = columns[0].get_text()
Borough =columns[1].get_text()
Neighbourhood = columns[2].get_text()
df.append([Postcode,Borough,Neighbourhood])
Then again, be careful, using get_text will also return the links and anchor tags intact. You might wanna change the code to avoid that.
Happy web scraping :)

I don't know pandas but I use this script to scrape table. Hope it is helpful.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
tbl= soup.find('table', {'class': 'wikitable sortable'})
table_dict = {
"head": [th.text.strip() for th in tbl.find_all('th')],
"rows": [
[td.text.strip() for td in tr.find_all("td")]
for tr in tbl.find_all("tr")
if not tr.find("th")
]
}

If you want to scrape a table from web, you can use pandas library.
import pandas as pd
url = 'valid_url'
df = pd.read_html(url)
print(df[0].head())

Splitting HTML text by <br> while using beautifulsoup

HTML code:
<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>
I need to get values 4.5 kn and 7.1 as separate list items so I could append them separately. I do not want to split it I wanted to split the text string using re.sub, but it does not work. I tried too use replace to replace br, but it did not work. Can anybody provide any insight?
Python code:
def NameSearch(shipLink, mmsi, shipName):
from bs4 import BeautifulSoup
import urllib2
import csv
import re
values = []
values.append(mmsi)
values.append(shipName)
regex = re.compile(r'[\n\r\t]')
i = 0
with open('Ship_indexname.csv', 'wb')as f:
writer = csv.writer(f)
while True:
try:
shipPage = urllib2.urlopen(shipLink, timeout=5)
except urllib2.URLError:
continue
except:
continue
break
soup = BeautifulSoup(shipPage, "html.parser") # Read the web page HTML
#soup.find('br').replaceWith(' ')
#for br in soup('br'):
#br.extract()
table = soup.find_all("table", {"id": "vessel-related"}) # Finds table with class table1
for mytable in table: #Loops tables with class table1
table_body = mytable.find_all('tbody') #Finds tbody section in table
for body in table_body:
rows = body.find_all('tr') #Finds all rows
for tr in rows: #Loops rows
cols = tr.find_all('td') #Finds the columns
for td in cols: #Loops the columns
checker = td.text.encode('ascii', 'ignore')
check = regex.sub('', checker)
if check == ' Speed (avg./max): ':
i = 1
elif i == 1:
print td.text
pat=re.compile('<br\s*/>')
print pat.sub(" ",td.text)
values.append(td.text.strip("\n").encode('utf-8')) #Takes the second columns value and assigns it to a list called Values
i = 0
#print values
return values
NameSearch('https://www.fleetmon.com/vessels/kind-of-magic_0_3478642/','230034570','KIND OF MAGIC')

Locate the "Speed (avg./max)" label first and then go to the value via .find_next():
from bs4 import BeautifulSoup
data = '<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>'
soup = BeautifulSoup(data, "html.parser")
label = soup.find("label", class_="identifier", text="Speed (avg./max):")
value = label.find_next("td", class_="value").get_text(strip=True)
print(value) # prints 4.5 kn7.1 kn
Now, you can extract the actual numbers from the string:
import re
speed_values = re.findall(r"([0-9.]+) kn", value)
print(speed_values)
Prints ['4.5', '7.1'].
You can then further convert the values to floats and unpack into separate variables:
avg_speed, max_speed = map(float, speed_values)

Wiki scraping using python

I am trying to scrape the data stored in the table of this wikipedia page https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India).
However i am unable to scrape the full data
Hers's what i wrote so far:
from bs4 import BeautifulSoup
import urllib2
wiki = "https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page,"html.parser")
name = ""
pic = ""
strt = ""
end = ""
pri = ""
x=""
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 8:
name = cells[0].find(text=True)
print name`
The output obtained is:
Jairamdas Daulatram, Surjit Singh Barnala, Rao Birendra Singh
Whereas the output should be: Jairamdas Daulatram followed by Panjabrao Deshmukh

Have you read the raw html?
Because some of the cells span several rows (e.g. Political Party), most rows do not have 8 cells in them.
You cannot therefore do if len(cells) == 8 and expect it to work. Think about what this line was meant to achieve. If it was to ignore the header row then you could replace it with if len(cells) > 0 because all the header cells are <th> tags (and therefore will not appear in your list).
Page source (showing your problem):
<tr>
<td>Jairamdas Daulatram</td>
<td></td>
<td>1948</td>
<td>1952</td>
<td rowspan="6">Indian National Congress</td>
<td rowspan="6" bgcolor="#00BFFF" width="4px"></td>
<td rowspan="3">Jawaharlal Nehru</td>
<td><sup id="cite_ref-1" class="reference"><span>[</span>1<span>]</span></sup></td>
</tr>
<tr>
<td>Panjabrao Deshmukh</td>
<td></td>
<td>1952</td>
<td>1962</td>
<td><sup id="cite_ref-2" class="reference"><span>[</span>2<span>]</span></sup></td>
</tr>

Like already stated in a previous post. It does not make sense to set a static length. Just check if <td> exists. The code below is written in Python 3, but should work in Python 2.7 as well with some small adjustments.
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = urlopen("https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)")
soup = BeautifulSoup(wiki, "html.parser")
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if cells:
name = cells[0].find(text=True)
print(name)

Parsing html elements using BeautifulSoup

Suppose I have:
<tr>
<td class="prodSpecAtribute">word</td>
<td colspan="5">
another_word
</td>
</tr>
I want to extract text in 2 td classes (word and another_word:
So I used BeautifulSoup:
This is the code Matijn Pieters was asking for:
Basically, it grabs info from html page (from a table) and stores these values in a left and right column list. Then, I create a dictionary from this details (using the left col list as the key, and for the values, I use the right col list)
def get_data(page):
soup = BeautifulSoup(page)
left = []
right = []
#Obtain data from table and store into left and right columns
#Iterate through each row
for tr in soup.findAll('tr'):
#Find all table data(cols) in that row
tds = tr.findAll('td')
#Make sure there are 2 elements, a col and a row
if len(tds) >= 2:
#Find each entry in a row -> convert to text
right_col = []
inp = []
once = 0
no_class = 0
for td in tds:
if once == 0:
#Check if of class 'prodSpecAtribute'
if check(td) == True:
left_col = td.findAll(text=True)
left_col_x = re.sub('&\w+;', '', str(left_col[0]))
once = 1
else:
no_class = 1
break
else:
right_col = td.findAll(text=True)
right_col_x = ' '.join(text for text in right_col if text.strip())
right_col_x = re.sub('&\w+;', '', right_col_x)
inp.append(right_col_x)
if no_class == 0:
inps = '. '.join(inp)
left.append(left_col_x)
right.append(inps)
#Create a Dictionary for left and right cols
item = dict(zip(left, right))
return item

You may use HTQL (http://htql.net).
Here is for your example:
import htql
page="""
<tr>
<td class="prodSpecAtribute">word</td>
<td colspan="5">
another_word
</td>
</tr>
"""
query = """
<tr>{
c1 = <td (class='prodSpecAtribute')>1 &tx;
c2 = <td>2 &tx &trim;
}
"""
a=htql.query(page, query)
print(dict(a))
It prints:
{'word': 'another_word'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scraping a table from wikipedia with python : can't get a column - python

use .text instead of .find(text=True) DataFondMonetaireInt = BeautifulSoup(html_text, "html.parser") A = [] B = [] C = [] for row in DataFondMonetaireInt.findAll("tr"): cells = row.findAll("td") if len(cells) == 3: A.append(cells[0].text) B.append(cells[1].text.strip()) C.append(cells[2].text)

You could do the following to get each table import pandas as pd tables = pd.read_html("https://fr.wikipedia.org/wiki/Liste_des_pays_par_PIB_nominal") [tables[i] for i in range(3)]

Related

Web scrape and pull an attribute value instead of the text value out of td for the entire table beautiful soup

How to scrape a table from any site and store it to data frame?

Splitting HTML text by <br> while using beautifulsoup

Wiki scraping using python

Parsing html elements using BeautifulSoup

Categories

Resources