How to extract table information using BeautilSoup? - python

I am trying to scrape information from these kind of pages.
I need the information contained under Internship, Residency, Fellowship. I can extract values from tables, but in this case I could not decide which table to use because the heading (like Internship) is present under a div tag outside the table as a simple plain text, and after that the table is present whose value I need to extract. And I have many such pages of this kind, and it is not necessary that each page has these values, like in some pages Residency may not be present at all. (This decreases the total number of tables in the page). One example of such page is this. In this page Internship is not present at all.
The main problem I am facing is all the tables have the same attribute values, so I can not decide which table in to use for different pages. If any value of my interest is not present in a page, I have to return an empty string for that value.
I am using BeautifulSoup in Python. Can someone point, how could I proceed in extracting those values.

It looks like the ids for the headings and data each have a unique value and standard suffixes. You can use that to search for the appropriate values. Here's my solution:
from BeautifulSoup import BeautifulSoup
# Insert whatever networking stuff you're doing here. I'm going to assume
# that you've already downloaded the page and assigned it to a variable
# named 'html'
soup = BeautifulSoup(html)
headings = ['Internship', 'Residency', 'Fellowship']
values = []
for heading in headings:
x = soup.find('span', text=heading)
if x:
span_id = x.parent['id']
table_id = span_id.replace('dnnTITLE_lblTitle', 'Display_HtmlHolder')
values.append(soup.find('td', attrs={'id': table_id}).text)
else:
values.append('')
print zip(headings, values)

Related

How to select all table elements inside a div parent node with BeautifulSoup?

I am trying to select all table elements from a div parent node by using a customized function.
This is what I've got so far:
import BeautifulSoup
import requests
import lxml
url = 'https://www.salario.com.br/profissao/abacaxicultor-cbo-612510'
def getTables(url):
url = requests.get(url)
soup=BeautifulSoup(url.text, 'lxml')
div_component = soup.find('div', attrs={'class':'td-post-content'})
tables = div_component.find_all('table', attrs={'class':'listas'})
return tables
However when applied as getTables(url) the output is an empty list [].
I expect this function to return all html tables elements inside div node given specific his specific attributes.
How could I adjust this function?
Is there any other library I could use to accomplish this task?
Taking what the other commenters have said, and expanding on it.
Your div_component returns 1 element and doesn't contain tables, but using find_all() yeilds 8 elements:
len(soup.find_all('div', attrs={'class':'td-post-content'}))
So you can't just use find() on a list you'll need to iterate through it to find a div that contains tables.
Another way to just go after the tables you want, you can just use
tables = soup.find_all('table', attrs={'class':'listas'})
where tables is a list with 6 elements. If you know which table you want, you can iterate through the tables until you find the one you want.
The first problem is that "find" finds only the first such match. The first td-post-content <div> does not contain any tables. I think you want "findall". Second, you can use CSS selectors with BeautifulSoup. So, you can search for soup.findall('div.td-post-content') without using the attributes parameter.

Beautiful Soup to scrape data

I'm trying to scrape the EPS Estimates, EPS Earnings History (1st and 3rd tables) using BeautifulSoup from yahoo finance into an existing csv file. https://uk.finance.yahoo.com/quote/MSFT/analysis?p=MSFT
I have made a start but am struggling to be able to pull the exact data that I need, I am guessing I will need a for loop across the rows and td tags.
url = 'https://uk.finance.yahoo.com/quote/' + index +'/analysis?p=' + index
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
EP = soup.find('table', attrs={'class':"W(100%)"})
print(EP)
This appears be getting only the first table, but I am not sure how we write the loop to get the appropriate data. Looking at the HTML it looks like both the first and third tables have the same class name, so I can't use that to just go to the appropriate table.
Another idea I had, is searching for all tables on the page and putting them into a list. I could then select the correct index, but I'm not sure how I would do that in code.
Replace soup.find with soup.find_all(). It returns a list of all the tables, which you can then iterate.
EPs = soup.find_all('table', attrs={'class':"W(100%)"})
for EP in EPs:
...
Your first and third tables would be EPs[0] and EPs[2] if that is what you are looking for.

Scraping Link Location

I have been following FC Pythons tutorial on web scraping and I do not understand how they have identified 1,41,2 as the link locations for this page. Is this something I should be able to see on the page source?
#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')
#Create an empty list to assign these values to
teamLinks = []
#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")
#We need the location that the link is pointing to, so for each link, take the link location.
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
teamLinks.append(links[i].get("href"))
#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
The range(1,41,2) is used to avoid duplicated links. That's because in the table, each row there are multiple cells that contains the same link:
We can obtain the same result getting all links and removing duplicates with a Set:
teamLinks = list({x.get("href") for x in links})
In the website for each row in the table, there're 3 a.vereinprofil_tooltip links and they're same with same href. To avoid duplications the use 1, 3, 5, etc. links. And yes you should see it in the page source, and also in Chrome Dev Tools.
Also you collect links using different ways:
Use different selector, like #yw1 .zentriert a.vereinprofil_tooltip for CLUBS - PREMIER LEAGUE 19/20 table
Use python code to remove duplicates:
team_links = list(dict.fromkeys([f"https://www.transfermarkt.co.uk{x['href']}" for x in soup.select("a.vereinprofil_tooltip")]))
It is an attempt to remove duplicate entries.
A more robust way to achieve the same is this:
# iterate over all links in list
for i in range(len(links)):
teamLinks.append(links[i].get("href"))
for i in range(len(teamLinks)):
teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
# make a set to remove duplicates and then make a list of it again
teamLinks = list(set(teamLinks))
teamLinks prints out to sth like this then:
['https://www.transfermarkt.co.uk/crystal-palace/spielplan/verein/873/saison_id/2019',
'https://www.transfermarkt.co.uk/afc-bournemouth/spielplan/verein/989/saison_id/2019',
'https://www.transfermarkt.co.uk/sheffield-united/spielplan/verein/350/saison_id/2019',
...

Python - capture ALL tables from an HTML page

I have emails with embedded HTML tables and I have code that uses BeautifulSoup to extract the tables and the data within them, my problem is that sometimes it only succeeds in capturing one table when there are more.
The code I normally run on these table is:
with open(file_path) as in_f:
msg = email.message_from_file(in_f)
html_msg = msg.get_payload(1)
body = html_msg.get_payload(decode=True)
html = body.decode()
table = bs4.BeautifulSoup(html).find("table")
data = [[cell.text.strip() for cell in row.find_all("td")] for row in table.find_all("tr")]
But for this email, and some others like it, I only successfully extract the first Package. I've tried changing one line to table = bs4.BeautifulSoup(html).find_all("table") but find_all doesn't work there.
I'm a novice when it comes to BeautifulSoup so any help would be appreciated, thanks.
I think I see what you are doing wrong;
if you do
table = bs4.BeautifulSoup(html).find("table")
it returns a Tag (ie one element). If instead you do
tables = bs4.BeautifulSoup(html).find_all("table")
it returns a ResultSet (basically a list of tables). So far so good! The problem comes in the next line, when you try to treat the ResultSet as if it were a single Tag:
... for row in tables.find_all("tr") # Can't do this!
tables is not a single element (which has a .find_all method), it is a list of elements (which doesn't) - hence the AttributeError. Instead, you have to iterate over each table, like so:
tables = bs4.BeautifulSoup(html).find_all("table")
data = []
for table in tables: # <-- extra level of iteration!
for row in table.find_all("tr"):
data.append([cell.text.strip() for cell in row.find_all("td")])
Hope that helps!

How does table parsing work in python? Is there an easy way other that beautiful soup?

I am trying to understand how one can use beautiful soup to extract the href links for the contents under a particular column in a table on a webpage. For example consider the link: http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015.
On this page the table with class wikitable has a column title, I need to extract the href links that are behind each of the values under the column title and put them in an excel sheet. What would be the best way to do this? I am having a little difficulty in understanding the beautiful soup table parsing documentation.
You don't really have to literally navigate the tree, you can simply try to see what identifies those lines.
Like in this example, the urls you are looking for reside in a table with class="wikitable", in that table they reside in a td tag with align=center, now we have a somewhat unique identification for our links, we can start extracting them.
However you should put into consideration that multiple tables with class="wikitable" and td tags with align=center may exist, in case you want the first or second table, it depends on your choice, you will have to add extra filters.
The code should look something like this for extracting all links from those tables:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))
There's one more thing to note here, notice the use of SoupStrainer, it's used to specify a filter for reading the content you want to process, it helps to speed the process, try to not use the parse_only argument on this line:
soup = BeautifulSoup(content, parse_only=filter_tag)
and notice the difference. (I noticed it because my pc is not that powerful)

Categories

Resources