Extracting data from a wikipedia page - python

This question might be really specific. I am trying to extract the number of employees from the Wikipedia pages of companies such as https://en.wikipedia.org/wiki/3M.
I tried using the Wikipedia python API and some regex queries. However, I couldn't find anything solid that I could generalize for any company (not considering exceptions).
Also, because the table row does not have an id or a class I cannot directly access the value. Following is the source:
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;">Number of employees</div>
</th>
<td style="line-height:1.35em;">89,800 (2015)<sup id="cite_ref-FY_1-5" class="reference">[1]</sup></td>
</tr>
So, even though I have the id of the table - infobox vcard so I couldn't figure out a way to scrape this information using beautifulSoup.
Is there a way to extract this information? It is present in the summary table on the right at the beginning of the page.

Using lxml.etree instead of BeautifulSoup, you can get what you want with an XPath expression:
>>> from lxml import etree
>>> import requests
>>> r = requests.get('https://en.wikipedia.org/wiki/3M')
>>> doc = etree.fromstring(r.text)
>>> e = doc.xpath('//table[#class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td')
>>> e[0].text
'89,800 (2015)'
Let's take a closer look at that expression:
//table[#class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td
That says:
Find all table elements that have attribute class set to infobox
vcard, and inside those elements look for tr elements that have a
child th element that has a child div element that contains the
text "Number of employees", and inside that tr element, get the
first td element.

Why reinvent the wheel?
DBpedia
has this information in RDF triples.
See e.g.
http://dbpedia.org/page/3M

Related

Beautifulsoup4, BS4, Python Parsing Question

I am parsing a webpage using bs4. There are more then one data type I would like to select, with the same class name.
My parsing code:
rows_ranking = soup_ranking.select('#current-poll tbody tr .left')
The page I want to parse has two different ".left" identifiers in the table rows. How can I choose which one I would like. Here is an exmample of two of these table rows (one I would like my program to parse, the other I would like to ignore)
1 - <td class="left " data-stat="school_name" csk="Baylor.015">Baylor</td>
2 - <td class="left " data-stat="conf_abbr" csk="Big 12 Conference.015.001">Big 12</td>
As you can see they have the same class identifier. Is there a way I can have bs4 look only for the first of the two?
I hope my question makes sense, thanks in advance!
Haven't used BS4 or python for awhile, but If I remember correctly something like this should work on getting all elements with data_stat and school_name in the data.
results = soup.findAll("td", {"data_stat" : "school_name"})
Or if you want all results in data with the data_stat attribute and the value doesn't matter use -
results = soup.findAll("td", {"data_stat" : True})
You have a couple of options:
You can use soup.find_all and loop through your results.
Use the css selector for first.
Inspect and copy the selector for that element.

Retrieving a 'td' tag by searching for a 'th' tag under the same 'tr' row

I need a way to retrieve a specific 'td' tag with it's text content under a specific 'th' tag belonging to the same 'tr' row. This is how the structure looks like:
<tr>...Not interested in this row...</tr>
<tr>...Not interested in this row...</tr>
<tr>
<th>Titletext</th>
<td class="rightalign right">64663438434</td>
</tr>
<tr>...Not interested in this row...</tr>
<tr>...Not interested in this row...</tr>
I want to search by the 'th' tag, and retrieve the number inside the 'td' tag under it. Any ideas?
Is this what you're looking for?
num = soup.find('td', class_='rightalign right')
num.text
output:
'64663438434'
You can probably use the re module.
import re
cells = re.findall(u"<th>Titletext</th>[^>]*>([^<]*)</td>", page)
print(cells)
BeautifulSoup is kind enough to search the required elements for you:
value = soup.find('th', text='Titletext').findNextSibling('td').text
You will get a string so consider to convert it to int...
If the line contains more than one TD tags and you do not want the first one, but the first one with a specific class, you can add that to the request:
value = soup.find('th', text='Titletext').findNextSibling('td',
{'class': "rightalign right"}).text
(Thanks to ArranDuff for noticing it)
Using Beautiful soup you can iterate through all of the tr's and search for th's.
Then for each th you can use the find_next_sibling method to find the next tag element.
If this is the required td then extract the number
For example
import bs4
html = '<tr>...Not interested in this row...</tr> \n <tr>...Not interested in this row...</tr>\n <tr> \n <th>Titletext</th> \n <td class="rightalign right">64663438434</td> \n </tr> \n <tr>...Not interested in this row...</tr> \n <tr>...Not interested in this row...</tr>'
bs = bs4.BeautifulSoup(html)
for tr in bs.find_all('tr'):
for th in tr.find_all('th'):
td = th.find_next_sibling()
if 'class=\"rightalign right' in str(td):
print(td.text)
Output
64663438434
Personally, I would stick with beautiful soup rather than using your own regex's as much as possible. The structure of html can be inconsistent and beautiful soup hides a lot of complexity and heavy lifting

Select a html a tag with specified display content

I'm new to scrapy and have been struggling for this problem for hours.
I need to scrape a page, with its source somehow looks like this:
<tr class="odd">
<td class="pfama_PF02816">Pfam</td>
<td>Alpha_kinase</td>
<td>1389</td>
<td>1590</td>
<td class="sh" style="display: none">21.30</td>
</tr>
I need to get the information of the tr.odd tag, if and only if the a tag has "Alpha_kinase" value
I can get all of those content (including "Alpha_kinase", 1389, 1590 and many other values) and then process the output to get "Alpha_kinase" only, but this approach will be significantly fragile and ugly. Currently I have to do that way:
positions = response.css('tr.odd td:not([class^="sh"]) td a::text').extract()
then do a for-loop to check.
Is there any condition (like td.not above) expression to put in response.css to solve my problem?
Thanks in advance. Any advice will be highly appreciated!
You can use another selector: response.xpath to select element from the html,
and filter the text with xpath contains function.
>>> response.xpath("//tr[#class='odd']/td/a[contains(text(),'Alpha_kinase')]")
[<Selector xpath="//tr[#class='odd']/td/a[contains(text(),'Alpha_kinase')]" data='<a href="http://pfam.xfam.org/family/Alp'>]
I assume there are multiple such tr elements on the page. If so, I would probably do something like:
# get only rows containing 'Alpha_kinase' in link text
for row in response.xpath('//tr[#class="odd" and contains(./td/a/text(), "Alpha_kinase")]'):
# extract all the information
item['link'] = row.xpath('./td[2]/a/#href').extract_first()
...
yield item

Iteratively reading a specific element from a <table> with Selenium for Python

I am trying to read in information from this table that changes periodically. The HTML looks like this:
<table class="the_table_im_reading">
<thead>...</thead>
<tbody>
<tr id="uc_6042339">
<td class="expansion">...</td>
<td>
<div id="card_6042339_68587" class="cb">
TEXT I NEED TO READ
</td>
<td>...</td>
more td's
</tr>
<tr id="uc_6194934">...</tr>
<td class="expansion">...</td>
similar as the first <tr id="uc...">
I was able to get to the table using:
table_xpath = "//*[#id="content-wrapper"]/div[5]/table"
table_element = driver.find_element_by_xpath(table_xpath)
And I am trying to read the TEXT I NEED TO READ part for each unique <tr id="uc_unique number">. The id=uc_unique number changes periodically, so I cannot use find element by id.
Is there a way reach that element and read that specific text?
Looks like you can search via the anchor-element link (href-attribute), since I guess this will not change.
via xpath:
yourText = table_element.find_element_by_xpath(.//a[#href='/blahsomelink']).text
UPDATE
OP mentioned that his link is also changing (with each call?), which means that the first approach is not for him.
if you want the text of the first row-element you can try this:
yourText = table_element.find_element_by_xpath(.//tr[1]//a[#class='cl']).text
if you know for example that the link element is always in the second data-element of the first row and there is only one link-element, then you can do this:
yourText = table_element.find_element_by_xpath(.//tr[1]/td[2]//a).text
Unless you provide more detailed requirements as to what you are really searching for, this will have to suffice so far...
Another UPDATE
OP gave more info regarding his requirement:
I am trying to get the text in each row.
Given there is only one anchor-element with class cl in each tr element you can do the following:
elements = table_element.find_elements_by_xpath(.//tr//a[#class='cl'])
for element in elements:
row_text = element.text
Now you can do whatever you need with all these texts...
It looks like you have a few options.
If all you want is the first A, it might be as simple as
table_element.find_element_by_css_selector("a.cl")).text
or the little more specific
table_element.find_element_by_css_selector("div.cb > a.cl")).text
If you want all the As, try the find_elements_* versions of the above.
I managed to find the elements I needed using .get_attribute("textContent") instead of .text , a tip from Get Text from Span returns empty string

Find all html elements whose contains a specific class

I want BeautifulSoup to find all element in html page whose have a certain class. But they can also have extra classes. For example:
soup.findAll('tr', {'class': 'super_class1'})
This code only finds tr whose have only super_class1. But I want it to find all tr whose contains this class such
<tr class='super_class1'>aaa</tr>
and
<tr class='super_class1 super_class2'>bbb</tr>
and
<tr class='super_class1 super_class15 super_class16'>ccc</tr>
This is a bug that has been fixed (https://bugs.launchpad.net/beautifulsoup/+bug/410304); the problem is basically that the soup doesn't recognizes spaces in class name.
But if you have to use a version without the fix, the above link also provides a solution:
soup.findAll(True, {'class': re.compile(r'\bsuper_class1\b')})

Categories

Resources