Beautiful Soup to scrape data - python

I'm trying to scrape the EPS Estimates, EPS Earnings History (1st and 3rd tables) using BeautifulSoup from yahoo finance into an existing csv file. https://uk.finance.yahoo.com/quote/MSFT/analysis?p=MSFT
I have made a start but am struggling to be able to pull the exact data that I need, I am guessing I will need a for loop across the rows and td tags.
url = 'https://uk.finance.yahoo.com/quote/' + index +'/analysis?p=' + index
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
EP = soup.find('table', attrs={'class':"W(100%)"})
print(EP)
This appears be getting only the first table, but I am not sure how we write the loop to get the appropriate data. Looking at the HTML it looks like both the first and third tables have the same class name, so I can't use that to just go to the appropriate table.
Another idea I had, is searching for all tables on the page and putting them into a list. I could then select the correct index, but I'm not sure how I would do that in code.

Replace soup.find with soup.find_all(). It returns a list of all the tables, which you can then iterate.
EPs = soup.find_all('table', attrs={'class':"W(100%)"})
for EP in EPs:
...
Your first and third tables would be EPs[0] and EPs[2] if that is what you are looking for.

Related

Scraping Webpage with Javascript Elements

So to preface the website I've been trying to scrape seems to have/use (I'm unsure about the jargon with things relating to web development and the like) javascript code and I've been having varying success trying to scrape different tables on different pages.
For instance on this page: http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic I was easily able to 'inspect element' then go to Network find the correct 'Name' of the script and then find the Request URL I needed to get the table that I wanted. The code I used for this was:
url = 'http://www.minorleaguesplits.com/tennisabstract/cgi-bin/frags/NovakDjokovic.js'
content = requests.get(url)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find('table', id='tour-years', attrs= {'class':'tablesorter'})
dfs = pd.read_html(str(table))
df = pd.concat(dfs)
However, now when I'm looking at a different page on the same site, say this one http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html, I'm unable to find the Request URL that will allow me to eventually get the table that I want. I repeat the same process as I did above, but there's no .js script under the Network tab that has the table. I do see the table when I'm looking at the html elements, but of course I can't get it without the correct url.
So my question would be, how can I get the table from this page http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html ?
TIA!
On looking at the source code of the html page, you can see that all the data is already loaded in the script tag. Only thing you want is extract the variable value and load it to beautifulsoup.
The following code gives all the variables and the values from script tag
import requests, re
from bs4 import BeautifulSoup
res = requests.get("http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html")
soup = BeautifulSoup(res.text, "lxml")
script = soup.find("script", attrs={"language":"JavaScript"}).text
var_only = script[:script.index("$(document)")].strip()
Next you can use regex to get the variable values - https://regex101.com/r/7cE85A/1

Python (Selenium/BeautifulSoup) Search Result Dynamic URL

Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
Once you have your soup variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.

How to select the specific table of webpage with Python

I am newbie in programming and python. But I want to parse HTML in my python script.
Here is the webpage:
http://stock.finance.sina.com.cn/hkstock/finance/00759.html
Question 1:
This page is about the financial information of particular share. These four tables is about:
financial summary,
Balance Sheet,
Cash Flow
Income Statement.
And I want to extract the information in table 3 & 4. Here is my code:
import urllib
from bs4 import BeautifulSoup
url = 'http://stock.finance.sina.com.cn/hkstock/finance/00759.html'
html = urllib.urlopen(url).read() #.read() mean read all into a string
soup = BeautifulSoup(html, "lxml")
table = soup.find("table", { "class" : "tab05" })
for row in table.findAll("tr"):
print row.findAll("td")
But this code only can get the first table information. How can I change the code in order to get the third and fourth table information? I found that those 4 tables do not contain unique id or classname, I dont know how to locate them....
Question 2:
Also this is simplify Chinese webpage, how to keep the original text on output?
Question 3:
On the upper right corner of each table, there is a drop down menu for selecting the appropriated period, namely: "All", "Whole Year", "Half Year", "First Quarter" and "Third Quarter". Is urllib able to change this drop down menu?
Thank you very much.
According to the website, all four tables have the class name "tab05".
Therefore, all you have to do is simply change the .find method to .findAll at the var soup, then all four table can be accessed.
import urllib
from bs4 import BeautifulSoup
url = 'http://stock.finance.sina.com.cn/hkstock/finance/00759.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
tables = soup.findAll("table", { "class" : "tab05" })
print len(tables) #4
for table in tables:
for row in table.findAll("tr"):
for col in row.findAll("td"):
print col.getText()
As for the encoding of simplify Chinese, print col.getText() will get the right words on the terminal. If you are seeking for writing them to a file, you have to encode the string to gb2312.
f.write(col.getText().encode('gb2312'))
For the 3rd question, since the data are rendered by javascript function written in datatable.js, I think it is not possible to get all of them simply by urllib. Better check out some other library to find a suitable usage.
Thanks for your reply.
I may be misunderstand your meaning. I rewrite the code as following:
tables = soup.findAll("table", { "class" : "tab05" })
print len(tables)
for row in tables[0].findAll("tr"):
for col in row.findAll("td"):
print col.getText()
The result of "len(tables)" is 1. Only the first table can be accessed.
Also I found that if I use
for row in tables[0].findAll("tr"):
for col in row.findAll("td"):
I cannot get all information of that table. The last figure got from this code is "-45.7852", which is only the half of that table.

How does table parsing work in python? Is there an easy way other that beautiful soup?

I am trying to understand how one can use beautiful soup to extract the href links for the contents under a particular column in a table on a webpage. For example consider the link: http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015.
On this page the table with class wikitable has a column title, I need to extract the href links that are behind each of the values under the column title and put them in an excel sheet. What would be the best way to do this? I am having a little difficulty in understanding the beautiful soup table parsing documentation.
You don't really have to literally navigate the tree, you can simply try to see what identifies those lines.
Like in this example, the urls you are looking for reside in a table with class="wikitable", in that table they reside in a td tag with align=center, now we have a somewhat unique identification for our links, we can start extracting them.
However you should put into consideration that multiple tables with class="wikitable" and td tags with align=center may exist, in case you want the first or second table, it depends on your choice, you will have to add extra filters.
The code should look something like this for extracting all links from those tables:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))
There's one more thing to note here, notice the use of SoupStrainer, it's used to specify a filter for reading the content you want to process, it helps to speed the process, try to not use the parse_only argument on this line:
soup = BeautifulSoup(content, parse_only=filter_tag)
and notice the difference. (I noticed it because my pc is not that powerful)

How to extract table information using BeautilSoup?

I am trying to scrape information from these kind of pages.
I need the information contained under Internship, Residency, Fellowship. I can extract values from tables, but in this case I could not decide which table to use because the heading (like Internship) is present under a div tag outside the table as a simple plain text, and after that the table is present whose value I need to extract. And I have many such pages of this kind, and it is not necessary that each page has these values, like in some pages Residency may not be present at all. (This decreases the total number of tables in the page). One example of such page is this. In this page Internship is not present at all.
The main problem I am facing is all the tables have the same attribute values, so I can not decide which table in to use for different pages. If any value of my interest is not present in a page, I have to return an empty string for that value.
I am using BeautifulSoup in Python. Can someone point, how could I proceed in extracting those values.
It looks like the ids for the headings and data each have a unique value and standard suffixes. You can use that to search for the appropriate values. Here's my solution:
from BeautifulSoup import BeautifulSoup
# Insert whatever networking stuff you're doing here. I'm going to assume
# that you've already downloaded the page and assigned it to a variable
# named 'html'
soup = BeautifulSoup(html)
headings = ['Internship', 'Residency', 'Fellowship']
values = []
for heading in headings:
x = soup.find('span', text=heading)
if x:
span_id = x.parent['id']
table_id = span_id.replace('dnnTITLE_lblTitle', 'Display_HtmlHolder')
values.append(soup.find('td', attrs={'id': table_id}).text)
else:
values.append('')
print zip(headings, values)

Categories

Resources