WebScraping with BeautifulSoup or LXML.HTML

WebScraping with BeautifulSoup or LXML.HTML - python

I have seen some webcasts and need help in trying to do this:
I have been using lxml.html. Yahoo recently changed the web structure.
target page;
http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true
In Chrome using inspector: I see the data in
//*[#id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table
then some more code
How Do get this data out into a list.
I want to change to other stock from "LLY" to "Msft"?
How do I switch between dates....And get all months.

I know you said you can't use lxml.html. But here is how to do it using that library, because it is very good library. So I provide the code using it, for completeness, since I don't use BeautifulSoup anymore -- it's unmaintained, slow and has ugly API.
The code below parses the page and writes the results in a csv file.
import lxml.html
import csv
doc = lxml.html.parse('http://finance.yahoo.com/q/os?s=lly&m=2011-04-15')
# find the first table contaning any tr with a td with class yfnc_tabledata1
table = doc.xpath("//table[tr/td[#class='yfnc_tabledata1']]")[0]
with open('results.csv', 'wb') as f:
cf = csv.writer(f)
# find all trs inside that table:
for tr in table.xpath('./tr'):
# add the text of all tds inside each tr to a list
row = [td.text_content().strip() for td in tr.xpath('./td')]
# write the list to the csv file:
cf.writerow(row)
That's it! lxml.html is so simple and nice!! Too bad you can't use it.
Here's some lines from the results.csv file that was generated:
LLY110416C00017500,N/A,0.00,17.05,18.45,0,0,17.50,LLY110416P00017500,0.01,0.00,N/A,0.03,0,182
LLY110416C00020000,15.70,0.00,14.55,15.85,0,0,20.00,LLY110416P00020000,0.06,0.00,N/A,0.03,0,439
LLY110416C00022500,N/A,0.00,12.15,12.80,0,0,22.50,LLY110416P00022500,0.01,0.00,N/A,0.03,2,50

Here is a simple example to extract all data from the stock tables:
import urllib
import lxml.html
html = urllib.urlopen('http://finance.yahoo.com/q/op?s=lly&m=2014-11-15').read()
doc = lxml.html.fromstring(html)
# scrape figures from each stock table
for table in doc.xpath('//table[#class="details-table quote-table Fz-m"]'):
rows = []
for tr in table.xpath('./tbody/tr'):
row = [td.text_content().strip() for td in tr.xpath('./td')]
rows.append(row)
print rows
Then to extract for different stocks and dates you need to change the URL. Here is Msft for the previous day:
http://finance.yahoo.com/q/op?s=msft&m=2014-11-14

If you'd like raw json try MSN
http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/
You can also specify an expiration date ?date=11/14/2014
http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/?date=11/14/2014
If you prefer Yahoo json
http://finance.yahoo.com/q/op?s=LLY
But you have to extract it from the html
import re
m = re.search('<script>.+({"applet_type":"td-applet-options-table".+);</script>', resp.content)
data = json.loads(m.group(1))
as_dicts = data['models']['applet_model']['data']['optionData']['_options'][0]['straddles']
Expirations are here
data['models']['applet_model']['data']['optionData']['expirationDates']
Convert iso to unix timestamp as here
Then re-request the other expirations with the unix timestamp
http://finance.yahoo.com/q/op?s=LLY&date=1414713600

Basing the Answer on #hoju:
import lxml.html
import calendar
from datetime import datetime
exDate = "2014-11-22"
symbol = "LLY"
dt = datetime.strptime(exDate, '%Y-%m-%d')
ym = calendar.timegm(dt.utctimetuple())
url = 'http://finance.yahoo.com/q/op?s=%s&date=%s' % (symbol, ym,)
doc = lxml.html.parse(url)
table = doc.xpath('//table[#class="details-table quote-table Fz-m"]/tbody/tr')
rows = []
for tr in table:
d = [td.text_content().strip().replace(',','') for td in tr.xpath('./td')]
rows.append(d)
print rows

Related

How to extract specific table data (div\tr\td) from multiple URLs in a website in a literate way into CSV (with sample)

I am learning python and practicing it for extracting data in a public site.
but I found a problem in this learning. I'd like to get your kindly help me out.
Thanks for your help in advance! I will keep track this thread daily to wait for your kindly comments :)
Purpose:
extract all 65 pages' col, row with contents into a csv in one script
65 pages URLs loop rule:
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1
..........
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=65
Question1:
When running below one page script to extract one page data into csv. I had to run twice with different filename, then data can be extracted to 1st time run file
for example if I run it with test.csv, excel keep 0kb status, after I change filename to test2, then run this script again, after that data can be extract to test.csv..., but test2.csv keep no data with 0 KB. any idea?
here is one page extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = requests.get("http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1")
soup = bs(url.content, 'html.parser')
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w', newline=''))
divs = soup.find_all("div", class_ = "iiright")
for div in divs:
for tr in div.find_all("tr")[1:]:
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
csv_writer.writerow(data)
Question2:
I found problem to literate 65 pages urls to extract data into csv.
it doesn't work... any idea fix it..
here are 65 pages urls' extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={}"
def get_data(url):
for url in [url.format(pageNo) for pageNo in range(1,65)]:
soup = bs(url.content, 'html.parser')
for div in soup.find_all("div", class_ = "iiright"):
for tr in div.find_all("tr"):
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
writer.writerow(data)
if __name__ == '__main__':
with open("test.csv","w",newline="") as infile:
writer = csv.writer(infile)
get_data(url)

Just an alternativ approach
Try to keep it simple and may use pandas, cause it will do all these things for you under the hood.
define a list (data) to keep your results
iterate over the urls with pd.read_html
concat the data frames in data and write them to_csvor to_excel
read_html
find the table that matches a string -> match='预售信息查询：' and select it with [0] cause read_html() will always give you a list of tables
take a special row as header header =2
get rid of the last row with navigation and last column that is caused by the wrong colspan with .iloc[:-1,:-1]
Example
import pandas as pd
data = []
for pageNo in range(1,5):
data.append(pd.read_html(f'http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={pageNo}', header =2, match='预售信息查询：')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
Example (based on your code with function)
import pandas as pd
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey="
def get_data(url):
data = []
for pageNo in range(1,2):
data.append(pd.read_html(f'{url}&pageNo={pageNo}', header=2, match='预售信息查询：')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
if __name__ == '__main__':
get_data(url)

Creating a list using BeautifulSoup

I want to scrape the IRS past forms site to gather the data for studying data mining. This web data contains a big table with 101 pages.
Here's the link:
https://apps.irs.gov/app/picklist/list/priorFormPublication.html
picture of site
My task:
Taking a list of tax form names (ex: "Form W-2", "Form 1095-C"), search the website
and return some informational results. Specifically, you must return the "Product
Number", the "Title", and the maximum and minimum years the form is available for
download. The forms returned should be an exact match for the input (ex: "Form W-2"
should not return "Form W-2 P", etc.) The results should be returned as json.
MY CODE SO FAR:
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
import requests
url="https://apps.irs.gov/app/picklist/list/priorFormPublication.html"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify())
forms_table = soup.find("table", class_= "picklist-dataTable")
forms_table_data = forms_table.find_all("tr") # contains 2 rows
headings = []
for tr in forms_table_data[0].find_all("th"):
headings.append(tr.b.text.replace('\n', ' ').strip())
print(headings)
THIS IS WHERE I AM GETTING HORRIBLY STUCK:
data = {}
for table, heading in zip(forms_table_data, headings):
t_headers = []
for th in table.find_all("th"):
t_headers.append(th.text.replace('\n', ' ').strip())
table_data = []
for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
t_row = {}
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
data[heading] = table_data
print(data)
I also seem to be missing how to incorporate the rest of the pages on the site.
Thanks for your patience!

Easiest way as mentioned to get table in data frame is read_html() - Be aware that pandas read all the table from the site and put them in a list of data frames. In your case you have to slice it by [3]
Cause your question is not that clear and hard to read with all that images, you should improve it.
Example (Form W-2)
import pandas as pd
pd.read_html('pd.read_html('https://apps.irs.gov/app/picklist/list/priorFormPublication.html?resultsPerPage=200&sortColumn=sortOrder&indexOfFirstRow=0&criteria=formNumber&value=Form+W-2&isDescending=false')[3]
Than you can filter and sort the data frame and export as json.

How to iterate a webscraping script over daily time series object in order to create a daily time series of data from the webpage

Thanks for taking a look at my question. I have created a script using BeautifulSoup and Pandas which scrapes data on projections from the Federal Reserve's website. The projections come out once a quarter (~ 3 mos.). I'd like to write a script which creates a daily time series and checks the Federal Reserve website once a day, and if there has been a new projection posted, the script would add that to the time series. If there has not been an update, then the script would just append the time series with the last valid, updated projection.
From my initial digging, it seems there are outside sources I can use to "trigger" the script daily, but I would prefer to keep everything purely python.
The code I have written to accomplish the scraping is as so:
from bs4 import BeautifulSoup
import requests
import re
import wget
import pandas as pd
# Starting url and the indicator (key) for links of interest
url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"
key = '/monetarypolicy/fomcprojtabl'
# Cook the soup
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
# Create the tuple of links for projection pages
projections = []
for link in soup.find_all('a', href=re.compile(key)):
projections.append(link["href"])
# Create a tuple to store the projections
decfcasts = []
for i in projections:
url = "https://www.federalreserve.gov{}".format(i)
file = wget.download(url)
df_list = pd.read_html(file)
fcast = df_list[-1].iloc[:,0:2]
fcast.columns = ['Target', 'Votes']
fcast.fillna(0, inplace = True)
decfcasts.append(fcast)
So far, the code I have written places everything in a tuple, but there is no time/date index for the data. I've been thinking of pseudo-code to write, and my guess is it will look something like
Create daily time series object
for each day in time series:
if day in time series = day in link:
run webscraper
other wise, append time series with last available observation
At least, this is what I have in mind. The final time series should probably end up looking fairly "clumpy" in the sense that there will be a lot of days with the same observation, and then when a new projection comes out, there will be a "jump" and then a lot more repetition until the next projection comes out.
Obviously, any help is tremendously appreciated. Thanks ahead of time, either way!

I've edited code for you. Now it get date from url. Date is saved as period in dataframe. Only when date is not present in dataframe (restored from pickle) it is processed and appended.
from bs4 import BeautifulSoup
import requests
import re
import wget
import pandas as pd
# Starting url and the indicator (key) for links of interest
url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"
key = '/monetarypolicy/fomcprojtabl'
# Cook the soup
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
# Create the tuple of links for projection pages
projections = []
for link in soup.find_all('a', href=re.compile(key)):
projections.append(link["href"])
# past results from pickle, when no pickle init empty dataframe
try:
decfcasts = pd.read_pickle('decfcasts.pkl')
except FileNotFoundError:
decfcasts = pd.DataFrame(columns=['target', 'votes', 'date'])
for i in projections:
# parse date from url
date = pd.Period(''.join(re.findall(r'\d+', i)), 'D')
# process projection if it wasn't included in data from pickle
if date not in decfcasts['date'].values:
url = "https://www.federalreserve.gov{}".format(i)
file = wget.download(url)
df_list = pd.read_html(file)
fcast = df_list[-1].iloc[:, 0:2]
fcast.columns = ['target', 'votes']
fcast.fillna(0, inplace=True)
# set date time
fcast.insert(2, 'date', date)
decfcasts = decfcasts.append(fcast)
# save to pickle
pd.to_pickle(decfcasts, 'decfcasts.pkl')

Print specific line (Beautifulsoup)

Currently, my code is parsing through the link and printing all of the information from the website. I only want to print a single specific line from the website. How can I go about doing that?
Here's my code:
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen("Link goes here").read()
soup = BeautifulSoup(r, "html.parser")
# This is what I want to change. I currently have it printing everything.
# I just want a specific line from the website
print (soup.prettify())

li = soup.prettify().split('\n')
print str(li[line_number-1])

Don't use pretty print to try and parse tds, select the tag specifically, if the attribute is unique then use that, if the class name is unique then just use that:
td = soup.select_one("td.content")
td = soup.select_one("td[colspan=3]")
If it was the fourth td:
td = soup.select_one("td:nth-of-type(4)")
If it is in a specific table, then select the table and then find the td in the table, trying to split the html into lines and indexing is actually worse than using a regex to parse html.
You can get the specific td using the text from the bold tag preceding the td i.e Department of Finance Building Classification::
In [19]: from bs4 import BeautifulSoup
In [20]: import urllib.request
In [21]: url = "http://a810-bisweb.nyc.gov/bisweb/PropertyProfileOverviewServlet?boro=1&houseno=1&street=park+ave&go2=+GO+&requestid=0"
In [22]: r = urllib.request.urlopen(url).read()
In [23]: soup = BeautifulSoup(r, "html.parser")
In [24]: print(soup.find("b",text="Department of Finance Building Classification:").find_next("td").text)
O6-OFFICE BUILDINGS
Pick the nth table and row:
In [25]: print(soup.select_one("table:nth-of-type(8) tr:nth-of-type(5) td[colspan=3]").text)
O6-OFFICE BUILDINGS

Parsing html data into python list for manipulation

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)

It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.

I would take a very different approach. We use LXML for scraping html pages
One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.
In my test I ran the following
import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content
tree = html.fromstring(page_as_string)
Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.
tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]
now I noticed that the first row has the column headings, so I want to separate all of the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now lets get the column headings:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally we can map the column headings to the row labels and cell values
my_results = []
for row in table_rows[1:]:
cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
temp_dict = OrderedDict()
for numb, cell in enumerate(cell_content):
if numb == 0:
temp_dict['row_label'] = cell.strip()
else:
dict_key = column_headings[numb]
temp_dict[dict_key] = cell
my_results.append(temp_dict)
now to access the results
for row_dict in my_results:
if row_dict['row_label'] == 'EPS (Basic)':
for key in row_dict:
print key, ':', row_dict[key]
row_label : EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :
Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).
Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.
I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.
When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE
I hope this helps

from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd
url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'
soup = BeautifulSoup(urllib2.urlopen(url).read())
table = soup.find('table', {'data-ajax-content' : 'true'})
data = []
for row in table.findAll('tr'):
cells = row.findAll('td')
cols = [ele.text.strip() for ele in cells]
data.append([ele for ele in cols if ele])
df = pd.DataFrame(data)
print df
dictframe = df.to_dict()
print dictframe
The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

WebScraping with BeautifulSoup or LXML.HTML - python

Related

How to extract specific table data (div\tr\td) from multiple URLs in a website in a literate way into CSV (with sample)

Creating a list using BeautifulSoup

How to iterate a webscraping script over daily time series object in order to create a daily time series of data from the webpage

Print specific line (Beautifulsoup)

Parsing html data into python list for manipulation

Categories

Resources