I want to scrape the IRS past forms site to gather the data for studying data mining. This web data contains a big table with 101 pages.
Here's the link:
https://apps.irs.gov/app/picklist/list/priorFormPublication.html
picture of site
My task:
Taking a list of tax form names (ex: "Form W-2", "Form 1095-C"), search the website
and return some informational results. Specifically, you must return the "Product
Number", the "Title", and the maximum and minimum years the form is available for
download. The forms returned should be an exact match for the input (ex: "Form W-2"
should not return "Form W-2 P", etc.) The results should be returned as json.
MY CODE SO FAR:
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
import requests
url="https://apps.irs.gov/app/picklist/list/priorFormPublication.html"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify())
forms_table = soup.find("table", class_= "picklist-dataTable")
forms_table_data = forms_table.find_all("tr") # contains 2 rows
headings = []
for tr in forms_table_data[0].find_all("th"):
headings.append(tr.b.text.replace('\n', ' ').strip())
print(headings)
THIS IS WHERE I AM GETTING HORRIBLY STUCK:
data = {}
for table, heading in zip(forms_table_data, headings):
t_headers = []
for th in table.find_all("th"):
t_headers.append(th.text.replace('\n', ' ').strip())
table_data = []
for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
t_row = {}
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
data[heading] = table_data
print(data)
I also seem to be missing how to incorporate the rest of the pages on the site.
Thanks for your patience!
Easiest way as mentioned to get table in data frame is read_html() - Be aware that pandas read all the table from the site and put them in a list of data frames. In your case you have to slice it by [3]
Cause your question is not that clear and hard to read with all that images, you should improve it.
Example (Form W-2)
import pandas as pd
pd.read_html('pd.read_html('https://apps.irs.gov/app/picklist/list/priorFormPublication.html?resultsPerPage=200&sortColumn=sortOrder&indexOfFirstRow=0&criteria=formNumber&value=Form+W-2&isDescending=false')[3]
Than you can filter and sort the data frame and export as json.
Related
I webscraped this webpage.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)
data = []
u = soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({ soup.select('main .section p:not([class])')})
print(data)
df = pd.DataFrame(data)
# results (it may not be the same text
[... <p><strong>Duisenberg:</strong> My answer is, well, in the first place when something is before the courts you do not comment. I don't comment and particularly not when it concerns such an esteemed colleague of mine. So, on the hypothetical question whether other people would be eligible for the job, I think it is wise not to go into that either. </p>]
The problem is that when I turn data into a dataframe, it remains in a list format which is difficult to handle. I would like it to be saved as a unique object without losing its properties (</p>,</strong>).
If I do this, it loses the division in pararaphs and bolds that will be needed for manipulation.
data = []
u = soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({
'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
})
df = pd.DataFrame(data)
# with this however I lose the breakdown in paragraphs, bold characters etc. I'd like to keep them in the text.
Can anyone help me with this?
Thanks!
Note sure if I understand it correctly, but if you like to convert the resultset to text you can do it like that:
''.join([str(e) for e in soup.select('main .section p:not([class])')])
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)
data = []
u = soup.select('div.title > a')
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{u[0]['href']}").content)
data.append({'text':''.join([str(e) for e in soup.select('main .section p:not([class])')])})
pd.DataFrame(data)
Output
text
<p>Good afternoon, the Vice-President and I welcome you to our press conference. </p><p id="_Hlk93669934">The euro area economy is continuing to r...
I am learning python and practicing it for extracting data in a public site.
but I found a problem in this learning. I'd like to get your kindly help me out.
Thanks for your help in advance! I will keep track this thread daily to wait for your kindly comments :)
Purpose:
extract all 65 pages' col, row with contents into a csv in one script
65 pages URLs loop rule:
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1
..........
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=65
Question1:
When running below one page script to extract one page data into csv. I had to run twice with different filename, then data can be extracted to 1st time run file
for example if I run it with test.csv, excel keep 0kb status, after I change filename to test2, then run this script again, after that data can be extract to test.csv..., but test2.csv keep no data with 0 KB. any idea?
here is one page extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = requests.get("http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1")
soup = bs(url.content, 'html.parser')
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w', newline=''))
divs = soup.find_all("div", class_ = "iiright")
for div in divs:
for tr in div.find_all("tr")[1:]:
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
csv_writer.writerow(data)
Question2:
I found problem to literate 65 pages urls to extract data into csv.
it doesn't work... any idea fix it..
here are 65 pages urls' extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={}"
def get_data(url):
for url in [url.format(pageNo) for pageNo in range(1,65)]:
soup = bs(url.content, 'html.parser')
for div in soup.find_all("div", class_ = "iiright"):
for tr in div.find_all("tr"):
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
writer.writerow(data)
if __name__ == '__main__':
with open("test.csv","w",newline="") as infile:
writer = csv.writer(infile)
get_data(url)
Just an alternativ approach
Try to keep it simple and may use pandas, cause it will do all these things for you under the hood.
define a list (data) to keep your results
iterate over the urls with pd.read_html
concat the data frames in data and write them to_csvor to_excel
read_html
find the table that matches a string -> match='预售信息查询:' and select it with [0] cause read_html() will always give you a list of tables
take a special row as header header =2
get rid of the last row with navigation and last column that is caused by the wrong colspan with .iloc[:-1,:-1]
Example
import pandas as pd
data = []
for pageNo in range(1,5):
data.append(pd.read_html(f'http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={pageNo}', header =2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
Example (based on your code with function)
import pandas as pd
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey="
def get_data(url):
data = []
for pageNo in range(1,2):
data.append(pd.read_html(f'{url}&pageNo={pageNo}', header=2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
if __name__ == '__main__':
get_data(url)
I am relatively new to programming and completely new to stack overflow. I thought a good way to learn would be with a python & excel based project, but am stuck. My plan was to scrape a website of addresses using beautiful soup look up the zillow estimates of value for those addresses and populate them into tabular form in excel. I am unable to figure out how to get the addresses (the html on the site I am trying to scrape seems pretty messy), but was able to pull google address links from the site. Sorry if this is a very basic question, any advice would help though:
from bs4 import BeautifulSoup
from urllib.request import Request,
urlopen
import re
import pandas as pd
req = Request("http://www.tjsc.com/Sales/TodaySales")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
count = 0
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
count = count +1
print(links)
print("count is", count)
po = links
pd.DataFrame(po).to_excel('todaysale.xlsx', header=False, index=False)
you are on the right track. Instead of 'a', you need to use different html tag 'td' for the rows. Also 'th' for column names. here is one way to implement it. list_slide function converts each 14 elements to one row since the original table has 14 columns.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = "http://www.tjsc.com/Sales/TodaySales"
r = requests.get(url, verify=False)
text = r.text
soup = bs(text, 'lxml')
# Get column headers from the html file
header = []
for c_name in soup.findAll('th'):
header.append(c_name)
# clean up the extracted header content
header = [h.contents[0].strip() for h in header]
# get each row of the table
row = []
for link in soup.find_all('td'):
row.append(link.get_text().strip())
def list_slice(my_list, step):
"""This function takes any list, and divides it to chunks of size of "step"
"""
return [my_list[x:x + step] for x in range(0, len(my_list), step)]
# creating the final dataframe
df = pd.DataFrame(list_slice(row, 14), columns=header[:14])
I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)
It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.
I would take a very different approach. We use LXML for scraping html pages
One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.
In my test I ran the following
import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content
tree = html.fromstring(page_as_string)
Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.
tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]
now I noticed that the first row has the column headings, so I want to separate all of the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now lets get the column headings:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally we can map the column headings to the row labels and cell values
my_results = []
for row in table_rows[1:]:
cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
temp_dict = OrderedDict()
for numb, cell in enumerate(cell_content):
if numb == 0:
temp_dict['row_label'] = cell.strip()
else:
dict_key = column_headings[numb]
temp_dict[dict_key] = cell
my_results.append(temp_dict)
now to access the results
for row_dict in my_results:
if row_dict['row_label'] == 'EPS (Basic)':
for key in row_dict:
print key, ':', row_dict[key]
row_label : EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :
Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).
Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.
I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.
When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE
I hope this helps
from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd
url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'
soup = BeautifulSoup(urllib2.urlopen(url).read())
table = soup.find('table', {'data-ajax-content' : 'true'})
data = []
for row in table.findAll('tr'):
cells = row.findAll('td')
cols = [ele.text.strip() for ele in cells]
data.append([ele for ele in cols if ele])
df = pd.DataFrame(data)
print df
dictframe = df.to_dict()
print dictframe
The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.
I have seen some webcasts and need help in trying to do this:
I have been using lxml.html. Yahoo recently changed the web structure.
target page;
http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true
In Chrome using inspector: I see the data in
//*[#id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table
then some more code
How Do get this data out into a list.
I want to change to other stock from "LLY" to "Msft"?
How do I switch between dates....And get all months.
I know you said you can't use lxml.html. But here is how to do it using that library, because it is very good library. So I provide the code using it, for completeness, since I don't use BeautifulSoup anymore -- it's unmaintained, slow and has ugly API.
The code below parses the page and writes the results in a csv file.
import lxml.html
import csv
doc = lxml.html.parse('http://finance.yahoo.com/q/os?s=lly&m=2011-04-15')
# find the first table contaning any tr with a td with class yfnc_tabledata1
table = doc.xpath("//table[tr/td[#class='yfnc_tabledata1']]")[0]
with open('results.csv', 'wb') as f:
cf = csv.writer(f)
# find all trs inside that table:
for tr in table.xpath('./tr'):
# add the text of all tds inside each tr to a list
row = [td.text_content().strip() for td in tr.xpath('./td')]
# write the list to the csv file:
cf.writerow(row)
That's it! lxml.html is so simple and nice!! Too bad you can't use it.
Here's some lines from the results.csv file that was generated:
LLY110416C00017500,N/A,0.00,17.05,18.45,0,0,17.50,LLY110416P00017500,0.01,0.00,N/A,0.03,0,182
LLY110416C00020000,15.70,0.00,14.55,15.85,0,0,20.00,LLY110416P00020000,0.06,0.00,N/A,0.03,0,439
LLY110416C00022500,N/A,0.00,12.15,12.80,0,0,22.50,LLY110416P00022500,0.01,0.00,N/A,0.03,2,50
Here is a simple example to extract all data from the stock tables:
import urllib
import lxml.html
html = urllib.urlopen('http://finance.yahoo.com/q/op?s=lly&m=2014-11-15').read()
doc = lxml.html.fromstring(html)
# scrape figures from each stock table
for table in doc.xpath('//table[#class="details-table quote-table Fz-m"]'):
rows = []
for tr in table.xpath('./tbody/tr'):
row = [td.text_content().strip() for td in tr.xpath('./td')]
rows.append(row)
print rows
Then to extract for different stocks and dates you need to change the URL. Here is Msft for the previous day:
http://finance.yahoo.com/q/op?s=msft&m=2014-11-14
If you'd like raw json try MSN
http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/
You can also specify an expiration date ?date=11/14/2014
http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/?date=11/14/2014
If you prefer Yahoo json
http://finance.yahoo.com/q/op?s=LLY
But you have to extract it from the html
import re
m = re.search('<script>.+({"applet_type":"td-applet-options-table".+);</script>', resp.content)
data = json.loads(m.group(1))
as_dicts = data['models']['applet_model']['data']['optionData']['_options'][0]['straddles']
Expirations are here
data['models']['applet_model']['data']['optionData']['expirationDates']
Convert iso to unix timestamp as here
Then re-request the other expirations with the unix timestamp
http://finance.yahoo.com/q/op?s=LLY&date=1414713600
Basing the Answer on #hoju:
import lxml.html
import calendar
from datetime import datetime
exDate = "2014-11-22"
symbol = "LLY"
dt = datetime.strptime(exDate, '%Y-%m-%d')
ym = calendar.timegm(dt.utctimetuple())
url = 'http://finance.yahoo.com/q/op?s=%s&date=%s' % (symbol, ym,)
doc = lxml.html.parse(url)
table = doc.xpath('//table[#class="details-table quote-table Fz-m"]/tbody/tr')
rows = []
for tr in table:
d = [td.text_content().strip().replace(',','') for td in tr.xpath('./td')]
rows.append(d)
print rows