I am relatively new to programming and completely new to stack overflow. I thought a good way to learn would be with a python & excel based project, but am stuck. My plan was to scrape a website of addresses using beautiful soup look up the zillow estimates of value for those addresses and populate them into tabular form in excel. I am unable to figure out how to get the addresses (the html on the site I am trying to scrape seems pretty messy), but was able to pull google address links from the site. Sorry if this is a very basic question, any advice would help though:
from bs4 import BeautifulSoup
from urllib.request import Request,
urlopen
import re
import pandas as pd
req = Request("http://www.tjsc.com/Sales/TodaySales")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
count = 0
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
count = count +1
print(links)
print("count is", count)
po = links
pd.DataFrame(po).to_excel('todaysale.xlsx', header=False, index=False)
you are on the right track. Instead of 'a', you need to use different html tag 'td' for the rows. Also 'th' for column names. here is one way to implement it. list_slide function converts each 14 elements to one row since the original table has 14 columns.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = "http://www.tjsc.com/Sales/TodaySales"
r = requests.get(url, verify=False)
text = r.text
soup = bs(text, 'lxml')
# Get column headers from the html file
header = []
for c_name in soup.findAll('th'):
header.append(c_name)
# clean up the extracted header content
header = [h.contents[0].strip() for h in header]
# get each row of the table
row = []
for link in soup.find_all('td'):
row.append(link.get_text().strip())
def list_slice(my_list, step):
"""This function takes any list, and divides it to chunks of size of "step"
"""
return [my_list[x:x + step] for x in range(0, len(my_list), step)]
# creating the final dataframe
df = pd.DataFrame(list_slice(row, 14), columns=header[:14])
Related
I am learning python and practicing it for extracting data in a public site.
but I found a problem in this learning. I'd like to get your kindly help me out.
Thanks for your help in advance! I will keep track this thread daily to wait for your kindly comments :)
Purpose:
extract all 65 pages' col, row with contents into a csv in one script
65 pages URLs loop rule:
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1
..........
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=65
Question1:
When running below one page script to extract one page data into csv. I had to run twice with different filename, then data can be extracted to 1st time run file
for example if I run it with test.csv, excel keep 0kb status, after I change filename to test2, then run this script again, after that data can be extract to test.csv..., but test2.csv keep no data with 0 KB. any idea?
here is one page extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = requests.get("http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1")
soup = bs(url.content, 'html.parser')
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w', newline=''))
divs = soup.find_all("div", class_ = "iiright")
for div in divs:
for tr in div.find_all("tr")[1:]:
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
csv_writer.writerow(data)
Question2:
I found problem to literate 65 pages urls to extract data into csv.
it doesn't work... any idea fix it..
here are 65 pages urls' extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={}"
def get_data(url):
for url in [url.format(pageNo) for pageNo in range(1,65)]:
soup = bs(url.content, 'html.parser')
for div in soup.find_all("div", class_ = "iiright"):
for tr in div.find_all("tr"):
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
writer.writerow(data)
if __name__ == '__main__':
with open("test.csv","w",newline="") as infile:
writer = csv.writer(infile)
get_data(url)
Just an alternativ approach
Try to keep it simple and may use pandas, cause it will do all these things for you under the hood.
define a list (data) to keep your results
iterate over the urls with pd.read_html
concat the data frames in data and write them to_csvor to_excel
read_html
find the table that matches a string -> match='预售信息查询:' and select it with [0] cause read_html() will always give you a list of tables
take a special row as header header =2
get rid of the last row with navigation and last column that is caused by the wrong colspan with .iloc[:-1,:-1]
Example
import pandas as pd
data = []
for pageNo in range(1,5):
data.append(pd.read_html(f'http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={pageNo}', header =2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
Example (based on your code with function)
import pandas as pd
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey="
def get_data(url):
data = []
for pageNo in range(1,2):
data.append(pd.read_html(f'{url}&pageNo={pageNo}', header=2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
if __name__ == '__main__':
get_data(url)
I want to scrape the IRS past forms site to gather the data for studying data mining. This web data contains a big table with 101 pages.
Here's the link:
https://apps.irs.gov/app/picklist/list/priorFormPublication.html
picture of site
My task:
Taking a list of tax form names (ex: "Form W-2", "Form 1095-C"), search the website
and return some informational results. Specifically, you must return the "Product
Number", the "Title", and the maximum and minimum years the form is available for
download. The forms returned should be an exact match for the input (ex: "Form W-2"
should not return "Form W-2 P", etc.) The results should be returned as json.
MY CODE SO FAR:
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
import requests
url="https://apps.irs.gov/app/picklist/list/priorFormPublication.html"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify())
forms_table = soup.find("table", class_= "picklist-dataTable")
forms_table_data = forms_table.find_all("tr") # contains 2 rows
headings = []
for tr in forms_table_data[0].find_all("th"):
headings.append(tr.b.text.replace('\n', ' ').strip())
print(headings)
THIS IS WHERE I AM GETTING HORRIBLY STUCK:
data = {}
for table, heading in zip(forms_table_data, headings):
t_headers = []
for th in table.find_all("th"):
t_headers.append(th.text.replace('\n', ' ').strip())
table_data = []
for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
t_row = {}
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
data[heading] = table_data
print(data)
I also seem to be missing how to incorporate the rest of the pages on the site.
Thanks for your patience!
Easiest way as mentioned to get table in data frame is read_html() - Be aware that pandas read all the table from the site and put them in a list of data frames. In your case you have to slice it by [3]
Cause your question is not that clear and hard to read with all that images, you should improve it.
Example (Form W-2)
import pandas as pd
pd.read_html('pd.read_html('https://apps.irs.gov/app/picklist/list/priorFormPublication.html?resultsPerPage=200&sortColumn=sortOrder&indexOfFirstRow=0&criteria=formNumber&value=Form+W-2&isDescending=false')[3]
Than you can filter and sort the data frame and export as json.
I have a csv called 'df' with 1 column. I have a header and 10 urls.
Col
"http://www.cnn.com"
"http://www.fark.com"
etc
etc
This is my ERROR code
import bs4 as bs
df_link = pd.read_csv('df.csv')
for link in df_link:
x = urllib2.urlopen(link[0])
new = x.read()
# Code does not even get past here as far as I checked
soup = bs.BeautifulSoup(new,"lxml")
for text in soup.find_all('a',href = True):
text.append((text.get('href')))
I am getting an error which says
ValueError: unknown url type: C
I also get other variations of this error like
The issue is, it is not even getting past
x = urllib2.urlopen(link[0])
On the other hand; This is the WORKING CODE...
url = "http://www.cnn.com"
x = urllib2.urlopen(url)
new = x.read()
soup = bs.BeautifulSoup(new,"lxml")
for link in soup.find_all('a',href = True):
links.append((link.get('href')))
Fixed answer
I didn't realize you were using pandas, so what I said wasn't very helpful.
The way you want to do this using pandas is to iterate over the rows and extract the info from them. The following should work without having to get rid of the header:
import bs4 as bs
import pandas as pd
import urllib2
df_link = pd.read_csv('df.csv')
for link in df_link.iterrows():
url = link[1]['Col']
x = urllib2.urlopen(url)
new = x.read()
# Code does not even get past here as far as I checked
soup = bs.BeautifulSoup(new,"lxml")
for text in soup.find_all('a',href = True):
text.append((text.get('href')))
Original misleading answer below
It looks like the header of your CSV file is not being treated separately, and so in the first iteration through df_link, link[0] is "Col", which isn't a valid URL.
I'm attempting to extract some links from a chunk of beautiful soup html and append them to rows of a new pandas dataframe.
So far, I have this code:
url = "http://www.reed.co.uk/jobs
datecreatedoffset=Today&isnewjobssearch=True&pagesize=100"
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
adcount = soup.find_all("div", class_="pages")
print(adcount)
From my output I then want to take every link, identified by href="" and store each one in a new row of a pandas dataframe.
Using the above snippet I would end up with 6 rows in my new dataset.
Any help would be appreciated!
Your links gives a 404 but the logic should be the same as below. You just need to extract the anchor tags with the page class and join them to the base url:
import pandas as pd
from urlparse import urljoin
import requests
base = "http://www.reed.co.uk/jobs"
url = "http://www.reed.co.uk/jobs?keywords=&location=&jobtitleonly=false"
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
df = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div.pages a.page")])
print(df)
Which gives you:
links
0 http://www.reed.co.uk/jobs?cached=True&pageno=2
1 http://www.reed.co.uk/jobs?cached=True&pageno=3
2 http://www.reed.co.uk/jobs?cached=True&pageno=4
3 http://www.reed.co.uk/jobs?cached=True&pageno=5
4 http://www.reed.co.uk/jobs?cached=True&pageno=...
5 http://www.reed.co.uk/jobs?cached=True&pageno=2
I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)
It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.
I would take a very different approach. We use LXML for scraping html pages
One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.
In my test I ran the following
import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content
tree = html.fromstring(page_as_string)
Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.
tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]
now I noticed that the first row has the column headings, so I want to separate all of the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now lets get the column headings:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally we can map the column headings to the row labels and cell values
my_results = []
for row in table_rows[1:]:
cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
temp_dict = OrderedDict()
for numb, cell in enumerate(cell_content):
if numb == 0:
temp_dict['row_label'] = cell.strip()
else:
dict_key = column_headings[numb]
temp_dict[dict_key] = cell
my_results.append(temp_dict)
now to access the results
for row_dict in my_results:
if row_dict['row_label'] == 'EPS (Basic)':
for key in row_dict:
print key, ':', row_dict[key]
row_label : EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :
Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).
Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.
I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.
When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE
I hope this helps
from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd
url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'
soup = BeautifulSoup(urllib2.urlopen(url).read())
table = soup.find('table', {'data-ajax-content' : 'true'})
data = []
for row in table.findAll('tr'):
cells = row.findAll('td')
cols = [ele.text.strip() for ele in cells]
data.append([ele for ele in cols if ele])
df = pd.DataFrame(data)
print df
dictframe = df.to_dict()
print dictframe
The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.