Extract multiple page web table into Excel - python

I have a table that spans across many pages. I'm able to pull the info from a designated page and pull it into a CSV table. My goal now is to have this iterate through all the pages and add it to the bottom of the previous page's info. Here is the code so far that works on a single page:
import requests
import pandas as pd
url = 'https://www.mineralanswers.com/oklahoma/producers?page=1'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
The page URL is setup in the "...producers?page=1, ...producers?page=2 ...producers?page=3" format so I feel like it's likely possible using a loop, I just am having trouble amending the data instead of overwriting it.

Here is corrected example code to fetch 3 pages and append them to one DataFrame. You may run this code here online.
import requests
import pandas as pd
df = pd.DataFrame()
for page in range(1, 4):
url = 'https://www.mineralanswers.com/oklahoma/producers?page=' + str(page)
html = requests.get(url).content
df_list = pd.read_html(html)
df = df.append(df_list[-1], ignore_index = True)
df.to_csv('my data.csv')

Related

How to extract specific table data (div\tr\td) from multiple URLs in a website in a literate way into CSV (with sample)

I am learning python and practicing it for extracting data in a public site.
but I found a problem in this learning. I'd like to get your kindly help me out.
Thanks for your help in advance! I will keep track this thread daily to wait for your kindly comments :)
Purpose:
extract all 65 pages' col, row with contents into a csv in one script
65 pages URLs loop rule:
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1
..........
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=65
Question1:
When running below one page script to extract one page data into csv. I had to run twice with different filename, then data can be extracted to 1st time run file
for example if I run it with test.csv, excel keep 0kb status, after I change filename to test2, then run this script again, after that data can be extract to test.csv..., but test2.csv keep no data with 0 KB. any idea?
here is one page extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = requests.get("http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1")
soup = bs(url.content, 'html.parser')
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w', newline=''))
divs = soup.find_all("div", class_ = "iiright")
for div in divs:
for tr in div.find_all("tr")[1:]:
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
csv_writer.writerow(data)
Question2:
I found problem to literate 65 pages urls to extract data into csv.
it doesn't work... any idea fix it..
here are 65 pages urls' extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={}"
def get_data(url):
for url in [url.format(pageNo) for pageNo in range(1,65)]:
soup = bs(url.content, 'html.parser')
for div in soup.find_all("div", class_ = "iiright"):
for tr in div.find_all("tr"):
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
writer.writerow(data)
if __name__ == '__main__':
with open("test.csv","w",newline="") as infile:
writer = csv.writer(infile)
get_data(url)
Just an alternativ approach
Try to keep it simple and may use pandas, cause it will do all these things for you under the hood.
define a list (data) to keep your results
iterate over the urls with pd.read_html
concat the data frames in data and write them to_csvor to_excel
read_html
find the table that matches a string -> match='预售信息查询:' and select it with [0] cause read_html() will always give you a list of tables
take a special row as header header =2
get rid of the last row with navigation and last column that is caused by the wrong colspan with .iloc[:-1,:-1]
Example
import pandas as pd
data = []
for pageNo in range(1,5):
data.append(pd.read_html(f'http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={pageNo}', header =2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
Example (based on your code with function)
import pandas as pd
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey="
def get_data(url):
data = []
for pageNo in range(1,2):
data.append(pd.read_html(f'{url}&pageNo={pageNo}', header=2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
if __name__ == '__main__':
get_data(url)

Adding href to panda .read_html DF

I want to create a table with the information available on this website. I want the table to have 3 columns: 0 series/date, 1 title and 2 links. I already managed to get the first two columns but I don't know how to get the link for each entry.
import pandas as pd
import requests
url = "http://legislaturautuado.com/pgs/resolutions.php?st=5&f=2016"
r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()
Will it be possible to get what I want by only using pandas?
As far as I know, it's not possible with pandas only. It can be done with BeautifulSoup, though:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "http://legislaturautuado.com/pgs/resolutions.php?st=5&f=2016"
r = requests.get(url)
html_table = BeautifulSoup(r.text).find('table')
r.close()
df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]

Pandas df read data source stored in ".xml?" page which is not in tables format?

I need to download the data table and export to excel in"http://www.dicj.gov.mo/web/en/information/DadosEstat_mensal/2019/index.html" Inspecting the page using Chrome, inspect function. The data is in "http://www.dicj.gov.mo/web/en/information/DadosEstat_mensal/2019/report_en.xml?id=2". However, it is no longer in Table format.
url = "http://www.dicj.gov.mo/web/en/information/DadosEstat_mensal/2019/index.html"
table= pd.read_html(url)[2]
table.info()
print(table)
table.to_excel("GGR.xlsx")
I see that now your source web site returns the content in XML format.
To process it, you can apply BeautifulSoup. Assuming that you have installed it,
proceed as follows:
Import necessary modules.
from bs4 import BeautifulSoup
import requests
Read the source page:
page = requests.get('http://www.dicj.gov.mo/web/en/information/DadosEstat_mensal/2019/report_en.xml?id=2')
soup = BeautifulSoup(page.text, 'lxml')
Read column names and create the MultiIndex for columns in the target DataFrame:
col = soup.find_all('column')
h1 = [ col[i].contents[0] for i in range(1,3) ]
h2 = [ col[i].contents[0] for i in range(3,6) ]
cols = pd.MultiIndex.from_product([h1, h2])
Process source records, creating ind (index) and rows:
recs = soup.find_all('record')
ind = []
rows = []
for rec in recs:
cells = rec.find_all('data')
ind.append(cells[0].contents[0])
rows.append([ cells[i].contents[0] for i in range(1,7) ])
And the last step - create the target DataFrame:
df = pd.DataFrame(rows, index=ind, columns=cols)
I tried to read this table from the first page given by you, using read_html,
but I failed.
Probably the final content is loaded by some JavaScript in this page,
which can not be "seen" by read_html.

python web scraping and excel population

I am relatively new to programming and completely new to stack overflow. I thought a good way to learn would be with a python & excel based project, but am stuck. My plan was to scrape a website of addresses using beautiful soup look up the zillow estimates of value for those addresses and populate them into tabular form in excel. I am unable to figure out how to get the addresses (the html on the site I am trying to scrape seems pretty messy), but was able to pull google address links from the site. Sorry if this is a very basic question, any advice would help though:
from bs4 import BeautifulSoup
from urllib.request import Request,
urlopen
import re
import pandas as pd
req = Request("http://www.tjsc.com/Sales/TodaySales")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
count = 0
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
count = count +1
print(links)
print("count is", count)
po = links
pd.DataFrame(po).to_excel('todaysale.xlsx', header=False, index=False)
you are on the right track. Instead of 'a', you need to use different html tag 'td' for the rows. Also 'th' for column names. here is one way to implement it. list_slide function converts each 14 elements to one row since the original table has 14 columns.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = "http://www.tjsc.com/Sales/TodaySales"
r = requests.get(url, verify=False)
text = r.text
soup = bs(text, 'lxml')
# Get column headers from the html file
header = []
for c_name in soup.findAll('th'):
header.append(c_name)
# clean up the extracted header content
header = [h.contents[0].strip() for h in header]
# get each row of the table
row = []
for link in soup.find_all('td'):
row.append(link.get_text().strip())
def list_slice(my_list, step):
"""This function takes any list, and divides it to chunks of size of "step"
"""
return [my_list[x:x + step] for x in range(0, len(my_list), step)]
# creating the final dataframe
df = pd.DataFrame(list_slice(row, 14), columns=header[:14])

Get xml from webservice?

I'm trying to get a data from this site
and then use some of it. Sorry for not copy-paste it but it's a long xml. So far I tried to get this data those ways:
from urllib.request import urlopen
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
s = urlopen(url)
content = s.read()
as print(content) looks good, now I would like to get a data from it
<tabela_rozklad data-aktualizacji="1480583567">
<DZIEN>2</DZIEN>
<GODZ>3</GODZ>
<ILOSC>2</ILOSC>
<TYG>0</TYG>
<ID_NAUCZ>66</ID_NAUCZ>
<ID_SALA>79</ID_SALA>
<ID_PRZ>104</ID_PRZ>
<RODZ>W</RODZ>
<GRUPA>1</GRUPA>
<ID_ST>13</ID_ST>
<SEM>1</SEM>
<ID_SPEC>0</ID_SPEC>
</tabela_rozklad>
How can I handle this data to easy use it?
You can use Beautiful soup and capture the tags you want. The code below should get you started!
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
# secure url content
response = requests.get(url).content
soup = BeautifulSoup(response)
# find each tabela_rozklad
tables = soup.find_all('tabela_rozklad')
# for each tabela_rozklad looks like there is 12 nested corresponding tags
tags = ['dzien', 'godz', 'ilosc', 'tyg', 'id_naucz', 'id_sala',
'id_prz', 'rodz', 'grupa', 'id_st', 'sem', 'id_spec']
# initialize empty dataframe
df = pd.DataFrame()
# iterate over each tabela_rozklad and extract each tag and append to pandas dataframe
for table in tables:
all = map(lambda x: table.find(x).text, tags)
df = df.append([all])
# insert tags as columns
df.columns = tags
# display first 5 rows of table
df.head()
# and the shape of the data
df.shape # 665 rows, 12 columns
# and now you can get to the information using traditional pandas functionality
# for instance, count observations by rodz
df.groupby('rodz').count()
# or subset only observations where rodz = J
J = df[df.rodz == 'J']

Categories

Resources