python parsing beautiful soup data to csv - python

I have written code in python3 to parse an html/css table. Have a few issues with it:
my csv output file headers are not generated based on html (tag: td, class: t1) by my code (on the first run when the output file is being created)
if the incoming html table has a few additional fields (tag: td, class: t1) my code cannot currently capture them and create additional headers in the csv output file
the data is not written to the output cvs file till ALL the ids (A001,A002,A003...) from my input file are processed. i want to write to the output cvs file when the processing of each id from my input file is completed (i.e. A001 to be written to csv before processing A002).
whenever i rerun the code, the data does not begin from the next line in the output csv
Being a noob, I am sure my code is very rudimentary and there will be a better way to do this and would like to learn to write this better and fix the above as well.
Need advise & guidance, please help. Thank you.
My Code:
import csv
import requests
from bs4 import BeautifulSoup
## SIDs.csv contains ids in col2 based on which the 'url' variable pulls the respective data
SIDFile = open('SIDs.csv')
SIDReader = csv.reader(SIDFile)
SID = list(SIDReader)
SqID_data = []
#create and open output file
with open('output.csv','a', newline='') as csv_h:
fields = \
[
"ID",
"Financial Year",
"Total Income",
"Total Expenses",
"Tax Expense",
"Net Profit"
]
for row in SID:
col1,col2 = row
SID ="%s" % (col2)
url = requests.get("http://.......")
soup = BeautifulSoup(url.text, "lxml")
fy = soup.findAll('td',{'class':'tablehead'})
titles = soup.findAll('td',{'class':'t1'})
values = soup.findAll('td',{'class':'t0'})
if titles:
data = {}
for title in titles:
name = title.find("td", class_ = "t1")
data["ID"] = SID
data["Financial Year"] = fy[0].string.strip()
data["Total Income"] = values[0].string.strip()
data["Total Expenses"] = values[1].string.strip()
data["Tax Expense"] = values[2].string.strip()
data["Net Profit"] = values[3].string.strip()
SqID_data.append(data)
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(SqID_data)
print("write rows complete")
Excerpt of HTML being processed:
<p>
<TABLE border=0 cellspacing=1 cellpadding=6 align=center class="vTable">
<TR>
<TD class=tablehead>Financial Year</TD>
<TD class=t1>01-Apr-2015 To 31-Mar-2016</TD>
</TR>
</TABLE>
</p>
<p>
<br>
<table cellpadding=3 cellspacing=1 class=vTable>
<TR>
<TD class=t1><b>Total income from operations (net) ( a + b)</b></td>
<TD class=t0 nowrap>675529.00</td>
</tr>
<TR>
<TD class=t1><b>Total expenses</b></td>
<TD class=t0 nowrap>446577.00</td>
</tr>
<TR>
<TD class=t1>Tax expense</td>
<TD class=t0 nowrap>71708.00</td>
</tr>
<TR>
<TD class=t1><b>Net Profit / (Loss)</b></td>
<TD class=t0 nowrap>157621</td>
</tr>
</table>
</p>
SIDs.csv (no header row)
1,A0001
2,A0002
3,A0003
Expected Output: output.csv (create header row)
ID,Financial Year,Total Income,Total Expenses,Tax Expense,Net Profit,OtherFieldsAsAndWhenFound
A001,01-Apr-2015 To 31-Mar-2016,675529.00,446577.00,71708.00,157621.00
A002,....
A003,....

I would recommend looking at pandas.read_html for parsing your web data; on your sample data this gives you:
import pandas as pd
tables=pd.read_html(s, index_col=0)
tables[0]
Out[11]:
1
0
Financial Year 01-Apr-2015 To 31-Mar-2016
tables[1]
1
0
Total income from operations (net) ( a + b) 675529
Total expenses 446577
Tax expense 71708
Net Profit / (Loss) 157621
You can then do what ever data manipulations you need (adding id's etc) using Pandas functions, and then export with DataFrame.to_csv.

Related

How can I make a proper loop for this case

I'm trying for a bootcamp project to scrape data from the following website: https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30. My aim is to get separate lists for the following columns: Date, Market Cap, volume, Open, and Close. My problem is, that for the columns Market Cap, volume, open and close, the class name (text-center) is the same:
<td class="text-center">
$161,716,193,676
</td>
<td class="text-center">
$16,571,161,476
</td>
<td class="text-center">
$1,340.02
</td>
<td class="text-center">
N/A
</td>
I've tried to solve it with this:
import requests
url_get = requests.get('https://www.coingecko.com/en/coins/ethereum/historical_data#panel')
from bs4 import BeautifulSoup
soup = BeautifulSoup(url_get.content,"html.parser")
table = soup.find('div', attrs={'class':'card-block'})
row = table.find_all('th', attrs={'class':'font-semibold text-center'})
row_length = len(row)
row_length
temp = []
for i in range(0, row_length):
Date = table.find_all('th', attrs={'class':'font-semibold text-center'})[i].text
Market_Cap = table.find_all('td', attrs={'class':'text-center'})[i].text
Market_Cap = Market_Cap.strip()
Volume = table.find_all('td', attrs={'class':'text-center'})[i].text
Volume = Market_Cap.strip()
Open = table.find_all('td', attrs={'class':'text-center'})[i].text
Open = Open.strip()
Close = table.find_all('td', attrs={'class':'text-center'})[i].text
Close = Close.strip()
temp.append((Date,Market_Cap,Volume,Open,Close))
temp
and the output was looking as frustrating like this:
[('2022-09-29',
'$161,716,193,676',
'$161,716,193,676',
'$161,716,193,676',
('2022-09-28',
'$16,571,161,476',
'$16,571,161,476',
'$16,571,161,476',
'$16,571,161,476'),
(and on)
I need to do it with the row length method (which is 31 according to my code), but since the code is not identical, i suppose i can't get the output as i wanted. it would much appreciated if anyone could help me figure it out. cheers
soup.find_all('td',class_= 'text-center').text

Converting a HTML table to a CSV in Python

I am trying to convert a table in HTML to a csv in Python. The table I am trying to extract is this one:
<table class="tblperiode">
<caption>Dades de període</caption>
<tr>
<th class="sortable"><span class="tooltip" title="Període (Temps Universal)">Període</span><br/>TU</th>
<th><span class="tooltip" title="Temperatura mitjana (°C)">TM</span><br/>°C</th>
<th><span class="tooltip" title="Temperatura màxima (°C)">TX</span><br/>°C</th>
<th><span class="tooltip" title="Temperatura mínima (°C)">TN</span><br/>°C</th>
<th><span class="tooltip" title="Humitat relativa mitjana (%)">HRM</span><br/>%</th>
<th><span class="tooltip" title="Precipitació (mm)">PPT</span><br/>mm</th>
<th><span class="tooltip" title="Velocitat mitjana del vent (km/h)">VVM (10 m)</span><br/>km/h</th>
<th><span class="tooltip" title="Direcció mitjana del vent (graus)">DVM (10 m)</span><br/>graus</th>
<th><span class="tooltip" title="Ratxa màxima del vent (km/h)">VVX (10 m)</span><br/>km/h</th>
<th><span class="tooltip" title="Irradiància solar global mitjana (W/m2)">RS</span><br/>W/m<sup>2</sup></th>
</tr>
<tr>
<th>
00:00 - 00:30
</th>
<td>16.2</td>
<td>16.5</td>
<td>15.4</td>
<td>93</td>
<td>0.0</td>
<td>6.5</td>
<td>293</td>
<td>10.4</td>
<td>0</td>
</tr>
<tr>
<th>
00:30 - 01:00
</th>
<td>16.4</td>
<td>16.5</td>
<td>16.1</td>
<td>90</td>
<td>0.0</td>
<td>5.8</td>
<td>288</td>
<td>8.6</td>
<td>0</td>
</tr>
And I want it to look something like this:
To achieve so, what I have tried is to parse the html and I have managed to build a dataframe with the data correctly doing the following:
from bs4 import BeautifulSoup
import csv
html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.select_one("table.tblperiode")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
df = pd.DataFrame(output_rows)
print(df)
However, I would like to have the columns name and a column indicating the interval of time, in the example of html above just two of them appear 00:00-00:30 and 00:30 1:00. Therefore my table should have two rows, one corresponding with the observations of 00:00-00:30 and another one with the observations of 00:30 and 1:00.
How could I get this information from my HTML?
Here's a way of doing it, it's probably not the nicest way but it works! You can read through the comments to figure out what the code is doing!
from bs4 import BeautifulSoup
import csv
#read the html
html = open("table.html").read()
soup = BeautifulSoup(html, 'html.parser')
# get the table from html
table = soup.select_one("table.tblperiode")
# find all rows
rows = table.findAll('tr')
# strip the header from rows
headers = rows[0]
header_text = []
# add the header text to array
for th in headers.findAll('th'):
header_text.append(th.text)
# init row text array
row_text_array = []
# loop through rows and add row text to array
for row in rows[1:]:
row_text = []
# loop through the elements
for row_element in row.findAll(['th', 'td']):
# append the array with the elements inner text
row_text.append(row_element.text.replace('\n', '').strip())
# append the text array to the row text array
row_text_array.append(row_text)
# output csv
with open("out.csv", "w") as f:
wr = csv.writer(f)
wr.writerow(header_text)
# loop through each row array
for row_text_single in row_text_array:
wr.writerow(row_text_single)
With this script:
import csv
from bs4 import BeautifulSoup
html = open('table.html').read()
soup = BeautifulSoup(html, features='lxml')
table = soup.select_one('table.tblperiode')
rows = []
for i, table_row in enumerate(table.findAll('tr')):
if i > 0:
periode = [' '.join(table_row.findAll('th')[0].text.split())]
data = [x.text for x in table_row.findAll('td')]
rows.append(periode + data)
header = ['Periode', 'TM', 'TX', 'TN', 'HRM', 'PPT', 'VVM', 'DVM', 'VVX', 'PM', 'RS']
with open('result.csv', 'w', newline='') as f:
w = csv.writer(f)
w.writerow(header)
w.writerows(rows)
I've managed to generate following CSV file on output:
Periode,TM,TX,TN,HRM,PPT,VVM,DVM,VVX,PM,RS
00:00 - 00:30,16.2,16.5,15.4,93,0.0,6.5,293,10.4,0
00:30 - 01:00,16.4,16.5,16.1,90,0.0,5.8,288,8.6,0
import csv
from bs4 import BeautifulSoup
import pandas as pd
html = open('test.html').read()
soup = BeautifulSoup(html, features='lxml')
#Specify table name which you want to read.
#Example: <table class="queryResults" border="0" cellspacing="1">
table = soup.select_one('table.queryResults')
def get_all_tables(soup):
return soup.find_all("table")
tbls = get_all_tables(soup)
for i, tablen in enumerate(tbls, start=1):
print(i)
print(tablen)
def get_table_headers(table):
headers = []
for th in table.find("tr").find_all("th"):
headers.append(th.text.strip())
return headers
head = get_table_headers(table)
#print(head)
def get_table_rows(table):
rows = []
for tr in table.find_all("tr")[1:]:
cells = []
# grab all td tags in this table row
tds = tr.find_all("td")
if len(tds) == 0:
# if no td tags, search for th tags
# can be found especially in wikipedia tables below the table
ths = tr.find_all("th")
for th in ths:
cells.append(th.text.strip())
else:
# use regular td tags
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
return rows
table_rows = get_table_rows(table)
#print(table_rows)
def save_as_csv(table_name, headers, rows):
pd.DataFrame(rows, columns=headers).to_csv(f"{table_name}.csv")
save_as_csv("Test_table", head, table_rows)

Wiki scraping using python

I am trying to scrape the data stored in the table of this wikipedia page https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India).
However i am unable to scrape the full data
Hers's what i wrote so far:
from bs4 import BeautifulSoup
import urllib2
wiki = "https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page,"html.parser")
name = ""
pic = ""
strt = ""
end = ""
pri = ""
x=""
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 8:
name = cells[0].find(text=True)
print name`
The output obtained is:
Jairamdas Daulatram, Surjit Singh Barnala, Rao Birendra Singh
Whereas the output should be: Jairamdas Daulatram followed by Panjabrao Deshmukh
Have you read the raw html?
Because some of the cells span several rows (e.g. Political Party), most rows do not have 8 cells in them.
You cannot therefore do if len(cells) == 8 and expect it to work. Think about what this line was meant to achieve. If it was to ignore the header row then you could replace it with if len(cells) > 0 because all the header cells are <th> tags (and therefore will not appear in your list).
Page source (showing your problem):
<tr>
<td>Jairamdas Daulatram</td>
<td></td>
<td>1948</td>
<td>1952</td>
<td rowspan="6">Indian National Congress</td>
<td rowspan="6" bgcolor="#00BFFF" width="4px"></td>
<td rowspan="3">Jawaharlal Nehru</td>
<td><sup id="cite_ref-1" class="reference"><span>[</span>1<span>]</span></sup></td>
</tr>
<tr>
<td>Panjabrao Deshmukh</td>
<td></td>
<td>1952</td>
<td>1962</td>
<td><sup id="cite_ref-2" class="reference"><span>[</span>2<span>]</span></sup></td>
</tr>
Like already stated in a previous post. It does not make sense to set a static length. Just check if <td> exists. The code below is written in Python 3, but should work in Python 2.7 as well with some small adjustments.
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = urlopen("https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)")
soup = BeautifulSoup(wiki, "html.parser")
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if cells:
name = cells[0].find(text=True)
print(name)

Python + BeautifulSoup - Limiting text extraction on a specific table (multiple tables on a webpage)

Hello all…I am trying to use BeautifulSoup to pick up the content of “Date of Employment:” on a webpage. the webpage contains 5 tables. the 5 tables are similar and looked like below.
<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Design Team</th><th class="top">Top</th></tr></thead><tbody><tr><td style="width:20px;">Designer:</td><td>Michael Linnen</td></tr>
<tr><td style="width:20px;">Date of Employment:</td><td>07 Jan 2012</td></tr>
<tr><td style="width:20px;">No of Works:</td><td>6</td></tr>
<tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>
<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Operation Team</th><th class="top">Top</th></tr></thead><tbody><tr><td style="width:20px;">Manager:</td><td>Nich Sharmen</td></tr>
<tr><td style="width:20px;">Date of Employment:</td><td>02 Nov 2005</td></tr>
<tr><td style="width:20px;">Zones:</td><td>6</td></tr>
<tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>
The texts I want is in the 3rd table, the table header is "Design Team".
I am Using below:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
aa = soup.find_all(text=re.compile("Date of Employment:"))
bb = aa[2].findNext('td')
print bb.text
the problem is that, the “Date of Employment:” in this table sometimes is not available. when it's not there, the code picks the "Date of Employment:" in the next table.
How do I restrict my code to pick only the wanted ones in the table named “Design Team”? thanks.
Rather than finding all the Date of Employment and finding the next td you can directy find the 5th table, given that the th is Design Team
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
aa = soup.find_all(text="Design Team")
nexttr = aa.next_sibling
if nexttr.td.text == "Date of Employment:":
print nexttr.td.next_sibling.text
else:
print "No Date of Employment:"
nexttr = aa.next_sibling finds the next tr tag within the table tag.
if nexttr.td.text == "Date of Employment:": ensures that the text within the next td tag withn the tr is "No Date of Employment:"
nexttr.td.next_sibling extracts the immediate td tag following the "Date of Employment"
print nexttr.td.next_sibling.text prints the date

Pick up texts from a batch of files, and write them into an Excel file

(Environment: Python 2.7.6 Shell IDLE + BeautifulSoup 4.3.2 + )
I want to pick up some texts from a batch of files (about 50 files), and put them nicely into an Excel file, either row by row, or column by column.
The text sample in each file contains below:
<tr>
<td width=25%>
Arnold Ed
</td>
<td width=15%>
18 Feb 1959
</td>
</tr>
<tr>
<td width=15%>
男性
</td>
<td width=15%>
02 March 2002
</td>
</tr>
<tr>
<td width=15%>
Guangxi
</td>
</tr>
What I so far worked out are being shown below. The way is to read the files one by one. The codes run fine until the texts pickup part, but they are not writing into the Excel file.
from bs4 import BeautifulSoup
import xlwt
list_open = open("c:\\file list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
for each_file in line_in_list:
page = open(each_file)
soup = BeautifulSoup(page.read())
all_texts = soup.find_all("td")
for a_t in all_texts:
a = a_t.renderContents()
#"print a" here works ok
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('namelist', cell_overwrite_ok = True)
sheet.write (0, 0, a)
book.save("C:\\details.xls")
Actually it’s only writing the last piece of texts into the Excel file. So in what way I can have it correctly done?
With laike9m's help, the final version is:
list_open = open("c:\\file list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('namelist', cell_overwrite_ok = True)
for i,each_file in enumerate(line_in_list):
page = open(each_file)
soup = BeautifulSoup(page.read())
all_texts = soup.find_all("td")
for j,a_t in enumerate(all_texts):
a = a_t.renderContents()
sheet.write (i, j, a)
book.save("C:\\details.xls")
You didn't put the last four lines into for loop. I guess that's why it’s only writing the last piece of texts into the Excel file.
EDIT
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('namelist', cell_overwrite_ok = True)
for i, each_file in enumerate(line_in_list):
page = open(each_file)
soup = BeautifulSoup(page.read())
all_texts = soup.find_all("td")
for j, a_t in enumerate(all_texts):
a = a_t.renderContents()
sheet.write(i, j, a)
book.save("C:\\details.xls")

Categories

Resources