python - print a csv row into a HTML column - python

I'm trying to print the data of a CSV column into an HTML table
CSV file is like this (sample)
firstname, surname
firstname, surname
firstname, surname
firstname, surname
firstname, surname
firstname, surname
firstname, surname
I can read this data in ok - and get it to print out into a table via the following:
import csv
import sys
from fpdf import FPDF, HTMLMixin
#load in csv file
data = csv.reader(open(sys.argv[1]))
names = ""
#Read column names from first line of the file
fields = data.next()
for row in data:
names = row[0] + " " + row[1]
html_row = " <tr> "
html_col = " <td border=0 width=15%>" + names + "</td></tr>"
html_out = html_out + html_row + html_col
html = html_header + html_out + html_footer
print html
pdf.write_html(html)
pdf.output('test2.pdf', 'F')
this gives the following:
<tr><td border=0 width=15%>firstname surname</td></tr>
<tr><td border=0 width=15%>firstname surname</td></tr>
<tr><td border=0 width=15%>firstname surname</td></tr>
ie - every name is on a separate row - what i'd like to do is have every name as a cell column cell
<tr><td border=0 width=15%>firstname surname</td><td border=0 width=15%>firstname surname</td><td border=0 width=15%>firstname surname</td></tr>
thanks

Some variables in sample are not set, but this is due to reducing of sample to aminimal size I guess, so if you fill in these details, the following should work (suboptimal if more than 7 entries due to the hardcoded 15% width on the html td element.
import csv
import sys
from fpdf import FPDF, HTMLMixin
# load in csv file
data = csv.reader(open(sys.argv[1]))
# Read column names from first line of the file to ignore them
fields = data.next()
Below the loop from question has been replaced by a list comprehension.
cell_att holds the given attributes of the table cell elements and it will be interpolated like row[0] and row[1] into the strings making up the html_out list.
cell_att = " border=0 width=15%"
row_data = ["<td%s>%s %s</td>" % (cell_att, row[0], row[1]) for row in data]
Here one simply joins all cells and injects into a html table row element:
html_out = "<tr>" + "".join(row_data) + "</tr>"
html = html_header + html_out + html_footer
print html
pdf.write_html(html)
pdf.output('test2.pdf', 'F')
The other answers also give IMO useful hints esp. on the styling level of the HTML. In case you later decide to digest / transform more than 7 names, the above code might be a good start to create rows with at most 7 cells or to adapt the width attribute value by minor modifications.

You only need one <tr> to make a single row.
table, td {
border: solid 1px #CCC;
}
<table>
<tr>
<td>firstname surname</td>
<td>firstname surname</td>
<td>firstname surname</td>
</tr>
</table>
In order to make this work in your code, you need to create you row outside the loop:
html_row = '<tr>' # open row
for row in data:
names = row[0] + " " + row[1]
# append columns to the row
html_row += "<td border=0 width=15%>" + names + "</td>"
html_row += '</tr>' # close row
html_out = html_row

Related

Converting a HTML table to a CSV in Python

I am trying to convert a table in HTML to a csv in Python. The table I am trying to extract is this one:
<table class="tblperiode">
<caption>Dades de període</caption>
<tr>
<th class="sortable"><span class="tooltip" title="Període (Temps Universal)">Període</span><br/>TU</th>
<th><span class="tooltip" title="Temperatura mitjana (°C)">TM</span><br/>°C</th>
<th><span class="tooltip" title="Temperatura màxima (°C)">TX</span><br/>°C</th>
<th><span class="tooltip" title="Temperatura mínima (°C)">TN</span><br/>°C</th>
<th><span class="tooltip" title="Humitat relativa mitjana (%)">HRM</span><br/>%</th>
<th><span class="tooltip" title="Precipitació (mm)">PPT</span><br/>mm</th>
<th><span class="tooltip" title="Velocitat mitjana del vent (km/h)">VVM (10 m)</span><br/>km/h</th>
<th><span class="tooltip" title="Direcció mitjana del vent (graus)">DVM (10 m)</span><br/>graus</th>
<th><span class="tooltip" title="Ratxa màxima del vent (km/h)">VVX (10 m)</span><br/>km/h</th>
<th><span class="tooltip" title="Irradiància solar global mitjana (W/m2)">RS</span><br/>W/m<sup>2</sup></th>
</tr>
<tr>
<th>
00:00 - 00:30
</th>
<td>16.2</td>
<td>16.5</td>
<td>15.4</td>
<td>93</td>
<td>0.0</td>
<td>6.5</td>
<td>293</td>
<td>10.4</td>
<td>0</td>
</tr>
<tr>
<th>
00:30 - 01:00
</th>
<td>16.4</td>
<td>16.5</td>
<td>16.1</td>
<td>90</td>
<td>0.0</td>
<td>5.8</td>
<td>288</td>
<td>8.6</td>
<td>0</td>
</tr>
And I want it to look something like this:
To achieve so, what I have tried is to parse the html and I have managed to build a dataframe with the data correctly doing the following:
from bs4 import BeautifulSoup
import csv
html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.select_one("table.tblperiode")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
df = pd.DataFrame(output_rows)
print(df)
However, I would like to have the columns name and a column indicating the interval of time, in the example of html above just two of them appear 00:00-00:30 and 00:30 1:00. Therefore my table should have two rows, one corresponding with the observations of 00:00-00:30 and another one with the observations of 00:30 and 1:00.
How could I get this information from my HTML?
Here's a way of doing it, it's probably not the nicest way but it works! You can read through the comments to figure out what the code is doing!
from bs4 import BeautifulSoup
import csv
#read the html
html = open("table.html").read()
soup = BeautifulSoup(html, 'html.parser')
# get the table from html
table = soup.select_one("table.tblperiode")
# find all rows
rows = table.findAll('tr')
# strip the header from rows
headers = rows[0]
header_text = []
# add the header text to array
for th in headers.findAll('th'):
header_text.append(th.text)
# init row text array
row_text_array = []
# loop through rows and add row text to array
for row in rows[1:]:
row_text = []
# loop through the elements
for row_element in row.findAll(['th', 'td']):
# append the array with the elements inner text
row_text.append(row_element.text.replace('\n', '').strip())
# append the text array to the row text array
row_text_array.append(row_text)
# output csv
with open("out.csv", "w") as f:
wr = csv.writer(f)
wr.writerow(header_text)
# loop through each row array
for row_text_single in row_text_array:
wr.writerow(row_text_single)
With this script:
import csv
from bs4 import BeautifulSoup
html = open('table.html').read()
soup = BeautifulSoup(html, features='lxml')
table = soup.select_one('table.tblperiode')
rows = []
for i, table_row in enumerate(table.findAll('tr')):
if i > 0:
periode = [' '.join(table_row.findAll('th')[0].text.split())]
data = [x.text for x in table_row.findAll('td')]
rows.append(periode + data)
header = ['Periode', 'TM', 'TX', 'TN', 'HRM', 'PPT', 'VVM', 'DVM', 'VVX', 'PM', 'RS']
with open('result.csv', 'w', newline='') as f:
w = csv.writer(f)
w.writerow(header)
w.writerows(rows)
I've managed to generate following CSV file on output:
Periode,TM,TX,TN,HRM,PPT,VVM,DVM,VVX,PM,RS
00:00 - 00:30,16.2,16.5,15.4,93,0.0,6.5,293,10.4,0
00:30 - 01:00,16.4,16.5,16.1,90,0.0,5.8,288,8.6,0
import csv
from bs4 import BeautifulSoup
import pandas as pd
html = open('test.html').read()
soup = BeautifulSoup(html, features='lxml')
#Specify table name which you want to read.
#Example: <table class="queryResults" border="0" cellspacing="1">
table = soup.select_one('table.queryResults')
def get_all_tables(soup):
return soup.find_all("table")
tbls = get_all_tables(soup)
for i, tablen in enumerate(tbls, start=1):
print(i)
print(tablen)
def get_table_headers(table):
headers = []
for th in table.find("tr").find_all("th"):
headers.append(th.text.strip())
return headers
head = get_table_headers(table)
#print(head)
def get_table_rows(table):
rows = []
for tr in table.find_all("tr")[1:]:
cells = []
# grab all td tags in this table row
tds = tr.find_all("td")
if len(tds) == 0:
# if no td tags, search for th tags
# can be found especially in wikipedia tables below the table
ths = tr.find_all("th")
for th in ths:
cells.append(th.text.strip())
else:
# use regular td tags
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
return rows
table_rows = get_table_rows(table)
#print(table_rows)
def save_as_csv(table_name, headers, rows):
pd.DataFrame(rows, columns=headers).to_csv(f"{table_name}.csv")
save_as_csv("Test_table", head, table_rows)

Delete HTML element if it contains a certain amount of numeric characters

For transforming a html-formatted file to a plain text file with Python, I need to delete all tables if the text within the table contains more than 40% numeric characters.
Specifically, I would like to:
identify each table element in a html file
calculate the number of numeric and alphabetic characters in the text and the correpsonding ratio, not considering characters within any html tags
. Thus, delete all html tags.
delete the table if its text is composed of more than 40% numeric characters.
 Keep the table if it contains less than 40% numeric characters
.
I defined a function that is called when the re.sub command is run. The rawtext variable contains the whole html-formatted text I want to parse. Within the function, I try to process the steps described above and return a html-stripped version of the table or a blank space, depending on the ratio of numeric characters. However, the first re.sub command within the function seems to delete not only tags, but everything, including the textual content.
def tablereplace(table):
table = re.sub('<[^>]*>', ' ', str(table))
numeric = sum(c.isdigit() for c in table)
alphabetic = sum(c.isalpha() for c in table)
try:
ratio = numeric / (numeric + alphabetic)
print('ratio = ' + ratio)
except ZeroDivisionError as err:
ratio = 1
if ratio > 0.4:
emptystring = re.sub('.*?', ' ', table, flags=re.DOTALL)
return emptystring
else:
return table
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)
If you have an idea on what might be wrong with this code, I would be very happy if you share it with me. Thank you!
As I suggested you in comments, I wouldn't use regex to parse and use HTML in code. For example you could use a python library build up for this purpose like BeautifulSoup.
Here an example on how to use it
#!/usr/bin/python
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = """<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
</div>
</body>
</html>"""
parsed_html = BeautifulSoup(html, 'html.parser')
print parsed_html.body.find('table').text
So you could end up with a code like that (just to give you an idea)
#!/usr/bin/python
import re
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
def tablereplace(table):
table = re.sub('<[^>]*>', ' ', str(table))
numeric = sum(c.isdigit() for c in table)
print('numeric: ' + str(numeric))
alphabetic = sum(c.isalpha() for c in table)
print('alpha: ' + str(alphabetic))
try:
ratio = numeric / float(numeric + alphabetic)
print('ratio: '+ str(ratio))
except ZeroDivisionError as err:
ratio = 1
if ratio > 0.4:
return True
else:
return False
table = """<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>3241424134213424214321342424214321412</td>
<td>213423423234242142134214124214214124124</td>
<td>213424214234242</td>
</tr>
<tr>
<td>124234412342142414</td>
<td>1423424214324214</td>
<td>2134242141242341241</td>
</tr>
</table>
"""
if tablereplace(table):
print 'replace table'
parsed_html = BeautifulSoup(table, 'html.parser')
rawdata = parsed_html.find('table').text
print rawdata
UPDATE:
Anyway just this line of your code strips away all HTML tags, as you will know 'cause you are using it for char/digit counting purpose
table = re.sub('<[^>]*>', ' ', str(table))
But it's not safe, because you could also have <> inside the text of your tags or the HTML could be shattered or misplaced
I left it where it is because for the example it's working. But consider to use BeautifulSoup for all HTML management.
Thank you for your replies so far!
After intensive research, I found the solution to the mysterious deletion of the whole match. It seemed that the function only considered the first 150 or so characters of the match. However, if you specify table = table.group(0), the whole match is processed. group(0) accounts for the big difference here.
Below you can find my updated script thats works properly (also includes some other minor changes):
def tablereplace(table):
table = table.group(0)
table = re.sub('<[^>]*>', '\n', table)
numeric = sum(c.isdigit() for c in table)
alphabetic = sum(c.isalpha() for c in table)
try:
ratio = numeric / (numeric + alphabetic)
except ArithmeticError:
ratio = 1
else:
pass
if ratio > 0.4:
emptystring = ''
return emptystring
else:
return table
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)

Splitting HTML text by <br> while using beautifulsoup

HTML code:
<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>
I need to get values 4.5 kn and 7.1 as separate list items so I could append them separately. I do not want to split it I wanted to split the text string using re.sub, but it does not work. I tried too use replace to replace br, but it did not work. Can anybody provide any insight?
Python code:
def NameSearch(shipLink, mmsi, shipName):
from bs4 import BeautifulSoup
import urllib2
import csv
import re
values = []
values.append(mmsi)
values.append(shipName)
regex = re.compile(r'[\n\r\t]')
i = 0
with open('Ship_indexname.csv', 'wb')as f:
writer = csv.writer(f)
while True:
try:
shipPage = urllib2.urlopen(shipLink, timeout=5)
except urllib2.URLError:
continue
except:
continue
break
soup = BeautifulSoup(shipPage, "html.parser") # Read the web page HTML
#soup.find('br').replaceWith(' ')
#for br in soup('br'):
#br.extract()
table = soup.find_all("table", {"id": "vessel-related"}) # Finds table with class table1
for mytable in table: #Loops tables with class table1
table_body = mytable.find_all('tbody') #Finds tbody section in table
for body in table_body:
rows = body.find_all('tr') #Finds all rows
for tr in rows: #Loops rows
cols = tr.find_all('td') #Finds the columns
for td in cols: #Loops the columns
checker = td.text.encode('ascii', 'ignore')
check = regex.sub('', checker)
if check == ' Speed (avg./max): ':
i = 1
elif i == 1:
print td.text
pat=re.compile('<br\s*/>')
print pat.sub(" ",td.text)
values.append(td.text.strip("\n").encode('utf-8')) #Takes the second columns value and assigns it to a list called Values
i = 0
#print values
return values
NameSearch('https://www.fleetmon.com/vessels/kind-of-magic_0_3478642/','230034570','KIND OF MAGIC')
Locate the "Speed (avg./max)" label first and then go to the value via .find_next():
from bs4 import BeautifulSoup
data = '<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>'
soup = BeautifulSoup(data, "html.parser")
label = soup.find("label", class_="identifier", text="Speed (avg./max):")
value = label.find_next("td", class_="value").get_text(strip=True)
print(value) # prints 4.5 kn7.1 kn
Now, you can extract the actual numbers from the string:
import re
speed_values = re.findall(r"([0-9.]+) kn", value)
print(speed_values)
Prints ['4.5', '7.1'].
You can then further convert the values to floats and unpack into separate variables:
avg_speed, max_speed = map(float, speed_values)

python parsing beautiful soup data to csv

I have written code in python3 to parse an html/css table. Have a few issues with it:
my csv output file headers are not generated based on html (tag: td, class: t1) by my code (on the first run when the output file is being created)
if the incoming html table has a few additional fields (tag: td, class: t1) my code cannot currently capture them and create additional headers in the csv output file
the data is not written to the output cvs file till ALL the ids (A001,A002,A003...) from my input file are processed. i want to write to the output cvs file when the processing of each id from my input file is completed (i.e. A001 to be written to csv before processing A002).
whenever i rerun the code, the data does not begin from the next line in the output csv
Being a noob, I am sure my code is very rudimentary and there will be a better way to do this and would like to learn to write this better and fix the above as well.
Need advise & guidance, please help. Thank you.
My Code:
import csv
import requests
from bs4 import BeautifulSoup
## SIDs.csv contains ids in col2 based on which the 'url' variable pulls the respective data
SIDFile = open('SIDs.csv')
SIDReader = csv.reader(SIDFile)
SID = list(SIDReader)
SqID_data = []
#create and open output file
with open('output.csv','a', newline='') as csv_h:
fields = \
[
"ID",
"Financial Year",
"Total Income",
"Total Expenses",
"Tax Expense",
"Net Profit"
]
for row in SID:
col1,col2 = row
SID ="%s" % (col2)
url = requests.get("http://.......")
soup = BeautifulSoup(url.text, "lxml")
fy = soup.findAll('td',{'class':'tablehead'})
titles = soup.findAll('td',{'class':'t1'})
values = soup.findAll('td',{'class':'t0'})
if titles:
data = {}
for title in titles:
name = title.find("td", class_ = "t1")
data["ID"] = SID
data["Financial Year"] = fy[0].string.strip()
data["Total Income"] = values[0].string.strip()
data["Total Expenses"] = values[1].string.strip()
data["Tax Expense"] = values[2].string.strip()
data["Net Profit"] = values[3].string.strip()
SqID_data.append(data)
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(SqID_data)
print("write rows complete")
Excerpt of HTML being processed:
<p>
<TABLE border=0 cellspacing=1 cellpadding=6 align=center class="vTable">
<TR>
<TD class=tablehead>Financial Year</TD>
<TD class=t1>01-Apr-2015 To 31-Mar-2016</TD>
</TR>
</TABLE>
</p>
<p>
<br>
<table cellpadding=3 cellspacing=1 class=vTable>
<TR>
<TD class=t1><b>Total income from operations (net) ( a + b)</b></td>
<TD class=t0 nowrap>675529.00</td>
</tr>
<TR>
<TD class=t1><b>Total expenses</b></td>
<TD class=t0 nowrap>446577.00</td>
</tr>
<TR>
<TD class=t1>Tax expense</td>
<TD class=t0 nowrap>71708.00</td>
</tr>
<TR>
<TD class=t1><b>Net Profit / (Loss)</b></td>
<TD class=t0 nowrap>157621</td>
</tr>
</table>
</p>
SIDs.csv (no header row)
1,A0001
2,A0002
3,A0003
Expected Output: output.csv (create header row)
ID,Financial Year,Total Income,Total Expenses,Tax Expense,Net Profit,OtherFieldsAsAndWhenFound
A001,01-Apr-2015 To 31-Mar-2016,675529.00,446577.00,71708.00,157621.00
A002,....
A003,....
I would recommend looking at pandas.read_html for parsing your web data; on your sample data this gives you:
import pandas as pd
tables=pd.read_html(s, index_col=0)
tables[0]
Out[11]:
1
0
Financial Year 01-Apr-2015 To 31-Mar-2016
tables[1]
1
0
Total income from operations (net) ( a + b) 675529
Total expenses 446577
Tax expense 71708
Net Profit / (Loss) 157621
You can then do what ever data manipulations you need (adding id's etc) using Pandas functions, and then export with DataFrame.to_csv.

Parsing html elements using BeautifulSoup

Suppose I have:
<tr>
<td class="prodSpecAtribute">word</td>
<td colspan="5">
another_word
</td>
</tr>
I want to extract text in 2 td classes (word and another_word:
So I used BeautifulSoup:
This is the code Matijn Pieters was asking for:
Basically, it grabs info from html page (from a table) and stores these values in a left and right column list. Then, I create a dictionary from this details (using the left col list as the key, and for the values, I use the right col list)
def get_data(page):
soup = BeautifulSoup(page)
left = []
right = []
#Obtain data from table and store into left and right columns
#Iterate through each row
for tr in soup.findAll('tr'):
#Find all table data(cols) in that row
tds = tr.findAll('td')
#Make sure there are 2 elements, a col and a row
if len(tds) >= 2:
#Find each entry in a row -> convert to text
right_col = []
inp = []
once = 0
no_class = 0
for td in tds:
if once == 0:
#Check if of class 'prodSpecAtribute'
if check(td) == True:
left_col = td.findAll(text=True)
left_col_x = re.sub('&\w+;', '', str(left_col[0]))
once = 1
else:
no_class = 1
break
else:
right_col = td.findAll(text=True)
right_col_x = ' '.join(text for text in right_col if text.strip())
right_col_x = re.sub('&\w+;', '', right_col_x)
inp.append(right_col_x)
if no_class == 0:
inps = '. '.join(inp)
left.append(left_col_x)
right.append(inps)
#Create a Dictionary for left and right cols
item = dict(zip(left, right))
return item
You may use HTQL (http://htql.net).
Here is for your example:
import htql
page="""
<tr>
<td class="prodSpecAtribute">word</td>
<td colspan="5">
another_word
</td>
</tr>
"""
query = """
<tr>{
c1 = <td (class='prodSpecAtribute')>1 &tx;
c2 = <td>2 &tx &trim;
}
"""
a=htql.query(page, query)
print(dict(a))
It prints:
{'word': 'another_word'}

Categories

Resources