converting text file to html file with python

converting text file to html file with python - python

I have a text file that contains :
JavaScript 0
/AA 0
OpenAction 1
AcroForm 0
JBIG2Decode 0
RichMedia 0
Launch 0
Colors>2^24 0
uri 0
I wrote this code to convert the text file to html :
contents = open("C:\\Users\\Suleiman JK\\Desktop\\Static_hash\\test","r")
with open("suleiman.html", "w") as e:
for lines in contents.readlines():
e.write(lines + "<br>\n")
but the problem that I had in html file that in each line there is no space between the two columns:
JavaScript 0
/AA 0
OpenAction 1
AcroForm 0
JBIG2Decode 0
RichMedia 0
Launch 0
Colors>2^24 0
uri 0
what should I do to have the same content and the two columns like in text file

Just change your code to include <pre> and </pre> tags to ensure that your text stays formatted the way you have formatted it in your original text file.
contents = open"C:\\Users\\Suleiman JK\\Desktop\\Static_hash\\test","r")
with open("suleiman.html", "w") as e:
for lines in contents.readlines():
e.write("<pre>" + lines + "</pre> <br>\n")

This is HTML -- use BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup()
body = soup.new_tag('body')
soup.insert(0, body)
table = soup.new_tag('table')
body.insert(0, table)
with open('path/to/input/file.txt') as infile:
for line in infile:
row = soup.new_tag('tr')
col1, col2 = line.split()
for coltext in (col2, col1): # important that you reverse order
col = soup.new_tag('td')
col.string = coltext
row.insert(0, col)
table.insert(len(table.contents), row)
with open('path/to/output/file.html', 'w') as outfile:
outfile.write(soup.prettify())

That is because HTML parsers collapse all whitespace. There are two ways you could do it (well probably many more).
One would be to flag it as "preformatted text" by putting it in <pre>...</pre> tags.
The other would be a table (and this is what a table is made for):
<table>
<tr><td>Javascript</td><td>0</td></tr>
...
</table>
Fairly tedious to type out by hand, but easy to generate from your script. Something like this should work:
contents = open("C:\\Users\\Suleiman JK\\Desktop\\Static_hash\\test","r")
with open("suleiman.html", "w") as e:
e.write("<table>\n")
for lines in contents.readlines():
e.write("<tr><td>%s</td><td>%s</td></tr>\n"%lines.split())
e.write("</table>\n")

You can use a standalone template library like mako or jinja. Here is an example with jinja:
from jinja2 import Template
c = '''<!doctype html>
<html>
<head>
<title>My Title</title>
</head>
<body>
<table>
<thead>
<tr><th>Col 1</th><th>Col 2</th></tr>
</thead>
<tbody>
{% for col1, col2 in lines %}
<tr><td>{{ col 1}}</td><td>{{ col2 }}</td></tr>
{% endfor %}
</tbody>
</table>
</body>
</html>'''
t = Template(c)
lines = []
with open('yourfile.txt', 'r') as f:
for line in f:
lines.append(line.split())
with open('results.html', 'w') as f:
f.write(t.render(lines=lines))
If you can't install jinja, then here is an alternative:
header = '<!doctyle html><html><head><title>My Title</title></head><body>'
body = '<table><thead><tr><th>Col 1</th><th>Col 2</th></tr>'
footer = '</table></body></html>'
with open('input.txt', 'r') as input, open('output.html', 'w') as output:
output.writeln(header)
output.writeln(body)
for line in input:
col1, col2 = line.rstrip().split()
output.write('<tr><td>{}</td><td>{}</td></tr>\n'.format(col1, col2))
output.write(footer)

I have added title, looping here line by line and appending each line on < tr > and < td > tags, it is should work as single table without column. No need to use these tags(< tr >< /tr > and < td >< /td >[gave a spaces for readability]) for col1 and col2.
log: snippet:
MUTHU PAGE
2019/08/19 19:59:25 MUTHUKUMAR_TIME_DATE,line: 118 INFO | Logger
object created for: MUTHUKUMAR_APP_USER_SIGNUP_LOG 2019/08/19 19:59:25
MUTHUKUMAR_DB_USER_SIGN_UP,line: 48 INFO | ***** User SIGNUP page
start ***** 2019/08/19 19:59:25 MUTHUKUMAR_DB_USER_SIGN_UP,line: 49
INFO | Enter first name: [Alphabet character only allowed, minimum 3
character to maximum 20 chracter]
html source page:
'''
<?xml version="1.0" encoding="utf-8"?>
<body>
<table>
<p>
MUTHU PAGE
</p>
<tr>
<td>
2019/08/19 19:59:25 MUTHUKUMAR_TIME_DATE,line: 118 INFO | Logger object created for: MUTHUKUMAR_APP_USER_SIGNUP_LOG
</td>
</tr>
<tr>
<td>
2019/08/19 19:59:25 MUTHUKUMAR_DB_USER_SIGN_UP,line: 48 INFO | ***** User SIGNUP page start *****
</td>
</tr>
<tr>
<td>
2019/08/19 19:59:25 MUTHUKUMAR_DB_USER_SIGN_UP,line: 49 INFO | Enter first name: [Alphabet character only allowed, minimum 3 character to maximum 20 chracter]
'''
CODE:
from bs4 import BeautifulSoup
soup = BeautifulSoup(features='xml')
body = soup.new_tag('body')
soup.insert(0, body)
table = soup.new_tag('table')
body.insert(0, table)
with open('C:\\Users\xxxxx\\Documents\\Latest_24_may_2019\\New_27_jun_2019\\DB\\log\\input.txt') as infile:
title_s = soup.new_tag('p')
title_s.string = " MUTHU PAGE "
table.insert(0, title_s)
for line in infile:
row = soup.new_tag('tr')
col1 = list(line.split('\n'))
col1 = [ each for each in col1 if each != '']
for coltext in col1:
col = soup.new_tag('td')
col.string = coltext
row.insert(0, col)
table.insert(len(table.contents), row)
with open('C:\\Users\xxxx\\Documents\\Latest_24_may_2019\\New_27_jun_2019\\DB\\log\\output.html', 'w') as outfile:
outfile.write(soup.prettify())

Related

How to get numbers from html?

I want to get
the number after: page=
the number after: "new">
the number after: /a>-
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 4449-4450<br/> </td>
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 5111-5550<br/> </td>
<td> </td>
...
My code
tables = soup.find_all('a', attrs={'target': 'new'})
gives my only a list (see below) without the third number
[4449,
5111,
...]
her is how i would try to extract the 3 numbers from my list, once it has the third digit in it.
list_of_number1 = []
list_of_number2 = []
list_of_number3 = []
regex = re.compile("page=(\d+)")
for table in tables:
number1 = filter(regex.match, tables)
number2 = table.next_sibling.strip()
number3 =
list_of_number1.append(number1)
list_of_number2.append(number2)
list_of_number3.append(number3)
Do i use beautifulsoup for the third number or is it feasible to regex through the whole html for any number following "/a>-"

Here is how you can obtain your result using just the information that you need to get the numbers in the specific a element and in the text node immediately on the right:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tables = soup.find_all('a', attrs={'target': 'new'})
print([(t.text, t["href"].split('=')[-1], t.next_sibling.replace('-', '')) for t in tables])
# => [('4449', '99', '4450'), ('5111', '77', '5550')]
You may certainly go the harder way with regexps:
import re
#... the same initialization code as above
for t in tables:
page = ""
page_m = re.search(r"[#?]page=(\d+)", t["href"])
if page_m:
page = page_m.group(1)
else:
page = ""
num = "".join([x for x in t.next_sibling if x.isdigit()])
results.append((int(t.text), int(page), int(num)))
print(results)
# => [(4449, 99, 4450), (5111, 77, 5550)]
NOTE:
t.text - gets the element text
t["href"] - gets the href attribute value of the t element
t.next_sibling - gets the next node after current one that is on the same hierarchy level.

You can also try:
for b in soup.select('a'):
print(b.attrs['href'].split('=')[1], b.text, b.nextSibling)
Output:
99 4449 -4450
77 5111 -5550

Show URLs inside list with lxml.builder

I have a need to generate HTML with lxml package. Here is the sample main function that shows how I do it:
def main():
from lxml.builder import E
p_persons = []
person = ['1'] #counter
person.append('ID')
person.append('0. https://www.youtube.com/watch?v=qLsn5aNaVkI 1. https://www.youtube.com/watch?v=MPbO6P3Vtx8 2. https://www.youtube.com/watch?v=jVKWPaFuNng 3. https://www.youtube.com/watch?v=9HFyB4gCOqY 4. https://www.youtube.com/watch?v=muQGef4Df_8')
person.append('birthplace')
p_persons.append(person)
page = (
E.html(
E.body(
E.table(
*[E.tr(
*[
E.td(split(col)) if ind == 1 and col is not None else
E.td(str(col)) for ind, col in enumerate(row)
]
) for row in p_persons ]
, border="2"
)
)
)
)
with open('result.html', 'w') as f:
f.write(etree.tostring(page, pretty_print=True).decode('utf-8'))
def split(col):
from lxml.builder import E
import re
muts = re.split('\d\.',col)
links = []
for idx, mut in enumerate(muts):
print(mut)
links.append(str(idx + 1))
links.append(E.a(mut, href=mut))
links.append('\n')
return links
All is fine with simple structures like above, but sometimes I need to analyze the data and output it to E.td depending on the content.
I build person element which is a list of fields, and than put it to p_persons list, intended for output. Second field (a string which contains URLs separated by counter) demonstrates us the structure to be output. It is necessary to split this string and show the URLs in a form of a numeric list inside single cell E.td.
But E.td doesn't recognize it, if I put E.td(split(col))
Traceback (most recent call last):
File "<stdin>", line 11, in <module>
File "/home/user/functions.py", line 298, in rows_to_html
) for row in rows ]
File "/home/user/functions.py", line 298, in <listcomp>
) for row in rows ]
File "/home/user/functions.py", line 296, in <listcomp>
E.td(str(col)) for ind, col in enumerate(row)
File "src/lxml/builder.py", line 222, in lxml.builder.ElementMaker.__call__
TypeError: bad argument type: list(['1', <Element a at 0x7f1900117c48>, '\n'])
Here is the HTML sample I want to receive:
<!DOCTYPE html>
<html>
<body>
<table border="2">
<tr>
<td>ID</td>
<td><ol>
<li>https://www.youtube.com/watch?v=qLsn5aNaVkI</li>
<li>https://www.youtube.com/watch?v=MPbO6P3Vtx8</li>
<li>https://www.youtube.com/watch?v=jVKWPaFuNng</li>
<li>https://www.youtube.com/watch?v=9HFyB4gCOqY</li>
<li>https://www.youtube.com/watch?v=muQGef4Df_8</li>
</ol>
</td>
<td>birthplace</td>
</tr>
</table>
</body>
</html>
What is the proper way of doing this? Should I wrap the URLs into DIV or smth else? I didn't find similar examples in the web.

python - print a csv row into a HTML column

I'm trying to print the data of a CSV column into an HTML table
CSV file is like this (sample)
firstname, surname
firstname, surname
firstname, surname
firstname, surname
firstname, surname
firstname, surname
firstname, surname
I can read this data in ok - and get it to print out into a table via the following:
import csv
import sys
from fpdf import FPDF, HTMLMixin
#load in csv file
data = csv.reader(open(sys.argv[1]))
names = ""
#Read column names from first line of the file
fields = data.next()
for row in data:
names = row[0] + " " + row[1]
html_row = " <tr> "
html_col = " <td border=0 width=15%>" + names + "</td></tr>"
html_out = html_out + html_row + html_col
html = html_header + html_out + html_footer
print html
pdf.write_html(html)
pdf.output('test2.pdf', 'F')
this gives the following:
<tr><td border=0 width=15%>firstname surname</td></tr>
<tr><td border=0 width=15%>firstname surname</td></tr>
<tr><td border=0 width=15%>firstname surname</td></tr>
ie - every name is on a separate row - what i'd like to do is have every name as a cell column cell
<tr><td border=0 width=15%>firstname surname</td><td border=0 width=15%>firstname surname</td><td border=0 width=15%>firstname surname</td></tr>
thanks

Some variables in sample are not set, but this is due to reducing of sample to aminimal size I guess, so if you fill in these details, the following should work (suboptimal if more than 7 entries due to the hardcoded 15% width on the html td element.
import csv
import sys
from fpdf import FPDF, HTMLMixin
# load in csv file
data = csv.reader(open(sys.argv[1]))
# Read column names from first line of the file to ignore them
fields = data.next()
Below the loop from question has been replaced by a list comprehension.
cell_att holds the given attributes of the table cell elements and it will be interpolated like row[0] and row[1] into the strings making up the html_out list.
cell_att = " border=0 width=15%"
row_data = ["<td%s>%s %s</td>" % (cell_att, row[0], row[1]) for row in data]
Here one simply joins all cells and injects into a html table row element:
html_out = "<tr>" + "".join(row_data) + "</tr>"
html = html_header + html_out + html_footer
print html
pdf.write_html(html)
pdf.output('test2.pdf', 'F')
The other answers also give IMO useful hints esp. on the styling level of the HTML. In case you later decide to digest / transform more than 7 names, the above code might be a good start to create rows with at most 7 cells or to adapt the width attribute value by minor modifications.

You only need one <tr> to make a single row.
table, td {
border: solid 1px #CCC;
}
<table>
<tr>
<td>firstname surname</td>
<td>firstname surname</td>
<td>firstname surname</td>
</tr>
</table>
In order to make this work in your code, you need to create you row outside the loop:
html_row = '<tr>' # open row
for row in data:
names = row[0] + " " + row[1]
# append columns to the row
html_row += "<td border=0 width=15%>" + names + "</td>"
html_row += '</tr>' # close row
html_out = html_row

Splitting HTML text by <br> while using beautifulsoup

HTML code:
<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>
I need to get values 4.5 kn and 7.1 as separate list items so I could append them separately. I do not want to split it I wanted to split the text string using re.sub, but it does not work. I tried too use replace to replace br, but it did not work. Can anybody provide any insight?
Python code:
def NameSearch(shipLink, mmsi, shipName):
from bs4 import BeautifulSoup
import urllib2
import csv
import re
values = []
values.append(mmsi)
values.append(shipName)
regex = re.compile(r'[\n\r\t]')
i = 0
with open('Ship_indexname.csv', 'wb')as f:
writer = csv.writer(f)
while True:
try:
shipPage = urllib2.urlopen(shipLink, timeout=5)
except urllib2.URLError:
continue
except:
continue
break
soup = BeautifulSoup(shipPage, "html.parser") # Read the web page HTML
#soup.find('br').replaceWith(' ')
#for br in soup('br'):
#br.extract()
table = soup.find_all("table", {"id": "vessel-related"}) # Finds table with class table1
for mytable in table: #Loops tables with class table1
table_body = mytable.find_all('tbody') #Finds tbody section in table
for body in table_body:
rows = body.find_all('tr') #Finds all rows
for tr in rows: #Loops rows
cols = tr.find_all('td') #Finds the columns
for td in cols: #Loops the columns
checker = td.text.encode('ascii', 'ignore')
check = regex.sub('', checker)
if check == ' Speed (avg./max): ':
i = 1
elif i == 1:
print td.text
pat=re.compile('<br\s*/>')
print pat.sub(" ",td.text)
values.append(td.text.strip("\n").encode('utf-8')) #Takes the second columns value and assigns it to a list called Values
i = 0
#print values
return values
NameSearch('https://www.fleetmon.com/vessels/kind-of-magic_0_3478642/','230034570','KIND OF MAGIC')

Locate the "Speed (avg./max)" label first and then go to the value via .find_next():
from bs4 import BeautifulSoup
data = '<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>'
soup = BeautifulSoup(data, "html.parser")
label = soup.find("label", class_="identifier", text="Speed (avg./max):")
value = label.find_next("td", class_="value").get_text(strip=True)
print(value) # prints 4.5 kn7.1 kn
Now, you can extract the actual numbers from the string:
import re
speed_values = re.findall(r"([0-9.]+) kn", value)
print(speed_values)
Prints ['4.5', '7.1'].
You can then further convert the values to floats and unpack into separate variables:
avg_speed, max_speed = map(float, speed_values)

python parsing beautiful soup data to csv

I have written code in python3 to parse an html/css table. Have a few issues with it:
my csv output file headers are not generated based on html (tag: td, class: t1) by my code (on the first run when the output file is being created)
if the incoming html table has a few additional fields (tag: td, class: t1) my code cannot currently capture them and create additional headers in the csv output file
the data is not written to the output cvs file till ALL the ids (A001,A002,A003...) from my input file are processed. i want to write to the output cvs file when the processing of each id from my input file is completed (i.e. A001 to be written to csv before processing A002).
whenever i rerun the code, the data does not begin from the next line in the output csv
Being a noob, I am sure my code is very rudimentary and there will be a better way to do this and would like to learn to write this better and fix the above as well.
Need advise & guidance, please help. Thank you.
My Code:
import csv
import requests
from bs4 import BeautifulSoup
## SIDs.csv contains ids in col2 based on which the 'url' variable pulls the respective data
SIDFile = open('SIDs.csv')
SIDReader = csv.reader(SIDFile)
SID = list(SIDReader)
SqID_data = []
#create and open output file
with open('output.csv','a', newline='') as csv_h:
fields = \
[
"ID",
"Financial Year",
"Total Income",
"Total Expenses",
"Tax Expense",
"Net Profit"
]
for row in SID:
col1,col2 = row
SID ="%s" % (col2)
url = requests.get("http://.......")
soup = BeautifulSoup(url.text, "lxml")
fy = soup.findAll('td',{'class':'tablehead'})
titles = soup.findAll('td',{'class':'t1'})
values = soup.findAll('td',{'class':'t0'})
if titles:
data = {}
for title in titles:
name = title.find("td", class_ = "t1")
data["ID"] = SID
data["Financial Year"] = fy[0].string.strip()
data["Total Income"] = values[0].string.strip()
data["Total Expenses"] = values[1].string.strip()
data["Tax Expense"] = values[2].string.strip()
data["Net Profit"] = values[3].string.strip()
SqID_data.append(data)
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(SqID_data)
print("write rows complete")
Excerpt of HTML being processed:
<p>
<TABLE border=0 cellspacing=1 cellpadding=6 align=center class="vTable">
<TR>
<TD class=tablehead>Financial Year</TD>
<TD class=t1>01-Apr-2015 To 31-Mar-2016</TD>
</TR>
</TABLE>
</p>
<p>
<br>
<table cellpadding=3 cellspacing=1 class=vTable>
<TR>
<TD class=t1><b>Total income from operations (net) ( a + b)</b></td>
<TD class=t0 nowrap>675529.00</td>
</tr>
<TR>
<TD class=t1><b>Total expenses</b></td>
<TD class=t0 nowrap>446577.00</td>
</tr>
<TR>
<TD class=t1>Tax expense</td>
<TD class=t0 nowrap>71708.00</td>
</tr>
<TR>
<TD class=t1><b>Net Profit / (Loss)</b></td>
<TD class=t0 nowrap>157621</td>
</tr>
</table>
</p>
SIDs.csv (no header row)
1,A0001
2,A0002
3,A0003
Expected Output: output.csv (create header row)
ID,Financial Year,Total Income,Total Expenses,Tax Expense,Net Profit,OtherFieldsAsAndWhenFound
A001,01-Apr-2015 To 31-Mar-2016,675529.00,446577.00,71708.00,157621.00
A002,....
A003,....

I would recommend looking at pandas.read_html for parsing your web data; on your sample data this gives you:
import pandas as pd
tables=pd.read_html(s, index_col=0)
tables[0]
Out[11]:
1
0
Financial Year 01-Apr-2015 To 31-Mar-2016
tables[1]
1
0
Total income from operations (net) ( a + b) 675529
Total expenses 446577
Tax expense 71708
Net Profit / (Loss) 157621
You can then do what ever data manipulations you need (adding id's etc) using Pandas functions, and then export with DataFrame.to_csv.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

converting text file to html file with python - python

Related

How to get numbers from html?

Show URLs inside list with lxml.builder

python - print a csv row into a HTML column

Splitting HTML text by <br> while using beautifulsoup

python parsing beautiful soup data to csv

Categories

Resources