Nested for-loop iteration stops

Nested for-loop iteration stops - python

I have two input files: an html one, and a css for it. I want to produce some operation on the html file based on the contents of the css file.
my html is like this:
<html>
<head>
<title></title>
</head>
<body>
<p class = "cl1" id = "id1"> <span id = "span1"> blabla</span> </p>
<p class = "cl2" id = "id2"> <span id = "span2"> blablabla</span> <span id = "span3"> qwqwqw </span> </p>
</body>
</html>
Styles for span ids are defined in the css file (individually for each span id!)
Before doing real stuff (deletion of spans based on their style) I was trying just to print out ids from the html and the style descritption from the css corresponding to each id.
Code:
from lxml import etree
tree = etree.parse("file.html")
filein = "file.css"
def f1():
with open(filein, 'rU') as f:
for span in tree.iterfind('//span'):
for line in f:
if span and span.attrib.has_key('id'):
x = span.get('id')
if "af" not in x and x in line:
print x, line
def main():
f1()
So, there are two for-loops, which iterate perfectly if separated, but when put together in this function the iteration stops after the first loop:
>> span1 span`#span1 { font-weight: bold; font-size: 11.0pt; font-style: normal; letter-spacing: 0em }
How can I fix this?

If as I think, tree is completely loaded in memory, you could try to reverse the loops. That way, you only browse the file filein once :
def f1():
with open(filein, 'rU') as f:
for line in f:
for span in tree.iterfind('//span'):
if span and span.attrib.has_key('id'):
x = span.get('id')
if "af" not in x and x in line:
print x, line

It happens because you have read all filein lines till second outer loop begin.
To make it work, you need add f.seek(0) before starting inner loop over filein:
with open(filein, 'rU') as f:
for span in tree.iterfind('//span'):
f.seek(0)
for line in f:
if span and span.attrib.has_key('id'):
x = span.get('id')
if "af" not in x and x in line:
print x, line

Related

How to extract tags from HTML file and write them to a new file?

My HTML file has the format shown below
<unit id="2" status="FINISHED" type="pe">
<S producer="Alice_EN">CHAPTER I Down the Rabbit-Hole</S>
<MT producer="ALICE_GG">CAPÍTULO I Abaixo do buraco de coelho</MT>
<annotations revisions="1">
<annotation r="1">
<PE producer="A1.ALICE_GG"><html>
<head>
</head>
<body>
CAPÍTULO I Descendo pela toca do coelho
</body>
</html></PE>
I need to extract ALL the content from two tags in the entire HTML file. The content of one of the tags that starts with <unit id ...> is in one line, but the content of the other tag that starts with "<PE producer ..." and ends with '' is spread over different lines. I need to extract the content within these two tags and write the content to a new file one after another. My output should be:
<unit id="2" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
<head>
</head>
<body>
CAPÍTULO I Descendo pela toca do coelho
</body>
</html></PE>
My code does not extract the content from all the tags of the file. Does anyone have a clue of whats is going on and how I can make this code work properly?
import codecs
import re
t=codecs.open('ALICE.per1_replaced.html','r')
t=t.read()
unitid=re.findall('<unit.*?"pe">', t)
PE=re.findall('<PE.*?</PE>', t, re.DOTALL)
for i in unitid:
for j in PE:
a=i + '\n' + j + '\n'
with open('PEtags.txt','w') as fi:
fi.write(a)

You have a problem with the code where you loop through the matches and write them to file.
If your initid and PE match counts are the same, you may adjust the code to
import re
with open('ALICE.per1_replaced.html','r') as t:
contents = t.read()
unitid=re.findall('<unit.*?"pe">', contents)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtags.txt','w') as fi:
for i, p in zip(unitid, PE):
fi.write( "{}\n{}\n".format(i, p) )

Is there a function similar to .replace() in Dominate library for Python?

I want to add HTML tags to text taken from a .txt file and then save as HTML. I'm trying to find any instances of a particular word, then 'replace' it with the same word inside an anchor tag.
Something like this:
import dominate
from dominate.tags import *
item = 'item1'
text = ['here is item1 in a line of text', 'here is item2 in a line too']
doc = dominate.document()
with doc:
for i, line in enumerate(text):
if item in text[i]:
text[i].replace(item, a(item, href='/item1'))
The above code gives an error:
TypeError: replace() argument 2 must be str, not a.
I can make this happen:
print(doc.body)
<body>
<p>here is item1 in a line of text</p>
<p>here is item2 in a line too</p>
</body>
But I want this:
print(doc.body)
<body>
<p>here is <a href='/item1'>item1</a> in a line of text</p>
<p>here is item2 in a line too</p>
</body>

There is no replace() method in Dominate, but this solution works for what I want to achieve:
Create the anchor tag as a string. Stored here in the variable 'item_atag':
item = 'item1'
url = '/item1'
item_atag = '<a href={}>{}</a>'.format(url, item)
Use Dominate library to wrap paragraph tags around each line in the original text, then convert to string:
text = ['here is item1 in a line of text', 'here is item2 in a line too']
from dominate import document
from dominate.tags import p
doc = document()
with doc.body:
for i, line in enumerate(text):
p(text[i])
html_string = str(doc.body)
Use Python's built-in replace() method for strings to add the anchor tag:
html_with_atag = html_string.replace(item, item_atag)
Finally, write the new string to a HTML file:
with open('html_file.html', 'w') as f:
f.write(html_with_atag)

Modifying a BeautifulSoup .string with line breaks

I am trying to change the content of an html file with BeautifulSoup. This content will be coming from python-based text so it will have \n newlines...
newContent = """This is my content \n with a line break."""
newContent = newContent.replace("\n", "<br>")
htmlFile.find_all("div", "product").p.string = newContent
when I do this, the html file <p> text is changed to this:
This is my content <br> with a line break.
How do I change a string within a BeautifulSoup object and keep <br> breaks? if the string just contains \n then it'll create an actual line break.

You need to create separate elements; there isn't one piece of text contained in the <p> tag, but a series of text and <br/> elements.
Rather than replace \n newlines with the text <br/> (which will be escaped), split the text on newlines and insert extra elements in between:
parent = htmlFile.find_all("div", "product")[0].p
lines = newContent.splitlines()
parent.append(htmlFile.new_string(lines[0]))
for line in lines[1:]:
parent.append(htmlFile.new_tag('br'))
parent.append(htmlFile.new_string(line))
This uses the Element.append() method to add new elements to the tree, and using BeautifulSoup.new_string() and BeautifulSoup.new_tag() to create those extra elements.
Demo:
>>> from bs4 import BeautifulSoup
>>> htmlFile = BeautifulSoup('<p></p>')
>>> newContent = """This is my content \n with a line break."""
>>> parent = htmlFile.p
>>> lines = newContent.splitlines()
>>> parent.append(htmlFile.new_string(lines[0]))
>>> for line in lines[1:]:
... parent.append(htmlFile.new_tag('br'))
... parent.append(htmlFile.new_string(line))
...
>>> print htmlFile.prettify()
<html>
<head>
</head>
<body>
<p>
This is my content
<br/>
with a line break.
</p>
</body>
</html>

converting text file to html file with python

I have a text file that contains :
JavaScript 0
/AA 0
OpenAction 1
AcroForm 0
JBIG2Decode 0
RichMedia 0
Launch 0
Colors>2^24 0
uri 0
I wrote this code to convert the text file to html :
contents = open("C:\\Users\\Suleiman JK\\Desktop\\Static_hash\\test","r")
with open("suleiman.html", "w") as e:
for lines in contents.readlines():
e.write(lines + "<br>\n")
but the problem that I had in html file that in each line there is no space between the two columns:
JavaScript 0
/AA 0
OpenAction 1
AcroForm 0
JBIG2Decode 0
RichMedia 0
Launch 0
Colors>2^24 0
uri 0
what should I do to have the same content and the two columns like in text file

Just change your code to include <pre> and </pre> tags to ensure that your text stays formatted the way you have formatted it in your original text file.
contents = open"C:\\Users\\Suleiman JK\\Desktop\\Static_hash\\test","r")
with open("suleiman.html", "w") as e:
for lines in contents.readlines():
e.write("<pre>" + lines + "</pre> <br>\n")

This is HTML -- use BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup()
body = soup.new_tag('body')
soup.insert(0, body)
table = soup.new_tag('table')
body.insert(0, table)
with open('path/to/input/file.txt') as infile:
for line in infile:
row = soup.new_tag('tr')
col1, col2 = line.split()
for coltext in (col2, col1): # important that you reverse order
col = soup.new_tag('td')
col.string = coltext
row.insert(0, col)
table.insert(len(table.contents), row)
with open('path/to/output/file.html', 'w') as outfile:
outfile.write(soup.prettify())

That is because HTML parsers collapse all whitespace. There are two ways you could do it (well probably many more).
One would be to flag it as "preformatted text" by putting it in <pre>...</pre> tags.
The other would be a table (and this is what a table is made for):
<table>
<tr><td>Javascript</td><td>0</td></tr>
...
</table>
Fairly tedious to type out by hand, but easy to generate from your script. Something like this should work:
contents = open("C:\\Users\\Suleiman JK\\Desktop\\Static_hash\\test","r")
with open("suleiman.html", "w") as e:
e.write("<table>\n")
for lines in contents.readlines():
e.write("<tr><td>%s</td><td>%s</td></tr>\n"%lines.split())
e.write("</table>\n")

You can use a standalone template library like mako or jinja. Here is an example with jinja:
from jinja2 import Template
c = '''<!doctype html>
<html>
<head>
<title>My Title</title>
</head>
<body>
<table>
<thead>
<tr><th>Col 1</th><th>Col 2</th></tr>
</thead>
<tbody>
{% for col1, col2 in lines %}
<tr><td>{{ col 1}}</td><td>{{ col2 }}</td></tr>
{% endfor %}
</tbody>
</table>
</body>
</html>'''
t = Template(c)
lines = []
with open('yourfile.txt', 'r') as f:
for line in f:
lines.append(line.split())
with open('results.html', 'w') as f:
f.write(t.render(lines=lines))
If you can't install jinja, then here is an alternative:
header = '<!doctyle html><html><head><title>My Title</title></head><body>'
body = '<table><thead><tr><th>Col 1</th><th>Col 2</th></tr>'
footer = '</table></body></html>'
with open('input.txt', 'r') as input, open('output.html', 'w') as output:
output.writeln(header)
output.writeln(body)
for line in input:
col1, col2 = line.rstrip().split()
output.write('<tr><td>{}</td><td>{}</td></tr>\n'.format(col1, col2))
output.write(footer)

I have added title, looping here line by line and appending each line on < tr > and < td > tags, it is should work as single table without column. No need to use these tags(< tr >< /tr > and < td >< /td >[gave a spaces for readability]) for col1 and col2.
log: snippet:
MUTHU PAGE
2019/08/19 19:59:25 MUTHUKUMAR_TIME_DATE,line: 118 INFO | Logger
object created for: MUTHUKUMAR_APP_USER_SIGNUP_LOG 2019/08/19 19:59:25
MUTHUKUMAR_DB_USER_SIGN_UP,line: 48 INFO | ***** User SIGNUP page
start ***** 2019/08/19 19:59:25 MUTHUKUMAR_DB_USER_SIGN_UP,line: 49
INFO | Enter first name: [Alphabet character only allowed, minimum 3
character to maximum 20 chracter]
html source page:
'''
<?xml version="1.0" encoding="utf-8"?>
<body>
<table>
<p>
MUTHU PAGE
</p>
<tr>
<td>
2019/08/19 19:59:25 MUTHUKUMAR_TIME_DATE,line: 118 INFO | Logger object created for: MUTHUKUMAR_APP_USER_SIGNUP_LOG
</td>
</tr>
<tr>
<td>
2019/08/19 19:59:25 MUTHUKUMAR_DB_USER_SIGN_UP,line: 48 INFO | ***** User SIGNUP page start *****
</td>
</tr>
<tr>
<td>
2019/08/19 19:59:25 MUTHUKUMAR_DB_USER_SIGN_UP,line: 49 INFO | Enter first name: [Alphabet character only allowed, minimum 3 character to maximum 20 chracter]
'''
CODE:
from bs4 import BeautifulSoup
soup = BeautifulSoup(features='xml')
body = soup.new_tag('body')
soup.insert(0, body)
table = soup.new_tag('table')
body.insert(0, table)
with open('C:\\Users\xxxxx\\Documents\\Latest_24_may_2019\\New_27_jun_2019\\DB\\log\\input.txt') as infile:
title_s = soup.new_tag('p')
title_s.string = " MUTHU PAGE "
table.insert(0, title_s)
for line in infile:
row = soup.new_tag('tr')
col1 = list(line.split('\n'))
col1 = [ each for each in col1 if each != '']
for coltext in col1:
col = soup.new_tag('td')
col.string = coltext
row.insert(0, col)
table.insert(len(table.contents), row)
with open('C:\\Users\xxxx\\Documents\\Latest_24_may_2019\\New_27_jun_2019\\DB\\log\\output.html', 'w') as outfile:
outfile.write(soup.prettify())

Parse HTML and preserve original content

I have lots of HTML files. I want to replace some elements, keeping all the other content unchanged. For example, I would like to execute this jQuery expression (or some equivalent of it):
$('.header .title').text('my new content')
on the following HTML document:
<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
and have the following result:
<div class=header><span class=title>my new content</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
The problem is, all parsers I’ve tried (Nokogiri, BeautifulSoup, html5lib) serialize it to something like this:
<html>
<head></head>
<body>
<div class=header><span class=title>my new content</span></div>
<p>1</p><p>2</p>
<table><tbody><tr><td>1</td></tr></tbody></table>
</body>
</html>
E.g. they add:
html, head and body elements
closing p tags
tbody
Is there a parser that satisfies my needs? It should work in either Node.js, Ruby or Python.

I highly recommend the pyquery package, for python. It is a jquery-like interface layered ontop of the extremely reliable lxml package, a python binding to libxml2.
I believe this does exactly what you want, with a quite familiar interface.
from pyquery import PyQuery as pq
html = '''
<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
'''
doc = pq(html)
doc('.header .title').text('my new content')
print doc
Output:
<div><div class="header"><span class="title">my new content</span></div>
<p>1</p><p>2
</p><table><tr><td>1</td></tr></table></div>
The closing p tag can't be helped. lxml only keeps the values from the original document, not the vagaries of the original. Paragraphs can be made two ways, and it chooses the more standard way when doing serialization. I don't believe you'll find a (bug-free) parser that does better.

Note: I'm on Python 3.
This will only handle a subset of CSS selectors, but it may be enough for your purposes.
from html.parser import HTMLParser
class AttrQuery():
def __init__(self):
self.repl_text = ""
self.selectors = []
def add_css_sel(self, seltext):
sels = seltext.split(" ")
for selector in sels:
if selector[:1] == "#":
self.add_selector({"id": selector[1:]})
elif selector[:1] == ".":
self.add_selector({"class": selector[1:]})
elif "." in selector:
html_tag, html_class = selector.split(".")
self.add_selector({"html_tag": html_tag, "class": html_class})
else:
self.add_selector({"html_tag": selector})
def add_selector(self, selector_dict):
self.selectors.append(selector_dict)
def match_test(self, tagwithattrs_list):
for selector in self.selectors:
for condition in selector:
condition_value = selector[condition]
if not self._condition_test(tagwithattrs_list, condition, condition_value):
return False
return True
def _condition_test(self, tagwithattrs_list, condition, condition_value):
for tagwithattrs in tagwithattrs_list:
try:
if condition_value == tagwithattrs[condition]:
return True
except KeyError:
pass
return False
class HTMLAttrParser(HTMLParser):
def __init__(self, html, **kwargs):
super().__init__(self, **kwargs)
self.tagwithattrs_list = []
self.queries = []
self.matchrepl_list = []
self.html = html
def handle_starttag(self, tag, attrs):
tagwithattrs = dict(attrs)
tagwithattrs["html_tag"] = tag
self.tagwithattrs_list.append(tagwithattrs)
if debug:
print("push\t", end="")
for attrname in tagwithattrs:
print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="")
print("")
def handle_endtag(self, tag):
try:
while True:
tagwithattrs = self.tagwithattrs_list.pop()
if debug:
print("pop \t", end="")
for attrname in tagwithattrs:
print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="")
print("")
if tag == tagwithattrs["html_tag"]: break
except IndexError:
raise IndexError("Found a close-tag for a non-existent element.")
def handle_data(self, data):
if self.tagwithattrs_list:
for query in self.queries:
if query.match_test(self.tagwithattrs_list):
line, position = self.getpos()
length = len(data)
match_replace = (line-1, position, length, query.repl_text)
self.matchrepl_list.append(match_replace)
def addquery(self, query):
self.queries.append(query)
def transform(self):
split_html = self.html.split("\n")
self.matchrepl_list.reverse()
if debug: print ("\nreversed list of matches (line, position, len, repl_text):\n{}\n".format(self.matchrepl_list))
for line, position, length, repl_text in self.matchrepl_list:
oldline = split_html[line]
newline = oldline[:position] + repl_text + oldline[position+length:]
split_html = split_html[:line] + [newline] + split_html[line+1:]
return "\n".join(split_html)
See the example usage below.
html_test = """<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td class=hi><div id=there>1</div></td></tr></table>"""
debug = False
parser = HTMLAttrParser(html_test)
query = AttrQuery()
query.repl_text = "Bar"
query.add_selector({"html_tag": "div", "class": "header"})
query.add_selector({"class": "title"})
parser.addquery(query)
query = AttrQuery()
query.repl_text = "InTable"
query.add_css_sel("table tr td.hi #there")
parser.addquery(query)
parser.feed(html_test)
transformed_html = parser.transform()
print("transformed html:\n{}".format(transformed_html))
Output:
transformed html:
<div class=header><span class=title>Bar</span></div>
<p>1<p>2
<table><tr><td class=hi><div id=there>InTable</div></td></tr></table>

Ok I have done this in a few languages and I have to say the best parser I have seen that preserves whitespace and even HTML comments is:
Jericho which is unfortunately Java.
That is Jericho knows how to parse and preserve fragments.
Yes I know its Java but you could easily make a RESTful service with a tiny bit of Java that would take the payload and convert it. In the Java REST service you could use JRuby, Jython, Rhino Javascript etc. to coordinate with Jericho.

You can use Nokogiri HTML Fragment for this:
fragment = Nokogiri::HTML.fragment('<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>')
fragment.css('.title').children.first.replace(Nokogiri::XML::Text.new('HEY', fragment))
frament.to_s #=> "<div class=\"header\"><span class=\"title\">HEY</span></div>\n<p>1</p><p>2\n</p><table><tr><td>1</td></tr></table>"
The problem with the p tag persists, because it is invalid HTML, but this should return your document without html, head or body and tbody tags.

With Python - using lxml.html is fairly straight forward:
(It meets points 1 & 3, but I don't think much can be done about 2, and handles the unquoted class='s)
import lxml.html
fragment = """<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
"""
page = lxml.html.fromstring(fragment)
for span in page.cssselect('.header .title'):
span.text = 'my new value'
print lxml.html.tostring(page, pretty_print=True)
Result:
<div>
<div class="header"><span class="title">my new content</span></div>
<p>1</p>
<p>2
</p>
<table><tr><td>1</td></tr></table>
</div>

This is a slightly separate solution but if this is only for a few simple instances then perhaps CSS is the answer.
Generated Content
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<style type="text/css">
#header.title1:first-child:before {
content: "This is your title!";
display: block;
width: 100%;
}
#header.title2:first-child:before {
content: "This is your other title!";
display: block;
width: 100%;
}
</style>
</head>
<body>
<div id="header" class="title1">
<span class="non-title">Blah Blah Blah Blah</span>
</div>
</body>
</html>
In this instance you could just have jQuery swap the classes and you'd get the change for free with css. I haven't tested this particular usage but it should work.
We use this for things like outage messages.

If you're running a Node.js app, this module will do exactly what you want, a JQuery style DOM manipulator: https://github.com/cheeriojs/cheerio
An example from their wiki:
var cheerio = require('cheerio'),
$ = cheerio.load('<h2 class="title">Hello world</h2>');
$('h2.title').text('Hello there!');
$('h2').addClass('welcome');
$.html();
//=> <h2 class="title welcome">Hello there!</h2>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Nested for-loop iteration stops - python

Related

How to extract tags from HTML file and write them to a new file?

Is there a function similar to .replace() in Dominate library for Python?

Modifying a BeautifulSoup .string with line breaks

converting text file to html file with python

Parse HTML and preserve original content

Categories

Resources