How do I draw a border around the blockquote content in the QLabel HTML code below? (I've tried a few ways, html_ contains a variant that should work).
from PyQt4.QtGui import QApplication, QLabel
import sys
html_ = '<p style="border: 1px dotted black;"> hassan </p>'
html = '''
<table cellspacing="5" border="0" cellpadding="0">
<tr valign="top" align="left">
<td style="padding-right: 10px;" width="150">
<p>#%s<br>
<b>User:</b> %s<br>
<b>posted at:</b> %s </p>
</td>
<td width="1" bgcolor="#00FFFF"><BR></td>
<td style="padding-left: 10px;" width="400" valign="top" align="left">
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum. Typi non habent claritatem insitam; est usus legentis in iis qui facit eorum claritatem. Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc putamus parum claram, anteposuerit litterarum formas humanitatis per seacula quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur parum clari, fiant sollemnes in futurum. </p>
<blockquote >
<em>Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat,</em>
<p style="border:1px dotted black;"><b> posted by: </b>hassan</p>
</blockquote>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum. Typi non habent claritatem insitam; est usus legentis in iis qui facit eorum claritatem. Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc putamus parum claram, anteposuerit litterarum formas humanitatis per seacula quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur parum clari, fiant sollemnes in futurum. </p>
</td>
</tr>
</table>
''' % (1, "hassan", "sunday")
app = QApplication(sys.argv)
l = QLabel(html)
l.setWordWrap(True)
l.show()
app.exec_()
sys.exit()
Widgets such as QLabel and QTextBrowser only have support for a limited subset of html/css.
In this particular case, borders are only directly supported by the table element.
So you could try something like:
<table border="1" style="border-style: dotted; border-color: black"><tr><td>
<blockquote>
...
</blockquote>
</td></tr></table>
or:
<table bgcolor="black"><tr><td>
<blockquote style="background-color: palette(window)">
...
</blockquote>
</td></tr></table>
Related
I have tried many methods but it hasn't worked for me. I want to split a text files lines into multiple chunks. Specifically 50 lines per chunk.
Like this [['Line1', 'Line2' -- up to 50] and so on.
data.txt (example):
Line2
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Python code:
with open('data.txt', 'r') as file:
sample = file.readlines()
chunks = []
for i in range(0, len(sample), 3): # replace 3 with 50 in your case
chunks.append(sample[i:i+3]) # replace 3 with 50 in your case
chunks (in my example, chunks of 3 lines):
[['Line1\n', 'Line2\n', 'Line3\n'], ['Line4\n', 'Line5\n', 'Line6\n'], ['Line7\n', 'Line8']]
You can apply the string.rstrip('\n') method on those lines to remove the \n at the end.
Alternative:
Without reading the whole file in memory (better):
chunks = []
with open('data.txt', 'r') as file:
while True:
chunk = []
for i in range(3): # replace 3 with 50 in your case
line = file.readline()
if not line:
break
chunk.append(line)
# or 'chunk.append(line.rstrip('\n')) to remove the '\n' at the ends
if not chunk:
break
chunks.append(chunk)
print(chunks)
Produces same result
A good way to do it would be to create a generic generator function that could break any sequence up into chunks of any size. Here's what I mean:
from itertools import zip_longest
def grouper(n, iterable): # Generator function.
"s -> (s0, s1, ...sn-1), (sn, sn+1, ...s2n-1), (s2n, s2n+1, ...s3n-1), ..."
FILLER = object() # Unique object
for group in zip_longest(*([iter(iterable)]*n), fillvalue=FILLER):
limit = group.index(FILLER) if group[-1] is FILLER else len(group)
yield group[:limit] # Sliced to remove any filler.
if __name__ == '__main__':
from pprint import pprint
with open('lorem ipsum.txt') as inf:
for chunk in grouper(3, inf):
pprint(chunk, width=90)
If the lorem ipsum.txt file contained these lines:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.
In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In
elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,
dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor
facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.
Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras
pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex
arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.
The result will be the following chunks each composed of 3 lines or less:
('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.\n',
'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In\n',
'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,\n')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor\n',
'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.\n',
'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras\n')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex\n',
'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.\n')
Update
If you want to remove the newline characters from the end of the lines of the file, you could do it with the same generic grouper() function by passing it a generator expression to preprocess the lines being read without needing to read them all into memory first:
if __name__ == '__main__':
from pprint import pprint
with open('lorem ipsum.txt') as inf:
lines = (line.rstrip() for line in inf) # Generator expr - cuz outer parentheses.
for chunk in grouper(3, lines):
pprint(chunk, width=90)
Output using generator expression:
('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.',
'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In',
'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor',
'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.',
'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex',
'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.')
You can split the text by each newline using the str.splitlines() method. Then, using a list comprehension, you can use list slices to slice the list at increments of the chunk_size (50 in your case). Below, I used 3 as the chunk_size variable, but you can replace that with 50:
text = '''Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.'''
lines = text.splitlines()
chunk_size = 3
chunks = [lines[i: i + chunk_size] for i in range(0, len(lines), chunk_size)]
print(chunks)
Output:
[['Lorem ipsum dolor sit amet, consectetur adipiscing elit,', 'sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.', 'Ut enim ad minim veniam,'],
['quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.', 'Duis aute irure dolor in reprehenderit in voluptate', 'velit esse cillum dolore eu fugiat nulla pariatur.'],
['Excepteur sint occaecat cupidatat non proident,', 'sunt in culpa qui officia deserunt mollit anim id est laborum.']]
I'm trying to count items in a list of strings using count() function and sorting the results from largest to smallest. Although the function performs reasonably well on small lists, it does not scale up well at all, as can be seen in the small experiment below with just 5 cycles of doubling up the input length (the 6th cycle was taking too long to wait). Is there a way to optimize the first list comprehension or perhaps an alternative to count() that would scale up better?
import nltk
from operator import itemgetter
import time
t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."
unigrams = nltk.word_tokenize(t.lower())
for size in range(1, 6):
unigrams = unigrams*size
start = time.time()
unigram_freqs = [unigrams.count(word) for word in unigrams]
freq_pairs = set((zip(unigrams, unigram_freqs)))
freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]
end = time.time()
time_elapsed = round(end-start, 3)
print("Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")
# Runtime: 0.001s for 1x the size
# Runtime: 0.003s for 2x the size
# Runtime: 0.022s for 3x the size
# Runtime: 0.33s for 4x the size
# Runtime: 8.065s for 5x the size
Using Counter from collections and sorting by means of the member function "most_common()" I get pretty much 0 seconds regardless of size:
import nltk
nltk.download('punkt')
from operator import itemgetter
from collections import Counter
import time
t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."
unigrams = nltk.word_tokenize(t.lower())
for size in range(1, 5):
unigrams = unigrams*size
start = time.time()
unigram_freqs = [unigrams.count(word) for word in unigrams]
freq_pairs = set((zip(unigrams, unigram_freqs)))
freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]
end = time.time()
time_elapsed = round(end-start, 3)
print("Slow Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")
start = time.time()
a = Counter(unigrams).most_common()
#print(a)
end = time.time()
time_elapsed = round(end-start, 3)
print("Fast Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")
Slow Runtime: 0.003s for 1x the size
Fast Runtime: 0.0s for 1x the size
Slow Runtime: 0.006s for 2x the size
Fast Runtime: 0.0s for 2x the size
Slow Runtime: 0.157s for 3x the size
Fast Runtime: 0.0s for 3x the size
Slow Runtime: 1.891s for 4x the size
Fast Runtime: 0.001s for 4x the size
How do you load and dump YAML using PyYAML, so that it uses the original styling as closely as possible?
I have Python to load and dump YAML data like:
import sys
import yaml
def _represent_dictorder(self, data):
# Maintains ordering of specific dictionary keys in the YAML output.
_data = []
ordering = ['questions', 'tags', 'answers', 'weight', 'date', 'text']
for key in ordering:
if key in data:
_data.append((str(key), data.pop(key)))
if data:
_data.extend(data.items())
return self.represent_mapping(u'tag:yaml.org,2002:map', _data)
yaml.add_representer(dict, _represent_dictorder)
text="""- questions:
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
tags:
context: curabitur
answers:
- weight: 2
date: 2014-1-19
text: |-
1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus, mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.
a. Aenean consectetur eleifend accumsan.
4. In erat lacus, egestas ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis maximus dignissim.
a. Proin nec neque convallis, placerat odio non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.
b. Nulla facilisi. Pellentesque at pretium nunc.
c. Ut ipsum nibh, suscipit a pretium eu, eleifend vitae purus.
"""
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, indent=4)
but this outputs the YAML in a different style, like:
- questions:
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
tags:
context: curabitur
answers:
- weight: 2
date: 2014-1-19
text: "1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.\n\
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.\n\
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus,\
\ mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.\n \
\ a. Aenean consectetur eleifend accumsan.\n4. In erat lacus, egestas\
\ ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis\
\ maximus dignissim.\n a. Proin nec neque convallis, placerat odio\
\ non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.\n\
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.\n \
\ a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.\n\
\ b. Nulla facilisi. Pellentesque at pretium nunc.\n c. Ut ipsum\
\ nibh, suscipit a pretium eu, eleifend vitae purus."
As you can see, it's changing the style of the text-block, so that newlines are escaped, making it a lot harder to read.
So I tried specifying the default_style attribute like:
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, default_style='|', indent=4)
And that fixed the text-block style, but then it broke other styles by putting quotes around all other strings, adding newlines to single-line strings, and munging integers, like:
- "questions":
- |-
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"tags":
"context": |-
curabitur
"answers":
- "weight": !!int |-
2
"date": |-
2014-1-19
"text": |-
1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus, mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.
a. Aenean consectetur eleifend accumsan.
4. In erat lacus, egestas ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis maximus dignissim.
a. Proin nec neque convallis, placerat odio non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.
b. Nulla facilisi. Pellentesque at pretium nunc.
c. Ut ipsum nibh, suscipit a pretium eu, eleifend vitae purus.
How do I fix this so the output resembles the style of my original input?
How would you determine what string to represent as a block literal (or a folded block for that matter) and what to represent inline?
Under the assumption that you only want block literals used with strings that span over multiple lines, you can write your own string representer to switch between the styles based on the string content:
def selective_representer(dumper, data):
return dumper.represent_scalar(u"tag:yaml.org,2002:str", data,
style="|" if "\n" in data else None)
yaml.add_representer(str, selective_representer)
Now if you dump your data with default flow style set to False (to prevent dict/list inlining):
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, indent=4)
Your scalars will act as you expect them to.
When I submit the following text in my textarea box on the windows GAE launcher at http://localhost:8080 it displays fine.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus a dolor eget diam
condimentum varius. Proin malesuada dictum ante, sed commodo purus vestibulum in.
Sed nibh dui, volutpat eu porta eu, molestie ut lacus. Vivamus iaculis urna ut tellus
blandit eu at nisl. Fusce eros libero, aliquam vitae hendrerit vitae, posuere ac diam.
Vivamus sagittis, felis in imperdiet pellentesque, eros nibh porttitor nisi, id
tristique leo libero a ligula. In in elit et velit auctor lacinia eleifend cursus mauris. Mauris
pellentesque lorem et augue placerat ultrices. Nam sed quam nisl, eget elementum felis.
Integer sapien ipsum, aliquet quis viverra quis, adipiscing eget sapien. Nam consequat
lacinia enim, id viverra nisl molestie feugiat.
When my code is deployed on GAE after I hit the submit button it displays like this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus a dolor eg=
et diam condimentum varius. Proin malesuada dictum ante, sed commodo purus =
vestibulum in. Sed nibh dui, volutpat eu porta eu, molestie ut lacus. Vivam=
us iaculis urna ut tellus tempor blandit eu at nisl. Fusce eros libero, ali=
quam vitae hendrerit vitae, posuere ac diam. Vivamus sagittis, felis in imp=
erdiet pellentesque, eros nibh porttitor nisi, id tristique leo libero a li=
gula. In in elit et velit auctor lacinia eleifend cursus mauris. Mauris pel=
lentesque lorem et augue placerat ultrices. Nam sed quam nisl, eget element=
um felis. Integer sapien ipsum, aliquet quis viverra quis, adipiscing eget =
sapien. Nam consequat lacinia enim, id viverra nisl molestie feugiat.
Implementation Description below:
I am using the jinja2 engine. I have autescape = false:
jinja_env = jinja2.Environment(loader = jinja2.FileSystemLoader(template_dir), autoescape = False)
I get the content from a textarea element. Here is how it is set in my template:
<label>
<div>Information</div>
<textarea name="information">{{r.information}}</textarea>
</label>
I retrieve the string using:
information = self.request.get('information')
I committ the string to the data store
r.information = information
r.put()
When displaying it again for editing I use the same template code:
<label>
<div>Information</div>
<textarea name="information">{{r.information}}</textarea>
</label>
Everything works great locally. But when I deploy it to the google app engine I am getting some strange results. Where do those = signs come from I wonder?
EDIT:
For clarification it is putting =CRLF at the end of every line.
*EDIT 2: *
Here is the code from comment 21 of the bug:
def from_fieldstorage(cls, fs):
"""
Create a dict from a cgi.FieldStorage instance
"""
obj = cls()
if fs.list:
# fs.list can be None when there's nothing to parse
for field in fs.list:
if field.filename:
obj.add(field.name, field)
else:
# first, set a common charset to utf-8.
common_charset = 'utf-8'
# second, check Content-Transfer-Encoding and decode
# the value appropriately
field_value = field.value
transfer_encoding = field.headers.get(
'Content-Transfer-Encoding', None)
if transfer_encoding == 'base64':
field_value = base64.b64decode(field_value)
if transfer_encoding == 'quoted-printable':
field_value = quopri.decodestring(field_value)
if field.type_options.has_key('charset') and \
field.type_options['charset'] != common_charset:
# decode with a charset specified in each
# multipart, and then encode it again with a
# charset specified in top level FieldStorage
field_value = field_value.decode(
field.type_options['charset']).encode(common_charset)
# TODO: Should we take care of field.name here?
obj.add(field.name, field_value)
return obj
multidict.MultiDict.from_fieldstorage = classmethod(from_fieldstorage)
You might be falling foul of this bug
The workaround in comment 21 has worked for me in the past, and recent comments indicate it still does.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I need your help or suggestion, whatever.
I start reading some books about python just because of this problem I have :) But I see it will takes long time for me to learn the whole language. I also skimmed and searched through lxml.html documentation but still I can figure out how to do this what I want.
I created two html files for sample, to explain what is my problem. You can see those pieces of code here: http://pzt.me/ltbj
There is also a screenshot with differences so that is even easier to see what's going on.
If somebody tried to do something like this before or if you have an idea how could I do this please let me know.
Thank you.
Best,
Jozsef
OK here is the code:
~~~~~~~~~~~
This:
~~~~~~~~~~~
New Document
<body>
<h2><a name="2" class="class1">2</a></h2> ^ top ^
<p><span class="class3">20</span>Sed imperdiet, lacus eu consectetur tempus, tellus metus vestibulum tortor, nec tincidunt nisl enim non tortor. <span class="class3">21</span>Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. Phasellus neque justo, aliquet non pellentesque vel, dictum non libero. Phasellus vel nulla mi, id molestie purus. Suspendisse orci ante, imperdiet at tempus id, pulvinar eu mi. Aliquam erat volutpat. <span class="class3">22</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Pellentesque pretium, ligula tristique porta fringilla, mauris lectus gravida nibh, consectetur ornare lacus tellus quis sem. <span class="class3">23</span>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor.</p>
<p><span class="class3">24</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. <span class="class3">25</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos.</p>
<p><span class="class3">26</span>Sed imperdiet, lacus eu consectetur tempus, "tellus metus vestibulum tortor, nec tincidunt nisl enim non tortor."</p>
<p><span class="class3">27</span></p>
<p>Nunc volutpat lacus;</p>
<p>Etiam sit amet dapibus;</p>
<p>Nunc consequat mauris.</p>
<p><span class="class3">15</span>Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Nunc volutpat lacus a lacus dignissim sed iaculis metus consectetur. <span class="class3">17</span>Nunc consequat mauris nec ligula ullamcorper ut iaculis nibh sodales. "Nulla tincidunt lorem eu odio laoreet facilisis." <span class="class3">18</span>Aliquam erat volutpat. Curabitur sagittis, mauris quis laoreet consectetur, erat urna tincidunt augue, ut eleifend felis mi quis felis. <span class="class3">19</span>Vivamus a elit risus, consequat sagittis ligula. Nunc ut vestibulum ipsum. Curabitur at sapien vitae est egestas aliquam. <span class="class3">20</span> Donec porttitor, ligula vel venenatis posuere, purus nunc adipiscing ante, id pellentesque turpis nulla eu magna. <span class="class3">21</span>Praesent gravida, eros ut scelerisque commodo, magna quam volutpat elit, a aliquet neque ligula a mauris. <span class="class3">22</span>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor. <span class="class3">23</span>Lorem ipsum dolor sit:</p>
<p>Pellentesque pretium, ligula tristique</p>
<p>felis viverra;</p>
<p>justo lobortis ut "l"</p>
<p>unc ut consectetur fermentum.</p>
<p><span class="class3">14</span>Proin et tellus felis:</p>
<p>Suspendisse potenti,</p>
<p>enim non tortor</p>
<p>Donec porttitor.</p>
<p>Morbi eleifend fermentum</p>
<p>Aliquam id ante.</p>
<p><span class="class3">15</span></p>
<p>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor,</p>
<p>etiam ullamcorper.</p>
<p>vivamus interdum nulla,</p>
<p>odio laoreet facilisis.</p>
<p><span class="class3">20</span>Suspendisse potenti. Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. <span class="class3">21</span>Suspendisse potenti. Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. </p>
</body>
~~~~~~~~~~~~~~~~~~~~~
To become this:
~~~~~~~~~~~~~~~~~~~~~
New Document
<body>
<h2><a name="2" class="class1">2</a></h2> ^ top ^
<p><span class="class3">20</span>Sed imperdiet, lacus eu consectetur tempus, tellus metus vestibulum tortor, nec tincidunt nisl enim non tortor. <span class="class3">21</span>Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. Phasellus neque justo, aliquet non pellentesque vel, dictum non libero. Phasellus vel nulla mi, id molestie purus. Suspendisse orci ante, imperdiet at tempus id, pulvinar eu mi. Aliquam erat volutpat. <span class="class3">22</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Pellentesque pretium, ligula tristique porta fringilla, mauris lectus gravida nibh, consectetur ornare lacus tellus quis sem. <span class="class3">23</span>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor.</p>
<p><span class="class3">24</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. <span class="class3">25</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos.</p>
<p><span class="class3">26</span>Sed imperdiet, lacus eu consectetur tempus, "tellus metus vestibulum tortor, nec tincidunt nisl enim non tortor."</p>
<p><span class="class3">27</span><br />
Nunc volutpat lacus;<br />
Etiam sit amet dapibus;<br />
Nunc consequat mauris.</p>
<p><span class="class3">15</span>Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Nunc volutpat lacus a lacus dignissim sed iaculis metus consectetur. <span class="class3">17</span>Nunc consequat mauris nec ligula ullamcorper ut iaculis nibh sodales. "Nulla tincidunt lorem eu odio laoreet facilisis." <span class="class3">18</span>Aliquam erat volutpat. Curabitur sagittis, mauris quis laoreet consectetur, erat urna tincidunt augue, ut eleifend felis mi quis felis. <span class="class3">19</span>Vivamus a elit risus, consequat sagittis ligula. Nunc ut vestibulum ipsum. Curabitur at sapien vitae est egestas aliquam. <span class="class3">20</span> Donec porttitor, ligula vel venenatis posuere, purus nunc adipiscing ante, id pellentesque turpis nulla eu magna. <span class="class3">21</span>Praesent gravida, eros ut scelerisque commodo, magna quam volutpat elit, a aliquet neque ligula a mauris. <span class="class3">22</span>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor. <span class="class3">23</span>Lorem ipsum dolor sit:<br />
Pellentesque pretium, ligula tristique<br />
felis viverra;<br />
justo lobortis ut "l"<br />
unc ut consectetur fermentum.</p>
<p><span class="class3">14</span>Proin et tellus felis:<br />
Suspendisse potenti,<br />
enim non tortor<br />
Donec porttitor.<br />
Morbi eleifend fermentum<br />
Aliquam id ante.</p>
<p><span class="class3">15</span><br />
Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor,<br />
etiam ullamcorper.<br />
vivamus interdum nulla,<br />
odio laoreet facilisis.</p>
<p><span class="class3">20</span>Suspendisse potenti. Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. <span class="class3">21</span>Suspendisse potenti. Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. </p>
</body>
Can't include the image. sorry. you must to see the link on top if you want to see the image.
Thanks.
Use BeautifulSoup to parse the document and recreate it after processing it. It is the easiest thing to do. I wouldn't use lxml for what you are trying to do.
http://www.crummy.com/software/BeautifulSoup/documentation.html
Look at example here on how tags are added and removed:
Extract all <script> tags in an HTML page and append to the bottom of the document
https://stackoverflow.com/questions/tagged/beautifulsoup
If you're really that short on time you may be able to accomplish your task after reading chapter 8 of Dive Into Python ( http://diveintopython.net/html_processing/index.html ).
Alas, I strongly suggest that you start from the very beginning of the book.
Regular expressions (chapter 7 same book) may also be of great help. I have not quite understood what you're trying to accomplish though. Replace <p></p> tags with <br/>?
Anyway look into smgllib and re modules.