When using the html2text python package to convert html to markdown it adds '\n' to the text. I also see this behaviour when trying the demo at http://www.aaronsw.com/2002/html2text/
Is there any way to turn this off? Of course I can remove them myself, but there might be occurrences of '\n' in the original text which I don't want to remove.
html2text('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.')
u'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod\ntempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo\nconsequat. Duis aute irure dolor in reprehenderit in voluptate velit esse\ncillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non\nproident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n\n'
In the latest version of html2text do this:
import html2text
h = html2text.HTML2Text()
h.body_width = 0
note = h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
This removes the word wrapping that html2text otherwise does
Looking at the source to html2text.py, it looks like you can disable the wrapping behavior by setting BODY_WIDTH to 0. Something like this:
import html2text
html2text.BODY_WIDTH = 0
text = html2text.html2text('...')
Of course, resetting BODY_WIDTH globally changes the module's behavior. If I had a need to access this functionality, I'd probably seek to patch the module, creating a parameter to html2text() to modify this behavior per-call, and provide this patch back to the author.
Related
Let's say I've scrapped this from a website.
PARIS - Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua (2015). Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat 22/05/2015. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
I can just use .replace ('PARIS - ','') and then get the texts with regex, but what if the place is changing in different article?
How do I exclude the first "Paris" and " - " and get the other texts
Should I seperate between the location and the content with regex?
What should I think or do first when facing problem like this?
Here's my code to get the first string for my third question, assume that text is variable that contains these texts
location = re.findall('^\w+', text)
Use a regular expression that matches a sequence of uppercase letters and spaces followed by a hyphen at the beginning, and replaces it with an empty string.
text = re.sub(r'^[A-Z\s]+\s-\s*', '', text)
I want to print the abstract of a paper in the middle of Terminal screen of linux. The abstract is a continues long paragraph. I tried:
print(colored(text.center(80), 'blue'))
but since the string is long it still occupy the whole width of screen, while I want to justify the text between say columns 10 to 70 ( for 80 columns screen)
You can use textwrap module:
import textwrap
abstract = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
abstract = "\n".join(textwrap.wrap(abstract, 60)) # wrap at 60 characters
print(textwrap.indent(abstract, " "*10)) # indent with 10 spaces
I am trying to create a program that simulates word wrapping text found in programs like Word or Notepad. If I have a long text, I would like to print out 64 characters (or less) per line, followed by a newline return, without truncating words. Using Windows 10, PyCharm 2018.2.4 and Python 3.6, I've tried the following code:
long_str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit," \
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." \
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris" \
"nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in" \
"reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur." \
"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui" \
"officia deserunt mollit anim id est laborum."
concat_str = long_str[:64] # The first 64 characters
rest_str = long_str[65:] # The rest of the string
rest_str_len = len(rest_str)
while rest_str_len > 64:
print(concat_str.lstrip() + " (" + str(len(concat_str)) + ")" + "\n")
concat_str = rest_str[:64]
rest_str = rest_str[65:]
rest_str_len = len(rest_str)
print(concat_str.lstrip() + " (" + str(len(concat_str)) + ")" + "\n")
print(rest_str.lstrip() + " (" + str(len(rest_str)) + ")")
This is so close, but there are two problems. First, the code truncates off letters at the end or beginning of lines, such as the following output:
# I've added the total len() at the end of each line just to check-sum.
'Lorem ipsum dolor sit amet, consectetur adipiscing elit,sed do e (64)'
'usmod tempor incididunt ut labore et dolore magna aliqua. Ut enim (64)'
'ad minim veniam, quis nostrud exercitation ullamco laborisnisi u (64)'
'aliquip ex ea commodo consequat. Duis aute irure dolor inrepreh (64)'
'nderit in voluptate velit esse cillum dolore eu fugiat nulla par (64)'
'atur. Excepteur sint occaecat cupidatat non proident, sunt in cul (64)'
'a quiofficia deserunt mollit anim id est laborum. (49)'
The second problem is that I need the code to print a newline only after a whole word (or punctuation), instead of chopping up the word at 64 characters.
Use textwrap.wrap:
import textwrap
long_str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit," \
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." \
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris" \
"nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in" \
"reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur." \
"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui" \
"officia deserunt mollit anim id est laborum."
lines = textwrap.wrap(long_str, 64, break_long_words=False)
print('\n'.join(lines))
This takes long string and splits it into lines of a particular width. Also, set break_long_words to False to prevent splitting of words.
I would like to estimate the height of a div element containing only text as rendered by a browser. For example, consider this htm file:
<html>
<body>
<div style="position:absolute; top:100pt; left:80pt; width:200pt">
<p style="line-height:16pt; font-size:9pt; font-family:Monospace; font-style:normal; font-weight:normal;">
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</p>
</div>
</body>
</html>
This is my attempt in python:
from PIL import ImageFont
import re
def getHeightOfTextBox(line_height, font_size, ttfFileName, box_width, text):
text = text.strip()
text = re.sub('\s+', ' ', text)
font = ImageFont.truetype(ttfFileName, font_size)
numLines = 1
cursor = 0
spaceWidth = font.getsize(' ')[0]
afterLineBreak = False
for word in text.split(' '):
wordWidth = font.getsize(word)[0]
cursor += wordWidth + (spaceWidth if afterLineBreak else 0)
if cursor > box_width:
print(word)
numLines += 1
afterLineBreak = True
cursor = wordWidth
else:
afterLineBreak = False
cursor += spaceWidth
return numLines * line_height
text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
'''
print(getHeightOfTextBox(16, 9, "LiberationMono-Regular.ttf", 200, text))
Here is the output of the script:
adipiscing
incididunt
aliqua.
nostrud
nisi
Duis
in
fugiat
occaecat
culpa
id
192
which gives 12 lines in total. Here is how the div element is rendered in my Firefox browser:
which gives 13 lines. In this example, pillow says the width of a character is 5pts, so 0.025 of the width of the div. Using Firefox's inspect element feature, Firefox calculates the width of a character to be 7px and the width of the div to be 266.667px, so 0.02625 the width of the div. At this point, my best guess is that the reason for this discrepancy is that Firefox thinks the width if a character is slightly larger than pillow does. If I change this example by using a font size of 10pts then I get agreement between my code and Firefox.
Is it possible to correct my code to accurately estimate, using the font metric file, the height of the div as rendered by a browser? I know another option is to use a headless browser like PhantomJS in conjunction with Selenium. But I am hoping to avoid this, as I feel it could be over-kill.
I have a 1000 character long text string and I want to split this text in chunks smaller than 100 characters without splitting a whole word (99 characters are fine but 100 not). The wrapping/splitting should only be made on whitespaces:
Example:
text = "... this is a test , and so on..."
^
#position: 100
should be splitted to:
newlist = ['... this is a test ,', ' and so on...', ...]
I want to get a list newlist of the text splitted properly into readable (not word-cropped) chunks. How would you do this?
Use the textwrap module's wrap function. The below example splits the lines 10 characters wide:
In [1]: import textwrap
In [2]: textwrap.wrap("... this is a test , and so on...", 10)
Out[2]: ['... this', 'is a test', ', and so', 'on...']
You can use the textwrap module:
In [2]: import textwrap
In [3]: textwrap.wrap("""Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
...: tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
...: quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
...: consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
...: cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
...: proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
""", 40)
Out[3]:
['Lorem ipsum dolor sit amet, consectetur',
'adipisicing elit, sed do eiusmod tempor',
'incididunt ut labore et dolore magna',
'aliqua. Ut enim ad minim veniam, quis',
'nostrud exercitation ullamco laboris',
'nisi ut aliquip ex ea commodo consequat.',
'Duis aute irure dolor in reprehenderit',
'in voluptate velit esse cillum dolore eu',
'fugiat nulla pariatur. Excepteur sint',
'occaecat cupidatat non proident, sunt in',
'culpa qui officia deserunt mollit anim',
'id est laborum.']
Wordwrap like the other guys said, however for an alternative option:
def splitter(s, n):
for start in range(0, len(s), n):
yield s[start:start+n]
data = "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghij"
for splitee in splitter(data, 10):
print splitee