I am building a Python function to input HTML email content and output simple text with very little formatting, but including line breaks for readability. The output is to be posted to a Slack channel.
Currently I am taking in the input text, unescaping (using HTMLParser.HTMLParser.unescape since this is in Python 2.7) and then cleaning using BeautifulSoup.gettext(). This outputs clean text, but since the output has been stripped of all formatting, the result is almost unreadable.
How can I force BeautifulSoup to include only newlines but strip all else?
My current code is as follows:
from HTMLParser import HTMLParser
def textify(raw_html):
parser = HTMLParser()
unescaped_html = parser.unescape(raw_html)
soup = BeautifulSoup(unescaped_html)
# get text
text = soup.getText()
return text
Related
I have an HTML string which is guaranteed to only contain text (i.e. no images, videos, or other assets). However, just to note, there might be formatting with some of the text like some of them might be bold.
Is there a way to convert the HTML string output to a .txt file? I don't care about maintaining the formatting but I do want to maintain the spacing of the text.
Is that possible with Python?
#!/usr/bin/env python
import urllib2
import html2text
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())
txt = soup.find('div', {'class' : 'body'})
print(html2text.html2text(txt))
i downloaded webpages using wget. now i am trying to extract some data i need from those pages. the problem is with the Japanese words contained in this data. the English words extraction was perfect.
when i try to extract the Japanese words and use them in another app they appear gibberish. during testing diffrent methods there was one solution that fixed only half the japanese words.
what i tried: i tried
from_encoding="utf-8"
which had no effect. also i tried multiple ways to extract the text from the html code like
section.get_text(strip=True)
section.text.strip()
and others, also i tried to encode the generated text using URLencoding which did not work, also i tried using every code i could find on stackoverflow
one of the methods that strangely worked (but not completely) was saving the string in a dictionary then saving it into a JSON then calling the JSON from ANOTHER script. just using the dictionary, as it is, would not work. i have to use JSON as a middle man between two scripts. strange. (not all the words worked)
my question may seem like duplicates of anther question. but that other question is scraping from the internet. and what i am trying to do is extract from an offline source.
here is a simple script explaining the main problem
from bs4 import BeautifulSoup
page = BeautifulSoup(open("page1.html"), 'html.parser', from_encoding="utf-8")
word = page.find('span', {'class' : "radical-icon"})
wordtxt = word.get_text(strip=True)
#then save the word to a file
with open("text.txt", "w", encoding="utf8") as text_file:
text_file.write(wordtxt)
when i open the file i get gibberish characters
here is the part of the html that BeautifulSoup searchs:
<span class="radical-icon" lang="ja">亠</span>
the expected results is to get the symbols inside the text file. or to save them properly in anyway.
is there a better web scraper to use to properly get the utf8?
PS: sorry for bad english
i think i found an answer, just uninstall beautifulsoup4. i dont need it.
python has a builtin way to search for strings, i tried something like this:
import codecs
import re
with codecs.open("page1.html", 'r', 'utf-8') as myfile:
for line in myfile:
if line.find('<span class="radical-icon"') > -1:
result = re.search('<span class="radical-icon" lang="ja">(.*)</span>', line)
s = result.group(1)
with codecs.open("text.txt", 'w', 'utf-8') as textfile:
textfile.write(s)
which is the over complicated and non-pythonic way of doing it. but what works works.
I am trying to create html via yattag. Only problem is, the header file I use from another html file, so I read that file and try to insert that as header. Here is the problem. Though I pass unescaped html string, yattag escapes it. That is it converts '<' to < while adding to html string.
MWE:
from yattag import Doc, indent
import html
doc, tag, text = Doc().tagtext()
h = open(nbheader_template, 'r')
h_content= h.read()
h_content = html.unescape(h_content)
doc.asis('<!DOCTYPE html>')
with tag('html'):
# insert dummy head
with tag('head'):
text(h_content) # just some dummy text to replace later - workaround for now
with tag('body'):
# insert as many divs as no of files
for i in range(counter):
with tag('div', id = 'divID_'+ str(1)):
text('Div Page: ' + str(i))
result = indent(doc.getvalue())
# inject raw head - dirty workaround as yattag not doing it
# result = result.replace('<head>headtext</head>',h_content)
with open('test.html', "w") as file:
file.write(result)
Output:
Context: I am trying to combine multiple jupyter python notebooks, in to a single html, that is why heavy header. The header content (nbheader_template) could be found here
If you want to prevent the escaping you have to use doc.asis instead of text.
The asis methods appends a string to the document without any form of escaping.
See also the documentation.
I'm not fully fluent in 'yattag', but the one thing I see missing is:
with tag('body'):
The code you have quoted (above) is placing <div> and text into your header, where it clearly doesn't belong.
I’m trying to read HTML content and extract only the data (such as the lines in a Wikipedia article). Here’s my code in Python:
import urllib.request
from html.parser import HTMLParser
urlText = []
#Define HTML Parser
class parseText(HTMLParser):
def handle_data(self, data):
print(data)
if data != '\n':
urlText.append(data)
def main():
thisurl = "https://en.wikipedia.org/wiki/Python_(programming_language)"
#Create instance of HTML parser (the above class)
lParser = parseText()
#Feed HTML file into parser. The handle_data method is implicitly called.
with urllib.request.urlopen(thisurl) as url:
htmlAsBytes = url.read()
#print(htmlAsBytes)
htmlAsString = htmlAsBytes.decode(encoding="utf-8")
#print(htmlAsString)
lParser.feed(htmlAsString)
lParser.close()
#for item in urlText:
#print(item)
I do get the HTML content from the webpage and if I print the bytes object returned by the read() method, it looks like I receive all the HTML content of the webpage. However, when I try to parse this content to get rid of the tags and store only the readable data, I’m not getting the result I expect at all.
The problem is that in order to use the feed() method of the parser, one has to convert the bytes object to a string. To do that you use the decode() method, which receives the encoding with which to do the conversion. If I print the decoded string, the content printed doesn’t contain the data itself (the useful readable data I’m trying to extract). Why does that happen and how can I solve this?
Note: I'm using Python 3.
Thanks for the help.
All right, I eventually used beautifulsoup to do the job, as Alden recommended, but I still don't know why the decoding process mysteriously gets rid of the data.
My goal is to take an XML file, pull out all instances of a specific element, remove the XML tags, then work on the remaining text.
I started with this, which works to remove the XML tags, but only from the entire XML file:
from urllib import urlopen
import re
url = [URL of XML FILE HERE] #the url of the file to search
raw = urlopen(url).read() #open the file and read it into variable
exp = re.compile(r'<.*?>')
text_only = exp.sub('',raw).strip()
I've also got this, text2 = soup.find_all('quoted-block'), which creates a list of all the quoted-block elements (yes, I know I need to import BeautifulSoup).
But I can't figure out how to apply the regex to the list resulting from the soup.find_all. I've tried to use text_only = [item for item in text2 if exp.sub('',item).strip()] and variations but I keep getting this error: TypeError: expected string or buffer
What am I doing wrong?
You don't want to regex this. Instead just use BeautifulSoup's existing support for grabbing text:
quoted_blocks = soup.find_all('quoted-block')
text_chunks = [block.get_text() for block in quoted_blocks]