Get a clean string from HTML, CSS and JavaScript - python

Currently, I'm trying to scrape 10-K submission text files on sec.gov.
Here's an example text file:
https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt
The text document contains things like HTML tags, CSS styles, and JavaScript. Ideally, I'd like to scrape only the content after removing all the tags and styling.
First, I tried the obvious get_text() method from BeautifulSoup. That didn't work out.
Then I tried using regex to remove everything between < and >. Unfortunately, also this didn't work out entirely. It keeps some tags, styles, and scripts.
Does anyone have a clean solution for me to accomplish my goal?
Here is my code so far:
import requests
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'
response = requests.get(url)
text = re.sub('<.*?>', '', response.text)
print(text)

Let's set a dummy string based on the example:
original_content = """
<script>console.log("test");</script>
<TD VALIGN="bottom" ALIGN="center"><FONT STYLE="font-family:Arial; ">(Address of principal executive offices)</FONT></TD>
"""
Now let's remove all the javascript.
from lxml.html.clean import Cleaner # remove javascript
# Delete javascript tags (some other options are left for the sake of example).
cleaner = Cleaner(
comments = True, # True = remove comments
meta=True, # True = remove meta tags
scripts=True, # True = remove script tags
embedded = True, # True = remove embeded tags
)
clean_dom = cleaner.clean_html(original_content)
(From https://stackoverflow.com/a/46371211/1204332)
And then we can either remove the HTML tags (extract the text) with the HTMLParser library:
from HTMLParser import HTMLParser
# Strip HTML.
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
text_content = strip_tags(clean_dom)
print text_content
(From: https://stackoverflow.com/a/925630/1204332)
Or we could get the text with the lxml library:
from lxml.html import fromstring
print fromstring(original_content).text_content()

Related

How to display text with HTML-markup? (I use ckeditor)

I heard about the filter |safe, but if I understood correctly, that's unsafe and creates a backdoor for injections.
What are the alternatives to display full posts with formatted text?
I think when you not use the filter of |safe, then output should return as text only with html markup (not rendered as html output).
But, if you need to exclude some dangerous tags such as <script>location.reload()</script>, you need to handle it with custom templatetag filter..
I got good answer from: https://stackoverflow.com/a/699483/6396981, via BeautifulSoup.
from bs4 import BeautifulSoup
from django import template
from django.utils.html import escape
register = template.Library()
INVALID_TAGS = ['script',]
def clean_html(value):
soup = BeautifulSoup(value)
for tag in soup.findAll(True):
if tag.name in INVALID_TAGS:
# tag.hidden = True # you also can use this.
tag.replaceWith(escape(tag))
return soup.renderContents()
# clean_html('<h1>This is heading</h1> and this one is xss injection <script>location.reload()</script>')
# output:
# <html><body><h1>This is heading</h1> and this one is xss injection <script>location.reload()</script></body></html>
#register.filter
def safe_exclude(text):
# eg: {{ post.description|safe_exclude|safe }}
return clean_html(text)
Hope it usefull..

How to filter html tags with Python

I have the html document with an article. I have some amount of tags, that I can use for text formatting. But my text editor uses a lot of unnecessary tags for formatting. I want to write a program in Python for filtering these tags.
What would be the major logic(structure, strategy) of such a program? I'm beginner in Python and want to learn this language through solving real practical task. But I need some general overview to start.
Use BeautifulSoup:
from BeautifulSoup import BeautifulSoup
html_string = # the HTML code
parsed_html = BeautifulSoup(html_string)
print parsed_html.body.find('div', attrs = {attrs inside html code}).text
Here, div is just the tag, you can use any tag whose text you want to filter.
Not so clear on your requirements but you should use ready-made parsers like BeautifulSoup in python.
You can find a tutorial here
just don't know about what will be missed, but you can use regex.
re.sub('<[^<]+?>', '', text)
the above function will search...
otherwise you can use htmlparser
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def handle_entityref(self, name):
self.fed.append('&%s;' % name)
def get_data(self):
return ''.join(self.fed)
def html_to_text(html):
s = MLStripper()
s.feed(html)
return s.get_data()

Python if-statement based on content of HTML title tag

We are trying to write a Python script to parse HTML with the following conditions:
If the HTML title tag contains the string "Record doesn't exist," then continue running a loop.
If NOT, download the page contents.
How do we write an if-statement based on the conditions?
We're aware of Beautiful Soup, unfortunately we don't have permission to install it on the machine we're using.
Our code:
import urllib2
opp1 = 1
oppn = 2
for opp in range(opp1, oppn + 1):
oppurl = (something.com)
response = urllib2.urlopen(oppurl)
html = response.read()
# syntax error on the next line #
if Title == 'Record doesn't exist':
continue
else:
oppfilename = 'work/opptest' + str(opp) + '.htm'
oppfile = open(oppfilename, 'w')
opp.write(opphtml)
print 'Wrote ', oppfile
votefile.close()
You can use a regular expression to get the contents of the title tag:
m = re.search('<title>(.*?)</title>', html)
if m:
title = m.group(1)
Try Beautiful Soup. It's an amazingly easy to use library for parsing HTML documents and fragments.
import urllib2
from BeautifulSoup import BeautifulSoup
for opp in range(opp1,oppn+1):
oppurl = (www.myhomepage.com)
response = urllib2.urlopen(oppurl)
html = response.read()
soup = BeautifulSoup(html)
if soup.head.title == "Record doesn't exist":
continue
else:
oppfilename = 'work/opptest'+str(opp)+'.htm'
oppfile = open(oppfilename,'w')
opp.write(opphtml)
print 'Wrote ',oppfile
votefile.close()
---- EDIT ----
If Beautiful Soup isn't an option, I personally would resort to a regular expression. However, I refuse to admit that in public, as I won't let allow people to know I would stoop to the easy solution. Let's see what's in that "batteries included" bag of tricks.
HTMLParser looks promising, let's see if we can bent it to our will.
from HTMLParser import HTMLParser
def titleFinder(html):
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
self.intitle = tag == "title"
def handle_data(self, data):
if self.intitle:
self.title = data
parser = MyHTMLParser()
parser.feed(html)
return parser.title
>>> print titleFinder('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
Test
That's incredibly painful. That almost as wordy as Java. (just kidding)
What else is there? There's xml.dom.minidom A "Lightweight DOM implementation". I like the sound of "lightweight", means we can do it with one line of code, right?
import xml.dom.minidom
html = '<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>'
title = ''.join(node.data for node in xml.dom.minidom.parseString(html).getElementsByTagName("title")[0].childNodes if node.nodeType == node.TEXT_NODE)
>>> print title
Test
And we have our one-liner!
So I heard that these regular expressions things are pretty efficient as extracting bits of text from HTML. I think you should use those.

BeautifulSoup python to parse html files

I am using BeautifulSoup to replace all the commas in an html file with ‚. Here is my code for that:
f = open(sys.argv[1],"r")
data = f.read()
soup = BeautifulSoup(data)
comma = re.compile(',')
for t in soup.findAll(text=comma):
t.replaceWith(t.replace(',', '‚'))
This code works except when there is some javascript included in the html file. In that case it even replaces the comma(,) with in the javascript code. Which is not required. I only want to replace in all the text content of the html file.
soup.findall can take a callable:
tags_to_skip = set(["script", "style"])
# Add to this list as needed
def valid_tags(tag):
"""Filter tags on the basis of their tag names
If the tag name is found in ``tags_to_skip`` then
the tag is dropped. Otherwise, it is kept.
"""
if tag.source.name.lower() not in tags_to_skip:
return True
else:
return False
for t in soup.findAll(valid_tags):
t.replaceWith(t.replace(',', '‚'))

python HTMLParser to replace some strings in the data of the html file

I need to replace some strings in the data content of my html page. I can't use replace function directly because I need to change only the data section. It shouldn't modify any of the tags or attributes. I used HTMLParser for this. But I am stuck on writing it back to file. Using HTMLParser I can parse and get data content on which I will do necessary changes. But how to put it back to my html file ?
Please help. Here is my code:
class EntityHTML(HTMLParser.HTMLParser):
def __init__(self, filename):
HTMLParser.HTMLParser.__init__(self)
f = open(filename)
self.feed(f.read())
def handle_starttag(self, tag, attrs):
"""Needn't do anything here"""
pass
def handle_data(self, data):
print data
data = data.replace(",", "&sbquo")
HTMLParser doesn't construct any representation in memory of your html file. You could do it yourself in handle_*() methods but a simpler way would be to use BeautifulSoup:
>>> import re
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<a title=,>,</a>')
>>> print soup
<a title=",">,</a>
>>> comma = re.compile(',')
>>> for t in soup.findAll(text=comma): t.replaceWith(t.replace(',', '&sbquo'))
>>> print soup
<a title=",">&sbquo</a>

Categories

Resources