Python lxml screen scraping? - python

I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is viewable. sounds simple enough.. i did it with the HTMLParser but its not handling javascript well
class HTML2Text(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.output = cStringIO.StringIO()
def get_text(self):
return self.output.getvalue()
def handle_data(self, data):
self.output.write(data)
def ParseHTML(source):
p = HTML2Text()
p.feed(source)
text = p.get_text()
return text
Any Ideas for a way to do this with lxml or a better way to do it HTMLParser.. HTMLParser would be best because no additional libs are needed.. thanks everyone
Scott F.

No screen-scraping library I know "does well with Javascript" -- it's just too hard to anticipate all ways in which JS could alter the HTML DOM dynamically, conditionally &c.

scrape.py can do this for you.
It's as simple as:
import scrape
s = scrape.Session()
s.go('yoursite.com')
print s.doc.text
Jump to about 2:40 in this video for an awesome overview from the creator of scrape.py:
pycon.blip.tv/file/3261277

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is often the right answer to python html scraping questions.

I know of no Python HTML parsing libraries that handle running javascript in the page being parsed. It's not "simple enough" for the reasons given by Alex Martelli and more.
For this task you may need to think about going to a higher level than just parsing HTML and look at web application testing frameworks.
Two that can execute javascript and are either Python based or can interface with Python:
PAMIE
Selenium
Unfortunately I'm not sure if the "unit testing" orientation of these frameworks will actually let you scrape out visible text.
So the only other solution would be to do it yourself, say by integrating python-spidermonkey into your app.

Your code is smart and very flexible to extent, I think.
How about simply adding handle_starttag() and handle_endtag() to supress the <script> blocks?
class HTML2Text(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.output = cStringIO.StringIO()
self.is_in_script = False
def get_text(self):
return self.output.getvalue()
def handle_data(self, data):
if not self.is_in_script:
self.output.write(data)
def handle_starttag(self, tag, attrs):
if tag == "script":
self.is_in_script = True
def handle_endtag(self, tag):
if tag == "script":
self.is_in_script = False
def ParseHTML(source):
p = HTML2Text()
p.feed(source)
text = p.get_text()
return text

Related

How to better read and parse an xml file using Python and SAX?

Windows 11/Python 3.8.10 - Using Spyder Python IDE and PyCharm
Hey all, newish to python app dev and have a big project to parse xml files. Trying to write a python program for it. Below is a very small sample of the xml file data structure I am working with.
<PillCall XMLInstanceID="98089D9A-768A-4FA0-A7CD-DC5966EB5B06" PillCallID="49" VersionNumber="1.2">
</PillCall>
These xml files will be huge. Eventually this will need to be able to process multiple large files with a lot of data 24/7 concurrently. Eventually parsing the data and saving it to a db, then after modification, creating an new modified xml file based on the current data in db.
Here is my sample program, from Python Spyder IDE: -- I have tried a bunch of other methods but the SAX method has been the best to understand for me personally so far. I am sure there are better ways though.
import xml.sax
class XMLHandler(xml.sax.ContentHandler):
def __init__(self):
self.CurrentData = ""
self.pillcall = ""
self. pillcallid= ""
self.vernum = ""
# Call when an element starts
def startElement(self, tag, attributes):
self.CurrentData = tag
if(tag == "PillCall"):
print("*****PillCall*****")
title = attributes["XMLInstanceID"]
print("XMLInstanceID:=", title) #How at add multiple values/strings here?
# print(sorted()
# create an XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# override the default ContextHandler
Handler = XMLHandler()
parser.setContentHandler( Handler )
parser.parse("xmltest10.xml")
My output is this:
PillCall
XMLInstanceID:= 98089D9A-768A-4FA0-A7CD-DC5966EB5B06
I have tried many different ways to read the whole string with element tree and beautifulsoap but can't get it to work. I also get no output with running this program in PyCharm.
Here is some extra python/sax code that I have been messing with as well but haven't got it to work right either.
I just need to be able to clearly read the data and parse it to a new file for now. And also how to loop through it and find all the data to ouput. Thanks for any and all help!!
# Call when an elements ends
def endElement(self, tag):
if(self.CurrentData != "/PillCall"):
print("End of PillCall:", self.pillcall)
elif(self.CurrentData == "PillCallID"):
print("PillCallID:=", self.pillcallid)
elif(self.CurrentData == "VersionNumber"):
print("VersionNumber:=", self.vernum)
self.CurrentData = ""
# Call when a character is read
def characters(self, content):
if(self.CurrentData == "PillCall"):
self.pillcall = content
elif(self.CurrentData == "qty"):
self.pillcallid = content
elif(self.CurrentData == "company"):
self.vernum = content
Using BeautifulSoup's find_all may be what you're looking for...
Given:
text = """
<PillCall XMLInstanceID="98089D9A-768A-4FA0-A7CD-DC5966EB5B06" PillCallID="49" VersionNumber="1.2">
</PillCall>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(text, 'xml')
for result in soup.find_all('PillCall'):
print(result.attrs)
Output:
{'PillCallID': '49',
'VersionNumber': '1.2',
'XMLInstanceID': '98089D9A-768A-4FA0-A7CD-DC5966EB5B06'}

pytest-django: Is this the right way to test view with parameters?

Say I'm testing an RSS feed view in a Django app, is this how I should go about it?
def test_some_view(...):
...
requested_url = reverse("personal_feed", args=[some_profile.auth_token])
resp = client.get(requested_url, follow=True)
...
assert dummy_object.title in str(resp.content)
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
Should I assert that dummy_object is in the response this way?
I'm testing here using the str representation of the response object. When is it a good practice to do this vs. using selenium? I know it makes it easier to verify that said obj or property (like dummy_object.title) is encapsulated within an H1 tag for example. On the other hand, if I don't care about how the obj is represented, it's faster to do it like the above.
Reevaluating my comment (didn't carefully read the question and overlooked the RSS feed stuff):
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
I would agree on that - from Django point, you are testing your views and don't care about what the exact endpoints they are mapped against. Using reverse is thus IMO the clear and correct approach.
Should I assert that dummy_object is in the response this way?
You have to pay attention here. response.content is a bytestring, so asserting dummy_object.title in str(resp.content) is dangerous. Consider the following example:
from django.contrib.syndication.views import Feed
class MyFeed(Feed):
title = 'äöüß'
...
Registered the feed in urls:
urlpatterns = [
path('my-feed/', MyFeed(), name='my-feed'),
]
Tests:
#pytest.mark.django_db
def test_feed_failing(client):
uri = reverse('news-feed')
resp = client.get(uri)
assert 'äöüß' in str(resp.content)
#pytest.mark.django_db
def test_feed_passing(client):
uri = reverse('news-feed')
resp = client.get(uri)
content = resp.content.decode(resp.charset)
assert 'äöüß' in content
One will fail, the other won't because of the correct encoding handling.
As for the check itself, personally I always prefer parsing the content to some meaningful data structure instead of working with raw string even for simple tests. For example, if you are checking for data in a text/html response, it's not much more overhead in writing
soup = bs4.BeautifulSoup(content, 'html.parser')
assert soup.select_one('h1#title-headliner') == '<h1>title</h1>'
or
root = lxml.etree.parse(io.StringIO(content), lxml.etree.HTMLParser())
assert next(root.xpath('//h1[#id='title-headliner']')).text == 'title'
than just
assert 'title' in content
However, invoking a parser is more explicit (you won't accidentally test for e.g. the title in page metadata in head) and also makes an implicit check for data integrity (e.g. you know that the payload is indeed valid HTML because parsed successfully).
To your example: in case of RSS feed, I'd simply use the XML parser:
from lxml import etree
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
root = etree.parse(io.BytesIO(resp.content))
title = root.xpath('//channel/title')[0].text
assert title == 'my title'
Here, I'm using lxml which is a faster impl of stdlib's xml. The advantage of parsing the content to an XML tree is also that the parser reads from bytestrings, taking care about the encoding handling - so you don't have to decode anything yourself.
Or use something high-level like atoma that ahs a nice API specifically for RSS entities, so you don't have to fight with XPath selectors:
import atoma
#pytest.mark.django_db
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
feed = atoma.parse_atom_bytes(resp.content)
assert feed.title.value == 'my title'
...When is it a good practice to do this vs. using selenium?
Short answer - you don't need it. I havent't paid much attention when reading your question and had HTML pages in mind when writing the comment. Regarding this selenium remark - this library handles all the low-level stuff, so when the tests start to accumulate in count (and usually, they do pretty fast), writing
uri = reverse('news-feed')
resp = client.get(uri)
root = parser.parse(resp.content)
assert root.query('some-query')
and dragging the imports along becomes too much hassle, so selenium can replace it with
driver = WebDriver()
driver.get(uri)
assert driver.find_element_by_id('my-element').text == 'my value'
Sure, testing with an automated browser instance has other advantages like seeing exactly what the user would see in real browser, allowing the pages to execute client-side javascript etc. But of course, all of this applies mainly to HTML pages testing; in case of testing against the RSS feed selenium usage is an overkill and Django's testing tools are more than enough.

Using python to get data (text) from wix

I'm making a python project in which I created a test wix website.
I want to get the data (text) from the wix website using urllib
so I did
url.urlopen(ADDRESS).readlines()
the problem is it did not give me anything from the text in the page and only information about the structure of the page in HTML.
how would I extricate the requested text information from the website?
I think you'll need to end up parsing the html for the information you want. Check out this python library:
https://docs.python.org/3/library/html.parser.html
You could potentially do something like this:
from html.parser import HTMLParser
rel_data = []
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
rel_data.append(data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
print(rel_data)
Output
["Test", "Parse me!"]

How to filter html tags with Python

I have the html document with an article. I have some amount of tags, that I can use for text formatting. But my text editor uses a lot of unnecessary tags for formatting. I want to write a program in Python for filtering these tags.
What would be the major logic(structure, strategy) of such a program? I'm beginner in Python and want to learn this language through solving real practical task. But I need some general overview to start.
Use BeautifulSoup:
from BeautifulSoup import BeautifulSoup
html_string = # the HTML code
parsed_html = BeautifulSoup(html_string)
print parsed_html.body.find('div', attrs = {attrs inside html code}).text
Here, div is just the tag, you can use any tag whose text you want to filter.
Not so clear on your requirements but you should use ready-made parsers like BeautifulSoup in python.
You can find a tutorial here
just don't know about what will be missed, but you can use regex.
re.sub('<[^<]+?>', '', text)
the above function will search...
otherwise you can use htmlparser
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def handle_entityref(self, name):
self.fed.append('&%s;' % name)
def get_data(self):
return ''.join(self.fed)
def html_to_text(html):
s = MLStripper()
s.feed(html)
return s.get_data()

How to escape or unscape this sequence?

Hi how should I escape to make the link render?
The way I write it now is with filter:
{{article.text|striptags|urlize|nl2br|safe}}
Can you recommend how to do it?
Related question: https://stackoverflow.com/questions/8179801/autolinebreaks-filter-in-jinja2
Thank you
Usually I'd like to use HTMLParser for processing (overkill maybe?), sample code below for Python 2.7 (3.0 library is renamed html.parser)
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Found Start Tag", attrs
s = "noivos, convites de casamento <a href=\"http://www.olharcaricato.com.br\">
http://www.olharcaricato.com.br</a> more entries here"
parser = MyHTMLParser()
parser.feed(s)
Outputs: Found Start Tag [('href', 'http://www.olharcaricato.com.br')]
Note: Implement the code above as a filter, tweak the output to your needs. Example of filter is found at Custom jinja2 filter for iterator

Categories

Resources