Parsing HTML with str.split in Python - python

I'm parsing a website with the requests module and I'm trying to get specific URLs inside tags (but a table of data as the tags are used more than once) without using BeautifulSoup. Here's part of the code I'm trying to parse:
<td class="notranslate" style="height:25px;">
<a class="post-list-subject" href="/Forum/ShowPost.aspx?PostID=80631954">
<div class="thread-link-outer-wrapper">
<div class="thread-link-container notranslate">
Forum Rule: Don't Spam in Any Way
</div>
I'm trying to get the text inside the tag:
/Forum/ShowPost.aspx?PostID=80631954
The thing is, because I'm parsing a forum site, there are multiple uses of those divider tags. I'd like to retrieve a table of post URLs using string.split using code similar to this:
htmltext.split('<a class="post-list-subject" href="')[1].split('"><div class="thread-link-outer-wrapper">')[0]
There is nothing in the HTML code to indicate a post number on the page, just links.

In my opinion there are better ways to do this. Even if you don't want to use BeautifulSoup, I would lean towards regular expressions. However, the task can definitely be accomplished using the code you want. Here's one way, using a list comprehension:
results = [chunk.split('">')[0] for chunk in htmltext.split('<a class="post-list-subject" href="')[1:]]
I tried to model it as closely off of your base code as possible, but I did simplify one of the split arguments to avoid whitespace issues.
In case regular expressions are fair game, here's how you could do it:
import re
target = '<a class="post-list-subject" href="(.*)">'
results = re.findall(target, htmltext)

Consider using Beautiful Soup. It will make your life a lot easier. Pay attention to the choice of parser so that you can get the balance of speed and leniency that is appropriate for your task.

It seems really dicey to try to pre-optimize without establishing your bottleneck is going to be html parsing. If you're worried about performance, why not use lxml? Module imports are hardly ever the bottleneck, and it sounds like you're shooting yourself in the foot here.
That said, this will technically do what you want, but it seriously is not more performant than using an HTML parser like lxml in the long run. Explicitly avoiding an HTML parser will also probably drastically increase your development time as you figure out obscure string manipulation snippets rather than just using the nice tree structure that you get for free with HTML.
strcleaner = lambda x : x.replace('\n', '').replace(' ', '').replace('\t', '')
S = strcleaner(htmltext)
S.split(strcleaner('<a class="post-list-subject" href="'))[1].split(strcleaner('"><div class="thread-link-outer-wrapper">'))[0]
The problem with the code you posted is that whitespace and newlines are characters too.

Related

How to perform a Google search and take the text result?

Wondering how to use Python 3 to use Google to create a dictionary of some words (so say I enter a word, I want Python to take the definition that Google is able to give, then store or display it)
I haven't done much coding, but I know how to manage the words after. I'm just a bit confused using urllib and stuff. I have only been able to find help for this on other versions of Python, which I have not been able to replicate on Python 3.3.
EDIT: Yes, I want to use Google because I like the way it defines words and phrases, and I plan to use the define protocol you mentioned, icedtrees.
Edit: it appears that Google Search grabs its definitions using AJAX calls or something. The below solution will not work.
If you are having trouble using urllib2, I suggest the nice Python Requests package, which is a lot easier to use.
If you are absolutely committed to getting the Google definition and no other definition, I would suggest doing a HTTP request to a page using the Google Search "define" protocol.
For example:
https://www.google.com.au/search?q=define:test
You would then save the HTML result, and then parse it for the definitions that you require. Some examples of Python HTML parsers are the HTMLParser module, and also BeautifulSoup. However, this parsing operation seems pretty simple, so a basic regex should be more than enough. All definitions are stored as follows:
<div style="display:inline" data-dobid="dfn"> # the order of the style and the data-dobid can change
<span>definition goes here</span>
</div>
An example of a regex to grab the definitions of "test" from the HTML page:
import re
definitions = re.findall(r'data-dobid="dfn".*?>.*?\<span>(.*?)</span>.*?</div>', html, re.DOTALL)
>>> len(definitions)
18
>>> definitions[0]
'a\n procedure intended to establish the quality, performance, or \nreliability of something, especially before it is taken into widespread \nuse.'
# Looks like you might need to remove the newlines
>>> definitions[5]
'the result of a medical examination or analytical procedure.'
As a sidenote, there also exists a Google Dictionary API, which can give you definition results in JSON format in response to a request.

Need help parsing html in python3, not well formed enough for xml.etree.ElementTree

I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?
The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.
You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).
One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.
The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:
You didn't write that awful page.
You're just trying to get some data
out of it. Right now, you don't really
care what HTML is supposed to look
like.
Neither does this parser.
However it isn't well-supported for Python 3, there's more information about this at the end of the link.
Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.
lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.
BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.

batch script or python program to edit string in xml tags

I am looking to write a program that searches for the tags in an xml document and changes the string between the tags from localhost to manager. The tag might appear in the xml document multiple times, and the document does have a definite path. Would python or vbscript make the most sense for this problem? And can anyone provide a template so I can get started? That would be great. Thanks.
If it's a simple thing, like changing a few strings here and there, you might be able to do everything with a python regexp, check here:
http://docs.python.org/library/re.html
For everything more complex, I would suggest using something like Beautiful Soup:
http://www.crummy.com/software/BeautifulSoup/documentation.html
It's a bit outdated, but contains everything you would ever need...
I agree this belongs to stackoverflow.com, as it's a programming question.
I suggest that you go straight to lxml library for python and don't look back. The regex manipulation of xml can have terrible consequences, and BeautifulSoup, although quite popular, is officially abandoned.
lxml is quite powerfull, fast and efficient. For your task, it is sufficient to write:
from lxml import etree
doc = etree.fromstring(content)
elements = doc.findall('tags_to_modify')
for el in elements:
el.text = your_replacement_function(el.text)
print etree.tostring(doc)
You can find a lot of help in lxml's documentation:
http://lxml.de/

Complex HTML parsing with Python

I am already aware of tag based HTML parsing in Python using BeautifulSoup, htmllib etc.
However, I want a powerful engine which can do complex tasks like read html tables, lists etc. and present these as simple to use objects within code. Does python have such powerful libraries?
BeautifulSoup is a nice library and provides a good way to parse HTML with some handy ways to parse the data very easily.
What you are trying to do, can easily be done using some simple regular expressions. You can write regular expressions to search for a particular pattern of data and extract the data you need.
You might consider lxml which has a powerful HTML processor. There is another complementary module that relies on lxml called pyquery that might be just what you're looking for.
PyQuery has jQuery-like syntax, so if you're used to jQuery you'll be able to jump right in.
Here is a simple example to get the first <ul> item from aol.com:
>>> from pyquery import PyQuery as pq
>>> import urllib
>>> data = urllib.urlopen('http://aol.com').read()
>>> d = pq(data)
>>> first_ul = d('ul:first')
>>> first_ul
[<ul#dhL2>]
>>> print first_ul
<ul id="dhL2"><li class="dhL1"><a accesskey="" href="https://new.aol.com/productsweb/?promocode=827693&ncid=txtlnkuswebr00000074" name="om_dirbtn1" class="_o4-0" id="om_dirbtn1">Get Free Mail</a></li>
</ul>
The standard HTML parsers are already pretty good at giving you simple objects (e.g. iterables). Creating anything more complex than a 2D list from a table would likely be dependent on the data that was in the page.
With that said...
Here's a link to a blog post by someone who wrote a script to convert html tables to python lists. The actual file is located here.
I've never heard of a standard python library that does these sorts of operations, so your best bet might be Googling each case as you need it. Chances are someone has done what you are trying to do.
Disclaimer: You should always read and understand any code you find online before pasting it into your own applications! Citing who/where it's from is good too!

Filter out HTML tags and resolve entities in python

Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
Use lxml which is the best xml/html library for python.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
And if you just want to sanitize the html look at the lxml.html.clean module
Use BeautifulSoup! It's perfect for this, where you have incoming markup of dubious virtue and need to get something reasonable out of it. Just pass in the original text, extract all the string tags, and join them.
While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.
You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.
How about parsing the HTML data and extracting the data with the help of the parser ?
I'd try something like the author described in chapter 8.3 in the Dive Into Python book
if you use django you might also use
http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags
;)
You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:
<div>5 < 7</div>
Stripping the tags with regex will return the string "5 " and treat
< 7</div>
as a single tag and strip it out.
I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
Regular expressions are not scary, but writing your own regexes to strip HTML is a sure path to madness (and it won't work, either). Follow the path of wisdom, and use one of the many good HTML-parsing libraries.
Lucas' example is also broken because "sub" is not a method of a Python string. You'd have to "import re", then call re.sub(pattern, repl, string). But that's neither here nor there, as the correct answer to your question does not involve writing any regexes.
Looking at the amount of sense people are demonstrating in other answers here, I'd say that using a regex probably isn't the best idea for your situation. Go for something tried and tested, and treat my previous answer as a demonstration that regexes need not be that scary.

Categories

Resources