In Python, are there any modules for extracting plain urls from a string like Perl's URI::Find does?
Thanks.
Here is a regex which can help you find the urls in text. There is no famous package in python which does uri::find in plain text. The sphinx documentation project however, includes a builder called linkcheck which finds all the links and checks for validity. You can check it's source too, but the linked regex is somewhat simpler.
If you simply care about http and https, the answer is here.
Related
I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?
The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.
You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).
One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.
The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:
You didn't write that awful page.
You're just trying to get some data
out of it. Right now, you don't really
care what HTML is supposed to look
like.
Neither does this parser.
However it isn't well-supported for Python 3, there's more information about this at the end of the link.
Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.
lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.
BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.
I am currently trying to make a program that given a word will look up its definition and return it. Although I have gotten this to work, I had to resort to using RegEx to search for the text between the tags where the definitions are stored. What is a more efficient way to do this using python 3.x?
lxml works for Python 3. It has an ElementTree compatible API, but is using c libraries behind the scenes, so it's fast, and it supports Xpaths, which is a nice way of parsing (sometimes).
Try BeautifulSoup a good HTML parser for Python. (works with Python 3.x too, although unless you are deep into a Python 3.0 project, consider using 2.7)
Your's a pretty simple requirement when it comes to HTML parsing. Python standard library includes ElementTree module which should be helpful to do the task which you are planning to undertake. Look for the example snippet which is given in that page.
Also, never make the mistake of parsing HTML/XML using regex. You may not know when it will get insanely complicated and it is a bad idea under any situation too.
I am looking to write a program that searches for the tags in an xml document and changes the string between the tags from localhost to manager. The tag might appear in the xml document multiple times, and the document does have a definite path. Would python or vbscript make the most sense for this problem? And can anyone provide a template so I can get started? That would be great. Thanks.
If it's a simple thing, like changing a few strings here and there, you might be able to do everything with a python regexp, check here:
http://docs.python.org/library/re.html
For everything more complex, I would suggest using something like Beautiful Soup:
http://www.crummy.com/software/BeautifulSoup/documentation.html
It's a bit outdated, but contains everything you would ever need...
I agree this belongs to stackoverflow.com, as it's a programming question.
I suggest that you go straight to lxml library for python and don't look back. The regex manipulation of xml can have terrible consequences, and BeautifulSoup, although quite popular, is officially abandoned.
lxml is quite powerfull, fast and efficient. For your task, it is sufficient to write:
from lxml import etree
doc = etree.fromstring(content)
elements = doc.findall('tags_to_modify')
for el in elements:
el.text = your_replacement_function(el.text)
print etree.tostring(doc)
You can find a lot of help in lxml's documentation:
http://lxml.de/
For example I want to know how to use Python pickle serialization & deserialization. Since I've never use it, reading Python official doc would be a great reference, but I prefer some snippets/example codes either has description or not. Like sites for python beginners, someone's blog, or from google codes.
How would you search? Like go to specific sites, or use what keyword. Actually this is a general question not only for Python, but for learning all languages. Thanks.
Google Code Search.
From the FAQ:
We're crawling as much publicly
accessible source code as we can find,
including archives (.tar.gz, .tar.bz2,
.tar, and .zip), CVS repositories and
Subversion repositories.
Sample search: http://www.google.com/codesearch?q=lang%3Apython+%22cpickle%22
The operators are handy:
The lang: operator, which restricts by programming language (e.g., lang:"c++", -lang:java, or lang:^(c|c#|c++)$)
The license: operator, which restricts by software license (e.g., license:apache, -license:gpl, or license:bsd|mit)
The package: operator, which restricts by package URL (e.g., package:"www.kernel.org" or package:.tgz$)
The file: operator, which restricts by filename (e.g., file:include/linux/$ or -file:.cc$)
You can also look at Activestate Python Recipe's:
http://code.activestate.com/recipes/langs/python/
Here's their recipes for Python Pickling:
http://code.activestate.com/search/#q=pickle python
O'Reilly's Python Cookbook is also good. You can read it online with a Safari membership.
There is also Nullege. A search engine especially for Python code.
The Github search is pretty good. It's usually used to search for repository but its search code works well:
In general, Google Code Search is a pretty good place to look for code snippets. To look for Python pickle examples, I'd do a search like
lang:python pickle examples
Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
Use lxml which is the best xml/html library for python.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
And if you just want to sanitize the html look at the lxml.html.clean module
Use BeautifulSoup! It's perfect for this, where you have incoming markup of dubious virtue and need to get something reasonable out of it. Just pass in the original text, extract all the string tags, and join them.
While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.
You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.
How about parsing the HTML data and extracting the data with the help of the parser ?
I'd try something like the author described in chapter 8.3 in the Dive Into Python book
if you use django you might also use
http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags
;)
You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:
<div>5 < 7</div>
Stripping the tags with regex will return the string "5 " and treat
< 7</div>
as a single tag and strip it out.
I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
Regular expressions are not scary, but writing your own regexes to strip HTML is a sure path to madness (and it won't work, either). Follow the path of wisdom, and use one of the many good HTML-parsing libraries.
Lucas' example is also broken because "sub" is not a method of a Python string. You'd have to "import re", then call re.sub(pattern, repl, string). But that's neither here nor there, as the correct answer to your question does not involve writing any regexes.
Looking at the amount of sense people are demonstrating in other answers here, I'd say that using a regex probably isn't the best idea for your situation. Go for something tried and tested, and treat my previous answer as a demonstration that regexes need not be that scary.