How to perform a Google search and take the text result? - python

Wondering how to use Python 3 to use Google to create a dictionary of some words (so say I enter a word, I want Python to take the definition that Google is able to give, then store or display it)
I haven't done much coding, but I know how to manage the words after. I'm just a bit confused using urllib and stuff. I have only been able to find help for this on other versions of Python, which I have not been able to replicate on Python 3.3.
EDIT: Yes, I want to use Google because I like the way it defines words and phrases, and I plan to use the define protocol you mentioned, icedtrees.

Edit: it appears that Google Search grabs its definitions using AJAX calls or something. The below solution will not work.
If you are having trouble using urllib2, I suggest the nice Python Requests package, which is a lot easier to use.
If you are absolutely committed to getting the Google definition and no other definition, I would suggest doing a HTTP request to a page using the Google Search "define" protocol.
For example:
https://www.google.com.au/search?q=define:test
You would then save the HTML result, and then parse it for the definitions that you require. Some examples of Python HTML parsers are the HTMLParser module, and also BeautifulSoup. However, this parsing operation seems pretty simple, so a basic regex should be more than enough. All definitions are stored as follows:
<div style="display:inline" data-dobid="dfn"> # the order of the style and the data-dobid can change
<span>definition goes here</span>
</div>
An example of a regex to grab the definitions of "test" from the HTML page:
import re
definitions = re.findall(r'data-dobid="dfn".*?>.*?\<span>(.*?)</span>.*?</div>', html, re.DOTALL)
>>> len(definitions)
18
>>> definitions[0]
'a\n procedure intended to establish the quality, performance, or \nreliability of something, especially before it is taken into widespread \nuse.'
# Looks like you might need to remove the newlines
>>> definitions[5]
'the result of a medical examination or analytical procedure.'
As a sidenote, there also exists a Google Dictionary API, which can give you definition results in JSON format in response to a request.

Related

Convert Wikipedia/MediaWiki's code into HTML using python

I am trying to grab content from Wikipedia and use the HTML of the article. Ideally I would also like to be able to alter the content (eg, hide certain infoboxes etc).
I am able to grab page content using mwclient:
>>> import mwclient
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Samuel_Pepys']
>>> print page.text()
{{Redirect|Pepys}}
{{EngvarB|date=January 2014}}
{{Infobox person
...
But I can't see a relatively simple, lightweight way to translate this wikicode into HTML using python.
Pandoc is too much for my needs.
I could just scrape the original page using Beautiful Soup but that doesn't seem like a particularly elegant solution.
mwparserfromhell might help in the process, but I can't quite tell from the documentation if it gives me anything I need and don't already have.
I can't see an obvious solution on the Alternative Parsers page.
What have I missed?
UPDATE: I wrote up what I ended up doing, following the discussion below.
page="""<html>
your pretty html here
<div id="for_api_content">%s</div>
</html>"""
Now you can grab your raw content with your API and just call
generated_page = page%api_content
This way you can design any HTML you want and just insert the API content in a designed spot.
Those APIs that you are using are designed to return raw content so it's up to you to style how you want the raw content to be displayed.
UPDATE
Since you showed me the actual output you are dealing with I realize your dilemma. However luckily for you there are modules that already parse and convert to HTML for you.
There is one called mwlib that will parse the wiki and output to HTML, PDF, etc. You can install it with pip using the install instructions. This is probably one of your better options since it was created in cooperation between Wikimedia Foundation and PediaPress.
Once you have it installed you can use the writer method to do the dirty work.
def writer(env, output, status_callback, **kwargs): pass
Here are the docs for this module: http://mwlib.readthedocs.org/en/latest/index.html
And you can set attributes on the writer object to set the filetype (HTML, PDF, etc).
writer.description = 'PDF documents (using ReportLab)'
writer.content_type = 'application/pdf'
writer.file_extension = 'pdf'
writer.options = {
'coverimage': {
'param': 'FILENAME',
'help': 'filename of an image for the cover page',
}
}
I don't know what the rendered html looks like but I would imagine that it's close to the actual wiki page. But since it's rendered in code I'm sure you have control over modifications as well.
I would go with HTML parsing, page content is reasonably semantic (class="infobox" and such), and there are classes explicitly meant to demarcate content which should not be displayed in alternative views (the first rule of the print stylesheet might be interesting).
That said, if you really want to manipulate wikitext, the best way is to fetch it, use mwparserfromhell to drop the templates you don't like, and use the parse API to get the modified HTML. Or use the Parsoid API which is a partial reimplementation of the parser returning XHTML/RDFa which is richer in semantic elements.
At any rate, trying to set up a local wikitext->HTML converter is by far the hardest way you can approach this task.
The mediawiki API contains a (perhaps confusingly named) parse action that in effect renders wikitext into HTML. I find that mwclient's faithful mirroring of the API structure sometimes actually gets in the way. There's a good example of just using requests to call the API to "parse" (aka render) a page given its title.

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Extract text from Webpages with Python 3.x

I am working with Python 3.x
I want to extract text from several webpages. What is a good library to allow me do just that?
Thanks,
Barry.
http://www.crummy.com/software/BeautifulSoup/
and the documentation to get you started
http://www.crummy.com/software/BeautifulSoup/documentation.html
mechanize is good library but unfortunately not ready for python 3, but you can take a look at lxml.html
I would suggest using Beautiful Soup and than it's just a matter of going through the returned structure for anything similar to an email address.
You could also just use urllib2 for this but Beautiful Soup takes care of a lot of syntax issues for you.
You don't say what you want to do with the extracted text, and that makes a big difference in how much effort you are willing to go to in order to get it out.
If you are trying to get the body text of a web page minus all of the site-related cruft (a nontrivial task), take a look at boilerpipe. It is written in Java, but it does an amazingly good job at getting essential text out of random web pages.
One of my hobbies over the next few weeks is recreating the core logic of boilerpipe in Python. We need the functionality it provides for a project, but don't want to haul the 10-ton rock that is the JVM around with it. I'm pretty certain we will be releasing it once it is fairly stable.

Confused by which XML processing option to use

I'm fairly new to Python, and I've just started working with XML parsing. I am getting a bit overwhelmed by all the options for working with XML, and I'm hoping an experienced person can give me some advice (and perhaps a code sample??) for the simple problem I'm working on.
I am working on a simple Python contact management application that does not involve a database - each contact's information is stored in a separate text file using XML. For example, assume the following is the contents of the file "1234.xml"
<contact>
<id>1234</id>
<name>Johnny Appleseed</name>
<phone>8145551212</phone>
<address>
<street>1234 Main Street</street>
<city>Hometown</city>
<state>OH</state>
</address>
<address>
<street>1313 Mockingbird Lane</street>
<city>White Plains</city>
<state>NY</state>
</address>
</contact>
For sake of example, let's assume there can be only one phone number, but multiple address blocks.
For what I'm doing here, I need to be able to parse the XML from the file, make changes to the data, and then update the XML and save it back to the file. Let's assume there are three types of data changes that might occur:
changing the data for one or more items, such as updating a phone number
adding a new address block (and the corresponding data for the street/city/state of the new address)
deleting an existing address block
Given what I'm trying to do here, can you recommend a particular way of doing this? (SAX, DOM, minidom, ElementTree, something else?) Code samples for whatever you suggest would be greatly appreciated.
Thank you!
Ron
The SAX and DOM APIs are older; they were pretty much translated from the Java world into Python. The ElementTree API was designed specifically to be Pythonic, i.e. to fit with the Python way of problem solving, so prefer that.
The richest and fastest implementation of ElementTree that I know of is lxml. It's XPath functionality is very useful. Untested example:
from lxml import etree
contacts = etree.parse(open("1234.xml"))
for c in contacts.xpath('//contact'):
if c.xpath('/name')[0].text == 'Johnny Appleseed':
c.xpath('/phone')[0].text = NEW_PHONE_NUMBER
print >> open("1234.xml", "w"), etree.tostring(contacts)
The best solution is to use ElementTree and parse this into a set of classes and manipulate the classes and then serialize them back to XML. You can do this by hand if the XML is really as simple as your example or you can use some tool or library to generate the bindings.
Working directly with XML in most cases always ends in tears, or at least hair pulling. It also isn't very maintainable, when the XML changes it usually will break your hand coded parsing.
Using a binding solution will be more robust to changes and easier to modify when you do need to manually intervene.

Complex HTML parsing with Python

I am already aware of tag based HTML parsing in Python using BeautifulSoup, htmllib etc.
However, I want a powerful engine which can do complex tasks like read html tables, lists etc. and present these as simple to use objects within code. Does python have such powerful libraries?
BeautifulSoup is a nice library and provides a good way to parse HTML with some handy ways to parse the data very easily.
What you are trying to do, can easily be done using some simple regular expressions. You can write regular expressions to search for a particular pattern of data and extract the data you need.
You might consider lxml which has a powerful HTML processor. There is another complementary module that relies on lxml called pyquery that might be just what you're looking for.
PyQuery has jQuery-like syntax, so if you're used to jQuery you'll be able to jump right in.
Here is a simple example to get the first <ul> item from aol.com:
>>> from pyquery import PyQuery as pq
>>> import urllib
>>> data = urllib.urlopen('http://aol.com').read()
>>> d = pq(data)
>>> first_ul = d('ul:first')
>>> first_ul
[<ul#dhL2>]
>>> print first_ul
<ul id="dhL2"><li class="dhL1"><a accesskey="" href="https://new.aol.com/productsweb/?promocode=827693&ncid=txtlnkuswebr00000074" name="om_dirbtn1" class="_o4-0" id="om_dirbtn1">Get Free Mail</a></li>
</ul>
The standard HTML parsers are already pretty good at giving you simple objects (e.g. iterables). Creating anything more complex than a 2D list from a table would likely be dependent on the data that was in the page.
With that said...
Here's a link to a blog post by someone who wrote a script to convert html tables to python lists. The actual file is located here.
I've never heard of a standard python library that does these sorts of operations, so your best bet might be Googling each case as you need it. Chances are someone has done what you are trying to do.
Disclaimer: You should always read and understand any code you find online before pasting it into your own applications! Citing who/where it's from is good too!

Categories

Resources