Python Text Extraction from parsed web pages - python

I'm working to develop a small system for extracting content from web pages (I know it has been done, but it is a good exercise and something I need). Basically, I'm looking to extract content-content, i.e. if it is an article, I just want the article text and nothing else.
I've just started, so consider me a dumb blank slate. I'm interested in how you do it, and with what, specifically in python but I'd be interested in any
EDIT:
I've found this rather enlightening and more in tune with what I'm trying to do, so solutions, discussion, and library suggestions along 'this type of thing' appreciated.

I have done a little bit of this and I recommend the combination of Mechanize and BeautifulSoup.
I would recommend parsing the HTML tree with beautiful soup and looking for a distinctive tag that identifies the content, perhaps:
<div id="article">
Then you can just take that node from the "soup".

Related

How can I Scrape Business Email Contact with python?

this morning I wanted to create a little Software/Script in Python, it was 6am when I started and now I'm about to become crazy because it's 22pm and I have nothing that works.
So basically, I want to do this: Given an Instagram Username, scrape the Name, Number of followers and the business contact email.
I found out that going to the page source will give me this info (let's consider only the email for now): https://imgur.com/a/jYQ2FtR
Any idea about how I can do that? I try many different things and nothing is working. I don't know what to do. I tried downloading the page and parsing the text looking for "business_email" but I have no idea about how to implement it and extracting the data I'm looking for, I know it's a simple task, but I'm a total noob and I haven't been coding for years.
Can someone tell me how to do it? Or at least point me in the right direction.
There are different ways to approach this problem. If the data you want is visible on the page, then you could scrap that info using Beatiful Soup. If not, then it's a little more trickier but you could extract the info for the page source using regular expressions with the re module.

Parsing multiple News articles

I have built a program for summarization that utilizes a parser to parse from multiple websites at a time. I extract only <p> in each article.
This throws out a lot of random content that is unrelated to the article. I've seen several people who can parse any article perfectly. How can i do it? I am using Beautiful Soup
Might be worth you trying an existing package like python-goose which does what it sounds like you're asking for, extracting article content from web pages.
Your solution is really going to be specific to each website page you want to scrape, so, without knowing the websites of interest, the only thing I could really suggest would be to inspect the page source of each page you want to scrape and look if the article is contained in some html element with a specific attribute (either a unique class, id, or even summary attribute) and then use beautiful soup to get the inner html text from that element

Python- is there a module that will automatically scrape the content of an article off a webpage?

I know there is lxml and BeautifulSoup, but that won't work for my project, because I don't know in advance what the HTML format of the site I am trying to scrape an article off of will be. Is there a python-type module similar to Readability that does a pretty good job at finding the content of an article and returning it?
It's possible to do using PhantomJS (C++) or PyPhantomJS (Python).
They're both headless WebKit based browsers, which you can fully control from JavaScript. Because you can control it from JavaScript, I find it is really easy to do stuff such as scrape the content of an article.
PyPhantomJS also has a plugin system, so that's definitely a plus. :)
Extracting the real content from a content-page can not be done automatically - at least not with the standard tools. You have to define/identify where the real content is stored (by specifying the related CSS ID or class in your own HTML extraction code).
Using HTQL, the query is:
&html_main_text

Grabbing non-HTML data from a website using python

I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
I would really like a python 2.6 solution.
It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.
But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?
Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.
It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.
See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text
If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:
http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2
This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.
Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.
Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.

Using Gecko/Firefox or Webkit got HTML parsing in python

I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem is with mis formed HTML pages. Though BeautifulSoup is good at handling mis formed HTML still its not as good as Firefox.
Considering that Firefox or Webkit are more updated and resilient at handling HTML I think its ideal to use them to construct and normalize DOM tree of a page and then manipulate it through Python.
However I cant find any python binding for the same. Can anyone suggest a way ?
I ran into some solutions of running a headless Firefox process and manipulating it through python but is there a more pythonic solution available.
Perhaps pywebkitgtk would do what you need.
see http://wiki.python.org/moin/WebBrowserProgramming
there are quite a lot of options - i'm maintaining the page above so that i don't keep repeating myself.
you should look at pyjamas-desktop: see the examples/uitest example because we use exactly this trick to get copies of the HTML page "out", so that the python-to-javascript compiler can be tested by comparing the page results after each unit test.
each of the runtimes supported and used by pyjamas-desktop is capable of allowing access to the "innerHTML" property of the document's body element (and a hell of a lot more).
bottom line: it is trivial to do what you want to do, but you have to know where to look to find out how to do it.
l.
You might like PyWebkitDFB from http://www.gnu.org/software/pythonwebkit/

Categories

Resources