BSoup against raw parsing in Python - python

My question is more of a generic one rather than single code based. I´m scraping some websites, and have been doing so with selenium+Bsoup4 (it´s javascript based).
Up until now I didn´t think of any other way, but browsing through threads I started to get in touch more with the elements suh as
element(by.id("id"));
element(by.css("#id"));
element(by.xpath("//*[#id='id']"))
So my question is, is it really necessary to get plain text and use find_all to locate the info you need when you can do the same with xpath or css? I mean, what is the difference in terms of coding?
And another one, in terms of speed, which way is faster? Or robust for all that matters.
Many thanks in advance

Related

BeautifulSoup find returning "None" [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.
The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.
This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.
Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.
Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(
There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.
Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.
i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.
This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?

Web scraping when multiple clicks are needed

Kindly have a short look here. https://www.cbp.gov/contact/find-broker-by-port/4901. Trying to scrape the list of all brokers, port wise. My question is directed to the approach that needs to be taken when multiple clicks(forward/back) are needed to arrive at a single/multiple data item(s). Could you point me to some reading material on this or any other solution you deem fit. Many Thanks.
You can use selenium for automating multiple clicks (forward/back) as needed, and also for identifying specific data item.
below you have a very good example.
[1] https://selenium-python.readthedocs.io/getting-started.html
Update: Another approach if the website is static is to use requests with beautifulsoup here is an example https://medium.com/#itylergarrett.tag/learning-web-scraping-with-python-requests-beautifulsoup-936e6445312

Scraping information from a flash object on a website using python or any other method

I was just wondering if it is possible to scrape information form this website that contained in a flash file.(http://www.tomtom.com/lib/doc/licensing/coverage/)
I am trying to get the all the text from the different components of this website.
Can anyone suggest a good starting point in python or any simpler method.
I believe the following blog post answers your question well. The author had the same need, to scrape Flash content using Python. And the same problem came up. He realized that he just needed to instantiate a browser (even just an in-memory one that did not even display to the screen) and then scrape its output. I think this could be a successful approach for what you need, and he makes it easy to understand.
http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/

Choosing a Python webscraping framework for handling pure Javascript based sites

I'm a Python programmer specializing in web-scraping, I had to ask this question as I found nothing relevant.
I want to know what are the popular, well documented frameworks that are available for Python for scraping pure Javascript based sites? Currently I know Mechanize and Beautiful Soup but they do not interact with Javascript so I'm looking for something different. I would prefer something that would be as elegant and simple as mechanize.
I've done a bit of research and so far I've heard about Selenium, Selenium 2 and Windmill.
Right now I'm trying to choose among one these three and I do not know of any others.
So can anyone point out the features of these frameworks and what makes them different? I heard that Selenium uses a separate server to do all it's task and it seems to be feature rich. Also what is the core difference between Selenium and Selenium2? Please enlighten me if I'm wrong, and if you know of any other frameworks do mention it's features and other details.
Thanks.
Before using tools like Selenium that are designed for front end testing and not for scraping, you should have a look at where the data on the site comes from. Find out what XHR requests are made, what parameters they take and what the result is.
For example the site you mentioned in your comment does a POST request with lots of parameters in JavaScript and displays the result. You probably only need to use the result of this POST request to get your data.

Using Gecko/Firefox or Webkit got HTML parsing in python

I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem is with mis formed HTML pages. Though BeautifulSoup is good at handling mis formed HTML still its not as good as Firefox.
Considering that Firefox or Webkit are more updated and resilient at handling HTML I think its ideal to use them to construct and normalize DOM tree of a page and then manipulate it through Python.
However I cant find any python binding for the same. Can anyone suggest a way ?
I ran into some solutions of running a headless Firefox process and manipulating it through python but is there a more pythonic solution available.
Perhaps pywebkitgtk would do what you need.
see http://wiki.python.org/moin/WebBrowserProgramming
there are quite a lot of options - i'm maintaining the page above so that i don't keep repeating myself.
you should look at pyjamas-desktop: see the examples/uitest example because we use exactly this trick to get copies of the HTML page "out", so that the python-to-javascript compiler can be tested by comparing the page results after each unit test.
each of the runtimes supported and used by pyjamas-desktop is capable of allowing access to the "innerHTML" property of the document's body element (and a hell of a lot more).
bottom line: it is trivial to do what you want to do, but you have to know where to look to find out how to do it.
l.
You might like PyWebkitDFB from http://www.gnu.org/software/pythonwebkit/

Categories

Resources