I want to do automatic searches on a database (in this example www.scopus.com) with a simple python script. I need some place from where to start. For example I would like to do a search and get a list of links and open the links and extract information from the opened pages. Where do I start?
Technically speaking, scopus.com is not "a database", it's a web site that let's you search / consult a database. If you want to programmatically access their service, the obvious way is to use their API, which will mostly requires sending HTTP requests and parsing the HTTP response. You can do this with the standard lib's modules, but you'll certainly save a lot of time using python-requests instead. And you'll certainly want to get some understanding of the HTTP protocol before...
Related
I'm trying to get some tables with specific filters on this qlikview page, for future analysis: http://transferenciasabertas.planejamento.gov.br/QvAJAXZfc/opendoc.htm?document=painelcidadao.qvw&lang=en-US&host=QVS%40srvbsaiasprd01&anonymous=true
I don't want to do it manually (downloading tables for every filter). Therefore, I searched for API's for Python on qlikview website, but only found qliksense API's for SSE (like this https://github.com/qlik-oss/server-side-extension).
Is there any chance that I could automate the retrieving process that I explained using Python?
Server side extensions are used for something else. They extend Qlik's functionality to process data (for example running some statistical functions on top of the displayed data if such functions do not exists in Qlik natively)
Interestingly is that the portal link (http://transferenciasabertas.planejamento.gov.br) is a QlikView app that later redirects to a Qlik Sense app(s). It seems that anonymous users are allowed on the platform (which makes automating data retrieval easier).
Qlik Sense communicates with the browser via web sockets. So the answer to your question is - yes. You can used Python to connect to the underlying Qlik Sense Engine and make some selections and get the data back.
The not very good news is that I dont think there is dedicated Python library so you'll have to send the raw web socket requests by yourself. The documentation for the Engine API can be found at Qlik's help site
If you are open for JS solution then you can use Qlik's enigma.js library for Engine communication.
The web sockets traffic can be monitored from the browser (to view what data is being send/received and its format)
I have been playing with requests module on Python for a while as part of studying HTTP requests/responses; and I think I grasped most of the fundamental things on the topic that are supposed to be understood. With a naive analogy it basically works on ping-pong principle. You send a request in a packet to server and then it send back to you another packet. For instance, logging in to a site is simply sending a post request to server, I managed to do that. However, what I have trouble is to fail clicking on buttons through HTTP post request. I searched for it here and there, but I could not find a valid answer to my inquiry other than utilizing selenium module, which is what I do not want to if there is another way with requests module too. I am also aware of the fact that they created such a module called selenium for a thing.
QUESTIONS:
1) What kind of parameters do I have to take into account for being able to click on buttons or links from the account I accessed through HTTP requests? For instance, when I watch network activity for request header and response header with my browser's built-in inspect tool, I get so many parameters sent back by server, e.g. sec-fetch-dest, sec-fetch-mode, etc.
2) Is it too complicated for a beginner or is there too much advanced stuff going on behind the scene to do that so selenium was created for that reason?
Theoretically, you could write a program to do this with requests, but you would be duplicating much of the functionality that is already built and optimized in other tools and APIs. The general process would be:
Load the HTML that is normally rendered in your browser using a get request.
Process the HTML to find the button in question.
Then, if it's a simple form:
Determine the request method the button will carry out (e.g. using the formmethod argument, see here).
Perform the specified request with the required information in your request packet.
If it's a complex page (i.e. it uses JavaScript):
Find the button's unique identifier.
Process the JavaScript code to determine what action is performed when the button is clicked.
If possible, perform the JavaScript action using requests (e.g. following a link or something like that). I say if possible because JavaScript can do many things that, to my knowledge, simple HTTP request cannot, like changing rendered CSS in order to change the background color of a <div> when a button is clicked.
You are much better off using a tool like selenium or beautiful soup, as they have created APIs that do a lot of the above for you. If you've used the built-in requests library to learn about the basic HTTP request types and how they work, awesome--now move on to the plethora of excellent tools that wrap requests up into a more functional and robust API.
I have zero experience with website development but am working on a group project and am wondering whether it would be possible to create an interaction between a simple html/css website and my python function.
Required functionality:
I have to take in a simple string input from a text box in the website, pass it into my python function which gives me a single list of strings as output. This list of strings is then passed back to the website. I would just like a basic tutorial website to achieve this. Please do not give me a link to the CGI python website as I have already read it and would like a more basic and descriptive view as to how this is done. I would really appreciate your help.
First you will need to understand HTTP. It is a text based protocol.
I assume by "web site" you mean User-Agent, like FireFox.
Now, your talking about an input box, well this will mean that you've already handled an HTTP request for your content. In most web applications this would have been several requests (one for the dynamically generated application HTML, and more for the static css and js files).
CGI is the most basic way to programmatically inspect already parsed HTTP requests and create HTTP responses from objects you've set.
Now your application is simple enough where you can probably do all the HTTP parsing yourself to gain a basic understanding of what's going on, but you will still need to understand how to develop a server that can listen on a socket.
To avoid all that just find a Python application server that has already implemented all of the above and much more. There are many python application servers to choose from. Use one with a small learning curve for something as simple as above. Some are labeled as "micro-frameworks" in this genre.
Have you considered creating an app on Google App Engine (https://developers.google.com/appengine/)?
Their Handling Forms tutorial seems to describe your problem:
https://developers.google.com/appengine/docs/python/gettingstartedpython27/handlingforms
I have a requirement to build a client for Shopify's API, building it in Python & Django.
I've never done it before and so I'm wondering if someone might advise on a good starting point for the kinds of patterns and techniques needed to get a job like this done.
Here's a link to the Shopify API reference
Thanks.
Your question is somewhat open-ended, but if you're new to Python or API programming, then you should get a feel for how to do network programming in Python, using either the urllib2 or httplib modules that come with more recent versions of Python. Learn how to initiate a request for a page and read the response into a file.
Here is an overview of the httplib module in Python documentation:
http://docs.python.org/library/httplib.html
After you've managed to make page requests using the GET HTTP verb, learn about how to make POST requests and how to add headers, like Content-Type, to your request. When communicating with most APIs, you need to be able to send these.
The next step would be to get familiar with the XML standard and how XML documents are constructed. Then, play around with different XML libraries in Python. There are several, but I've always used xml.dom.minidom module. In order to talk to an API, you'll probably need to know to create XML documents (to include in your requests) and how to parse content out of them. (to make use of the API's responses) The minidom module allows a developer to do both of these. For your reference:
http://docs.python.org/library/xml.dom.minidom.html
Your final solution will likely put both of these together, where you create an XML document, submit it as content to the appropriate Shopify REST API URL, and then have your application deal with the XML response the API sends back to you.
If you're sending any sensitive data, be sure to use HTTPS over port 443, and NOT HTTP over port 80.
I have been working on a project for the last few months using Python and Django integrating with Shopify, built on Google App Engine.
Shopify has a valuable wiki resource, http://wiki.shopify.com/Using_the_shopify_python_api. This is what I used to get a good handle of the Shopify Python API that was mentioned, https://github.com/Shopify/shopify_python_api.
It will really depend on what you are building, but these are good resources to get you started. Also, understanding the Shopify API will help when using the Python API for Shopify.
Shopify has now released a Python API client: https://github.com/Shopify/shopify_python_api
I think you can find some inspiration by taking a look at this:
http://bitbucket.org/jespern/django-piston/wiki/Home
Although it is directly opposite what you want to do (Piston is for building APIs, and what you want is to use an API) it can give you some clues on common topics.
I could mention, of course, reading obvious sources like the Shopify developers forum:
http://forums.shopify.com/categories/9
But I guess you already had it in mind :)
Cheers,
H.
I've had a look at many tutorials regarding cookiejar, but my problem is that the webpage that i want to scape creates the cookie using javascript and I can't seem to retrieve the cookie. Does anybody have a solution to this problem?
If all pages have the same JavaScript then maybe you could parse the HTML to find that piece of code, and from that get the value the cookie would be set to?
That would make your scraping quite vulnerable to changes in the third party website, but that's most often the case while scraping. (Please bear in mind that the third-party website owner may not like that you're getting the content this way.)
I responded to your other question as well: take a look at mechanize. It's probably the most fully featured scraping module I know: if the cookie is sent, then I'm sure you can get to it with this module.
Maybe you can execute the JavaScript code in a JavaScript engine with Python bindings (like python-spidermonkey or pyv8) and then retrieve the cookie. Or, as the javascript code is executed client side anyway, you may be able to convert the cookie-generating code to Python.
You could access the page using a real browser, via PAMIE, win32com or similar, then the JavaScript will be running in its native environment.