I am fetching data from a URL using urllib2.urlopen:
from urllib2 import urlopen
...
conn = urlopen(url)
data = conn.read()
conn.close()
Suppose the data did not "come out" as I had expected.
What would be the best method for me to read it again?
I am currently repeating the whole process (open, read, close).
Is there a better way (some sort of connection-refresh perhaps)?
When you call urlopen on a URL, Python makes an HTTP GET request and returns the response; each of these request-response pairs are by nature separate connections. You have to repeat the process for every URL you want to request, although you don't really have to close your urlopen response.
No, repeating the process is the only way to get new data.
you chould close urllib after used to refresh when you open early
try:
import json, urllib
while 1 :
url='http://project/JsonVanner.php'
response = urllib.urlopen(url)
data = json.loads(response.read())
for x in data :
print x['Etat']
if (x['Etat'] == 'OFF'):
print('vanne fermer')
print((int(x['IDVanne'])*10)+0)
else :
print('vanne ouverte')
print((int(x['IDVanne'])*10)+1)
response.close()
Related
I want to use the Python 3 module urllib to access an Elasticsearch database at localhost:9200. My script gets a valid request (generated by Kibana) piped to STDIN in JSON format.
Here is what I did:
import json
import sys
import urllib.parse
import urllib.request
er = json.load(sys.stdin)
data = urllib.parse.urlencode(er)
data = data.encode('ascii')
uri = urllib.request.Request('http://localhost:9200/_search', data)
with urllib.request.urlopen(uri) as repsonse:
response.read()
(I understand that my repsonse.read() doesn't make much sense by itself but I just wanted to keep it simple.)
When I execute the script, I get an
HTTP Error 400: Bad request
I am very sure that the JSON data I'm piping to the script is correct, since I had it printed and fed it via curl to Elasticsearch, and got back the documents I expected to get back.
Any ideas where I went wrong? Am I using urllib correctly? Do I maybe mess up the JSON data in the urlencode line? Am I querying Elasticsearch correctly?
Thanks for your help.
With requests you can do one of two things
1) Either you create the string representation of the json object yourself and send it off like so:
payload = {'param': 'value'}
response = requests.post(url, data=json.dumps(payload))
2) Or you have requests do it for you like so:
payload = {'param': 'value'}
response = requests.post(url, json = payload)
So depending on what actually comes out of the sys.stdin call (probably - as Kibana would be sending that if the target was ElasticSearch - a string representation of a json object == equivalent of doing json.dumps on a dictionary), but you might have to adjust a bit depending on the output of sys.stdin.
My guess is that your code could work by just doing so:
import sys
import requests
payload = sys.stdin
response = requests.post('http://localhost:9200/_search', data=payload)
And if you then want to do some work with it in Python, requests has a built in support for this too. You just call this:
json_response = response.json()
Hope this helps you on the right track. For further reading om json.dumps/loads - this answer has some good stuff on it.
For anyone who doesn't want to use requests (for example if you're using IronPython where its not supported):
import urllib2
import json
req = urllib2.Request(url, json.dumps(data), headers={'Content-Type': 'application/json'})
response = urllib2.urlopen(req)
Where 'url' can be something like this (example below is search in index):
http://<elasticsearch-ip>:9200/<index-name>/_search/
I'm using an API for doing HTTP requests that return JSON. The calling of the api, however, depends on a start and an end page to be indicated, such as this:
def API_request(URL):
while(True):
try:
Response = requests.get(URL)
Data = Response.json()
return(Data['data'])
except Exception as APIError:
print(APIError)
continue
break
def build_orglist(start_page, end_page):
APILink = ("http://sc-api.com/?api_source=live&system=organizations&action="
"all_organizations&source=rsi&start_page={0}&end_page={1}&items_"
"per_page=500&sort_method=&sort_direction=ascending&expedite=1&f"
"ormat=json".format(start_page, end_page))
return(API_request(APILink))
The only way to know if you're not longer at an existing page is when the JSON will be null, like this.
If I wanted to do multiple build_orglist going over every single page asynchronously until I reach the end (Null JSON) how could I do so?
I went with a mix of #LukasGraf's answer of using sessions to unify all of my HTTP connections into a single session as well as made use of grequests for making the group of HTTP requests in parallel.
I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text
I need to set the timeout on urllib2.request().
I do not use urllib2.urlopen() since i am using the data parameter of request. How can I set this?
Although urlopen does accept data param for POST, you can call urlopen on a Request object like this,
import urllib2
request = urllib2.Request('http://www.example.com', data)
response = urllib2.urlopen(request, timeout=4)
content = response.read()
still, you can avoid using urlopen and proceed like this:
request = urllib2.Request('http://example.com')
response = opener.open(request,timeout=4)
response_result = response.read()
this works too :)
Why not use the awesome requests? You'll save yourself a lot of time.
If you are worried about deployment just copy it in your project.
Eg. of requests:
>>> requests.post('http://github.com', data={your data here}, timeout=10)
Using urlopen also for url queries seems obvious. What I tried is:
import urllib2
query='http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
f = urllib2.urlopen(query)
s = f.read()
f.close()
However, for this specific url query it fails with HTTP error 403 forbidden
When entering this query in my browser, it works.
Also when using http://www.httpquery.com/ to submit the query, it works.
Do you have suggestions how to use Python right to grab the correct response?
Looks like it requires cookies... (which you can do with urllib2), but an easier way if you're doing this, is to use requests
import requests
session = requests.session()
r = session.get('http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627')
This is generally a much easier and less-stressful method of retrieving URLs in Python.
requests will automatically store and re-use cookies for you. Creating a session is slightly overkill here, but is useful for when you need to submit data to login pages etc..., or re-use cookies across a site... etc...
using urllib2 is something like
import urllib2, cookielib
cookies = cookielib.CookieJar()
opener = urllib2.build_opener( urllib2.HTTPCookieProcessor(cookies) )
data = opener.open('url').read()
It appears that the urllib2 default user agent is banned by the host. You can simply supply your own user agent string:
import urllib2
url = 'http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
request = urllib2.Request(url, headers={"User-Agent" : "MyUserAgent"})
contents = urllib2.urlopen(request).read()
print contents