Does urllib.request.urlopen(url) use cache? - python

I have this long list of URL that I need to check response code of, where the links are repeated 2-3 times. I have written this script to check the response code of each URL.
connection =urllib.request.urlopen(url)
return connection.getcode()
The URL comes in XML in this format
< entry key="something" > url</entry>
< entry key="somethingelse" > url</entry>
and I have to associate the response code with the attribute Key so I don't want to use a SET.
Now I definitely don't want to make more than 1 request for the same URL so I was searching whether urlopen uses cache or not but didn't find a conclusive answer. If not what other technique can be used for this purpose.

You can store the urls in a dictionary (urls = {}) as you make a request and check if you have already made a req to that url later:
if key not in urls:
connection = urllib.request.urlopen(url)
urls[key] = url
return connection.getcode()
BTW if you make requests to the same urls repeatedly (multiple runs of the script), and need a persistent cache, i recommend using requests with requests-cache

Why don't you create a python set() of the URLs? That way each url is included only once.

How are you associating the URL with the key? A dictionary?
You can use a dictionary to map the URL to it's response and any other information you need to keep track of. If the URL is already in the dictionary then you know the response. So you have one dictionary:
url_cache = {
"url1" : ("response", [key1,key2])
}
If you need to organize things differently it shouldn't be too hard with another dictionary.

Related

I need to create unit tests for my urls in a django project

I have this url which returns the json data of my models but I don't know how to create a unit test for a url like this
path("list/", views.json_list, name="json_list"),
I'm not really sure what is being asked. A test like this
url = reverse('myapp:json_list')
response = client.get( url)
body = response.content.decode()
is going to fail if anything is wrong with the url definition. (Specifically, reverse will fail if the name doesn't exist, and for an url with arguments, if what you supply as kwargs isn't accepted by the url definition).
As for validating the response, we can't help without knowing a lot more about what is expected. Presumably, you will locate the start of some JSON text in body, feed it to json.loads, and make sure the data is as expected. But I don't think that's what is being asked.

Where to store data, in flask, while paginating

Respected people:
I need a hint as to where should I hold the data, that I need to paginate, I am using flask.
Should I use session to remember what data I sent earlier, and do the same for subsequent requests?
Also, how should I hold the data sent from the API in json format?
data_received_from_the_api = calltoApi()
#How do I make flask to remember/store above data,
#for pagination, If I am not using sessions.
I am thinking of maintaining a list, with session[current-index], session[previous-index]. The json data has 5 fields and the number of json-records sent by the API is 100.
Could it be done without using session?
I used a list approach in a project:
When the page is loaded, if the req is a POST it check for previous ones and rememberers the actual:
if request.method == "POST" and list(request.form.to_dict().values())[0] in req_indicator_html_names_dict.keys() :
selector_remember = ast.literal_eval( list(request.form.to_dict().keys())[0] )
else :
selector_remember = []
appending the actual request to the list:
selector_remember.append( req_ind_html_name )
It then pass the list to the page so you can keep track of the previous requests.
Hope it helps!

retrieved URLs, trouble building payload to use requests module

I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])
You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.

Handling different response data types of REST APIs

The requirement for my application is to do a GET on the GitHub API
https://api.github.com/repos/{full_name}/commits
In the ideal case, this REST API returns a list of dictionaries. Then the application has to fetch the first element of the result.
However, the REST API might also return a dictionary in the non-ideal case(empty repository with no commits). In this case, if the first element is fetched, it will throw a keyerror.
Right now, I have wrapped the code in try..catch. So if an exception is raised in the non-ideal case, the application bails out.
Is there a better way to handle the ideal and non-ideal case?
The response for the GitHub API request is in JSON format. It would be better if you parse the response using the JSON library and then use a for loop to traverse through the commit data. For example, a good way to print all the commit sha that you get from the response can be as follows:
import json
import requests
response = requests.get(<<URL with necessary authentication>>)
if response != 0 and response != None:
response_j = response.json() #here 'response' is the response you get from the requests.get() command for example
for commit in response_j:
print(commit['sha'])
In case the repository has no commits it should return an empty dict so you can put a condition to check for no commits.

Sending the variable's content to my mailbox in Python?

I have asked this question here about a Python command that fetches a URL of a web page and stores it in a variable. The first thing that I wanted to know then was whether or not the variable in this code contains the HTML code of a web-page:
from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
doSomethingWithResult(result.content)
The answer that I received was "yes", i.e. the variable "result" in the code did contain the HTML code of a web page, and the programmer who was answering said that I needed to "check the Content-Type header and verify that it's either text/html or application/xhtml+xml". I've looked through several Python tutorials, but couldn't find anything about headers. So my question is where is this Content-Type header located and how can I check it? Could I send the content of that variable directly to my mailbox?
Here is where I got this code. It's on Google App Engines.
If you look at the Google App Engine documentation for the response object, the result of urlfetch.fetch() contains the member headers which contains the HTTP response headers, as a mapping of names to values. So, all you probably need to do is:
if result['Content-Type'] in ('text/html', 'application/xhtml+xml'):
# assuming you want to do something with the content
doSomethingWithXHTML(result.content)
else:
# use content for something else
doTheOtherThing(result.content)
As far as emailing the variable's contents, I suggest the Python email module.
for info on sending Content-Type header, see here: http://code.google.com/appengine/docs/python/urlfetch/overview.html#Request_Headers

Categories

Resources