retrieved URLs, trouble building payload to use requests module - python

I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])

You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.

Related

How can I prepend each value of a list with either a loop or a function?

I'm currently making API calls to an application and then parsing the JSON data for certain values. The values I'm parsing are suffixes of a full URL. I need to take this list of suffixes and prepend them with my variable which completes each URL. This list can vary in size and so I will not always know the exact list length nor the exact values of the suffixes.
The list (worker_uri_list) will consist of the following format: ['/api/admin/configuration/v1/vm/***/', '/api/admin/configuration/v1/vm/***/'], '/api/admin/configuration/v1/vm/***/']. The asterisk are characters that can vary at any given time when the API call is made. I need to then prepend each one of these suffixes with my variable (url_prefix).
I'm having a hard time trying to understand the best method and how to do this correctly. Very new to Python here so I hope I've explained this well enough. It seems a simple for or while loop could accomplish this, but a lot of all the loop examples I find are for simple integer values.
Any help is appreciated.
url_prefix = "https://website.com"
url1 = url_prefix + "/api/admin/configuration/v1/config"
payload={}
header = {
'Authorization': 'Basic ********'
}
#Parse response
response = requests.get(url1, headers=header).json()
worker_uri_list = nested_lookup("resource_uri", response)
loop?
function?
You could iterate the worker_uri_list in a loop:
for uri in worker_uri_list:
url = url_prefix + uri
# do something with the url
print(url)
Output:
https://website.com/api/admin/configuration/v1/vm/***/
https://website.com/api/admin/configuration/v1/vm/***/
https://website.com/api/admin/configuration/v1/vm/***/
Or if you want a list of URLs, use a list comprehension:
urls = [url_prefix + uri for uri in worker_uri_list]
print(urls)
Output:
[
'https://website.com/api/admin/configuration/v1/vm/***/',
'https://website.com/api/admin/configuration/v1/vm/***/',
'https://website.com/api/admin/configuration/v1/vm/***/'
]

Input variable name as raw string into request in python

I am kind of very new to python.
I tried to loop through an URL request via python and I want to change one variable each time it loops.
My code looks something like this:
codes = ["MCDNDF3","MCDNDF4"]
#count = 0
for x in codes:
response = requests.get(url_part1 + str(codes) + url_part3, headers=headers)
print(response.content)
print(response.status_code)
print(response.url)
I want to have the url change at every loop to like url_part1+code+url_part3 and then url_part1+NEXTcode+url_part3.
Sadly my request badly formats the string from the variable to "%5B'MCDNDF3'%5D".
It should get inserted as a raw string each loop. I don't know if I need url encoding as I don't have any special chars in the request. Just change code to MCDNDF3 and in the next request to MCDNDF4.
Any thoughts?
Thanks!
In your for loop, the first line should be:
response = requests.get(url_part1 + x + url_part3, headers=headers)
This will work assuming url_part1 and url_part3 are regular strings. x is already a string, as your codes list (at least in your example) contains only strings. %5B and %5D are [ and ] URL-encoded, respectively. You got that error because you called str() on a single-membered list:
>>> str(["This is a string"])
"['This is a string']"
If url_part1 and url_part3 are raw strings, as you seem to indicate, please update your question to show how they are defined. Feel free to use example.com if you don't want to reveal your actual target URL. You should probably be calling str() on them before constructing the full URL.
You’re putting the whole list in (codes) when you probably want x.

How to print same dictionary object from multiple urls with grequest?

I have a list of URLs that all use the same json structure. I am trying to pull specific dictionary objects from all of the URLs at once with grequest. I am able to do it with one URL, though I am using request:
import requests
import json
main_api = 'https://bittrex.com/api/v1.1/public/getorderbook?market=BTC-1ST&type=both&depth=50'
json_data = requests.get(main_api).json()
Quantity = json_data['result']['buy'][0]['Quantity']
Rate = json_data['result']['buy'][0]['Rate']
Quantity_2 = json_data['result']['sell'][0]['Quantity']
Rate_2 = json_data['result']['sell'][0]['Rate']
print ("Buy")
print(Rate)
print(Quantity)
print ("")
print ("Sell")
print(Rate_2)
print(Quantity_2)
I want to be able to print what I printed above, for every URL. But I do not know where to begin. This is what I have so far:
import grequests
import json
urls = [
'https://bittrex.com/api/v1.1/public/getorderbook?market=BTC-1ST&type=both&depth=50',
'https://bittrex.com/api/v1.1/public/getorderbook?market=BTC-2GIVE&type=both&depth=50',
'https://bittrex.com/api/v1.1/public/getorderbook?market=BTC-ABY&type=both&depth=50',
]
requests = (grequests.get(u) for u in urls)
responses = grequests.map(requests)
I thought it would be something like print(response.json(['result']['buy'][0]['Quantity'] for response in responses)) but that does not work at all, and python returns the following: print(responses.json(['result']['buy'][0]['Quantity'] for response in responses)) AttributeError: 'list' object has no attribute 'json'. I am very new to python, and coding in general, and I would appreciate any help.
Your responses variable is a list of Response objects. If you simple print the list with
print(responses)
it gives you
[<Response [200]>, <Response [200]>, <Response [200]>]
the brackets [] tell you that this is a list and it contains three Responseobjects.
When you type responses.json(...) you are telling python to call the json() method on the list object. The list, however does not offer such a method, only the objects in the list have it.
What you need to do is access an element in the list and call the json() method on this element. This done by specifying the position of the list element you want to access like this:
print(responses[0].json()['result']['buy'][0]['Quantity'])
This will access the first element in the responses list.
Of course, it is not practical to access each list element individually if you want to output many items. That's why there are loops. Using a loop you can simply say: do this for each element in my list. This looks like this:
for response in responses:
print("Buy")
print(response.json()['result']['buy'][0]['Quantity'])
print(response.json()['result']['buy'][0]['Rate'])
print("Sell")
print(response.json()['result']['sell'][0]['Quantity'])
print(response.json()['result']['sell'][0]['Rate'])
print("----")
The for-each-loop executes the indented lines of code for each element in the list. The current element is available in the response variable.

trouble scraping from JSONP feed

I asked a similar question earlier
python JSON feed returns string not object
but I am having a little more trouble and don't understand it.
For about half of the dates this works and returns a JSON object
for example November 9 2013 works
url = 'http://data.ncaa.com/jsonp/scoreboard/basketball-men/d1/2013/11/09/scoreboard.html?callback=c'
r = requests.get(url)
jsonObj = json.loads(r.content[2:-2])
but if I try November 11 2013:
url = 'http://data.ncaa.com/jsonp/scoreboard/basketball-men/d1/2013/11/11/scoreboard.html?callback=c'
r = requests.get(url)
jsonObj = json.loads(r.content[2:-2])
I get this error
ValueError: No JSON object could be decoded
I dont understand why. When I put both urls into a browser they look exactly the same.
The JSON in the second feed is, in fact, invalid JSON. Found this by removing the callback function and running it through: http://jsonlint.com/
To see for yourself, search for the following ID: 336252
The lines just above that ID contain two commas in a row, which is disallowed by the JSON spec.
My guess is that the server at data.ncaa.com is trying to generate JSON itself rather than using a JSON library. You should contact the site administrator and make them aware of this error.
Using demjson
demjson.decode(r.content[2:-2])
seems to work

Does urllib.request.urlopen(url) use cache?

I have this long list of URL that I need to check response code of, where the links are repeated 2-3 times. I have written this script to check the response code of each URL.
connection =urllib.request.urlopen(url)
return connection.getcode()
The URL comes in XML in this format
< entry key="something" > url</entry>
< entry key="somethingelse" > url</entry>
and I have to associate the response code with the attribute Key so I don't want to use a SET.
Now I definitely don't want to make more than 1 request for the same URL so I was searching whether urlopen uses cache or not but didn't find a conclusive answer. If not what other technique can be used for this purpose.
You can store the urls in a dictionary (urls = {}) as you make a request and check if you have already made a req to that url later:
if key not in urls:
connection = urllib.request.urlopen(url)
urls[key] = url
return connection.getcode()
BTW if you make requests to the same urls repeatedly (multiple runs of the script), and need a persistent cache, i recommend using requests with requests-cache
Why don't you create a python set() of the URLs? That way each url is included only once.
How are you associating the URL with the key? A dictionary?
You can use a dictionary to map the URL to it's response and any other information you need to keep track of. If the URL is already in the dictionary then you know the response. So you have one dictionary:
url_cache = {
"url1" : ("response", [key1,key2])
}
If you need to organize things differently it shouldn't be too hard with another dictionary.

Categories

Resources