Python: Running multiple http requests concurrently based on an initial request? - python

Currently I am trying to fetch from an API which has 2 endpoints:
GET /AllUsers
GET /user_detail/{id}
In order to get the details of all the users, I would have to call GET /AllUsers, and loop through the IDs to call the GET /user_detail/{id} endpoint 1 by 1. I wonder if it's possible to have multiple GET /user_detail/{id} calls running at the same time? Or perhaps there is a better approach?

This sounds like a great use case for grequests
import grequests
urls = [f'http://example.com/user_detail/{id}' for id in range(10)]
rs = (grequests.get(u) for u in urls)
responses = grequests.map(rs)
Edit: As an example for processing responses to retrieve json you could:
data = []
for response in responses:
data.append(response.json())

Related

How to send multiple HTTP requests using Python

I'm trying to create a script that checks all possible top level domain combinations based on a given root domain name.
What I've done is generate a list of all possible TLD combinations and what I want to do is make multiple HTTP requests, analyze its results and determine whether a root domain name has multiple active top level domains or not.
Example:
A client of mine has this active domains:
domain.com
domain.com.ar
domain.ar
I've tried using grequests but get this error:
TypeError: 'AsyncRequest' object is not iterable
Code:
import grequests
responses = grequests.get([url for url in urls])
grequests.map(responses)
As I mentioned, you cannot put code as a parameter. What you want is to add a list and you can using an inline for loop, like this:
[url for url in urls]
It is called list comprehensions and more information about this can be found over here: https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
So I just added two brackets:
import grequests
responses = (grequests.get(u) for u in urls)
grequests.map(responses)

post request using python for multiple json body

I am trying to post requests to a url with input body as a dict, here's the sample code
import requests
json_ = [{"foo":"bar"},{"foo1":"bar1"},{"foo2":"bar2"}]
for i in json_:
r = requests.post(url, json=i,auth=auth)
print(r.text)
but i have around 20k dict bodies, and using for loop takes lot of time,is there any way i can get the request content by passing all json_ in a single POST ?
That depends on the API you're posting to. If it only accepts a single object at a time, no.
You could split the job between threads and speed up the progress a lot that way. You can split up the list in some five lists and make a function that does the post. Then, make a thread for each list, start them and join them.
https://realpython.com/intro-to-python-threading/

Where to store data, in flask, while paginating

Respected people:
I need a hint as to where should I hold the data, that I need to paginate, I am using flask.
Should I use session to remember what data I sent earlier, and do the same for subsequent requests?
Also, how should I hold the data sent from the API in json format?
data_received_from_the_api = calltoApi()
#How do I make flask to remember/store above data,
#for pagination, If I am not using sessions.
I am thinking of maintaining a list, with session[current-index], session[previous-index]. The json data has 5 fields and the number of json-records sent by the API is 100.
Could it be done without using session?
I used a list approach in a project:
When the page is loaded, if the req is a POST it check for previous ones and rememberers the actual:
if request.method == "POST" and list(request.form.to_dict().values())[0] in req_indicator_html_names_dict.keys() :
selector_remember = ast.literal_eval( list(request.form.to_dict().keys())[0] )
else :
selector_remember = []
appending the actual request to the list:
selector_remember.append( req_ind_html_name )
It then pass the list to the page so you can keep track of the previous requests.
Hope it helps!

Does urllib.request.urlopen(url) use cache?

I have this long list of URL that I need to check response code of, where the links are repeated 2-3 times. I have written this script to check the response code of each URL.
connection =urllib.request.urlopen(url)
return connection.getcode()
The URL comes in XML in this format
< entry key="something" > url</entry>
< entry key="somethingelse" > url</entry>
and I have to associate the response code with the attribute Key so I don't want to use a SET.
Now I definitely don't want to make more than 1 request for the same URL so I was searching whether urlopen uses cache or not but didn't find a conclusive answer. If not what other technique can be used for this purpose.
You can store the urls in a dictionary (urls = {}) as you make a request and check if you have already made a req to that url later:
if key not in urls:
connection = urllib.request.urlopen(url)
urls[key] = url
return connection.getcode()
BTW if you make requests to the same urls repeatedly (multiple runs of the script), and need a persistent cache, i recommend using requests with requests-cache
Why don't you create a python set() of the URLs? That way each url is included only once.
How are you associating the URL with the key? A dictionary?
You can use a dictionary to map the URL to it's response and any other information you need to keep track of. If the URL is already in the dictionary then you know the response. So you have one dictionary:
url_cache = {
"url1" : ("response", [key1,key2])
}
If you need to organize things differently it shouldn't be too hard with another dictionary.

Why do my data miner threads collect some IDs many times, others not at all?

I'm writing a data miner in python with urllib2 and BeautifulSoup to parse some websites, and in attempting to divide its processes across a few threads, I get the following output:
Successfully scraped ID 301
Successfully scraped ID 301
Empty result at ID 301
"Successful" means I got the data I needed. "Empty" means the page doesn't have what I need. "ID" is an integer affixed to the URL, like site.com/blog/post/.
First off, each thread should be parsing different URLs, not the same URLs many times. Second, I shouldn't be getting different results for the same URL.
I'm threading the processes in the following way: I instantiate some threads, pass each of them shares of a list of URLs to parse, and send them on their merry way. Here's the code:
def constructURLs(settings,idList):
assert type(settings) is dict
url = settings['url']
return [url.replace('<id>',str(ID)) for ID in idList]
def miner(urls,results):
for url in urls:
data = spider.parse(url)
appendData(data,results)
def mine(settings,results):
(...)
urls = constructURLs(settings,idList)
threads = 3 # number of threads
urlList = [urls[i::threads] for i in xrange(threads)]
for urls in urlList:
t = threading.Thread(target=miner,args=(urls,results))
t.start()
So why are my threads parsing the same results many times, when they should all have unique lists? Why do they return different results, even on the same ID? If you'd like to see more of the code, just ask and I will happily provide. Thank you for whatever insight you can provide!

Categories

Resources