I have searched through all the requests docs I can find, including requests-futures, and I can't find anything addressing this question. Wondering if I missed something or if anyone here can help answer:
Is it possible to create and manage multiple unique/autonomous sessions with requests.Session()?
In more detail, I'm asking if there is a way to create two sessions I could use "get" with separately (maybe via multiprocessing) that would retain their own unique set of headers, proxies, and server-assigned sticky data.
If I want Session_A to hit someimaginarysite.com/page1.html using specific headers and proxies, get cookied and everything else, and then hit someimaginarysite.com/page2.html with the same everything, I can just create one session object and do it.
But what if I want a Session_B, at the same time, to start on page3.html and then hit page4.html with totally different headers/proxies than Session_A, and be assigned its own cookies and whatnot? Crawl multiple pages consecutively with the same session, not just a single request followed by another request from a blank (new) session.
Is this as simple as saying:
import requests
Session_A = requests.Session()
Session_B = requests.Session()
headers_a = {A headers here}
proxies_a = {A proxies here}
headers_b = {B headers here}
proxies_b = {B proxies here}
response_a = Session_A.get('https://someimaginarysite.com/page1.html', headers=headers_a, proxies=proxies_a)
response_a = Session_A.get('https://someimaginarysite.com/page2.html', headers=headers_a, proxies=proxies_a)
# --- and on a separate thread/processor ---
response_b = Session_B.get('https://someimaginarysite.com/page3.html', headers=headers_b, proxies=proxies_b)
response_b = Session_B.get('https://someimaginarysite.com/page4.html', headers=headers_b, proxies=proxies_b)
Or will the above just create one session accessible by two names so the server will see the same cookies and session appearing with two ips and two sets of headers... which would seem more than a little odd.
Greatly appreciate any help with this, I've exhausted my research abilities.
I think that there is probably a better way to do this, but without more information about the pagination and exactly what you want, it is a bit hard to understand exactly what you need. The following will make 2 threads with sequential calls in each keeping the same headers and proxies. Once again, there may be a better way to approach but with the limited information it's a bit murky.
import requests
import concurrent.futures
def session_get_same(urls, headers, proxies):
lst = []
with requests.Session() as s:
for url in urls:
lst.append(s.get(url, headers=headers, proxies=proxies))
return lst
def main():
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [
executor.submit(
session_get_same,
urls=[
'https://someimaginarysite.com/page1.html',
'https://someimaginarysite.com/page2.html'
],
headers={'user-agent': 'curl/7.61.1'},
proxies={'https': 'https://10.10.1.11:1080'}
),
executor.submit(
session_get_same,
urls=[
'https://someimaginarysite.com/page3.html',
'https://someimaginarysite.com/page4.html'
],
headers={'user-agent': 'curl/7.61.2'},
proxies={'https': 'https://10.10.1.11:1080'}
),
]
flst = []
for future in concurrent.futures.as_completed(futures):
flst.append(future.result())
return flst
Not sure if this is better for the first function, mileage may vary
def session_get_same(urls, headers, proxies):
lst = []
with requests.Session() as s:
s.headers.update(headers)
s.proxies.update(proxies)
for url in urls:
lst.append(s.get(url))
return lst
Related
i have managed to send multiple requests to a web api at the same time through ThreadPoolExecutor and get the json responses but i cant send requests with payload
would you be kind enough to see my code and suggest me an edit to send payload (data , header)
i just dont know how to send payload
.
from concurrent.futures import ThreadPoolExecutor
import requests
from timer import timer
URL = 'whatever.com'
payload = {'aaaaa': '0xxxxxxx'}
headers = {
'abc': 'xyz',
'Content-Type': 'application/json',
}
def fetch(session, url):
with session.post(url) as response:
print(response.json())
#timer(1, 1)
def main():
with ThreadPoolExecutor(max_workers=100) as executor:
with requests.session() as session:
executor.map(fetch, [session] * 100, [URL] * 100)
executor.shutdown(wait=True)
Normally you specify a "payload" using the data keyword argument on the call to the post method. But if you want to send it in JSON format, then you should use the json keyword argument:
session.post(url, json=payload, headers=headers)
(If the header specified 'Content-Type': 'application/json', as yours does, and if payload were already a JSON string, which yours is not, then you would be correct to use the data keyword argument for then you would not need any JSON conversion. But here you clearly need to first have requests convert a Python dictionary to a JSON string for transmission and that is why the json argument is being used. You do not really need to explicitly specify a header argument since requests will provide an appropriate one for you.)
Now I know this is a just a "dummy" program fetching the same URL 100 times. In a more realistic version you would be fetching 100 different URLs but you would, of course, be using the same Session instance for each call to fetch. You could therefore simplify the program in the following way:
from functools import partial
...
def main():
with ThreadPoolExecutor(max_workers=100) as executor:
with requests.Session() as session:
worker = partial(fetch, session) # first argument will be session
executor.map(worker, [URL] * 100)
# remove following line
#executor.shutdown(wait=True)
Note that I have also commented out your explicit call to method shutdown since shutdown will be automatically called following the termination of the with ... as executor: block.
I'm trying to change a bit of Python code at the moment that is being used as a Lambda function. Currently, the code in question takes a number of parameters that are URLs, and uses the urllib2 library to get the images at those URLs (for example, a URL could be: https://www.sundanceresort.com/wp-content/uploads/2016/08/nature-walk-1600x1400-c-center.jpg).
I'd like to change this so that the code will handle a POST request that has the image in its body. Looking at some tutorials I thought Flask request might do the trick but I’m confused as to how it would work in this particular case.
At the moment, the relevant code that would be replaced is in three sections:
urls = urls_str.split(",")
results = []
for url in urls:
headers = {"User-Agent": "Mozilla/5.0"}
try:
req = urllib2.Request(url, None, headers)
image = urllib2.urlopen(req).read()
...
def get_param_from_url(event, param_name):
params = event['queryStringParameters']
return params[param_name]
...
def predict(event, context):
try:
param = get_param_from_url(event, 'url')
result = model.predict(param)
The overall code is here.
Any help or suggestions would be greatly appreciated.
As it says in the title, I am trying to access a url through several different proxies sequentially (using for loop). Right now this is my code:
import requests
import json
with open('proxies.txt') as proxies:
for line in proxies:
proxy=json.loads(line)
with open('urls.txt') as urls:
for line in urls:
url=line.rstrip()
data=requests.get(url, proxies={'http':line})
data1=data.text
print data1
and my urls.txt file:
http://api.exip.org/?call=ip
and my proxies.txt file:
{"https": "84.22.41.1:3128"}
{"http":"194.126.181.47:81"}
{"http":"218.108.170.170:82"}
that I got at [www.hidemyass.com][1]
for some reason, the output is
68.6.34.253
68.6.34.253
68.6.34.253
as if it is accessing that website through my own router ip address. In other words, it is not trying to access through the proxies I give it, it is just looping through and using my own over and over again. What am I doing wrong?
According to this thread, you need to specify the proxies dictionary as {"protocol" : "ip:port"}, so your proxies file should look like
{"https": "84.22.41.1.3128"}
{"http": "194.126.181.47:81"}
{"http": "218.108.170.170:82"}
EDIT:
You're reusing line for both URLs and proxies. It's fine to reuse line in the inner loop, but you should be using proxies=proxy--you've already parsed the JSON and don't need to build another dictionary. Also, as abanert says, you should be doing a check to ensure that the protocol you're requesting matches that of the proxy. The reason the proxies are specified as a dictionary is to allow lookup for the matching protocol.
There are two obvious problems right here:
data=requests.get(url, proxies={'http':line})
First, because you have a for line in urls: inside the for line in proxies:, line is going to be the current URL here, not the current proxy. And besides, even if you weren't reusing line, it would be the JSON string representation, not the dict you decoded from JSON.
Then, if you fix that to use proxy, instead of something like {'https': '83.22.41.1:3128'}, you're passing {'http': {'https': '83.22.41.1:3128'}}. And that obviously isn't a valid value.
To fix both of those problems, just do this:
data=requests.get(url, proxies=proxy)
Meanwhile, what happens when you have an HTTPS URL, but the current proxy is an HTTP proxy? You're not going to use the proxy. So you probably want to add something to skip over them, like this:
if urlparse.urlparse(url).scheme not in proxy:
continue
Directly copied from another answer of mine.
Well, actually you can, I've done this with a few lines of code and it works pretty well.
import requests
class Client:
def __init__(self):
self._session = requests.Session()
self.proxies = None
def set_proxy_pool(self, proxies, auth=None, https=True):
"""Randomly choose a proxy for every GET/POST request
:param proxies: list of proxies, like ["ip1:port1", "ip2:port2"]
:param auth: if proxy needs auth
:param https: default is True, pass False if you don't need https proxy
"""
from random import choice
if https:
self.proxies = [{'http': p, 'https': p} for p in proxies]
else:
self.proxies = [{'http': p} for p in proxies]
def get_with_random_proxy(url, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_get(url, **kwargs)
def post_with_random_proxy(url, *args, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_post(url, *args, **kwargs)
self._session.original_get = self._session.get
self._session.get = get_with_random_proxy
self._session.original_post = self._session.post
self._session.post = post_with_random_proxy
def remove_proxy_pool(self):
self.proxies = None
self._session.get = self._session.original_get
self._session.post = self._session.original_post
del self._session.original_get
del self._session.original_post
# You can define whatever operations using self._session
I use it like this:
client = Client()
client.set_proxy_pool(['112.25.41.136', '180.97.29.57'])
It's simple, but actually works for me.
I am using gevent to download some html pages.
Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".
timeout = Timeout(10)
timeout.start()
def downloadSite():
# code to download site's url one by one
url1 = downloadUrl()
url2 = downloadUrl()
url3 = downloadUrl()
try:
gevent.spawn(downloadSite).join()
except Timeout:
print 'Lost state here'
But the problem with it is that i loose all the state when exception fires up.
Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.
The question is - how do I save state and process the data even if Timeout happens ?
Why not try something like:
timeout = Timeout(10)
def downloadSite(url):
with Timeout(10):
downloadUrl(url)
urls = ["url1", "url2", "url3"]
workers = []
limit = 5
counter = 0
for i in urls:
# limit to 5 URL requests at a time
if counter < limit:
workers.append(gevent.spawn(downloadSite, i))
counter += 1
else:
gevent.joinall(workers)
workers = [i,]
counter = 0
gevent.joinall(workers)
You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.
A self-contained example:
import gevent
from gevent import monkey
from gevent import Timeout
gevent.monkey.patch_all()
import urllib2
def get_source(url):
req = urllib2.Request(url)
data = None
with Timeout(2):
response = urllib2.urlopen(req)
data = response.read()
return data
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
print contents[5]
It implements one timeout for each request. In this example, contents contains 10 times the HTML source of google.com, each retrieved in an independent request. If one of the requests had timed out, the corresponding element in contents would be None. If you have questions about this code, don't hesitate to ask in the comments.
I saw your last comment. Defining one timeout per request definitely is not wrong from the programming point of view. If you need to throttle traffic to the website, then just don't spawn 100 greenlets simultaneously. Spawn 5, wait until they returned. Then, you can possibly wait for a given amount of time, and spawn the next 5 (already shown in the other answer by Gabriel Samfira, as I see now). For my code above, this would mean, that you would have to repeatedly call
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
whereas N should not be too high.
I have a Google AppEngine (in Python) application where I need to perform 4 to 5 url fetches, and then combine the data before I print it out to the response.
I can do this without any problems using a synchronous workflow, but since the urls that I am fetching are not related or dependent on each other, performing this asynchronously would be the most ideal (and quickest).
I have read and re-read the documentation here, but I just can't figure out how to get read the contents for each url. I've also searched the web for a small example (which is really what I am in need of). I have seen this SO question, but again, here they don't mention anything about reading the contents of these individual asynchronous url fetches.
Does anyone have any simple examples of how to perform 4 or 5 asynchronous url fetches with AppEngine? And then combine the results before I print it to the response?
Here is what I have so far:
rpcs = []
for album in result_object['data']:
total_facebook_photo_count = total_facebook_photo_count + album['count']
facebook_albumid_array.append(album['id'])
#Get the photos in the photo album
facebook_photos_url = 'https://graph.facebook.com/%s/photos?access_token=%s&limit=1000' % (album['id'], access_token)
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, facebook_photos_url)
rpcs.append(rpc)
for rpc in rpcs:
result = rpc.get_result()
self.response.out.write(result.content)
However, it still looks like the line: result = rpc.get_result() is forcing it to wait for the first request to finish, then the second, then the third, and so forth. Is there a way to simply put the results in a variables as they are received?
Thanks!
In the example, text = result.content is where you get the content (body).
To do url fetches in parallell, you could set them up, add to a list and check results afterwards. Expanding on the example already mentioned, it could look something like:
from google.appengine.api import urlfetch
futures = []
for url in urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
futures.append(rpc)
contents = []
for rpc in futures:
try:
result = rpc.get_result()
if result.status_code == 200:
contents.append(result.content)
# ...
except urlfetch.DownloadError:
# Request timed out or failed.
# ...
concatenated_result = '\n'.join(contents)
In this example, we assemble the body of all the requests that returned status code 200, and concatenate with linebreak between them.
Or with ndb, my personal preference for anything async on GAE, something like:
#ndb.tasklet
def get_urls(urls):
ctx = ndb.get_context()
result = yield map(ctx.urlfetch, urls)
contents = [r.content for r in result if r.status_code==200]
raise ndb.Return('\n'.join(contents))
I use this code (implmented before I learned about ndb tasklets):
while rpcs:
rpc = UserRPC.wait_any(rpcs)
result = rpc.get_result()
# process result here
rpcs.remove(rpc)