I am using private rotating proxy provided by (https://proxy.webshare.io/proxy/rotating?) in which each request to rotating proxy receives a new IP address. when I am using
requests.get('https://httpbin.org/get', headers=headers, proxies=get_proxy())
it returns a new IP each time whenever I make request. but when using
session = requests.Session()
session.headers = headers
session.proxies = get_proxy()
session.get('https://httpbin.org/get')
It returns same IP each time whenever I make request.
How does session object behaves different from requests.get() function in case of proxies.
Session uses previously set up variables/values for each subsequent request, like Cookies. If you want to change the proxy for each request in the session, then use Prepared Requests to set it each time or just put it in a function:
def send(session, url):
return session.get(url, proxy=get_proxy())
sess = requests.Session()
sess.headers = headers
resp = send(sess, 'https://httpbin.org/get')
print(resp.status_code)
But if you're trying to hide your origin IP for scraping or something, you probably don't want to persist cookies, etc. so you shouldn't use sessions.
The following code works, and it take a proxylistfile.txt file to check every proxy:
from requests import *
import bs4
import sys
if len(sys.argv) < 2:
print('Usage: ./testproxy.py <proxylistfile.txt>')
sys.exit()
ifco = 'http://ifconfig.co'
PROXIES_FILE = sys.argv[1]
proxy = dict()
with open(PROXIES_FILE) as file:
for line in file:
if line[0] == '#' or line == "\n":
continue
line_parts = line.replace('\n', '').split(':')
proxy['http'] = f'{line_parts[0]}://{line_parts[1]}:{line_parts[2]}'
try:
i = get(ifco, proxies=proxy, timeout=11)
print(f"{proxy['http']} - successfull - IP ---> ", end='')
zu = bs4.BeautifulSoup(i.text, 'html.parser')
testo = zu.findAll('p', text=True)[0].get_text()
print(testo)
except:
print(f"{proxy['http']} - unsuccessfull")
pass
It connect ot ifconfig.co site and return its real ip to check if the proxy works.
The output will be something like:
http://proxy:port - successfull - IP ---> your.real.ip
the input file format should be like:
http:1.1.1.1:3128
I finally switch to another rotating proxy provider (https://www.proxyegg.com) and the issue has been resolved now.
Related
I'm working with one pretty old server and for some reason it requires to send cookies in next format (RFC2109 section 4.4). Raw Cookie header:
Cookie: $Version="1"; temp="1234567890";$Path="/";$Domain=".example.com"; session="abcdefgh";$Path="/";$Domain=".example.com"; id="00001111";$Path="/";$Domain=".example.com"
I know, that there's a way to implement that formatting manually using prepared request, but maybe there's some other method?
I'm using python requests.
Code which obviously doesn't work as expected:
from requests import Session
from datetime import datetime, timedelta
from http.cookiejar import DefaultCookiePolicy
sess = Session()
sess.cookies.set_policy(DefaultCookiePolicy(rfc2109_as_netscape=True))
sess.cookies.set(name="temp", value="1234567890", domain=".httpbin.org", path="/", expires=int((datetime.now() + timedelta(days=365)).timestamp()))
sess.cookies.set(name="session", value="abcdefgh", domain=".httpbin.org", path="/", expires=int((datetime.now() + timedelta(days=365)).timestamp()))
sess.cookies.set(name="id", value="00001111", domain=".httpbin.org", path="/", expires=int((datetime.now() + timedelta(days=365)).timestamp()))
resp = sess.get("https://httpbin.org/headers")
print(resp.json())
Upd.
I've tried to set cookie policy to this old standard, but it changed nothing.
Unfortunately, I haven't found other way except using prepared request. Sharing my code for future researchers.
Code:
from requests import Session, Request
def dumb_cookies_request(*args, **kwargs):
session = kwargs.pop('session') if 'session' in kwargs else Session()
req = Request(*args, **kwargs)
prepped = session.prepare_request(req)
if len(session.cookies) > 0:
cookies = ['$Version="1"']
for c in session.cookies:
cookies.append(f'{c.name}="{c.value}";$Path="{c.path}";$Domain="{c.domain}"')
prepped.headers['Cookie'] = '; '.join(cookies)
return session.send(prepped)
Usage:
sess = Session()
# some actions
resp = dumb_cookies_request("GET", "https://httpbin.org/headers", session=sess)
this is a test script to request data from Rovi API, provided by the API itself.
test.py
import requests
import time
import hashlib
import urllib
class AllMusicGuide(object):
api_url = 'http://api.rovicorp.com/data/v1.1/descriptor/musicmoods'
key = 'my key'
secret = 'secret'
def _sig(self):
timestamp = int(time.time())
m = hashlib.md5()
m.update(self.key)
m.update(self.secret)
m.update(str(timestamp))
return m.hexdigest()
def get(self, resource, params=None):
"""Take a dict of params, and return what we get from the api"""
if not params:
params = {}
params = urllib.urlencode(params)
sig = self._sig()
url = "%s/%s?apikey=%s&sig=%s&%s" % (self.api_url, resource, self.key, sig, params)
resp = requests.get(url)
if resp.status_code != 200:
# THROW APPROPRIATE ERROR
print ('unknown err')
return resp.content
from another script I import the module:
from roviclient.test import AllMusicGuide
and create an instance of the class inside a mood function:
def mood():
test = AllMusicGuide()
print (test.get('[moodids=moodids]'))
according to documentation, the following is the syntax for requests:
descriptor/musicmoods?apikey=apikey&sig=sig [&moodids=moodids] [&format=format] [&country=country] [&language=language]
but running the script I get the following error:
unknown err
<h1>Gateway Timeout</h1>:
what is wrong?
"504, try once more. 502, it went through."
Your code is fine, this is a network issue. "Gateway Timeout" is a 504. The intermediate host handling your request was unable to complete it. It made its own request to another server on your behalf in order to handle yours, but this request took too long and timed out. Usually this is because of network congestion in the backend; if you try a few more times, does it sometimes work?
In any case, I would talk to your network administrator. There could be any number of reasons for this and they should be able to help fix it for you.
We use a custom scraper that have to take a separate website for a language (this is an architecture limitation). Like site1.co.uk, site1.es, site1.de etc.
But we need to parse a website with many languages, separated by url - like site2.com/en, site2.com/de, site2.com/es and so on.
I thought about MITMProxy: I could redirect all requests this way:
en.site2.com/* --> site2.com/en
de.site2.com/* --> site2.com/de
...
I have written a small script which simply takes URLs and rewrites them:
class MyMaster(flow.FlowMaster):
def handle_request(self, r):
url = r.get_url()
# replace URLs
if 'blabla' in url:
r.set_url(url.replace('something', 'another'))
But the target host generates 301 redirect with the response from the webserver - 'the page has been moved here' and the link to the site2.com/en
It worked when I played with URL rewriting, i.e. site2.com/en --> site2.com/de.
But for different hosts (subdomain and the root domain, to be precise), it does not work.
I tried to replace the Host header in the handle_request method from above:
for key in r.headers.keys():
if key.lower() == 'host':
r.headers[key] = ['site2.com']
also I tried to replace the Referrer - all of that didn't help.
How can I finally spoof that request from the subdomain to the main domain? If it generates a HTTP(s) client warning it's ok since we need that for the scraper (and the warnings there can be turned off), not the real browser.
Thanks!
You need to replace the content of the response and craft the header with just a few fields.
Open a new connection to the redirected url and craft your response :
def handle_request(self, flow):
newUrl = <new-url>
retryCount = 3
newResponse = None
while True:
try:
newResponse = requests.get(newUrl) # import requests
except:
if retryCount == 0:
print 'Cannot reach new url ' + newUrl
traceback.print_exc() # import traceback
return
retryCount -= 1
continue
break
responseHeaders = Headers() # from netlib.http import Headers
if 'Date' in newResponse.headers:
responseHeaders['Date'] = str(newResponse.headers['Date'])
if 'Connection' in newResponse.headers:
responseHeaders['Connection'] = str(newResponse.headers['Connection'])
if 'Content-Type' in newResponse.headers:
responseHeaders['Content-Type'] = str(newResponse.headers['Content-Type'])
if 'Content-Length' in newResponse.headers:
responseHeaders['Content-Length'] = str(newResponse.headers['Content-Length'])
if 'Content-Encoding' in newResponse.headers:
responseHeaders['Content-Encoding'] = str(inetResponse.headers['Content-Encoding'])
response = HTTPResponse( # from libmproxy.models import HTTPResponse
http_version='HTTP/1.1',
status_code=200,
reason='OK',
headers=responseHeaders,
content=newResponse.content)
flow.reply(response)
As it says in the title, I am trying to access a url through several different proxies sequentially (using for loop). Right now this is my code:
import requests
import json
with open('proxies.txt') as proxies:
for line in proxies:
proxy=json.loads(line)
with open('urls.txt') as urls:
for line in urls:
url=line.rstrip()
data=requests.get(url, proxies={'http':line})
data1=data.text
print data1
and my urls.txt file:
http://api.exip.org/?call=ip
and my proxies.txt file:
{"https": "84.22.41.1:3128"}
{"http":"194.126.181.47:81"}
{"http":"218.108.170.170:82"}
that I got at [www.hidemyass.com][1]
for some reason, the output is
68.6.34.253
68.6.34.253
68.6.34.253
as if it is accessing that website through my own router ip address. In other words, it is not trying to access through the proxies I give it, it is just looping through and using my own over and over again. What am I doing wrong?
According to this thread, you need to specify the proxies dictionary as {"protocol" : "ip:port"}, so your proxies file should look like
{"https": "84.22.41.1.3128"}
{"http": "194.126.181.47:81"}
{"http": "218.108.170.170:82"}
EDIT:
You're reusing line for both URLs and proxies. It's fine to reuse line in the inner loop, but you should be using proxies=proxy--you've already parsed the JSON and don't need to build another dictionary. Also, as abanert says, you should be doing a check to ensure that the protocol you're requesting matches that of the proxy. The reason the proxies are specified as a dictionary is to allow lookup for the matching protocol.
There are two obvious problems right here:
data=requests.get(url, proxies={'http':line})
First, because you have a for line in urls: inside the for line in proxies:, line is going to be the current URL here, not the current proxy. And besides, even if you weren't reusing line, it would be the JSON string representation, not the dict you decoded from JSON.
Then, if you fix that to use proxy, instead of something like {'https': '83.22.41.1:3128'}, you're passing {'http': {'https': '83.22.41.1:3128'}}. And that obviously isn't a valid value.
To fix both of those problems, just do this:
data=requests.get(url, proxies=proxy)
Meanwhile, what happens when you have an HTTPS URL, but the current proxy is an HTTP proxy? You're not going to use the proxy. So you probably want to add something to skip over them, like this:
if urlparse.urlparse(url).scheme not in proxy:
continue
Directly copied from another answer of mine.
Well, actually you can, I've done this with a few lines of code and it works pretty well.
import requests
class Client:
def __init__(self):
self._session = requests.Session()
self.proxies = None
def set_proxy_pool(self, proxies, auth=None, https=True):
"""Randomly choose a proxy for every GET/POST request
:param proxies: list of proxies, like ["ip1:port1", "ip2:port2"]
:param auth: if proxy needs auth
:param https: default is True, pass False if you don't need https proxy
"""
from random import choice
if https:
self.proxies = [{'http': p, 'https': p} for p in proxies]
else:
self.proxies = [{'http': p} for p in proxies]
def get_with_random_proxy(url, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_get(url, **kwargs)
def post_with_random_proxy(url, *args, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_post(url, *args, **kwargs)
self._session.original_get = self._session.get
self._session.get = get_with_random_proxy
self._session.original_post = self._session.post
self._session.post = post_with_random_proxy
def remove_proxy_pool(self):
self.proxies = None
self._session.get = self._session.original_get
self._session.post = self._session.original_post
del self._session.original_get
del self._session.original_post
# You can define whatever operations using self._session
I use it like this:
client = Client()
client.set_proxy_pool(['112.25.41.136', '180.97.29.57'])
It's simple, but actually works for me.
I have seen this thread already - How can I unshorten a URL?
My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.
So far I am stuck with using:
def unshorten_url(url):
resolvedURL = urllib2.urlopen(url)
print resolvedURL.url
#t = Test()
#c = pycurl.Curl()
#c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
#c.setopt(c.WRITEFUNCTION, t.body_callback)
#c.perform()
#c.close()
#dom = xml.dom.minidom.parseString(t.contents)
#resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
return resolvedURL.url
Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.
Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?
one line functions, using requests library and yes, it supports recursion.
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
Use the best rated answer (not the accepted answer) in that question:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url
You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:
A short link is a key into somebody else's database; you can't expand the link without querying the database
Now to your question.
Does anyone know of a more efficient way to complete this operation
without using open (since it is a waste of bandwidth)?
The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.
After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me
HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.
Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.
You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.
You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
import requests
short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)