why url works in browser but not using requests get method - python

while testing, I just discovered, that this
url = ' http://wi312.rockdizfile.com/d/uclf2kr7fp4r2ge47pcuihdpky2chcsjur5nrds2hx53f26qgxnrktew/Kimbra%20-%20Love%20in%20High%20Places.mp3'
works in browser and file download begins but if i try to fetch this file using
requests.get(url)
it gives massive error ...
any clue why is this happening ? do in need to decode this to make it working?
Update
this is the error I keep getting:
Exception in thread Thread-5:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "python/file_download.py", line 98, in _downloadChunk
stream=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 382, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 485, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/adapters.py", line 381, in send
raise Timeout(e)
Timeout: (<requests.packages.urllib3.connectionpool.HTTPConnectionPool object at 0x10258de90>, 'Connection to wi312.rockdizfile.com timed out. (connect timeout=0.001)')
there was no space when I posted, it was just in newline because I posted inline code embed.
Here is the code that makes requests:(also try new URL: http://archive.org/download/LucyIsabelleMarsh/LucyIsabelleMarsh-ItalianStreetSong.mp3)
import requests
import signal
import sys
import time
import threading
import utils as _fdUtils
from socket import error as SocketError, timeout as SocketTimeout
def _downloadChunk(url, idx, irange, fileName, sizeInBytes):
_log.debug("Downloading %s for first chunk %s " % (irange, idx+1))
pulledSize = irange[-1]
try:
resp = requests.get(url, allow_redirects=False, timeout=0.001,
headers={'Range': 'bytes=%s-%s' % (str(irange[0]), str(irange[-1]))},
stream=True)
except (SocketTimeout, requests.exceptions), e:
_log.error(e)
return
chunk_size = str(irange[-1])
for chunk in resp.iter_content(chunk_size):
status = r"%10d [%3.2f%%]" % (pulledSize, pulledSize * 100. / int(chunk_size))
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
sys.stdout.flush()
pulledSize += len(chunk)
dataDict[idx] = chunk
time.sleep(.03)
if pulledSize == sizeInBytes:
_log.info("%s downloaded %3.0f%%", fileName, pulledSize * 100. / sizeInBytes)
class ThreadedFetch(threading.Thread):
""" docstring for ThreadedFetch
"""
def __init__(self, saveTo, queue):
super(ThreadedFetch, self).__init__()
self.queue = queue
self.__saveTo = saveTo
def run(self):
threadLimiter.acquire()
try:
items = self.queue.get()
url = items[0]
split = items[-1]
fileName = _fdUtils.getFileName(url)
# grab split chunks in separate thread.
if split > 1:
maxSplits.acquire()
try:
sizeInBytes = _fdUtils.getUrlSizeInBytes(url)
if sizeInBytes:
byteRanges = _fdUtils.getRangeSegements(sizeInBytes, split)
else:
byteRanges = ['0-']
filePath = os.path.join(self.__saveTo, fileName)
downloaders = [
threading.Thread(
target=_downloadChunk,
args=(url, idx, irange, fileName, sizeInBytes),
)
for idx, irange in enumerate(byteRanges)
]
# start threads, let run in parallel, wait for all to finish
for th in downloaders:
th.start()
# this makes the wait for all thread to finish
# which confirms the dataDict is up-to-date
for th in downloaders:
th.join()
downloadedSize = 0
with open(filePath, 'wb') as fh:
for _idx, chunk in sorted(dataDict.iteritems()):
downloadedSize += len(chunk)
status = r"%10d [%3.2f%%]" % (downloadedSize, downloadedSize * 100. / sizeInBytes)
status = status + chr(8)*(len(status)+1)
fh.write(chunk)
sys.stdout.write('%s\r' % status)
time.sleep(.04)
sys.stdout.flush()
if downloadedSize == sizeInBytes:
_log.info("%s, saved to %s", fileName, self.__saveTo)
self.queue.task_done()
finally:
maxSplits.release()

The traceback is showing a Timeout exception, and in your code indeed you have a very short timeout set, either remove this limit or increase it:
requests.get(url, allow_redirects=False, timeout=0.001, # <-- this is very short
Even if you were accessing localhost (your own computer), such a timeout will result in a Timeout exception. From the documentation:
Note
timeout is not a time limit on the entire response download; rather,
an exception is raised if the server has not issued a response for
timeout seconds (more precisely, if no bytes have been received on the
underlying socket for timeout seconds).
So its not doing what you might expect.

You have a space before the start of the url which causes a requests.exceptions.InvalidSchema error:
url = ' http://wi312.rockdizfile.com/d/uclf2kr7fp4r2ge47pcuihdpky2chcsjur5nrds2hx53f26qgxnrktew/Kimbra%20-%20Love%20in%20High%20Places.mp3'
Change to:
url = 'http://wi312.rockdizfile.com/d/uclf2kr7fp4r2ge47pcuihdpky2chcsjur5nrds2hx53f26qgxnrktew/Kimbra%20-%20Love%20in%20High%20Places.mp3'

Related

How to speed up async requests in Python

I want to download/scrape 50 million log records from a site. Instead of downloading 50 million in one go, I was trying to download it in parts like 10 million at a time using the following code but it's only handling 20,000 at a time (more than that throws an error) so it becomes time-consuming to download that much data. Currently, it takes 3-4 mins to download 20,000 records with the speed of 100%|██████████| 20000/20000 [03:48<00:00, 87.41it/s] so how to speed it up?
import asyncio
import aiohttp
import time
import tqdm
import nest_asyncio
nest_asyncio.apply()
async def make_numbers(numbers, _numbers):
for i in range(numbers, _numbers):
yield i
n = 0
q = 10000000
async def fetch():
# example
url = "https://httpbin.org/anything/log?id="
async with aiohttp.ClientSession() as session:
post_tasks = []
# prepare the coroutines that poat
async for x in make_numbers(n, q):
post_tasks.append(do_get(session, url, x))
# now execute them all at once
responses = [await f for f in tqdm.tqdm(asyncio.as_completed(post_tasks), total=len(post_tasks))]
async def do_get(session, url, x):
headers = {
'Content-Type': "application/x-www-form-urlencoded",
'Access-Control-Allow-Origin': "*",
'Accept-Encoding': "gzip, deflate",
'Accept-Language': "en-US"
}
async with session.get(url + str(x), headers=headers) as response:
data = await response.text()
print(data)
s = time.perf_counter()
try:
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch())
except:
print("error")
elapsed = time.perf_counter() - s
# print(f"{__file__} executed in {elapsed:0.2f} seconds.")
Traceback (most recent call last):
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 986, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore[return-value] # noqa
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 1056, in create_connection
raise exceptions[0]
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 1041, in create_connection
sock = await self._connect_sock(
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 955, in _connect_sock
await self.sock_connect(sock, address)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 702, in sock_connect
return await self._proactor.connect(sock, address)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\tasks.py", line 328, in __wakeup
future.result()
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\windows_events.py", line 812, in _poll
value = callback(transferred, key, ov)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\windows_events.py", line 599, in finish_connect
ov.getresult()
OSError: [WinError 121] The semaphore timeout period has expired
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\SGM\Desktop\xnet\x3stackoverflow.py", line 136, in <module>
loop.run_until_complete(fetch())
File "C:\Users\SGM\AppData\Roaming\Python\Python39\site-packages\nest_asyncio.py", line 81, in run_until_complete
return f.result()
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\futures.py", line 201, in result
raise self._exception
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\tasks.py", line 256, in __step
result = coro.send(None)
File "C:\Users\SGM\Desktop\xnet\x3stackoverflow.py", line 88, in fetch
response = await f
File "C:\Users\SGM\Desktop\xnet\x3stackoverflow.py", line 37, in _wait_for_one
return f.result()
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\futures.py", line 201, in result
raise self._exception
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\tasks.py", line 258, in __step
result = coro.throw(exc)
File "C:\Users\SGM\Desktop\xnet\x3stackoverflow.py", line 125, in do_get
async with session.get(url + str(x), headers=headers) as response:
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\client.py", line 1138, in __aenter__
self._resp = await self._coro
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\client.py", line 535, in _request
conn = await self._connector.connect(
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 542, in connect
proto = await self._create_connection(req, traces, timeout)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 907, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 1206, in _create_direct_connection
raise last_exc
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 1175, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 992, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host example.com:80 ssl:default [The semaphore timeout period has expired]
Bottleneck: number of simultaneous connections
First, the bottleneck is the total number of simultaneous connections in the TCP connector.
That default for aiohttp.TCPConnector is limit=100. On most systems (tested on macOS), you should be able to double that by passing a connector with limit=200:
# async with aiohttp.ClientSession() as session:
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=200)) as session:
The time taken should decrease significantly. (On macOS: q = 20_000 decreased 43% from 58 seconds to 33 seconds, and q = 10_000 decreased 42% from 31 to 18 seconds.)
The limit you can configure depends on the number of file descriptors that your machine can open. (On macOS: You can run ulimit -n to check, and ulimit -n 1024 to increase to 1024 for the current terminal session, and then change to limit=1000. Compared to limit=100, q = 20_000 decreased 76% to 14 seconds, and q = 10_000 decreased 71% to 9 seconds.)
Supporting 50 million requests: async generators
Next, the reason why 50 million requests appears to hang is simply because of its sheer number.
Just creating 10 million coroutines in post_tasks takes 68-98 seconds (varies greatly on my machine), and then the event loop is further burdened with that many tasks, 99.99% of which are blocked by the TCP connection pool.
We can defer the creation of coroutines using an async generator:
async def make_async_gen(f, n, q):
async for x in make_numbers(n, q):
yield f(x)
We need a counterpart to asyncio.as_completed() to handle async_gen and concurrency:
from asyncio import ensure_future, events
from asyncio.queues import Queue
def as_completed_for_async_gen(fs_async_gen, concurrency):
done = Queue()
loop = events.get_event_loop()
# todo = {ensure_future(f, loop=loop) for f in set(fs)} # -
todo = set() # +
def _on_completion(f):
todo.remove(f)
done.put_nowait(f)
loop.create_task(_add_next()) # +
async def _wait_for_one():
f = await done.get()
return f.result()
async def _add_next(): # +
try:
f = await fs_async_gen.__anext__()
except StopAsyncIteration:
return
f = ensure_future(f, loop=loop)
f.add_done_callback(_on_completion)
todo.add(f)
# for f in todo: # -
# f.add_done_callback(_on_completion) # -
# for _ in range(len(todo)): # -
# yield _wait_for_one() # -
for _ in range(concurrency): # +
loop.run_until_complete(_add_next()) # +
while todo: # +
yield _wait_for_one() # +
Then, we update fetch():
from functools import partial
CONCURRENCY = 200 # +
n = 0
q = 50_000_000
async def fetch():
# example
url = "https://httpbin.org/anything/log?id="
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=CONCURRENCY)) as session:
# post_tasks = [] # -
# # prepare the coroutines that post # -
# async for x in make_numbers(n, q): # -
# post_tasks.append(do_get(session, url, x)) # -
# Prepare the coroutines generator # +
async_gen = make_async_gen(partial(do_get, session, url), n, q) # +
# now execute them all at once # -
# responses = [await f for f in tqdm.asyncio.tqdm.as_completed(post_tasks, total=len(post_tasks))] # -
# Now execute them with a specified concurrency # +
responses = [await f for f in tqdm.tqdm(as_completed_for_async_gen(async_gen, CONCURRENCY), total=q)] # +
Other limitations
With the above, the program can start processing 50 million requests but:
it will still take 8 hours or so with CONCURRENCY = 1000, based on the estimate from tqdm.
your program may run out of memory for responses and crash.
For point 2, you should probably do:
# responses = [await f for f in tqdm.tqdm(as_completed_for_async_gen(async_gen, CONCURRENCY), total=q)]
for f in tqdm.tqdm(as_completed_for_async_gen(async_gen, CONCURRENCY), total=q):
response = await f
# Do something with response, such as writing to a local file
# ...
An error in the code
do_get() should return data:
async def do_get(session, url, x):
headers = {
'Content-Type': "application/x-www-form-urlencoded",
'Access-Control-Allow-Origin': "*",
'Accept-Encoding': "gzip, deflate",
'Accept-Language': "en-US"
}
async with session.get(url + str(x), headers=headers) as response:
data = await response.text()
# print(data) # -
return data # +
If it's not the bandwidth that limits you (but I cannot check this), there is a solution less complicated than the celery and rabbitmq but it is not as scalable as the celery and rabbitmq, it will be limited by your number of CPU.
Instead of splitting calls on celery workers, you split them on multiple processes.
I modified the fetch function like this:
async def fetch(start, end):
# example
url = "https://httpbin.org/anything/log?id="
async with aiohttp.ClientSession() as session:
post_tasks = []
# prepare the coroutines that poat
# use start and end arguments here!
async for x in make_numbers(start, end):
post_tasks.append(do_get(session, url, x))
# now execute them all at once
responses = [await f for f in
tqdm.tqdm(asyncio.as_completed(post_tasks), total=len(post_tasks))]
and I modified the main processes:
import concurrent.futures
from itertools import count
def one_executor(start, end):
loop = asyncio.new_event_loop()
try:
loop.run_until_complete(fetch(start, end))
except:
print("error")
if __name__ == '__main__':
s = time.perf_counter()
# Change the value to the number of core you want to use.
max_worker = 4
length_by_executor = q // max_worker
with concurrent.futures.ProcessPoolExecutor(max_workers=max_worker) as executor:
for index_min in count(0, length_by_executor):
# no matter with duplicated indexes due to the use of
# range in make_number function.
index_max = min(index_min + length_by_executor, q)
executor.submit(one_executor, index_min, index_max)
if index_max == q:
break
elapsed = time.perf_counter() - s
print(f"executed in {elapsed:0.2f} seconds.")
Here the result I get (with the value of q set to 10_000):
1 worker: executed in 13.90 seconds.
2 workers: executed in 7.24 seconds.
3 workers: executed in 6.82 seconds.
I don't work on the tqdm progress bar, with the current solution, two bars will be displayed (but I think tqdm works well with multi processes).

Python Multi-threaded App does not terminate

This my code which basically just takes a list of 94,000+ URLs, and collects the http_status codes for them:
#!/usr/bin/python3
import threading
from queue import Queue
import urllib.request
import urllib.parse
from http.client import HTTPConnection
import socket
import http.client
#import httplib
url_input = open("urls_prod_sort.txt", "r").read()
urls = url_input[:url_input.rfind('\n')].split('\n')
#urls = urls[:100]
url_502 = []
url_logs = []
url_502_lock = threading.Lock()
print_lock = threading.Lock()
def sendRequest(url_u, http_method = 'GET', data = None):
use_proxy = "http://xxxxxxxx:8080"
proxies = {"http": use_proxy}
proxy = urllib.request.ProxyHandler(proxies)
handler = urllib.request.HTTPHandler()
url = "http://" + url_u
with print_lock:
print(url)
opener = urllib.request.build_opener(proxy,handler)
urllib.request.install_opener(opener)
request = urllib.request.Request(url,data)
request.add_header("User-agent","| MSIE |")
request.get_method = lambda: http_method
try:
response = urllib.request.urlopen(request)
response_code = response.code
except urllib.error.HTTPError as error:
response_code = error.code
except urllib.error.URLError as e2:
response_code = 701
except socket.timeout as e3:
response_code = 702
except socket.error as e4:
response_code = 703
except http.client.IncompleteRead as e:
response_code = 700
if response_code == 502:
with url_502_lock:
#url_502.append(url)
url_502_file = open("url_502_file.txt", "a")
url_502_file.write(url + "\n")
url_502_file.close()
with print_lock:
#url_logs.append(url + "," + str(response_code))
url_all_logs_file = open("url_all_logs.csv", "a")
url_all_logs_file.write(url + "," + str(response_code) + '\n')
url_all_logs_file.close()
#print (url + "," + str(response_code))
#print (response_code)
return response_code
def worker():
while True:
url = q.get()
if url == ":::::"
break
else:
sendRequest(url)
q.task_done()
#======================================
q = Queue()
for threads in range(1000):
t = threading.Thread(target = worker)
t.daemon = True
t.start()
for url in urls:
q.put(url)
q.put(":::::")
q.join()
However, the program never seems to terminate (even tho the URLs have all been iteratred through) which forces me to ctrl-c the program - and then I get the following error:
Traceback (most recent call last):
File "./url_sc_checker.py", line 120, in <module>
q.join()
File "/usr/lib/python3.2/queue.py", line 82, in join
self.all_tasks_done.wait()
File "/usr/lib/python3.2/threading.py", line 235, in wait
waiter.acquire()
KeyboardInterrupt
The reason that your program doesn't terminate is simple, your worker creates an infinite loop:
def worker():
while True:
...
You need to either throw an exception, break, or have a terminating condition in your while statement. Otherwise your program would remain trying to get the next job from the queue, without knowing that there will never be the next job.
A common way to do this is to put a sentinel value in your queue, when checking out a job from the queue, the worker checks if it is the sentinel value and breaks out the loop.
Another way is to have a global condition variable that you check in the while condition. When the job producer have pushed all items to the queue, the job producer joins the queue, and when all jobs are done, the job producer unblocks and terminates the threads our processes.
Another possible reason why your process doesn't terminate is if your sendRequest produces an unexpected exception, then the thread terminates and you'll be left with some jobs that are never marked as done.

Gnip issue - while creating a job -- urllib2.URLError

I am trying to create a job using Gnip Historical Powertrack API.
I am getting issue with the urllib..
import urllib2
import base64
import json
UN = '' # YOUR GNIP ACCOUNT EMAIL ID
PWD = ''
account = '' # YOUR GNIP ACCOUNT USER NAME
def get_json(data):
return json.loads(data.strip())
def post():
url = 'https://historical.gnip.com/accounts/' + account + '/jobs.json'
publisher = "twitter"
streamType = "track"
dataFormat = "activity-streams"
fromDate = "201510140630"
toDate = "201510140631"
jobTitle = "job30"
rules = '[{"value":"","tag":""}]'
jobString = '{"publisher":"' + publisher + '","streamType":"' + streamType + '","dataFormat":"' + dataFormat + '","fromDate":"' + fromDate + '","toDate":"' + toDate + '","title":"' + jobTitle + '","rules":' + rules + '}'
base64string = base64.encodestring('%s:%s' % (UN, PWD)).replace('\n', '')
req = urllib2.Request(url=url, data=jobString)
req.add_header('Content-type', 'application/json')
req.add_header("Authorization", "Basic %s" % base64string)
proxy = urllib2.ProxyHandler({'http': 'http://proxy:8080', 'https': 'https://proxy:8080'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
try:
response = urllib2.urlopen(req)
the_page = response.read()
the_page = get_json(the_page)
print 'Job has been created.'
print 'Job UUID : ' + the_page['jobURL'].split("/")[-1].split(".")[0]
except urllib2.HTTPError as e:
print e.read()
if __name__=='__main__':
post()
this is the error I am getting :
Traceback (most recent call last):
File "gnip1.py", line 37, in <module>
post()
File "gnip1.py", line 28, in post
response = urllib2.urlopen(req)
File "/home/soundarya/anaconda-new-1/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/home/soundarya/anaconda-new-1/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/home/soundarya/anaconda-new-1/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/home/soundarya/anaconda-new-1/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/home/soundarya/anaconda-new-1/lib/python2.7/urllib2.py", line 1240, in https_open
context=self._context)
File "/home/soundarya/anaconda-new-1/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
I even tried through the curl command:
When I tried running the below one in terminal, I am getting error - ServiceUsername is not valid.
curl -v -X POST -uname -d '{"title": "HPT_test_job","publisher": "Twitter","streamType":"track","dataFormat":"activity-streams","fromDate":"201401010000","toDate":"201401020000 ","rules":[{"value": "twitter_lang:en (Hillary Clinton OR Donald)","tag": "2014_01_01_snow"}]}' 'https://historical.gnip.com/accounts/account_name/jobs.json'
This is the exact output msg:
Error retrieving Job status: {u'serviceUsername': [u'is invalid']} -- Please verify your connection parameters and network connection *
Try this.. see if it helps
import urllib2
from urllib2.request import urlopen
u = urlopen ('http:// .........')
If you are using python 3.5 you should use the library urllib.request which is the newer version of urllib2. Notice however that this changes a few things in the code including print (which should be in parentheses) and the need to transform some of the string results into bytes. Here you can look at all the required changes in code adapted to python 3.5

Converting python2 code to python3 problems

So I have been trying too convert an omegle bot, which was written in python2, to python3. This is the original code: https://gist.github.com/thefinn93/1543082
Now this is my code:
import requests
import sys
import json
import urllib
import random
import time
server = b"odo-bucket.omegle.com"
debug_log = False # Set to FALSE to disable excessive messages
config = {'verbose': open("/dev/null","w")}
headers = {}
headers['Referer'] = b'http://odo-bucket.omegle.com/'
headers['Connection'] = b'keep-alive'
headers['User-Agent'] = b'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 (KHTML, like Gecko) Ubuntu/11.10 Chromium/15.0.874.106 Chrome/15.0.874.106 Safari/535.2'
headers['Content-type'] = b'application/x-www-form-urlencoded; charset=UTF-8'
headers['Accept'] = b'application/json'
headers['Accept-Encoding'] = b'gzip,deflate,sdch'
headers['Accept-Language'] = b'en-US'
headers['Accept-Charset'] = b'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
if debug_log:
config['verbose'] = debug_log
def debug(msg):
if debug_log:
print("DEBUG: " + str(msg))
debug_log.write(str(msg) + "\n")
def getcookies():
r = requests.get(b"http://" + server + b"/")
debug(r.cookies)
return(r.cookies)
def start():
r = requests.request(b"POST", b"http://" + server + b"/start?rcs=1&spid=", data=b"rcs=1&spid=", headers=headers)
omegle_id = r.content.strip(b"\"")
print("Got ID: " + str(omegle_id))
cookies = getcookies()
event(omegle_id, cookies)
def send(omegle_id, cookies, msg):
r = requests.request(b"POST","http://" + server + "/send", data="msg=" + urllib.quote_plus(msg) + "&id=" + omegle_id, headers=headers, cookies=cookies)
if r.content == "win":
print("You: " + msg)
else:
print("Error sending message, check the log")
debug(r.content)
def event(omegle_id, cookies):
captcha = False
next = False
r = requests.request(b"POST",b"http://" + server + b"/events",data=b"id=" + omegle_id, cookies=cookies, headers=headers)
try:
parsed = json.loads(r.content)
for e in parsed:
if e[0] == "waiting":
print("Waiting for a connection...")
elif e[0] == "count":
print("There are " + str(e[1]) + " people connected to Omegle")
elif e[0] == "connected":
print("Connection established!")
send(omegle_id, cookies, "HI I just want to talk ;_;")
elif e[0] == "typing":
print("Stranger is typing...")
elif e[0] == "stoppedTyping":
print ("Stranger stopped typing")
elif e[0] == "gotMessage":
print("Stranger: " + e[1])
try:
cat=""
time.sleep(random.randint(1,5))
i_r=random.randint(1,8)
if i_r==1:
cat="that's cute :3"
elif i_r==2:
cat="yeah, guess your right.."
elif i_r==3:
cat="yeah, tell me something about yourself!!"
elif i_r==4:
cat="what's up"
elif i_r==5:
cat="me too"
else:
time.sleep(random.randint(3,9))
send(omegle_id, cookies, "I really have to tell you something...")
time.sleep(random.randint(3,9))
cat="I love you."
send(omegle_id, cookies, cat)
except:
debug("Send errors!")
elif e[0] == "strangerDisconnected":
print("Stranger Disconnected")
next = True
elif e[0] == "suggestSpyee":
print ("Omegle thinks you should be a spy. Fuck omegle.")
elif e[0] == "recaptchaRequired":
print("Omegle think's you're a bot (now where would it get a silly idea like that?). Fuckin omegle. Recaptcha code: " + e[1])
captcha = True
except:
print("Derka derka derka")
if next:
print("Reconnecting...")
start()
elif not captcha:
event(omegle_id, cookies)
start()
The error I get is:
Traceback (most recent call last):
File "p3.py", line 124, in <module>
start()
File "p3.py", line 46, in start
r = requests.request(b"POST", b"http://" + server + b"/start?rcs=1&spid=", data=b"rcs=1&spid=", headers=headers)
File "/usr/lib/python3.4/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3.4/site-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3.4/site-packages/requests/sessions.py", line 553, in send
adapter = self.get_adapter(url=request.url)
File "/usr/lib/python3.4/site-packages/requests/sessions.py", line 608, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'b'http://odo-bucket.omegle.com/start?rcs=1&spid=''
I didn't really understand what would fix this error, nor what the problem really is, even after looking it up.
UPDATE:
Now after removing all the b's I get the following error:
Traceback (most recent call last):
File "p3.py", line 124, in <module>
start()
File "p3.py", line 47, in start
omegle_id = r.content.strip("\"")
TypeError: Type str doesn't support the buffer API
UPDATE 2:
After putting the b back to r.content, I get the following error message:
Traceback (most recent call last):
File "p3.py", line 124, in <module>
start()
File "p3.py", line 50, in start
event(omegle_id, cookies)
File "p3.py", line 63, in event
r = requests.request("POST","http://" + server + "/events",data="id=" + omegle_id, cookies=cookies, headers=headers)
TypeError: Can't convert 'bytes' object to str implicitly
UPDATE 3:
Everytime I try to start it excepts "Derka derka", what could be causing this (It wasn't like that with python2).
requests takes strings, not bytes values for the URL.
Because your URLs are bytes values, requests is converting them to strings with str(), and the resulting string contains the characters b' at the start. That's no a valid scheme like http:// or https://.
The majority of your bytestrings should really be regular strings instead; only the content.strip() call deals with actual bytes.
The headers will be encoded for you, for example. Don't even set the Content-Type header; requests will take care of that for you if you pass in a dictionary (using string keys and values) to the data keyword argument.
You shouldn't set the Connection header either; leave connection management to requests as well.

url fetch gets stuck when multiple urls are passed

in the following code below I am trying to first check if the URL status code and then start the relevant thread and do the same for adding it to queue,
however if urls are too many then I get TimeOut error.
all code added below
but just discovered another bug if I am passing a mp3 file along with some jpeg images the mp3 file downloaded of its correct size is opening as one of the image in urls passed.
_fdUtils
def getParser():
parser = argparse.ArgumentParser(prog='FileDownloader',
description='Utility to download files from internet')
parser.add_argument('-v', '--verbose', default=logging.DEBUG,
help='by default its on, pass None or False to not spit in shell')
parser.add_argument('-st', '--saveTo', default=None, action=FullPaths,
help='location where you want files to download to')
parser.add_argument('-urls', nargs='*',
help='urls of files you want to download.')
parser.add_argument('-se', nargs='*', default=[1], help='Split each url passed to urls by the'\
" respective split order, if a url doesn't have a split default is taken 1 ")
return parser.parse_args()
def getResponse(url):
return requests.head(url, allow_redirects=True, timeout=10, headers={'Accept-Encoding': 'identity'})
def isWorkingURL(url):
response = getResponse(url)
return response.status_code in [302, 200, 100, 204, 300]
def getUrl(url):
""" gets the actual url to download file from.
"""
response = getResponse(url)
return response.headers.get('location', url)
error stack Trace:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "python/file_download.py", line 181, in run
_grabAndWriteToDisk(self, split, url, self.__saveTo, 0, self.queue)
File "python/file_download.py", line 70, in _grabAndWriteToDisk
resp = requests.get(url, headers={'Range': 'bytes=%s' % irange}, stream=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 382, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 505, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 167, in resolve_redirects
allow_redirects=False,
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 485, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/adapters.py", line 381, in send
raise Timeout(e)
Timeout: HTTPConnectionPool(host='ia600506.us.archive.org', port=80): Read timed out. (read timeout=<object object at 0x1002b40b0>)
there we go again:
import argparse
import logging
import Queue
import os
import requests
import signal
import socket
import sys
import time
import threading
import utils as _fdUtils
from collections import OrderedDict
from itertools import izip_longest
from socket import error as SocketError, timeout as SocketTimeout
# timeout in seconds
TIMEOUT = 10
socket.setdefaulttimeout(TIMEOUT)
DESKTOP_PATH = os.path.expanduser("~/Desktop")
appName = 'FileDownloader'
logFile = os.path.join(DESKTOP_PATH, '%s.log' % appName)
_log = _fdUtils.fdLogger(appName, logFile, logging.DEBUG, logging.DEBUG, console_level=logging.DEBUG)
queue = Queue.Queue()
STOP_REQUEST = threading.Event()
maxSplits = threading.BoundedSemaphore(3)
threadLimiter = threading.BoundedSemaphore(5)
lock = threading.Lock()
pulledSize = 0
dataDict = {}
def _grabAndWriteToDisk(threadName, url, saveTo, first=None, queue=None, mode='wb', irange=None):
""" Function to download file..
Args:
url(str): url of file to download
saveTo(str): path where to save file
first(int): starting byte of the range
queue(Queue.Queue): queue object to set status for file download
mode(str): mode of file to be downloaded
irange(str): range of byte to download
"""
fileName = _fdUtils.getFileName(url)
filePath = os.path.join(saveTo, fileName)
fileSize = _fdUtils.getUrlSizeInBytes(url)
downloadedFileSize = 0 if not first else first
block_sz = 8192
resp = requests.get(url, headers={'Range': 'bytes=%s' % irange}, stream=True)
for fileBuffer in resp.iter_content(block_sz):
if not fileBuffer:
break
with open(filePath, mode) as fd:
downloadedFileSize += len(fileBuffer)
fd.write(fileBuffer)
mode = 'a'
status = r"%10d [%3.2f%%]" % (downloadedFileSize, downloadedFileSize * 100. / fileSize)
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
time.sleep(.01)
sys.stdout.flush()
if downloadedFileSize == fileSize:
STOP_REQUEST.set()
queue.task_done()
_log.debug("Downloaded %s %s%% using %s and saved to %s", fileName,
downloadedFileSize * 100. / fileSize, threadName.getName(), saveTo)
def _downloadChunk(url, idx, irange, fileName, sizeInBytes):
_log.debug("Downloading %s for first chunk %s of %s " % (irange, idx+1, fileName))
pulledSize = irange[-1]
try:
resp = requests.get(url, allow_redirects=False, timeout=TIMEOUT,
headers={'Range': 'bytes=%s-%s' % (str(irange[0]), str(irange[-1]))},
stream=True)
except (SocketTimeout, requests.exceptions), e:
_log.error(e)
return
chunk_size = str(irange[-1])
for chunk in resp.iter_content(chunk_size):
status = r"%10d [%3.2f%%]" % (pulledSize, pulledSize * 100. / int(chunk_size))
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
sys.stdout.flush()
pulledSize += len(chunk)
dataDict[idx] = chunk
time.sleep(.03)
if pulledSize == sizeInBytes:
_log.info("%s downloaded %3.0f%%", fileName, pulledSize * 100. / sizeInBytes)
class ThreadedFetch(threading.Thread):
""" docstring for ThreadedFetch
"""
def __init__(self, saveTo, queue):
super(ThreadedFetch, self).__init__()
self.queue = queue
self.__saveTo = saveTo
def run(self):
threadLimiter.acquire()
try:
items = self.queue.get()
url = items[0]
split = items[-1]
fileName = _fdUtils.getFileName(url)
# grab split chunks in separate thread.
if split > 1:
maxSplits.acquire()
try:
sizeInBytes = _fdUtils.getUrlSizeInBytes(url)
byteRanges = _fdUtils.getRangeSegements(sizeInBytes, split)
filePath = os.path.join(self.__saveTo, fileName)
downloaders = [
threading.Thread(
target=_downloadChunk,
args=(url, idx, irange, fileName, sizeInBytes),
)
for idx, irange in enumerate(byteRanges)
]
# start threads, let run in parallel, wait for all to finish
for th in downloaders:
th.start()
# this makes the wait for all thread to finish
# which confirms the dataDict is up-to-date
for th in downloaders:
th.join()
downloadedSize = 0
with open(filePath, 'wb') as fh:
for _idx, chunk in sorted(dataDict.iteritems()):
downloadedSize += len(chunk)
status = r"%10d [%3.2f%%]" % (downloadedSize, downloadedSize * 100. / sizeInBytes)
status = status + chr(8)*(len(status)+1)
fh.write(chunk)
sys.stdout.write('%s\r' % status)
time.sleep(.04)
sys.stdout.flush()
if downloadedSize == sizeInBytes:
_log.info("%s, saved to %s", fileName, self.__saveTo)
self.queue.task_done()
finally:
maxSplits.release()
else:
while not STOP_REQUEST.isSet():
self.setName("primary_%s_thread" % fileName.split(".")[0])
# if downlaod whole file in single chunk no need
# to start a new thread, so directly download here.
_grabAndWriteToDisk(self, url, self.__saveTo, 0, self.queue)
finally:
threadLimiter.release()
def main(appName):
args = _fdUtils.getParser()
saveTo = args.saveTo if args.saveTo else DESKTOP_PATH
# spawn a pool of threads, and pass them queue instance
# each url will be downloaded concurrently
unOrdUrls = dict(izip_longest(args.urls, args.se, fillvalue=1))
ordUrls = OrderedDict([(k, unOrdUrls[k]) for k in sorted(unOrdUrls, key=unOrdUrls.get, reverse=False) if _fdUtils.isWorkingURL(k, _log) and _fdUtils.notOnDisk(k, saveTo)])
print "length: %s " % len(ordUrls)
for i in xrange(len(ordUrls)):
t = ThreadedFetch(saveTo, queue)
t.daemon = True
t.start()
try:
# populate queue with data
for url, split in ordUrls.iteritems():
url = _fdUtils.getUrl(url)
print url
queue.put((url, int(split)))
# wait on the queue until everything has been processed
queue.join()
_log.info('All tasks completed.')
except (KeyboardInterrupt, SystemExit):
_log.critical('! Received keyboard interrupt, quitting threads.')
if __name__ == "__main__":
# change the name of MainThread.
threading.currentThread().setName("FileDownloader")
myapp = threading.currentThread().getName()
main(myapp)
I see two problems in your code. Since it's incomplete, I'm not sure how it's supposed to work, so I can't promise either one is the particular one you're running into first, but I'm pretty sure you need to fix both.
First:
queue.put((_fdUtils.getUrl(url), int(split)))
That's going to call _fdUtils.getUrl(url) in the main thread, and put the result on the queue. Your comments clearly imply that you intended the downloading to happen on the background threads.
If you wanted to pass a function to be called, just pass the function and its argument as separate members of the tuple, or wrap it up in a closure or a partial:
queue.put((lambda: _fdUtils.getUrl(url), int(split)))
Second:
t = ThreadedFetch(saveTo, queue)
t.daemon = True
t.start()
This starts a thread for every URL. That's almost never a good idea. Generally, downloaders don't use more than 4-16 threads at a time, and no more than 2-4 to the same site. You could easily be timing out because you're spamming some sit too fast and its server or router is making you back off for a while. Or, with a huge number of requests, you could be flooding your own network and blocking ACKs or even rebooting the router (especially if you have either a cheap home WiFi router or ADSL with a crappy provider).
Also, a much simpler way to do this would be to use a smart pool, like a multiprocessing.dummy.Pool (multiprocessing.dummy means it acts like the multiprocessing module but uses threads) or, even better, a concurrent.futures.ThreadPoolExecutor. In fact, if you look at the docs, a parallel downloader is the first example for ThreadPoolExecutor.

Categories

Resources