First of all, I am new to the network (HTTP communication) and Python.
I am currently using requests and threading module to periodically send or receive data with a specific site. The target site is 'https://api.coinone.co.kr' but I think it does not matter here.
By the example code below, I let Python fetch data every 1 second. At first, it works pretty well. Each request takes about 0.07 s in my computer.
import requests
import time
import threading
url0 = 'https://api.coinone.co.kr/ticker/'
class Fetch:
def __init__(self):
self.thread = threading.Thread(target=self.fcn)
self.t0 = time.perf_counter()
self.period = 1
self.req0 = None
def fcn(self):
while True:
# headers = None
headers = {'Connection': 'close'}
# requests
t0 = time.perf_counter()
req0 = requests.get(url0, headers=headers, params={'currency': 'all'})
resp0 = req0.json()
self.req0 = req0
reqTimeInt0 = time.perf_counter() - t0
# prints
print('elapsed time: %0.3f' % t0)
# print(req0.request.headers)
print(resp0['result'])
print('requests time interval: %0.3f' % reqTimeInt0)
print('')
# timer
self.t0 += self.period
now = time.perf_counter()
sleepInterval = self.t0 - now
if sleepInterval > 0:
time.sleep(sleepInterval)
else:
while self.t0 - now <= 0:
self.t0 += self.period
f1 = Fetch()
f1.thread.start()
But as time passes, the time needed for each 'http get' increases. After about 3 hours, one request takes 0.80 s where it is 10 times larger than it took in the initial state. Furthermore, not only does Python request get slower, but also the entire PC network gets slower (including internet browsing) without any increase in CPU, RAM resources, and network usage. Closing the console does not get back the network speed to normal and I have to reboot the PC. Anyway, after rebooting, the network is completely recovered and the internet works fine.
It seems like some burdens in the network connection are accumulated at each Python request. So I tried adding 'Connection: close' to the header, but it didn't work. Will 'requests.Session()' fix the problem?
I don't even know what to do to figure out the problem. I have to make the repeated requests for at least several days without breaking the connection.
Thank you.
Use a session, as it won't open new network connections, just use one, to make all the requests.
There is the preferred modifications:
class Fetch:
def __init__(self):
self.session = requests.Session
self.thread = threading.Thread(target=self.fcn)
self.t0 = time.perf_counter()
self.period = 1
self.req0 = None
def fcn(self):
while True:
# headers = None
headers = {'Connection': 'close'}
# requests
t0 = time.perf_counter()
req0 = self.session.get(url0, headers=headers, params={'currency': 'all'})
resp0 = req0.json()
self.req0 = req0
... other codes goes there ...
Related
I am working on a simple web scraper and rn trying to implement some multithreading. While my code works as intended with some servers(reducing time of execution vastly), my primary goal is to make it work with few specific ones. So when I try it with the ones in sites list, I get performance like I am still using sequential code. Any guesses what can cause this?
import requests, time
from bs4 import BeautifulSoup
from threading import Thread
from random import choice
# Enable to get some logging info
#---------------------------------
# import logging
# import http.client
# http.client.HTTPConnection.debuglevel = 1
# logging.basicConfig()
# logging.getLogger().setLevel(logging.DEBUG)
# requests_log = logging.getLogger("requests.packages.urllib3")
# requests_log.setLevel(logging.DEBUG)
# requests_log.propagate = True
sites = [
"https://pikabu.ru/community/blackhumour",
"https://www.pikabu.ru/tag/%D0%9C%D0%B5%D0%BC%D1%8B/hot"
]
class Pikabu_Downloader(Thread):
def __init__(self, url, name, *args, **kwargs):
super().__init__(*args, **kwargs)
self.url = url
self.name = name
self.begin = time.time()
def run(self):
print("Beginning with thread number",self.name, ",", round(time.time()-self.begin, 4), " seconds has passed")
html_data = self._get_html()
print("After requests.get with thread number", self.name, ",", round(time.time()-self.begin, 4), " seconds has passed")
if html_data is None:
return
self.soup = BeautifulSoup(html_data, "html.parser")
print("After making soup with thread number", self.name, ",", round(time.time() - self.begin, 4), " seconds has passed")
def _get_html(self):
try:
user_agents = ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 'AppleWebKit/537.36 (KHTML, like Gecko)', 'Chrome/74.0.3729.169', 'Safari/537.36')
print(f"Go {self.url}...")
res = requests.get(self.url, headers={'User-Agent': choice(user_agents)}, stream = True)#, allow_redirects=False)
except Exception as exc:
print(exc)
else:
return res.text
test = "https://readingbooks.site/read/?name=1984&"
def download():
pikabu_urls = []
for url in sites:
pikabu = [url + "?page=" + str(x) for x in range(1, 10)]
pikabu_urls = pikabu_urls + pikabu
pikabu_dls = [Pikabu_Downloader(url=page, name=str(i)) for i, page in enumerate(pikabu_urls)]
# Comment the string above and enable 2 underlying strings to get result from test server
# tests = [test + "page=" + str(x) for x in range(1, pages)]
# pikabu_dls = [Pikabu_Downloader(url=page, name=str(i)) for i, page in enumerate(tests)]
for pikabu_dl in pikabu_dls:
pikabu_dl.start()
for pikabu_dl in pikabu_dls:
pikabu_dl.join()
download()
And the result is something like
...
After requests.get with thread number 1 , 1.6904 seconds has passed
After making soup with thread number 1 , 1.7554 seconds has passed
After requests.get with thread number 2 , 2.9805 seconds has passed
After making soup with thread number 2 , 3.0455 seconds has passed
After requests.get with thread number 3 , 4.3225 seconds has passed
After making soup with thread number 3 , 4.3895 seconds has passed
...
What can cause such latency between thread executions? I was hoping to get each thread to finish almost simultaneously and to get more...asynchronous output, like with server from test. If I set a timeout of 5 sec inside requests.get, most of the requests wont even work.
After I investigated your case, I would point out some issues that you have encountered:
Do not print when it is on parallel tasks, it will cause the bottle-neck on the way of rendering to screen
The large of tasks are not always good for performance, it depends on how much your memory will process. Imagine that you have 1000 links, you have to create 1000 task objects? No, only place-holder for 5-20 by leveraging ThreadPool
Server also is a problem to deal with when taking request. Downloaded size, low bandwidth, network, distancing,.. caused response late will affect your physic machine. Your sites are weight, it seems consuming 1-3000ms each request so when you test it with small size (20 links), it makes you feel it runs sequentially
Your code is running parallel, since you do a little bit trick to put it on different threads, it is not quite right because we need a fully async library, such like asyncio and aiohttp. The aiohttp will take care numerous async requests on the Coroutine whereas asyncio will support syntax and operate on your main thread.
I did a small experiment on colab, please be noticed that I didn't use asyncio and aiohttp on colab because of stuck, but I have implemented on several projects before and it worked faster than below fastest method.
The second function is your implementation
import urllib.request
from threading import Thread
import time, requests
from random import choice
user_agents = ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 'AppleWebKit/537.36 (KHTML, like Gecko)', 'Chrome/74.0.3729.169', 'Safari/537.36')
timeout = 5
sites = [
"https://pikabu.ru/community/blackhumour",
"https://www.pikabu.ru/tag/%D0%9C%D0%B5%D0%BC%D1%8B/hot"
]
URLS = []
for url in sites:
pikabu = [url + "?page=" + str(x) for x in range(25)]
URLS.extend(pikabu)
def convert_to_threads():
return [Thread(target=load_url, args=(page, timeout)) for page in URLS]
def running_threads():
threads = convert_to_threads()
start = time.time()
for i in threads:
i.start()
for i in threads:
i.join()
print(f'Finish with {len(URLS)} requests {time.time() - start}')
def load_url(url, timeout):
res = requests.get(url, headers={'User-Agent': choice(user_agents)}, stream = True)#, allow_redirects=False)
return res.text
def running_sequence():
start = time.time()
for url in URLS:
load_url(url, timeout)
print(f'Finish with {len(URLS)} requests {time.time() - start}')
def running_thread_pool():
start = time.time()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=15) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, timeout): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
# else:
# print('%r page is %d length' % (url, len(data)))
print(f'Finish with {len(URLS)} requests {time.time() - start}')
In short, I recommend you use ThreadPool (prefer in colab), or asyncio and aiohttp (not in colab) to gain speed
I have the following code, I am calculating time taken to send frames per second from client to server, basically calculating the percentage of frames drop rate and time taken to respond back and communication mode is asynchronous.
Now I am facing some issue in calculating the two metrics, for time taken to respond I have set delay to be more than 5 seconds because due to network and processing speed of server it takes time to send result back to client, therefore, longer delay, but for frames per second, I need to calculate how many frames are sent from the client to server per second, how would I calculate this in the data_rate method. both metrics need different time delay, I cant use same time delay for both metrics. Help is highly appreciated on how to define this in the code.
IMAGE_FOLDER = "videoframe"
FPS = 5
SERVER_A_ADDRESS = "tcp://localhost:5555"
ENDPOINT_HANDLER_ADDRESS = "tcp://*:5553"
SERVER_A_TITLE = "SERVER A"
SERVER_B_TITLE = "SERVER B"
context = zmq.Context()
socket_server_a = context.socket(zmq.PUSH)
socket_server_endpoint = context.socket(zmq.PULL)
socket_server_a.connect(SERVER_A_ADDRESS)
socket_server_endpoint.bind(ENDPOINT_HANDLER_ADDRESS)
destination = {
"currentSocket": socket_server_a,
"currentServersTitle": SERVER_A_TITLE,
"currentEndpoint": SERVER_B_TITLE,}
running = True
endpoint_responses = 0
frame_requests = 0
filenames = [f"{IMAGE_FOLDER}/frame{i}.jpg" for i in range(1, 2522)]
def handle_endpoint_responses():
global destination, running, endpoint_responses
while running:
endpoint_response = socket_server_endpoint.recv().decode()
endpoint_responses += 1
def data_rate():
global destination, running, endpoint_responses, frame_requests
while running:
before_received = endpoint_responses ###
time.sleep(5)
after_received = endpoint_responses
before_sent = frame_requests
time.sleep(1)
after_sent = frame_requests ###
print(25 * "#")
print(f"{time.strftime('%H:%M:%S')} ( i ) : receiving model results: {round((after_received - before_received) / 5, 2)} per second.")
print(f"{time.strftime('%H:%M:%S')} ( i ) : sending frames: {round((after_sent - before_sent) / 1, 2)} per second.")
print(25 * "#")
def send_frame(frame, frame_requests):
global destination, running
try:
frame = cv2.resize(frame, (224, 224))
encoded, buffer = cv2.imencode('.jpg', frame)
jpg_as_text = base64.b64encode(buffer)
destination["currentSocket"].send(jpg_as_text)
except Exception as Error:
running = False
def main():
global destination, running, frame_requests
interval = 1 / FPS
while running:
for img in filenames:
frame = cv2.imread(img)
frame_requests += 1
threading.Thread(target=send_frame, args=(frame, frame_requests)).start()
time.sleep(interval)
destination["currentSocket"].close()
if __name__ == "__main__":
threading.Thread(target=handle_endpoint_responses).start()
threading.Thread(target=data_rate).start()
main()
Not only the server time, opening the image also takes time, using sleep interval = 1/FPS may lead to a frame drop too, i.e. producing less frames than possible (the same if playing offline). For playing, if done with sleep, the interval could be shorter and current time could be checked in the loop, and if the time is appropriate - sending the next frame, if not - just wait. The delay could be adaptive also and that time may be ahead of the linear period, in order to compensate for the transmission delay, if the goal is the server-side to display or do something with the image at actual frame time.
I think you have to synchronize/measure the difference of the clocks of the client and the server with an initial hand-shake or as part of the transactions, each time, and to include and log the time of sending/receiving in the transactions/log.
That way you could measure some average delay of the transmission.
Another thing that may give hints is initially to play the images without sending them. That will show how much time cv2.imread(...) and the preprocessing take.
I.e. commenting/adding another function without destination["currentSocket"].send(jpg_as_text)
I'm trying to make actions with Python requests. Here is my code:
import threading
import resource
import time
import sys
#maximum Open File Limit for thread limiter.
maxOpenFileLimit = resource.getrlimit(resource.RLIMIT_NOFILE)[0] # For example, it shows 50.
# Will use one session for every Thread.
requestSessions = requests.Session()
# Making requests Pool bigger to prevent [Errno -3] when socket stacked in CLOSE_WAIT status.
adapter = requests.adapters.HTTPAdapter(pool_maxsize=(maxOpenFileLimit+100))
requestSessions.mount('http://', adapter)
requestSessions.mount('https://', adapter)
def threadAction(a1, a2):
global number
time.sleep(1) # My actions with Requests for each thread.
print number = number + 1
number = 0 # Count of complete actions
ThreadActions = [] # Action tasks.
for i in range(50): # I have 50 websites I need to do in parallel threads.
a1 = i
for n in range(10): # Every website I need to do in 3 threads
a2 = n
ThreadActions.append(threading.Thread(target=threadAction, args=(a1,a2)))
for item in ThreadActions:
# But I can't do more than 50 Threads at once, because of maxOpenFileLimit.
while True:
# Thread limiter, analogue of BoundedSemaphore.
if (int(threading.activeCount()) < threadLimiter):
item.start()
break
else:
continue
for item in ThreadActions:
item.join()
But the thing is that after I get 50 Threads up, the Thread limiter starting to wait for some Thread to finish its work. And here is the problem. After scrit went to the Limiter, lsof -i|grep python|wc -l is showing much less than 50 active connections. But before Limiter it has showed all the <= 50 processes. Why is this happening? Or should I use requests.close() instead of requests.session() to prevent it using already oppened sockets?
Your limiter is a tight loop that takes up most of your processing time. Use a thread pool to limit the number of workers instead.
import multiprocessing.pool
# Will use one session for every Thread.
requestSessions = requests.Session()
# Making requests Pool bigger to prevent [Errno -3] when socket stacked in CLOSE_WAIT status.
adapter = requests.adapters.HTTPAdapter(pool_maxsize=(maxOpenFileLimit+100))
requestSessions.mount('http://', adapter)
requestSessions.mount('https://', adapter)
def threadAction(a1, a2):
global number
time.sleep(1) # My actions with Requests for each thread.
print number = number + 1 # DEBUG: This doesn't update number and wouldn't be
# thread safe if it did
number = 0 # Count of complete actions
pool = multiprocessing.pool.ThreadPool(50, chunksize=1)
ThreadActions = [] # Action tasks.
for i in range(50): # I have 50 websites I need to do in parallel threads.
a1 = i
for n in range(10): # Every website I need to do in 3 threads
a2 = n
ThreadActions.append((a1,a2))
pool.map(ThreadActons)
pool.close()
I am working on creating a HTTP client which can generate hundreds of connections each second and send up to 10 requests on each of those connections. I am using threading so concurrency can be achieved.
Here is my code:
def generate_req(reqSession):
requestCounter = 0
while requestCounter < requestRate:
try:
response1 = reqSession.get('http://20.20.1.2/tempurl.html')
if response1.status_code == 200:
client_notify('r')
except(exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout) as Err:
client_notify('F')
break
requestCounter += 1
def main():
for q in range(connectionPerSec):
s1 = requests.session()
t1 = threading.Thread(target=generate_req, args=(s1,))
t1.start()
Issues:
It is not scaling above 200 connections/sec with requestRate = 1. I ran other available HTTP clients on the same client machine and against the server, test runs fine and it is able to scale.
When requestRate = 10, connections/sec drops to 30.
Reason: Not able to create targeted number of threads every second.
For issue #2, client machine is not able to create enough request sessions and start new threads. As soon as requestRate is set to more than 1, things start to fall apart.
I am suspecting it has something to do with HTTP connection pooling which requests uses.
Please suggest what am I doing wrong here.
I wasn't able to get things to fall apart, however the following code has some new features:
1) extended logging, including specific per-thread information
2) all threads join()ed at the end to make sure the parent process doesntt leave them hanging
3) multithreaded print tends to interleave the messages, which can be unwieldy. This version uses yield so a future version can accept the messages and print them clearly.
source
import exceptions, requests, threading, time
requestRate = 1
connectionPerSec = 2
def client_notify(msg):
return time.time(), threading.current_thread().name, msg
def generate_req(reqSession):
requestCounter = 0
while requestCounter < requestRate:
try:
response1 = reqSession.get('http://127.0.0.1/')
if response1.status_code == 200:
print client_notify('r')
except (exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout):
print client_notify('F')
break
requestCounter += 1
def main():
for cnum in range(connectionPerSec):
s1 = requests.session()
th = threading.Thread(
target=generate_req, args=(s1,),
name='thread-{:03d}'.format(cnum),
)
th.start()
for th in threading.enumerate():
if th != threading.current_thread():
th.join()
if __name__=='__main__':
main()
output
(1407275951.954147, 'thread-000', 'r')
(1407275951.95479, 'thread-001', 'r')
When downloading a large file with python, I want to put a time limit not only for the connection process, but also for the download.
I am trying with the following python code:
import requests
r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', timeout = 0.5, prefetch = False)
print r.headers['content-length']
print len(r.raw.read())
This does not work (the download is not time limited), as correctly noted in the docs: https://requests.readthedocs.org/en/latest/user/quickstart/#timeouts
This would be great if it was possible:
r.raw.read(timeout = 10)
The question is, how to put a time limit to the download?
And the answer is: do not use requests, as it is blocking. Use non-blocking network I/O, for example eventlet:
import eventlet
from eventlet.green import urllib2
from eventlet.timeout import Timeout
url5 = 'http://ipv4.download.thinkbroadband.com/5MB.zip'
url10 = 'http://ipv4.download.thinkbroadband.com/10MB.zip'
urls = [url5, url5, url10, url10, url10, url5, url5]
def fetch(url):
response = bytearray()
with Timeout(60, False):
response = urllib2.urlopen(url).read()
return url, len(response)
pool = eventlet.GreenPool()
for url, length in pool.imap(fetch, urls):
if (not length):
print "%s: timeout!" % (url)
else:
print "%s: %s" % (url, length)
Produces expected results:
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
When using Requests' prefetch=False parameter, you get to pull in arbitrary-sized chunks of the respone at a time (rather than all at once).
What you'll need to do is tell Requests not to preload the entire request and keep your own time of how much you've spent reading so far, while fetching small chunks at a time. You can fetch a chunk using r.raw.read(CHUNK_SIZE). Overall, the code will look something like this:
import requests
import time
CHUNK_SIZE = 2**12 # Bytes
TIME_EXPIRE = time.time() + 5 # Seconds
r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', prefetch=False)
data = ''
buffer = r.raw.read(CHUNK_SIZE)
while buffer:
data += buffer
buffer = r.raw.read(CHUNK_SIZE)
if TIME_EXPIRE < time.time():
# Quit after 5 seconds.
data += buffer
break
r.raw.release_conn()
print "Read %s bytes out of %s expected." % (len(data), r.headers['content-length'])
Note that this might sometimes use a bit more than the 5 seconds allotted as the final r.raw.read(...) could lag an arbitrary amount of time. But at least it doesn't depend on multithreading or socket timeouts.
Run download in a thread which you can then abort if not finished on time.
import requests
import threading
URL='http://ipv4.download.thinkbroadband.com/1GB.zip'
TIMEOUT=0.5
def download(return_value):
return_value.append(requests.get(URL))
return_value = []
download_thread = threading.Thread(target=download, args=(return_value,))
download_thread.start()
download_thread.join(TIMEOUT)
if download_thread.is_alive():
print 'The download was not finished on time...'
else:
print return_value[0].headers['content-length']