I was using python to publish and subscribe message-queueing
publisher:
rc = redis.Redis(host='127.0.0.1', port=6379)
rc.ping()
ps = rc.pubsub()
ps.subscribe('bdwaf')
r_str = "--8198b507-A--"
for i in range(0, 20000):
rc.publish('bdwaf', r_str)
subscriber:
rc = redis.Redis(host='localhost', port=6379)
rc.ping()
ps = rc.pubsub()
ps.subscribe('bdwaf')
num = 0
while True:
item = ps.get_message()
if item:
num += 1
if item['type'] == 'message':
a.parser(item['data'])
print num
when the publisher loop range is higher than 20000, the subscriber seems to not get all datas, only when i add a sleep method to the publisher, it can work.
how can I make it work without adding a sleep method to the publisher, and no matter what is the range of the publisher to publish data, the subscriber can get all datas?
You can persist the messages in a distributed task queue. commonly used with redis is a distributed task queue written in python called celery (http://www.celeryproject.org/)
Related
I have an use case which consists of loading huge tables from Oracle to Snowflake.
The Oracle server sits far away from Snowflake endpoint, so we do have connection issues when loading tables (views in fact) bigger than 12 GB by spool script or cx_oracle.
I was thinking of using ThreadPoolExecutor with 4 threads max., to test, and use SessionPool. With this, I get a connection per thread, that's the whole point. So, this means I would have to distribute the data fetch by batches for each thread.
My question is: how can I achieve this? Is it correct to do something like:
"Select * from table where rownum between x and y" (not this syntax, I know...but you get my point), should I rely on OFFSET, ...?
My idea was that each thread gets a "slice" of select , fetches data by batches and writerows to csv in batches as well, because I'll rather have small files then a huge file, to send to snowflake.
def query(start_off, pool):
start_conn = datetime.now()
con = pool.acquire()
end_conn = datetime.now()
print(f"Conn/Acquire time: {end_conn-start_conn}")
with con.cursor() as cur:
start_exec_ts = datetime.now()
cur.execute(QUERY, start_pos=start_off, end_pos=start_off+(OFFSET-1))
end_exec_ts = datetime.now()
rows = cur.fetchall()
end_fetch_ts = datetime.now()
total_exec_ts = end_exec_ts-start_exec_ts
total_fetch_ts = end_fetch_ts-end_exec_ts
print(f"Exec time : {total_exec_ts}")
print(f"Fetch time : {total_fetch_ts}")
print(f"Task executed {threading.current_thread().getName()}, {threading.get_ident()}")
return rows
def main():
pool = cx_Oracle.SessionPool(c.oracle_conn['oracle']['username'],
c.oracle_conn['oracle']['password'],
c.oracle_conn['oracle']['dsn'],
min=2, max=4, increment=1,
threaded=True,
getmode=cx_Oracle.SPOOL_ATTRVAL_WAIT
)
with ThreadPoolExecutor(max_workers=4) as executor:
futures = [executor.submit(query, d, pool) for d in range(1,13,OFFSET)]
for future in as_completed(futures):
# process your records from each thread
print(repr(future.result()))
# process_records(future.result())
if __name__ == '__main__':
main()
Also, using fetchMany in query funcion , how could I send back the results so I can process them each time?
if you want to transfer the data by the python script
you can create a producer -> queue -> consumer workflow to do this
and the consumer rely on the ID of the data
producer
the producer fetch the IDs of the data
put "a slice of IDs" to the queue as a job
consumer
the consumers fetch the job from the queue
fetch the data with the IDs (e.g. "select * from table where id in ...")
save the data to somewhere
example
a quick example for such concept
import time
import threading
import queue
from dataclasses import dataclass
from concurrent.futures import ThreadPoolExecutor
#dataclass
class Job:
ids: list
jobs = queue.Queue()
executor = ThreadPoolExecutor(max_workers=4)
fake_data = [i for i in range(0, 200)]
def consumer():
try:
job = jobs.get_nowait()
# select * from table where id in job.ids
# save the data
time.sleep(5)
print(f"done {job}")
except Exception as exc:
print(exc)
def fake_select(limit, offset):
if offset >= len(fake_data):
return None
return fake_data[offset:(offset+limit)]
def producer():
limit = 10
offset = 0
stop_fetch = False
while not stop_fetch:
# select id from table limit l offset o
ids = fake_select(limit, offset)
if ids is None:
stop_fetch = True
else:
job = Job(ids=ids)
jobs.put(job)
print(f"put {job}")
offset += limit
executor.submit(consumer)
time.sleep(0.2)
def main():
th = threading.Thread(target=producer)
th.start()
th.join()
while not jobs.empty():
time.sleep(1)
executor.shutdown(wait=True)
print("all jobs done")
if __name__ == "__main__":
main()
besides,
if you want to do more operation after consumer fetching the data
you can do this in the consumer flow
or add another queue and consumer to do extra operations
the workflow will become like this
producer -> queue -> fetch and save data consumers -> queue -> consumer to do some extra operation
I'm working with faust and would like to leverage concurrency feature.
The example listed doesn't quite demonstrate the use of concurrency.
What I would like to do is, read from kafka producer and unnest json.
Then the shipments are sent to a process to calculate billing etc. I should send 10 shipments at one time to a function which does the calculation. For this i'm using concurrency so 10 shipments can calculate concurrently.
import faust
import time
import json
from typing import List
import asyncio
class Items(faust.Record):
name: str
billing_unit: str
billing_qty: int
class Shipments(faust.Record, serializer="json"):
shipments: List[Items]
ship_type: str
shipping_service: str
shipped_at: str
app = faust.App('ships_app', broker='kafka://localhost:9092', )
ship_topic = app.topic('test_shipments', value_type=Shipments)
#app.agent(value_type=str, concurrency=10)
async def mytask(records):
# task that does some other activity
async for record in records:
print(f'received....{record}')
time.sleep(5)
#app.agent(ship_topic)
async def process_shipments(shipments):
# async for ships in stream.take(100, within=10):
async for ships in shipments:
data = ships.items
uid = faust.uuid()
for item in data:
item_uuid = faust.uuid()
print(f'{uid}, {item_uuid}, {ships.ship_type}, {ships.shipping_service}, {ships.shipped_at}, {item.name}, {item.billing_unit}, {item.billing_qty}')
await mytask.send(value=("{} -- {}".format(uid, item_uuid)))
# time.sleep(2)
# time.sleep(10)
if __name__ == '__main__':
app.main()
Ok I figured out how it works. The problem with the example you gave was actually with the time.sleep bit, not the concurrency bit. Below are two silly examples that show how an agent would work with and without concurrency.
import faust
import asyncio
app = faust.App(
'example_app',
broker="kafka://localhost:9092",
value_serializer='raw',
)
t = app.topic('topic_1')
# #app.agent(t, concurrency=1)
# async def my_task(tasks):
# async for my_task in tasks:
# val = my_task.decode('utf-8')
# if (val == "Meher"):
# # This will print out second because there is only one thread.
# # It'll take 5ish seconds and print out right after Waldo
# print("Meher's a jerk.")
# else:
# await asyncio.sleep(5)
# # Since there's only one thread running this will effectively
# # block the agent.
# print(f"Where did {val} go?")
#app.agent(t, concurrency=2)
async def my_task2(tasks):
async for my_task in tasks:
val = my_task.decode('utf-8')
if (val == "Meher"):
# This will print out first even though the Meher message is
# received second.
print("Meher's a jerk.")
else:
await asyncio.sleep(5)
# Because this will be sleeping and there are two threads available.
print(f"Where did {val} go?")
# ===============================
# In another process run
from kafka import KafkaProducer
p = KafkaProducer()
p.send('topic_1', b'Waldo'); p.send('topic_1', b'Meher')
I have a simple problem / question about the below code.
ip = '192.168.0.'
count = 0
while count <= 255:
print(count)
count += 1
for i in range(10):
ipg=ip+str(count)
t = Thread(target=conn, args=(ipg,80))
t.start()
I want to execute 10 threads each time and wait for it to finish and then continue with the next 10 threads until count <= 255
I understand my problem and why it does execute 10 threads for every count increase, but not how to solve it, any help would be appreciated.
it can easily achieved using concurrents.futures library
here's the example code:
from concurrent.futures import ThreadPoolExecutor
ip = '192.168.0.'
count = 0
THREAD_COUNT = 10
def work_done(future):
result = future.result()
# work with your result here
def main():
with ThreadPoolExecutor(THREAD_COUNT) as executor:
while count <= 255:
count += 1
ipg=ip+str(count)
executor.submit(conn, ipg, 80).add_done_callback(work_done)
if __name__ == '__main__':
main()
here executor returns future for every task it submits.
keep in mind that if you use add_done_callback() finished task from thread returns to the main thread (which would block your main thread) if you really want true parallelism then you should wait for future objects separately. here's the code snippet for that.
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures._base import wait
futures = []
with ThreadPoolExecutor(THREAD_COUNT) as executor:
while count <= 255:
count += 1
ipg=ip+str(count)
futures.append(executor.submit(conn, ipg, 80))
wait(futures)
for succeded, failed in futures:
# work with your result here
hope this helps!
There are two viable options: multiprocessing with ThreadPool as #martineau suggested and using queue. Here's an example with queue that executes requests concurrently in 10 different threads. Note that it doesn't do any kind of batching, as soon as a thread completes it picks up next task without caring the status of other workers:
import queue
import threading
def conn():
try:
while True:
ip, port = que.get_nowait()
print('Connecting to {}:{}'.format(ip, port))
que.task_done()
except queue.Empty:
pass
que = queue.Queue()
for i in range(256):
que.put(('192.168.0.' + str(i), 80))
# Start workers
threads = [threading.Thread(target=conn) for _ in range(10)]
for t in threads:
t.start()
# Wait que to empty
que.join()
# Wait workers to die
for t in threads:
t.join()
Output:
Connecting to 192.168.0.0:80
Connecting to 192.168.0.1:80
Connecting to 192.168.0.2:80
Connecting to 192.168.0.3:80
Connecting to 192.168.0.4:80
Connecting to 192.168.0.5:80
Connecting to 192.168.0.6:80
Connecting to 192.168.0.7:80
Connecting to 192.168.0.8:80
Connecting to 192.168.0.9:80
Connecting to 192.168.0.10:80
Connecting to 192.168.0.11:80
...
I modified your code so that it has correct logic to do what you want. Please note that I don't run it but hope you'll get the general idea:
import time
from threading import Thread
ip = '192.168.0.'
count = 0
while count <= 255:
print(count)
# a list to keep your threads while they're running
alist = []
for i in range(10):
# count must be increased here to count threads to 255
count += 1
ipg=ip+str(count)
t = Thread(target=conn, args=(ipg,80))
t.start()
alist.append(t)
# check if threads are still running
while len(alist) > 0:
time.sleep(0.01)
for t in alist:
if not t.isAlive():
# remove completed threads
alist.remove(t)
I have a ROUTER whose purpose is to accumulate image data from multiple DEALER clients and perform OCR on the complete image. I found that the most efficient way of handling the OCR is through the utilization of Python's multiprocessing library; the accumulated image bytes are put into a Queue for due procession in a separate Process. However, I need to ensure that when a client experiences a timeout that the Process is properly terminated and doesn't meaninglessly linger and hog resources.
In my current solution I insert each newly-connected client into a dict where the value is my ClientHandler class that possesses all image data and spawns a Thread that sets a boolean variable named "timeout" to True when 5 seconds have elapsed. Should a new message be received within that 5 second frame, bump is called & the timer is reset back to 0, otherwise I cleanup prior to thread termination and the reference is deleted from the dict in the main loop:
import threading
import time
import zmq
class ClientHandler(threading.Thread):
def __init__(self, socket):
self.elapsed = time.time()
self.timeout = False
self.socket = socket
super(ClientHandler, self).__init__()
def run(self):
while time.time() - self.elapsed < 5.0:
pass
self.timeout = True
# CLIENT TIMED OUT
# HANDLE TERMINATION AND CLEAN UP HERE
def bump(self):
self.elapsed = time.time()
def handle(self, id, header, data):
# HANDLE CLIENT DATA HERE
# ACCUMULATE IMAGE BYTES, ETC
self.socket.send_multipart([id, str(0)])
def server_task():
clients = dict()
context = zmq.Context.instance()
server = context.socket(zmq.ROUTER)
server.setsockopt(zmq.RCVTIMEO, 0)
server.bind("tcp://127.0.0.1:7777")
while True:
try:
id, header, data = server.recv_multipart()
client = clients.get(id)
if client == None:
client = clients[id] = ClientHandler(server)
client.start()
client.bump()
client.handle(id, header, data)
except zmq.Again:
for id in clients.keys():
if clients[id].timeout:
del clients[id]
context.term()
if __name__ == "__main__":
server_task()
But this entire method just doesn't feel right. Am I going about this improperly? If so, I would greatly appreciate if someone could point me in the right direction.
Figured it out myself, hoping it may be of assistance to others.
I instead have a ROUTER on an assigned port that distributes unique ports to each client, which thereafter connects to the newly-bound socket on said unique port. When a client disconnects, the port is recycled for reassignment.
import sys
import zmq
from multiprocessing import Process, Queue, Value
def server_task():
context = zmq.Context.instance()
server = context.socket(zmq.ROUTER)
server.bind("tcp://127.0.0.1:7777")
timeout_queue = Queue()
port_list = [ 1 ]
proc_list = [ ]
while True:
try:
id = server.recv_multipart()[0]
# Get an unused port from the list
# Ports from clients that have timed out are recycled here
while not timeout_queue.empty():
port_list.append(timeout_queue.get())
port = port_list.pop()
if len(port_list) == 0:
port_list.append(port + 1)
# Spawn a new worker task, binding the port to a socket
proc_running = Value("b", True)
proc_list.append(proc_running)
Process(target=worker_task, args=(proc_running, port, timeout_queue)).start()
# Send the new port to the client
server.send_multipart([id, str(7777 + port)])
except KeyboardInterrupt:
break
# Safely allow our worker processes to terminate
for proc_running in proc_list:
proc_running.value = False
context.term()
def worker_task(proc_running, port, timeout_queue):
context = zmq.Context.instance()
worker = context.socket(zmq.ROUTER)
worker.setsockopt(zmq.RCVTIMEO, 5000)
worker.bind("tcp://127.0.0.1:%d" % (7777 + port, ))
while proc_running.value:
try:
id, data = worker.recv_multipart()
worker.send_multipart([id, data])
except zmq.Again:
timeout_queue.put(port)
context.term()
break
print("Client on port %d disconnected" % (7777 + port, ))
I am working on creating a HTTP client which can generate hundreds of connections each second and send up to 10 requests on each of those connections. I am using threading so concurrency can be achieved.
Here is my code:
def generate_req(reqSession):
requestCounter = 0
while requestCounter < requestRate:
try:
response1 = reqSession.get('http://20.20.1.2/tempurl.html')
if response1.status_code == 200:
client_notify('r')
except(exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout) as Err:
client_notify('F')
break
requestCounter += 1
def main():
for q in range(connectionPerSec):
s1 = requests.session()
t1 = threading.Thread(target=generate_req, args=(s1,))
t1.start()
Issues:
It is not scaling above 200 connections/sec with requestRate = 1. I ran other available HTTP clients on the same client machine and against the server, test runs fine and it is able to scale.
When requestRate = 10, connections/sec drops to 30.
Reason: Not able to create targeted number of threads every second.
For issue #2, client machine is not able to create enough request sessions and start new threads. As soon as requestRate is set to more than 1, things start to fall apart.
I am suspecting it has something to do with HTTP connection pooling which requests uses.
Please suggest what am I doing wrong here.
I wasn't able to get things to fall apart, however the following code has some new features:
1) extended logging, including specific per-thread information
2) all threads join()ed at the end to make sure the parent process doesntt leave them hanging
3) multithreaded print tends to interleave the messages, which can be unwieldy. This version uses yield so a future version can accept the messages and print them clearly.
source
import exceptions, requests, threading, time
requestRate = 1
connectionPerSec = 2
def client_notify(msg):
return time.time(), threading.current_thread().name, msg
def generate_req(reqSession):
requestCounter = 0
while requestCounter < requestRate:
try:
response1 = reqSession.get('http://127.0.0.1/')
if response1.status_code == 200:
print client_notify('r')
except (exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout):
print client_notify('F')
break
requestCounter += 1
def main():
for cnum in range(connectionPerSec):
s1 = requests.session()
th = threading.Thread(
target=generate_req, args=(s1,),
name='thread-{:03d}'.format(cnum),
)
th.start()
for th in threading.enumerate():
if th != threading.current_thread():
th.join()
if __name__=='__main__':
main()
output
(1407275951.954147, 'thread-000', 'r')
(1407275951.95479, 'thread-001', 'r')