raising an OS error when running federated learning in python

raising an OS error when running federated learning in python - python

I'm attempting some simulated federated learning using Tensorflow. When I invoke the initialise computation to construct the server state I get the following OS error. Anyone know a way round this?
iterative_process = tff.learning.algorithms.build_weighted_fed_avg(
model_fn ,
#client optimizer
client_optimizer_fn = lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
#server optimizer
server_optimizer_fn = lambda: tf.keras.optimizers.SGD(learning_rate=1)
)
train_state = iterative_process.initialize()
OSError: [Errno 8] Exec format error:'/Users/*****/opt/anaconda3/envs/p3workshop/lib/python3.10/site-packages/tensorflow_federated/python/core/backends/native/../../../../data/worker_binary'

Related

How to use onnxruntime parallel with flask?

Created a server that want to run a session of onnxruntime parallel.
First question, will be used multi-threads or multi-processings?
Try to use multi-threads, app.run(host='127.0.0.1', port='12345', threaded=True).
When run 3 threads that the GPU's memory less than 8G, the program can run. But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.
I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash. So I try to limit the number of threads, and set intra_op_num_threads = 2 or inter_op_num_threads = 2 or os.environ["OMP_NUM_THREADS"] = "2", but don't work.
Try to 'gpu_mem_limit', don't work either.
import onnxruntime as rt
from flask import Flask, request
app = Flask(__name__)
sess = rt.InferenceSession(model_XXX, providers=['CUDAExecutionProvider'])
#app.route('/algorithm', methods=['POST'])
def parser():
prediction = sess.run(...)
if __name__ == '__main__':
app.run(host='127.0.0.1', port='12345', threaded=True)
My understanding is that the Flask HTTP server maybe use different sess for each call.
How can make each call use the same session of onnxruntime?
System information
OS Platform and Distribution: Windows10
ONNX Runtime version: 1.8
Python version: python 3.7
GPU model and memory: RTX3070 - 8G

PyTorch torchvision dataset download speed is very slow

I have the following code block in a colab notebook that downloads the EMNIST dataset from torchvision. Sometimes I randomly get an error saying
connectionError: HTTPConnectionPool(host='www.itl.nist.gov', port=80): Max retries exceeded with url: /iaui/vip/cs_links/EMNIST/gzip.zip (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f701d0936d0>: Failed to establish a new connection: [Errno 110] Connection timed out'))
So I have made a code block with a function that calls itself if the attempt to download fails (see bottom of post). Sometimes the download actually starts working, but it's extremely slow. See below screenshot for progress bar
Over two hours to download <1GB of data from the URL in the screenshot. Downloading the dataset directly to my machine took about 60 seconds, so it's not a problem with the server that's serving the data. It seems to be a problem with either colab's internet connection with the server or the way that PyTorch is handling the data download.I don't really know what to do to solve this issue. I tried reconnecting runtime many times but the same issue happened.
Data Download Code:
from torchvision import transforms, datasets
train_data = None
test_data = None
def load_data():
global train_data, test_data
try:
train_data = datasets.EMNIST("./data", split="balanced", train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor()
]))
test_data = datasets.EMNIST("./data", split="balanced", train=False, download=True,
transform=transforms.Compose([
transforms.ToTensor()
]))
except:
load_data()
load_data()
print(train_data)
print(test_data)

How to free TF/Keras memory in Python after a model has been deleted, while other models are still in memory and in use?

I have a Python server application, which provides TensorFlow / Keras model inference services. Multiple different such models can be loaded and used at the same time, for multiple different clients. A client can request to load another model, but this has no effect on the other clients (i.e. their models stay in memory and use as they are, so each client can ask to load another model regardless of the state of any other client).
The logic and implementation works, however, I am not sure how to correctly free memory in this setup. When a client asks for a new model to load, then the previously loaded model will simply be deleted from memory (via the Python del command), then the new model is being loaded via tensorflow.keras.models.load_model().
From what I read in the Keras documentation one might want to clear a Keras session in order to free memory via calling tf.keras.backend.clear_session(). However, that seems to release all TF memory, which is a problem in my case, since other Keras models for other clients are still in use at the same time, as described above.
Moreover, it seems I cannot put every model into their own process, since I cannot access the single GPU from different running processes in parallel (or at all).
So in other words: When loading a new TensorFlow / Keras model while other models are also in memory and in use, how can I free the TF memory from the previsouly loaded model, without interferring with the other currently loaded models?

When a Tensorflow session starts, it will try to allocate all of the GPU memory available. This is what prevents multiple processes from running sessions. The ideal way to stop this is to ensure that the tf session only allocates a part of the memory. From the docs, there are two ways to do this(Depending on your tf version)
The simple way is (tf 2.2+)
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
for tf 2.0/2.1
import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth(True)
for tf 1.* (Allocate 30% percentage of memory per process)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
The other method is more controlled IMHO and scales better. It requires that you create logical devices and manually control placement for each of them.
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Now you have to manually control placement using the with
gpus = tf.config.experimental.list_logical_devices('GPU')
if gpus:
# Replicate your computation on multiple GPUs
c = []
for gpu in gpus:
with tf.device(gpu.name):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c.append(tf.matmul(a, b))
with tf.device('/CPU:0'):
matmul_sum = tf.add_n(c)
print(matmul_sum)
Using this you won't run of out memory and can run multiple processes at once.

You can fork new kernels by customers. Each process will execute operations and separated environments with each other. It is safer and isolated way.
I created a basic scenario which has two parts. The main part's responsibility to start, execute and kill processes. The client part's responsibility is executing operations from the server's orders. Each client waits for orders with HTTP requests.
main.py
import subprocess
import sys
import requests
class ClientOperator:
def __init__(self, name, port, model):
self.name = name
self.port = port
self.proc = subprocess.Popen([sys.executable, 'client.py',
f'--port={port}', f'--model={model}'])
def process(self, a, b):
response = requests.get(f'http://localhost:{self.port}/process',
params={'a': a, 'b': b}).json()
print(f'{self.name} process {a} + {b} = {response}')
def close(self):
print(f'{self.name} is closing')
self.proc.terminate()
customer1 = ClientOperator('John', 20001, 'model1.hdf5')
customer2 = ClientOperator('Oscar', 20002, 'model2.hdf5')
customer1.process(5, 10)
customer2.process(4, 6)
# stop customer1
customer1.close()
client.py
import argparse
from flask import Flask, request, jsonify
# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--port', '-p', type=int)
parser.add_argument('--model', '-m', type=str)
args = parser.parse_args()
model = args.model
app = Flask(__name__)
#app.route('/process', methods=['GET'])
def process():
result = int(request.args['a']) + int(request.args['b'])
return jsonify({'result': result, 'model': model})
if __name__ == '__main__':
app.run(host="localhost", port=args.port)
Output:
$ python main.py
* Serving Flask app "client" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://localhost:20002/ (Press CTRL+C to quit)
* Serving Flask app "client" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://localhost:20001/ (Press CTRL+C to quit)
127.0.0.1 - - [22/Jan/2021 16:31:26] "?[37mGET /process?a=5&b=10 HTTP/1.1?[0m" 200 -
John process 5 + 10 = {'model': 'model1.hdf5', 'result': 15}
127.0.0.1 - - [22/Jan/2021 16:31:27] "?[37mGET /process?a=4&b=6 HTTP/1.1?[0m" 200 -
Oscar process 4 + 6 = {'model': 'model2.hdf5', 'result': 10}
John is closing

How to fix problem "Unable to complete the operation against any hosts" in Cassandra?

I have a pretty simple AWS Lambda function in which I connect to an Amazon Keyspaces for Cassandra database. This code in Python works, but from time to time I get the error. How do I fix this strange behavior? I have an assumption that you need to make additional settings when initializing the cluster. For example, set_max_connections_per_host. I would appreciate any help.
ERROR:
('Unable to complete the operation against any hosts', {<Host: X.XXX.XX.XXX:XXXX eu-central-1>: ConnectionShutdown('Connection to X.XXX.XX.XXX:XXXX was closed')})
lambda_function.py:
import sessions
cassandra_db_session = None
cassandra_db_username = 'your-username'
cassandra_db_password = 'your-password'
cassandra_db_endpoints = ['your-endpoint']
cassandra_db_port = 9142
def lambda_handler(event, context):
global cassandra_db_session
if not cassandra_db_session:
cassandra_db_session = sessions.create_cassandra_session(
cassandra_db_username,
cassandra_db_password,
cassandra_db_endpoints,
cassandra_db_port
)
result = cassandra_db_session.execute('select * from "your-keyspace"."your-table";')
return 'ok'
sessions.py:
from ssl import SSLContext
from ssl import CERT_REQUIRED
from ssl import PROTOCOL_TLSv1_2
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.policies import DCAwareRoundRobinPolicy
def create_cassandra_session(db_username, db_password, db_endpoints, db_port):
ssl_context = SSLContext(PROTOCOL_TLSv1_2)
ssl_context.load_verify_locations('your-path/AmazonRootCA1.pem')
ssl_context.verify_mode = CERT_REQUIRED
auth_provider = PlainTextAuthProvider(username=db_username, password=db_password)
cluster = Cluster(
db_endpoints,
ssl_context=ssl_context,
auth_provider=auth_provider,
port=db_port,
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='eu-central-1'),
protocol_version=4,
connect_timeout=60
)
session = cluster.connect()
return session

There isn't much point setting the max connections on the client side since AWS Lambdas are effectively "dead" between runs. For the same reason, the recommendation is to disable driver heartbeats (with idle_heartbeat_interval = 0) since there is no activity that occurs until the next time the function is called.
This doesn't necessarily cause the issue you are seeing but there's a good chance the connection is being reused by the driver after it has been closed server-side.
With the lack of public documentation on the inner-workings of AWS Keyspaces, it's difficult to know what is happening on the cluster. I've always suspected that AWS Keyspaces has a CQL-like API engine in front of a Dynamo DB so there are quirks like what you're seeing that are hard to track down since it requires knowledge only available internally at AWS.
FWIW the DataStax drivers aren't tested against AWS Keyspaces.

This is the biggest issue which I see:
result = cassandra_db_session.execute('select * from "your-keyspace"."your-table";')
The code looks fine, but I don't see a WHERE clause. So if there's a lot of data, a single node (chosen as a coordinator) will have to build the result set while pulling data from all other nodes. As this results in (un)predictibly bad performance, that could explain why it works sometimes, but not others.
Pro-tip: All queries in Cassandra should have a WHERE clause.

how to set the necessary permission for my code to use?

I'm writing a code for packet sniffing in python and in windows 10. According to my searching, I saw that most of the developers said that Linux is better for packet sniffing but I can't use Linux right now, So how could i fix this error?
I tried using print(ctypes.windll.shell32.IsUserAnAdmin()) and it gave 0 value.
This is my code:
def main():
print(ctypes.windll.shell32.IsUserAnAdmin())
conn = socket.socket(socket.AF_INET,socket.SOCK_RAW,socket.IPPROTO_IP)
while True:
raw_data, addr = conn.recvfrom(65535)
dest_mac, src_mac, eth_proto, data = ethernet_frame(raw_data)
print("\nEthernet Frame:")
print("Destination: {}, Source: {}, Protocol: {}".format(dest_mac,src_mac,eth_proto))
And I've got this error:
OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions
I know it's something about permission but how can i do that? Thank you in advance :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

raising an OS error when running federated learning in python - python

Related

How to use onnxruntime parallel with flask?

PyTorch torchvision dataset download speed is very slow

How to free TF/Keras memory in Python after a model has been deleted, while other models are still in memory and in use?

How to fix problem "Unable to complete the operation against any hosts" in Cassandra?

how to set the necessary permission for my code to use?

Categories

Resources