PyTorch torchvision dataset download speed is very slow

PyTorch torchvision dataset download speed is very slow - python

I have the following code block in a colab notebook that downloads the EMNIST dataset from torchvision. Sometimes I randomly get an error saying
connectionError: HTTPConnectionPool(host='www.itl.nist.gov', port=80): Max retries exceeded with url: /iaui/vip/cs_links/EMNIST/gzip.zip (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f701d0936d0>: Failed to establish a new connection: [Errno 110] Connection timed out'))
So I have made a code block with a function that calls itself if the attempt to download fails (see bottom of post). Sometimes the download actually starts working, but it's extremely slow. See below screenshot for progress bar
Over two hours to download <1GB of data from the URL in the screenshot. Downloading the dataset directly to my machine took about 60 seconds, so it's not a problem with the server that's serving the data. It seems to be a problem with either colab's internet connection with the server or the way that PyTorch is handling the data download.I don't really know what to do to solve this issue. I tried reconnecting runtime many times but the same issue happened.
Data Download Code:
from torchvision import transforms, datasets
train_data = None
test_data = None
def load_data():
global train_data, test_data
try:
train_data = datasets.EMNIST("./data", split="balanced", train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor()
]))
test_data = datasets.EMNIST("./data", split="balanced", train=False, download=True,
transform=transforms.Compose([
transforms.ToTensor()
]))
except:
load_data()
load_data()
print(train_data)
print(test_data)

Related

raising an OS error when running federated learning in python

I'm attempting some simulated federated learning using Tensorflow. When I invoke the initialise computation to construct the server state I get the following OS error. Anyone know a way round this?
iterative_process = tff.learning.algorithms.build_weighted_fed_avg(
model_fn ,
#client optimizer
client_optimizer_fn = lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
#server optimizer
server_optimizer_fn = lambda: tf.keras.optimizers.SGD(learning_rate=1)
)
train_state = iterative_process.initialize()
OSError: [Errno 8] Exec format error:'/Users/*****/opt/anaconda3/envs/p3workshop/lib/python3.10/site-packages/tensorflow_federated/python/core/backends/native/../../../../data/worker_binary'

Errors with Locust load testing scripts - Connection aborted.', RemoteDisconnected

I am new to Locust Load testing framework and in process of migrating my existing Azure cloud based Performance testing C# scripts to Locust's Python based scripts. Our team almost completed migration of scripts. But during our load tests, we are getting errors as below, which fails to create new requests from the machine due to high CPU utilization or because of so many exception on Locust. We are running with Locust web based mode - details are indicated below. These scritps are working fine on smaller loads of 50 to 100 users
"Error 1 -('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))"
"Error 2 : Connection pool is full, discarding connection"
"** **Error 3 :urllib3.exceptions.NewConnectionError: : Failed to establish a new connection: [Errno 110] Connection timed out****"
Yes, we are using UrlLibs on the utility classes . But first 2 error does seems to be of Locust.
Our Load testing configurations are : "3500 users at a hatch rate of 5 users per second". Running natively(no docker container) on a 8 Core , 16 Gb Linux Ubuntu Virtual machine on Azure. ulimit set as 50,000 on Linux machine.
Please help us with your thoughts
Sample test is as below
import os
import sys
sys.path.append(os.environ.get('WORKDIR', os.getcwd()))
from locust import HttpLocust, TaskSet, task
from locust.wait_time import between
class ContactUsBehavior(TaskSet):
wait_time = AppUtil.get_wait_time_function(2)
#task(1)
def post_load_test_contact(self):
data = { "ContactName" : "Mane"
, "Email" : "someone#someone.com"
, "EmailVerifaction" : "someone#someone.com"
, "TelephoneContact" : ""
, "PhoneNumber" : ""
, "ContactReason" : "Other"
, "OtherComment" : "TEST Comments 2019-12-30"
, "Agree" : "true"
}
self.client.post("app/contactform", self.client, 'Contact us submission', post_data = data)
class UnauthenticatedUser(HttpLocust):
task_set = ContactUsBehavior
# host is override-able
host = 'https://app.devurl.com/'

Locust’s default HTTP client uses python-requests which internally use urllib3 .
If you are working on a large scale tests, you should consider another HTTP client. The connection pool of urllib 3 (PoolManager ) will reuse connections and limit how many connections are allowed per host at any given time to avoid accumulating too many unused sockets.
So you have option to tweak the pool : https://urllib3.readthedocs.io/en/latest/advanced-usage.html#customizing-pool-behavior
Or you can try any other high performance HTTP client . Eg: gevenhttp
Locust also provides a built-int client which is faster than the default python-requests:
https://docs.locust.io/en/stable/increase-performance.html
You should consider to run Locust in cluster mode in different nodes if the client still couldn't handle the big load.

python grpc deadline exceeded errors in large percentages

I'm getting a lot of deadline exceeded errors in python grpc client calling a scala grpc server.
I'm reporting metrics from both client as well as server and I have a large discrepency between server reported time vs client reported time which I don't think can be explained by network latency only (as the variance is big). The returned objects are of similar size, I would assume serialization time is negligable compared to network times.
I've set the timeout to 20ms
My client code is simple:
self.channel = grpc.insecure_channel(...)
self.stub = MyService_pb2_grpc.MyServiceStub(self.channel)
timeout = 0.02
try:
start_ms = time.time()
grpc_res = self.stub.getFoo(Request(...), timeout=timeout)
end_ms = time.time()
total_duration_ms = int((end_ms - start_ms) * 1000)
....
except Exception as e:
status_code = str(e.code()).split('.')[1]
logger.error('exception ....: %s', status_code) # around 20% deadline exceptions
My server code is reporting 5ms on average, the client code is reporting 7ms on average , but as mentioned , hitting 20% timeouts at 20ms
Is there a way to debug the root cause for this problem, i.e. lower level logging etc.?

You could try running under environment variables:
GRPC_VERBOSITY=DEBUG GRPC_TRACE=all
https://github.com/grpc/grpc/blob/master/doc/environment_variables.md

How do I fix time out error when reading iris data?

I have a timeout error in reading my data.
I'm in my company, so I have to write pip install --proxy=http://ep.threatpulse.net:80 pandas in order to install pandas.
Is it a prozy problem?
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])
and the result comes like this:
urlopen error [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

Yes this error comes because its not able to establish a connection to the internet- network error or proxy setting issue. You can check the proxy setting on you IE, if its taking by default or try on another PC in the same network or ask system admin in your company to allow the access.

You can try set the proxy something like this!
import io
import requests
proxy_dict = {"https":"https://xx.xx.x.xx:80"} #replace proxy setting here
response = requests.get(url, proxies=proxy_dict).text
df = pd.read_csv(io.StringIO(response),header=None)
df.columns = ['sepal length in cm','sepal width in cm',
'petal length in cm','petal width in cm','class']

Uploading to S3 from Docker Container

I'm trying to get a handle on Docker. I've got a very basic container setup that runs a simple python script to:
Query a database
Write a CSV file of the query results
Upload the CSV to S3 (using the tinys3 package).
When I run the script from my host, everything works as intended: the query fires, csv is created and uploaded perfectly. But when I run it from within my Docker container, tinys3 fails with the following error:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='my-s3-bucket', port=443): Max retries exceeded with url: /bucket.s3.amazonaws.com/test.csv (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f4f17cf7790>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Everything prior to that works (query and CSV creation). This answer suggests that there's an incorrect endpoint. But that doesn't seem correct, since running the script from my host does not result in an error.
So my question is: am I missing something obvious? Is this an issue with the tinys3 module? Do I need to set something up in my container to allow it to "call out"? Or is there a better way to do this?

Alternatively you can also use minio-py client library for the same.
Please find example code for fput_object.py
from minio import Minio
from minio.error import ResponseError
client = Minio('s3.amazonaws.com',
access_key='YOUR-ACCESSKEYID',
secret_key='YOUR-SECRETACCESSKEY')
# Put on object 'my-objectname-csv' with contents from
# 'my-filepath.csv' as 'application/csv'.
try:
client.fput_object('my-bucketname', 'my-objectname-csv',
'my-filepath.csv', content_type='application/csv')
except ResponseError as err:
print(err)
Hope it helps.
Disclaimer: I work with Minio

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyTorch torchvision dataset download speed is very slow - python

Related

raising an OS error when running federated learning in python

Errors with Locust load testing scripts - Connection aborted.', RemoteDisconnected

python grpc deadline exceeded errors in large percentages

How do I fix time out error when reading iris data?

Uploading to S3 from Docker Container

Categories

Resources