Why does colab disconnect? - python

I am trying to load my dataset. I have been using colab's TPU for many times,but the colab gets disconnected every time. I have tried all the methods to keep it connected, still it doesn't work. I have been training for more than 10 hours and still the colab gets disconnected. What do I do??

There could be many possibilities why your session is crashing.
There is a time limit for the free tier in Google Colab. If your execution gets over the time, it disconnects.
Also check the RAM usage, if that exceeds the session will crash.
The storage limits might exceed.
Run and keep an eye on these factors. And try to optimise the code or use aws for training.

As #HarshitRuwali mentioned there are several reasons for why this would happen.
Regarding the question "what do I do?" - you can purchase a subscription for Colab Pro, which eliminates/relaxes different limitations, including the time limit.

Related

Google cloud run sudden latency spikes and container instance time increase

We are having recurring problems with our container instances with python running on cloud run. We currently have 20 services deployed, which run fine weeks at a time and then get sudden spikes in request latency as well as ping checks failing and the container instance time going up. We cannot see any added traffic during these spells of longer latency in our systems. Common access points such as database and cache all seem normal.
The region is europe-west1
Does anyone have any tips on what to check? Our have experienced similar problems?
Latency:
Container instance time:
I had to buy support for Google Cloud to get a good answer to this. They told me to make adjustment to my cloud service instances, but none to any effect. They later admitted that this was due to a problem on their end. It is a shame that you as a user do not get any feedback on problems like these when using the Google Cloud Platform, a simple notification in the Google Cloud console for affected users would be of great help, but I think they may like to cover these things up as to not worsen the service accessibility numbers.

High iowait on sdcard - process stuck for 5-15 seconds

I have IOT devices (arm64) running Ubuntu with SD cards (formatted as ext4 with Journaling) where my application logging(python logging library) is done to files on that SD card, overall the write speed (as reported by iotop) is around 40KB/s (the device operates 24/7/365)
What I see that once in a while (week or so?) there is a spike in iowait (see attached screenshot from netdata).
When this happens my proccess get stuck for 5-15 seconds which is a lot!
Now I know that I should change my logging to be non-blocking to avoid my process getting stuck if there is an issue with the disk but it seems excessive this amount of time considering the fact the write speed is very low.
It has gotten worse since I increased logging but still it is not a lot of data.
My next steps are:
Use QueueHandler to do logging without blocking
Disable journaling on sdcard
Disable docker logging as it is also writing to disk.
But I want to understand the underlying issue that causes this kind of stalls, what can it be?
Not a full solution but adding QueueHandler made my app survive this high loads.
It is easy to simulate this with slowpokefs or just doing a lot of IO (like taring a big folder) while logging constantly.

PythonAnywhere Issue: Your processes have been killed because a process exceeded RAM Limit

I am getting this warning email from PythonAnywhere on every single request to my website. I am using spaCy and Django and have just upgraded my account. Everything seems to work fine, though. Except I am receiving warning emails, that is. I have only 2 GB RAM on my local machine and it can run my app along with a few other apps too without any issues. Then why is 3 GB RAM not enough on PythonAnywhere? (I also have 3 GB disc space on PythonAnywhere, of which only 27% is used up.)
I have tried searching for the answers on their forum and on the internet in general but I have not got any clue about the issue.
If your initial requests on the PythonAnywhere webapp works fine (ie. your code successfully allocates say 2GB RAM and returns a result), and you see the results correctly, but you receive emails about processes exceeding the RAM limit, then perhaps you have processes that are left hanging around, not cleaned up, and they are accumulating until they slowly get killed? Can you correspond this with the # of kill messages you get vs the number of times you hit the webapp and get a result? My theory would be corroborated if there are significantly less kill messages vs the hits for that particular model endpoint.

Google CloudML: Job fails after "Finished tearing down training program" even though the training hasn't completed

I am trying to train a model using Google Cloud Platform (GCP).
I chose the standard-1 scale tier (using the basic tier give memory exceptions which I think is due to the size(2.6GB) of the data) but my job fails after a log of "Finished tearing down training program" even though it is still downloading the data into the VM from the cloud storage.
It doesn't provide any Tracebacks as to what the error might be.
I have my data stored in the Cloud Storage and to make it available I use os.system('gsutil -m cp -r location_of_data_in_cloud_storage os.getcwd()') to store the data in the VM assigned so that it can be directly accessed by the program. This data is then loaded into the model.fit_generator() method through a generator.
As can be seen the data of 2.6GB hasn't been downloaded completely but the job fails before that!
Anyone else who stumbles upon this question in the future (possibly me ;) ), the problem above was occurring because the machine couldn't handle the compute and so I had to scale up a machine using the standard_p100 scale-tier from the basic scale-tier in GCP which solved the problem!

Google Calendar API:Calendar usage limits exceeded

My work is migration calendar data, we're using google calendar API.
The number of Target Data is 280,000.
The method to execute is as follows.
・Calender API V3 Event-insert
https://developers.google.com/google-apps/calendar/v3/reference/events/insert
・batch request
https://developers.google.com/google-apps/calendar/batch
・exponential backoff
It has been running many times for this test, and it was able to run without problems before.
However, it currently has an error indication "Calendar usage limits exceeded", a state where it can not be executed even once and lasted for about a week.
I understand that the cause is as follows.
https://support.google.com/a/answer/2905486?hl=en
QPD(quota per date) had been changed to 2,000,000 by the support .
Therefore, I think that the problem by quota can be solved.
However, Currently, I am currently in a state where I can not execute the API even once I run the program.
I want to eliminate this situation.
I think that it is probably necessary to cancel restrictions on google side.
Can I borrow your wisdom?

Categories

Resources