Issues loading from Drive with pytorch' datasets.DatasetFolder

Issues loading from Drive with pytorch' datasets.DatasetFolder - python

The loading works great using jupyter and local files, but when I adapted to Colab, fetching data from a Drive folder, datasets.DatasetFolder always loads 9500 odd datapoints, never the full 10 000. Anyone had similar issues?
train_data = datasets.DatasetFolder('/content/drive/My Drive/4 - kaggle/data', np.load, list(('npy')) )
print(train_data.__len__)
Returns
<bound method DatasetFolder.__len__ of Dataset DatasetFolder
Number of datapoints: 9554
Root Location: /content/drive/My Drive/4 - kaggle/data
Transforms (if any): None
Target Transforms (if any): None>
Where I would get the full 10 000 elements usually.

Loading lots of files from a single folder in Drive is likely to be slow and error-prone. You'll probably end up much happier if you either stage the data on GCS or upload an archive (.zip or .tar.gz) to Drive and copy that one file to your colab VM, unarchive it there, and then run your code over the local data.

Related

Psutil gives same storage data for directories irrespective of the content inside it. [python v3.8+]

I have different smb shares mounted on my system. I need the storage data of each mountpoint. Iam using psutil to get the data. But for some reason, the data does not change for diff folders.
Ex: I have a folder /mnt/share_Mount1/. I have no data inside share_Mount1.
psutil.disk_usage("/mnt/share_Mount1/")
sdiskusage(total=231496658944, used=124153004032, free=95560650752, percent=56.5)
Then, another share. /mnt/share_Mount2. Here i have data, 4 files, each of 100Mb. But when i try to get the storage here, it still shows the same values.
psutil.disk_usage("/mnt/share_Mount2/")
sdiskusage(total=231496658944, used=124153004032, free=95560650752, percent=56.5)
But how can both be same when there should be some difference here. Clarification required. Am i going wrong somewhere ?

Uploading Csv FIle Google colab

So I have a 1.2GB csv file and to upload it to google colab it is taking over an hour to upload. Is that normal? or am I doing something wrong?
Code:
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['IF 10 PERCENT.csv']), index_col=None)
Thanks.

files.upload is perhaps the slowest method to transfer data into Colab.
The fastest is syncing using Google Drive. Download the desktop sync client. Then, mount your Drive in Colab and you'll find the file there.
A middle ground that is faster than files.upload but still slower than Drive is to click the upload button in the file browser.

1.2 GB is huge dataset and if you upload this huge dataset it take time no question at all. Previously i worked on one of my project and i face this same problem. There are multiple ways to handel this problem.
Solution 1:
Try to get your dataset in google drive and start doing your project in google colab. In colab you can mount your drive and just use file path and it works.
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('Enter file path')
Solution 2:
I believe that you used this dataset for a machine learning project. So for developing the initial model, your first task is to check whether your model is working or not so what you do, you just open your CSV file in Excel and copy the first 500 or 1000 thousand rows and paste into another excel sheet and make small dataset and work with that dataset. Once you find everything is working then uploads your full dataset and train your model on it.
This technique is little bit tedious because you have to take care about EDA and Feature Engineering stuff, when you upload entire 1.2 GB dataset. Apart from that everything is fine and it work.
NOTE: This techinique very helpful when your first priority is performing experiment, because loading huge dataset and then start working is very time comsuming process.

Is there a way to access local files without having to use upload() option in Google Colab or uploading the data to the drive and then accessing it

I have data in my local drive spread over a lot of files. I want to access those data from Google Colab. Since it is spread over a large area and the data is susceptible to constant change I don't want to use the upload() option as it can get tedious and long.
Uploading to Drive is also something I am trying to avoid, due to the changing data values.
So I was wondering if there is another method to access the local data something similar to the code presented.
def list_files(dir):
r = []
for root, dirs, files in os.walk(dir):
for name in dirs:
r.append(os.path.join(root, name))
return r
train_path = list_files('/home/path/to/folder/containing/data/')
This does not seem to work since GC cannot access my local machine. So I always get an empty array (0,) returned from the function

The short answer is: no, you can't. The long answer is: you can skip the uploading phase each time you restart the runtime. You just need to use google.colab package in order to have a similar behaviour to the local environment. Upload all the files you need to your google drive, then just import:
from google.colab import drive
drive.mount('/content/gdrive')
After the authentication part, you will be able to access all your files stored in google drive. They will be imported as you have uploaded them, so you just have to modify the last line in this way:
train_path = list_files('gdrive/path/to/folder/containing/data/')
or in this way:
train_path = list_files('/content/gdrive/home/path/to/folder/containing/data/')

Extracting and accessing files in the cloud without taking up hard drive space

I have compressed files that are simply too big for me to extract to my local drive.
Thus, I would like to first extract the compressed files to be stored in any cloud service provider, and then access it using Python with something like...
fl = open('Dropbox\\myfile.txt')
for line in fl:
#do something
fl.close()
I am a little confused as to whether this can even be done using Dropbox or not. I know that I can extract files to my local hard drive using 7-Zip (as shown below) and then use the above coding to access it, but I believe that this would use up all my hard drive space.

Python GCS how to rename file within inner zip file?

Suppose I have a file hosted on GCS on a Python AppEngine project. unfortunately, this file structure is something like:
outer.zip/
- inner.zip/
- vid_file
- png_file
the problem is that the two files inside inner.zip do not have the .mp4 extension on the file, and it's causing all sorts of troubles. How do i rename the files so that it appears like:
outer.zip/
- inner.zip/
- vid_file.mp4
- png_file.png
so that the files inside inner.zip have their extensions?
I keep running into all sorts of limitations since gcs doesn't allow file renaming, unarchiving...etc.
the files aren't terribly large.
P.S. i'm not very familiar with Python, so any code examples would be great appreciated, thanks!

There is absolutely no way to perform any alteration to GCS objects -- full stop. They are exactly the bunch of bytes you decided at their birth (uninterpreted by GCS itself) and thus they will stay.
The best you can do is create a new object which is almost like the original except it fixes little errors and oopses you did when creating the original. Then you can overwrite (i.e completely replace) the original with the new, improved version.
Hopefully it's a one-off terrible mistake you made just once and now want to fix so it's not worth writing a program for that. Just download that GCS object, use normal tools to unzip it and unzip any further zipfiles it may contain, do the fixes on the filesystem with your favorite local filesystem tools, zip things up again, upload/rewrite the final zip to your desired new GCS object -- phew, you're done.

Alex is right that objects are immutable, i.e., no editing in-place. The only way to accomplish what you're talking about would be to download the current file, unzip it, update the new files, re-zip the files into the same-named file, and upload to GCS. GCS object overwrites are transactional, so the old content will be visible until the instant the upload completes. Doing it this way is obviously not very network efficient but at least it wouldn't leave periods of time when the object is invisible (as deleting and re-uploading would).

"Import zipfile" and you can unzip the file once it's downloaded into gcs storage.
I have code doing exactly this on a nightly basis from a cron job.
Ive never tried creating a zip file with GAE but the docs say you can do it.
https://docs.python.org/2/library/zipfile.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issues loading from Drive with pytorch' datasets.DatasetFolder - python

Related

Psutil gives same storage data for directories irrespective of the content inside it. [python v3.8+]

Uploading Csv FIle Google colab

Is there a way to access local files without having to use upload() option in Google Colab or uploading the data to the drive and then accessing it

Extracting and accessing files in the cloud without taking up hard drive space

Python GCS how to rename file within inner zip file?

Categories

Resources