everyone. I use colab and big dataset for my task.. about 10Gb dataset divided into 50 parts (I use parquet)... and it's 1GB of free space left. When I try to update my dataset_parts.parquet files on goole drive, GD create new verion of my file. I don't need it, because I don't have free space for it.
So, How I can update my files without verion control?
#its example of my code
from google.colab import drive
drive.mount('/content/drive')
import pands as pd
df = pd.read_parquet('/content/drive/MyDrive/file_0.parquet',columns=columns)
df['amnt']=df['coral']/10
df.to_parquet('/content/drive/MyDrive/file_0.parquet, compression='gzip')
Related
I have hydrated several thousand tweets via the command line, resulting in multiple JSON files of several GB (between 1 and 9 GB). When I have gone to read these files in Google Colab (python language), the pd.read_json function works for files less than 3 GB, above this size it gives me a memory error and I am not able to load them. I attach a copy of the code used:
from google.colab import drive
drive.mount('/content/drive')
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/drive/My Drive/Kaggle"
%cd /content/drive/My Drive/Kaggle
import pandas as pd
path = "hydrated_tweets_6.json"
df_json = pd.read_json(path, lines = True)
Could someone help me with some code structure for large JSON files? I have read other open questions and I have not been able to solve the problem.
I have a custom tokenizer and want to use it for prediction in Production API. How do I save/download the tokenizer?
This is my code trying to save it:
import pickle
from tensorflow.python.lib.io import file_io
with file_io.FileIO('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
No error, but I can't find the tokenizer after saving it. So I assume the code didn't work?
Here is the situation, using a simple file to disentangle the issue from irrelevant specificities like pickle, Tensorflow, and tokenizers:
# Run in a new Colab notebook:
%pwd
/content
%ls
sample_data/
Let's save a simple file foo.npy:
import numpy as np
np.save('foo', np.array([1,2,3]))
%ls
foo.npy sample_data/
In this stage, %ls should show tokenizer.pickle in your case instead of foo.npy.
Now, Google Drive & Colab do not communicate by default; you have to mount the drive first (it will ask for identification):
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
After which, an %ls command will give:
%ls
drive/ foo.npy sample_data/
and you can now navigate (and save) inside drive/ (i.e. actually in your Google Drive), changing the path accordingly. Anything saved there can be retrieved later.
How to get the creation date of a parquet file?
They gave me the parquet files and I have them stored in my Google Drive for testing. I am using Google Colab with Python and Pyspark. What would be the correct command?
Thanks
I haven't try on colab but with Databricks work in this way:
import os, time
path = '/local/file.parquet' # if you are on Databricks the adress could be dbfs:/mnt/local so you need replace to /dbfs/mnt/local/file.parquet
stat_info = os.stat(path)
print(stat_info)
print(stat_info.st_mtime)
print(time.ctime(stat_info.st_mtime))
Hope that works for you
So I have a 1.2GB csv file and to upload it to google colab it is taking over an hour to upload. Is that normal? or am I doing something wrong?
Code:
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['IF 10 PERCENT.csv']), index_col=None)
Thanks.
files.upload is perhaps the slowest method to transfer data into Colab.
The fastest is syncing using Google Drive. Download the desktop sync client. Then, mount your Drive in Colab and you'll find the file there.
A middle ground that is faster than files.upload but still slower than Drive is to click the upload button in the file browser.
1.2 GB is huge dataset and if you upload this huge dataset it take time no question at all. Previously i worked on one of my project and i face this same problem. There are multiple ways to handel this problem.
Solution 1:
Try to get your dataset in google drive and start doing your project in google colab. In colab you can mount your drive and just use file path and it works.
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('Enter file path')
Solution 2:
I believe that you used this dataset for a machine learning project. So for developing the initial model, your first task is to check whether your model is working or not so what you do, you just open your CSV file in Excel and copy the first 500 or 1000 thousand rows and paste into another excel sheet and make small dataset and work with that dataset. Once you find everything is working then uploads your full dataset and train your model on it.
This technique is little bit tedious because you have to take care about EDA and Feature Engineering stuff, when you upload entire 1.2 GB dataset. Apart from that everything is fine and it work.
NOTE: This techinique very helpful when your first priority is performing experiment, because loading huge dataset and then start working is very time comsuming process.
The loading works great using jupyter and local files, but when I adapted to Colab, fetching data from a Drive folder, datasets.DatasetFolder always loads 9500 odd datapoints, never the full 10 000. Anyone had similar issues?
train_data = datasets.DatasetFolder('/content/drive/My Drive/4 - kaggle/data', np.load, list(('npy')) )
print(train_data.__len__)
Returns
<bound method DatasetFolder.__len__ of Dataset DatasetFolder
Number of datapoints: 9554
Root Location: /content/drive/My Drive/4 - kaggle/data
Transforms (if any): None
Target Transforms (if any): None>
Where I would get the full 10 000 elements usually.
Loading lots of files from a single folder in Drive is likely to be slow and error-prone. You'll probably end up much happier if you either stage the data on GCS or upload an archive (.zip or .tar.gz) to Drive and copy that one file to your colab VM, unarchive it there, and then run your code over the local data.