Uploading Csv FIle Google colab - python

So I have a 1.2GB csv file and to upload it to google colab it is taking over an hour to upload. Is that normal? or am I doing something wrong?
Code:
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['IF 10 PERCENT.csv']), index_col=None)
Thanks.

files.upload is perhaps the slowest method to transfer data into Colab.
The fastest is syncing using Google Drive. Download the desktop sync client. Then, mount your Drive in Colab and you'll find the file there.
A middle ground that is faster than files.upload but still slower than Drive is to click the upload button in the file browser.

1.2 GB is huge dataset and if you upload this huge dataset it take time no question at all. Previously i worked on one of my project and i face this same problem. There are multiple ways to handel this problem.
Solution 1:
Try to get your dataset in google drive and start doing your project in google colab. In colab you can mount your drive and just use file path and it works.
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('Enter file path')
Solution 2:
I believe that you used this dataset for a machine learning project. So for developing the initial model, your first task is to check whether your model is working or not so what you do, you just open your CSV file in Excel and copy the first 500 or 1000 thousand rows and paste into another excel sheet and make small dataset and work with that dataset. Once you find everything is working then uploads your full dataset and train your model on it.
This technique is little bit tedious because you have to take care about EDA and Feature Engineering stuff, when you upload entire 1.2 GB dataset. Apart from that everything is fine and it work.
NOTE: This techinique very helpful when your first priority is performing experiment, because loading huge dataset and then start working is very time comsuming process.

Related

How Update Google Drive file in Colab without version history?

everyone. I use colab and big dataset for my task.. about 10Gb dataset divided into 50 parts (I use parquet)... and it's 1GB of free space left. When I try to update my dataset_parts.parquet files on goole drive, GD create new verion of my file. I don't need it, because I don't have free space for it.
So, How I can update my files without verion control?
#its example of my code
from google.colab import drive
drive.mount('/content/drive')
import pands as pd
df = pd.read_parquet('/content/drive/MyDrive/file_0.parquet',columns=columns)
df['amnt']=df['coral']/10
df.to_parquet('/content/drive/MyDrive/file_0.parquet, compression='gzip')

Best way to read and write data on google colab?

I'm making an application that shows the correlation between your daily habits and your mood. Because python has so many of the components I need and I wan't this to be web based (also I'm not worried about the front end right now) I'm leaning towards using colab.
The problem is the session storage. I know how to work with pre-existing data but I'm totally unfamiliar with storing collected data with python. It's a light weight app and I'd like to use the panda's library to visualize the data.
The point is: I don't know how I should store the data that will be input on a daily basis on colab for future use. Of course, every time I run the colab, data collected will be cleared. What's the best way to store data from each use on colab? Can I create a csv file on my google drive and read / write to that file and if so what's the best method?
If colab seems like a bad option, I'll use javascript to collect the data & d3.js to visualize but I'd like to stick to colab if I can so I don't have to stand up my own webpage.
Since you want it to be web-based, you can use Heroku Student Plan with Github Education or PythonAnywhere. Because the colab session will stop after 12 hours and it is a headache to run it again.
In case, you still want to use Colab, one way is to save data into a file and keep it in Google Drive. In this case,
Saving of data can be automated. But you'll need to get access token for Google Drive every session. Check Example I/O notebook
Other methods are generally inconvenient

How to load large xml dataset file in python?

Hi I am working on a project in data analysis with python where I have an XML file of around 2,8GB which is too large to open . I downloaded EmEditor which helped me open the file . The problem is when i try to load the file in python google colaboratory like this :
import xml.etree.ElementTree as ET
tree = ET.parse('dataset.xml') //dataset.xml is the name of my file
root = tree.getroot()
I get the result that No such file or directory: 'dataset.xml' exists . I have my dataset.xml file on my desktop and it can be opened using the EmEditor which gives me the idea that it can be edited and loaded via the EmEditor but I don't know . I would appreciate your help with helping me load the data in python
google colab.
Google Colab runs remotely on a computer from Google, and can't access files that are on your desktop.
To open the file in Python, you'll first need to transfer the file to your colab instance. There's multiple ways to do this, and you can find them here: https://colab.research.google.com/notebooks/io.ipynb
The easiest is probably this:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
Although keep in mind that every time you start a new colab session you'll need to reupload the file. This is because Google would like to use to the computer for someone else when you are not using it, and thus wipes all the data on the computer.

Google Colab Download dataframe to csv appears to be losing data

I have imported a 14gig .csv file from Google Drive into Google drive and used pandas to sort it and also delete some columns and rows.
After deleting about a third of the rows and about half the columns of data, df_edited file.shape shows:
(27219355, 7)
To save the file, the best method I've been able to find is:
from google.colab import files
df_edited.to_csv('edited.csv')
files.download('edited.csv')
When I run this, after a long time (if it doesn't crash which happens about 1 out of 2 times), it opens a dialog box to save the file locally.
I then say yes to the save and allow it to save. However, it reduces what was originally a 14 gig .csv file that I probably cut in half to about 7 gigs to a csv file of about 100 megs.
When I open the file locally it launches excel and I am only seeing about 358,000 observations instead of what should be about 27 million. I know Excel only shows you a limited amount but the fact that the size of the csv file has been shrunk to 100 megs suggests a lot of data has been lost in the download process.
Is there anything about the code above that would cause all this data to get lost?
Or what could be causing it.
Thanks for any suggestions.

Issues loading from Drive with pytorch' datasets.DatasetFolder

The loading works great using jupyter and local files, but when I adapted to Colab, fetching data from a Drive folder, datasets.DatasetFolder always loads 9500 odd datapoints, never the full 10 000. Anyone had similar issues?
train_data = datasets.DatasetFolder('/content/drive/My Drive/4 - kaggle/data', np.load, list(('npy')) )
print(train_data.__len__)
Returns
<bound method DatasetFolder.__len__ of Dataset DatasetFolder
Number of datapoints: 9554
Root Location: /content/drive/My Drive/4 - kaggle/data
Transforms (if any): None
Target Transforms (if any): None>
Where I would get the full 10 000 elements usually.
Loading lots of files from a single folder in Drive is likely to be slow and error-prone. You'll probably end up much happier if you either stage the data on GCS or upload an archive (.zip or .tar.gz) to Drive and copy that one file to your colab VM, unarchive it there, and then run your code over the local data.

Categories

Resources