Google Colab Download dataframe to csv appears to be losing data - python

I have imported a 14gig .csv file from Google Drive into Google drive and used pandas to sort it and also delete some columns and rows.
After deleting about a third of the rows and about half the columns of data, df_edited file.shape shows:
(27219355, 7)
To save the file, the best method I've been able to find is:
from google.colab import files
df_edited.to_csv('edited.csv')
files.download('edited.csv')
When I run this, after a long time (if it doesn't crash which happens about 1 out of 2 times), it opens a dialog box to save the file locally.
I then say yes to the save and allow it to save. However, it reduces what was originally a 14 gig .csv file that I probably cut in half to about 7 gigs to a csv file of about 100 megs.
When I open the file locally it launches excel and I am only seeing about 358,000 observations instead of what should be about 27 million. I know Excel only shows you a limited amount but the fact that the size of the csv file has been shrunk to 100 megs suggests a lot of data has been lost in the download process.
Is there anything about the code above that would cause all this data to get lost?
Or what could be causing it.
Thanks for any suggestions.

Related

Writing dataframe to Excel takes extremely long

I have got an excel file from work which I amended using pandas. It has 735719 rows × 31 columns, I made the changes necessary and allocated them to a new dataframe. Now I need to have this dataframe in an Excel format. I have checked to see that in jupyter notebooks the ont_dub works and it shows a dataframe. So I use the following code ont_dub.to_excel("ont_dub 2019.xlsx") which I always use.
However normally this would only take a few seconds, but now it has been 40 minutes and it is still calculating. Sidenote I am working in a onedrive folder from work, but that hasn't caused issues before. Hopefully someone can see the problem.
Usually, if you want to save such high amount of datas in a local folder. You don't utilize excel. If I am not mistaken excel has a know limit of displayable cells and it wasnt built to display and query such massive amounts of data (you can use pandas for that). You can either utilize feather files (a known quick save alternative). Or csv files, which are built for this sole purpose.

Sync Issues Inputting Excel Data with Openpyxl with MS OneDrive

I have a script which scrapes some data from a few websites, and then inputs that data into an excel sheet in the form of a log. The problem I am having is that this excel file is regularly used by many other people within my company, and often someone will be in the file at the time. This is fine if I go in and have 'auto-save' turned on. Everything syncs together and people can make changes.
However, if I use my script to go into the file using openpyxl and input the data which was scraped, it almost always leads to a sync error when I open the file and a requirement to delete the updated version of the file.
Does anyone know a way around this?
Nothing complex in terms of the actual code:
#Put results in Log
ws.cell(column=1, row=newRowLocation, value='=DATEVALUE("' + yesterday + '")')
ws.cell(column=2, row=newRowLocation, value='NAME')
ws.cell(column=3, row=newRowLocation, value=int(SCRAPED_DATA))
wb.save(filename=THE_FILE)
wb.close()

Uploading Csv FIle Google colab

So I have a 1.2GB csv file and to upload it to google colab it is taking over an hour to upload. Is that normal? or am I doing something wrong?
Code:
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['IF 10 PERCENT.csv']), index_col=None)
Thanks.
files.upload is perhaps the slowest method to transfer data into Colab.
The fastest is syncing using Google Drive. Download the desktop sync client. Then, mount your Drive in Colab and you'll find the file there.
A middle ground that is faster than files.upload but still slower than Drive is to click the upload button in the file browser.
1.2 GB is huge dataset and if you upload this huge dataset it take time no question at all. Previously i worked on one of my project and i face this same problem. There are multiple ways to handel this problem.
Solution 1:
Try to get your dataset in google drive and start doing your project in google colab. In colab you can mount your drive and just use file path and it works.
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('Enter file path')
Solution 2:
I believe that you used this dataset for a machine learning project. So for developing the initial model, your first task is to check whether your model is working or not so what you do, you just open your CSV file in Excel and copy the first 500 or 1000 thousand rows and paste into another excel sheet and make small dataset and work with that dataset. Once you find everything is working then uploads your full dataset and train your model on it.
This technique is little bit tedious because you have to take care about EDA and Feature Engineering stuff, when you upload entire 1.2 GB dataset. Apart from that everything is fine and it work.
NOTE: This techinique very helpful when your first priority is performing experiment, because loading huge dataset and then start working is very time comsuming process.

how to use cache while working with a heavy excel file for extracting data using python

HI I have a rather huge excel file (.xlsx) and has got multiple tab that I need to access for various purpose. Every time I have to read from excel it slows the process down. is that any way I can load selected tabs to cache the first time I read the excel book? thanks

Python, pandas.read_csv on large csv file with 10 Million rows from Google Drive file

I extracted a .csv file from Google Bigquery of 2 columns and 10 Million rows.
I have downloaded the file locally as a .csv with the size of 170Mb, then I uploaded the file to Google Drive, and I want to use pandas.read_csv() function to read it into pandas DataFrame in my Jupyter Notebook.
Here is the code I used, with specific fileID that I wanna read.
# read into pandasDF from .csv stored on Google Drive.
follow_network_df = pd.read_csv("https://drive.google.com/uc?export=download&id=1WqHWdgMVLPKVbFzIIprBBhe3I9faq4HA")
Then here is what I got:
It seems the 170Mb csv file is read as an html link?
While when I tried the same code with another csv file of 40Mb, it worked perfectly
# another csv file of 40Mb.
user_behavior_df = pd.read_csv("https://drive.google.com/uc?export=download&id=1NT3HZmrrbgUVBz5o6z_JwW5A5vRXOgJo")
Can anyone give me some hint on the root cause of the difference?
Any ideas on how to read a csv file of 10 Million rows and 170Mb from online storage? I know it's possible to just read the 10 Million rows into pandasDF by just using the BigQuery interface or from local machine, but I have to include this as part of my submission, so it's only possible for me to read from online source.
The problem is that your first file is too large for Google Drive to scan for viruses, so there's a user prompt that gets displayed instead of the actual file. You can see this if you access the first file's link.
I'd say click on the user prompt and use the following url with pd.read_csv.

Categories

Resources