Read large JSON files in Google Colab

Read large JSON files in Google Colab - python

I have hydrated several thousand tweets via the command line, resulting in multiple JSON files of several GB (between 1 and 9 GB). When I have gone to read these files in Google Colab (python language), the pd.read_json function works for files less than 3 GB, above this size it gives me a memory error and I am not able to load them. I attach a copy of the code used:
from google.colab import drive
drive.mount('/content/drive')
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/drive/My Drive/Kaggle"
%cd /content/drive/My Drive/Kaggle
import pandas as pd
path = "hydrated_tweets_6.json"
df_json = pd.read_json(path, lines = True)
Could someone help me with some code structure for large JSON files? I have read other open questions and I have not been able to solve the problem.

Related

How to get the creation date of a parquet file with python and pyspark?

How to get the creation date of a parquet file?
They gave me the parquet files and I have them stored in my Google Drive for testing. I am using Google Colab with Python and Pyspark. What would be the correct command?
Thanks

I haven't try on colab but with Databricks work in this way:
import os, time
path = '/local/file.parquet' # if you are on Databricks the adress could be dbfs:/mnt/local so you need replace to /dbfs/mnt/local/file.parquet
stat_info = os.stat(path)
print(stat_info)
print(stat_info.st_mtime)
print(time.ctime(stat_info.st_mtime))
Hope that works for you

How Update Google Drive file in Colab without version history?

everyone. I use colab and big dataset for my task.. about 10Gb dataset divided into 50 parts (I use parquet)... and it's 1GB of free space left. When I try to update my dataset_parts.parquet files on goole drive, GD create new verion of my file. I don't need it, because I don't have free space for it.
So, How I can update my files without verion control?
#its example of my code
from google.colab import drive
drive.mount('/content/drive')
import pands as pd
df = pd.read_parquet('/content/drive/MyDrive/file_0.parquet',columns=columns)
df['amnt']=df['coral']/10
df.to_parquet('/content/drive/MyDrive/file_0.parquet, compression='gzip')

Why can't I read a joblib file from my github repo?

I've built a simple app in Python, with a front-end UI in Dash.
It relies on three files,
small dataframe, in pickle format ,95KB
large scipy sparse matrix, in NPZ format, 12MB
large scikit KNN-model, in job lib format, 65MB
I have read in the first dataframe successfully by
link = 'https://github.com/user/project/raw/master/filteredDF.pkl'
df = pd.read_pickle(link)
But when I try this with the others, say, the model by:
mLink = 'https://github.com/user/project/raw/master/knnModel.pkl'
filehandler = open(mLink, 'rb')
model_knn = pickle.load(filehandler)
I just get an error
Invalid argument: 'https://github.com/user/project/raw/master/knnModel45Percent.pkl'
I also pushed these files using Github LFS, but the same error occurs.
I understand that hosting large static files on github is bad practice, but I haven't been able to figure out how to use PyDrive or AWS S3 w/ my project. I just need these files to be read in by my project, and I plan to host the app on something like Heroku. I don't really need a full-on DB to store files.
The best case would be if I could read in these large files stored in my repo, but if there is a better approach, I am willing as well. I spent the past few days struggling through Dropbox, Amazon, and Google Cloud APIs and am a bit lost.
Any help appreciated, thank you.

Could you try the following?
from io import BytesIO
import pickle
import requests
mLink = 'https://github.com/aaronwangy/Kankoku/blob/master/filteredAnimeList45PercentAll.pkl?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model_knn = pickle.load(mfile)
Using the BytesIO you create a file object out of the response that you get from GitHub. That object can then be using in pickle.load. Note that I have added ?raw=true to the URL of the request.

For the ones having the KeyError 10 try
model_knn = joblib.load(mfile)
instead of
model_knn = pickle.load(mfile)

How to load large xml dataset file in python?

Hi I am working on a project in data analysis with python where I have an XML file of around 2,8GB which is too large to open . I downloaded EmEditor which helped me open the file . The problem is when i try to load the file in python google colaboratory like this :
import xml.etree.ElementTree as ET
tree = ET.parse('dataset.xml') //dataset.xml is the name of my file
root = tree.getroot()
I get the result that No such file or directory: 'dataset.xml' exists . I have my dataset.xml file on my desktop and it can be opened using the EmEditor which gives me the idea that it can be edited and loaded via the EmEditor but I don't know . I would appreciate your help with helping me load the data in python
google colab.

Google Colab runs remotely on a computer from Google, and can't access files that are on your desktop.
To open the file in Python, you'll first need to transfer the file to your colab instance. There's multiple ways to do this, and you can find them here: https://colab.research.google.com/notebooks/io.ipynb
The easiest is probably this:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
Although keep in mind that every time you start a new colab session you'll need to reupload the file. This is because Google would like to use to the computer for someone else when you are not using it, and thus wipes all the data on the computer.

Uploading Csv FIle Google colab

So I have a 1.2GB csv file and to upload it to google colab it is taking over an hour to upload. Is that normal? or am I doing something wrong?
Code:
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['IF 10 PERCENT.csv']), index_col=None)
Thanks.

files.upload is perhaps the slowest method to transfer data into Colab.
The fastest is syncing using Google Drive. Download the desktop sync client. Then, mount your Drive in Colab and you'll find the file there.
A middle ground that is faster than files.upload but still slower than Drive is to click the upload button in the file browser.

1.2 GB is huge dataset and if you upload this huge dataset it take time no question at all. Previously i worked on one of my project and i face this same problem. There are multiple ways to handel this problem.
Solution 1:
Try to get your dataset in google drive and start doing your project in google colab. In colab you can mount your drive and just use file path and it works.
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('Enter file path')
Solution 2:
I believe that you used this dataset for a machine learning project. So for developing the initial model, your first task is to check whether your model is working or not so what you do, you just open your CSV file in Excel and copy the first 500 or 1000 thousand rows and paste into another excel sheet and make small dataset and work with that dataset. Once you find everything is working then uploads your full dataset and train your model on it.
This technique is little bit tedious because you have to take care about EDA and Feature Engineering stuff, when you upload entire 1.2 GB dataset. Apart from that everything is fine and it work.
NOTE: This techinique very helpful when your first priority is performing experiment, because loading huge dataset and then start working is very time comsuming process.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read large JSON files in Google Colab - python

Related

How to get the creation date of a parquet file with python and pyspark?

How Update Google Drive file in Colab without version history?

Why can't I read a joblib file from my github repo?

How to load large xml dataset file in python?

Uploading Csv FIle Google colab

Categories

Resources