Access .rds file from Python

Access .rds file from Python - python

I am working on a project where I have got an .rds file which consist of trained model as per my requirement generated by R code.
Now I need to load the trained model in python and use it in processing the records.
Is there any way to do so? If not what are the alternatives.
Thanks

We can use feather:
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)

Related

can I use .h5 file in Django project?

I'm making AI web page using Django and tensor flow. and I wonder how I add .h5 file in Django project.
writing whole code in views.py file
but I want to use pre-trained model
not online learning in webpage.

Yeah u can use .h5 file in django. You can use h5py for operations on .h5 files. Exapmle -
import h5py
filename = "filename.h5"
h5 = h5py.File(filename,'r')
# logic
...
h5.close()

Is there a function to download a pickle file via requests.post(url) and load it into a dataframe without saving it locally

I am trying to download a pickle file from a web-based API via the requests.post(url) function in python. I was able to download and load the pickle file in a dataframe however i had to save it locally before loading it. I wanted to check if there is a way to load the pickle file directly into the dataframe without having to save it locally. I was able to do it for csv files (as seen below) however not for pickle files:
r=requests.post(url)
data=r.content.decode()
df=pd.read_csv(io.StringIO(data),header=0,engine=None)
Any help is appreciated, thanks.

Just a guess at something that might work for you since it looks like the pickle file contains text-csv-like data.
df=pd.read_csv(io.StringIO(pd.read_pickle(url)),header=0,engine=None)

thanks for your suggestion. Basically i ended up using the following and it worked;
r=requests.post(url)
data=r.content
df=pd.read_pickle(io.BytesIO(data))

Loading a FastText Model from s3 without Saving Locally

I am looking to use a FastText model in a ML pipeline that I made and saved as a .bin file on s3. My hope is to keep this all in a cloud based pipeline, so I don't want local files. I feel like I am really close, but I can't figure out how to make a temporary .bin file. I also am not sure if I am saving and reading the FastText model in the most efficient way. The below code works, but it saves the file locally which I want to avoid.
import smart_open
file = smart_open.smart_open(s3 location of .bin model)
listed = b''.join([i for i in file])
with open("ml_model.bin", "wb") as binary_file:
binary_file.write(listed)
model = fasttext.load_model("ml_model.bin")

If you want to use the fasttext wrapper for the official Facebook FastText code, you may need to create a local temporary copy - your troubles make it seem like that code relies on opening a local file path.
You could also try the Gensim package's separate FastText support, which should accept an S3 path via its load_facebook_model() function:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
(Note, though, that Gensim doesn't support all FastText functionality, like the supervised mode.)

As partially answered by the above response, a temporary file was needed. But on top of that, the temporary file needed to be passed as a string object, which is sort of strange. Working code below:
import tempfile
import fasttext
import smart_open
from pathlib import Path
file = smart_open.smart_open(f's3://{bucket_name}/{key}')
listed = b''.join([i for i in file])
with tempfile.TemporaryDirectory() as tdir:
tfile = Path(tdir).joinpath('tempfile.bin')
tfile.write_bytes(listed)
model = fasttext.load_model(str(tfile))

Why can't I read a joblib file from my github repo?

I've built a simple app in Python, with a front-end UI in Dash.
It relies on three files,
small dataframe, in pickle format ,95KB
large scipy sparse matrix, in NPZ format, 12MB
large scikit KNN-model, in job lib format, 65MB
I have read in the first dataframe successfully by
link = 'https://github.com/user/project/raw/master/filteredDF.pkl'
df = pd.read_pickle(link)
But when I try this with the others, say, the model by:
mLink = 'https://github.com/user/project/raw/master/knnModel.pkl'
filehandler = open(mLink, 'rb')
model_knn = pickle.load(filehandler)
I just get an error
Invalid argument: 'https://github.com/user/project/raw/master/knnModel45Percent.pkl'
I also pushed these files using Github LFS, but the same error occurs.
I understand that hosting large static files on github is bad practice, but I haven't been able to figure out how to use PyDrive or AWS S3 w/ my project. I just need these files to be read in by my project, and I plan to host the app on something like Heroku. I don't really need a full-on DB to store files.
The best case would be if I could read in these large files stored in my repo, but if there is a better approach, I am willing as well. I spent the past few days struggling through Dropbox, Amazon, and Google Cloud APIs and am a bit lost.
Any help appreciated, thank you.

Could you try the following?
from io import BytesIO
import pickle
import requests
mLink = 'https://github.com/aaronwangy/Kankoku/blob/master/filteredAnimeList45PercentAll.pkl?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model_knn = pickle.load(mfile)
Using the BytesIO you create a file object out of the response that you get from GitHub. That object can then be using in pickle.load. Note that I have added ?raw=true to the URL of the request.

For the ones having the KeyError 10 try
model_knn = joblib.load(mfile)
instead of
model_knn = pickle.load(mfile)

How to save your data you've already loaded and processed in Google Colab notebook so you don't have to reload it everytime?

I've read about 'pickle'-ing from the pickle library, but does that only save models you've trained and not the actual dataframe you've loaded into a variable from a massive csv file for instance?

This example notebook has some examples of different ways to save and load data.
You can actually use pickle to save any Python object, including Pandas dataframes, however it's more usual to serialize using one of Pandas' methods pandas.DataFrame.to_csv, to_feather etc.
I would probably recommend the option which uses the GCS command-line-tool which you can run from inside your notebook by prefixing with !
import pandas as pd
# Create a local file to upload.
df = pd.DataFrame([1,2,3])
df.to_csv("/tmp/to_upload.txt")
# Copy the file to our new bucket.
# Full reference: https://cloud.google.com/storage/docs/gsutil/commands/cp
!gsutil cp /tmp/to_upload.txt gs://my-bucket/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Access .rds file from Python - python

I am working on a project where I have got an .rds file which consist of trained model as per my requirement generated by R code. Now I need to load the trained model in python and use it in processing the records. Is there any way to do so? If not what are the alternatives. Thanks

We can use feather: import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)

Related

can I use .h5 file in Django project?

Is there a function to download a pickle file via requests.post(url) and load it into a dataframe without saving it locally

Loading a FastText Model from s3 without Saving Locally

Why can't I read a joblib file from my github repo?

How to save your data you've already loaded and processed in Google Colab notebook so you don't have to reload it everytime?

Categories

Resources