Why can't I read a joblib file from my github repo? - python

I've built a simple app in Python, with a front-end UI in Dash.
It relies on three files,
small dataframe, in pickle format ,95KB
large scipy sparse matrix, in NPZ format, 12MB
large scikit KNN-model, in job lib format, 65MB
I have read in the first dataframe successfully by
link = 'https://github.com/user/project/raw/master/filteredDF.pkl'
df = pd.read_pickle(link)
But when I try this with the others, say, the model by:
mLink = 'https://github.com/user/project/raw/master/knnModel.pkl'
filehandler = open(mLink, 'rb')
model_knn = pickle.load(filehandler)
I just get an error
Invalid argument: 'https://github.com/user/project/raw/master/knnModel45Percent.pkl'
I also pushed these files using Github LFS, but the same error occurs.
I understand that hosting large static files on github is bad practice, but I haven't been able to figure out how to use PyDrive or AWS S3 w/ my project. I just need these files to be read in by my project, and I plan to host the app on something like Heroku. I don't really need a full-on DB to store files.
The best case would be if I could read in these large files stored in my repo, but if there is a better approach, I am willing as well. I spent the past few days struggling through Dropbox, Amazon, and Google Cloud APIs and am a bit lost.
Any help appreciated, thank you.

Could you try the following?
from io import BytesIO
import pickle
import requests
mLink = 'https://github.com/aaronwangy/Kankoku/blob/master/filteredAnimeList45PercentAll.pkl?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model_knn = pickle.load(mfile)
Using the BytesIO you create a file object out of the response that you get from GitHub. That object can then be using in pickle.load. Note that I have added ?raw=true to the URL of the request.

For the ones having the KeyError 10 try
model_knn = joblib.load(mfile)
instead of
model_knn = pickle.load(mfile)

Related

How to read bigdataset to Pandas dataframe?

I have several files (11) already as datasets (mltable) in Azure ML Studio. Loading to df's works to all the cases except one. I believe the reason for that is the size - 1.95 GB. I wonder how can I load this dataset to dataframe? So far I did not manage to load it at all.
Any tips how to do it effectively? I tried to figure out a way to do it in parallel with the modin but failed. Below you will find the load script.
subscription_id = 'xyz'
resource_group = 'rg-personal'
workspace_name = 'test'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='buses')
dataset.to_pandas_dataframe()
You can load the data using an AzureML long-form datastore URI directly into Pandas.
Ensure you have the azureml-fsspec Python library installed:
pip install azureml-fsspec
Next, just load the data:
import pandas as pd
df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()
As this uses the AzureML datastore, it will automatically handle authentication for you without exposing SAS keys in the URI. Authentication can be either identity-based (i.e. passthrough your AAD to storage) or credential-based.
AzureML Datastore URIs are a known implementation of Filesystem spec (fsspec): A unified pythonic interface to local, remote and embedded file systems and bytes storage.
This implementation leverages the AzureML data runtime: a fast and efficient engine to materialize the data into a Pandas or Spark dataframe. The engine is written in Rust, which is known for high speed and high memory efficiency for data processing tasks.
I found another solution, easier than what was posted by the #DeepDave
Instead of loading data from assets, I loaded them directly from blob with the URL, using the modin library instead of Pandas. Worked like a charm
Code below:
import modin.pandas as pd
url ='URLLINKHERE'
df_bus = pd.read_csv(url, encoding='utf16')
df_bus.head()
To supplement where to find URL.
Go go storage and find the file.
Right click on the file.
Generate SAS.
BLOB SAS URL -> that was the link I used.
Hope this help others.

Is there a function to download a pickle file via requests.post(url) and load it into a dataframe without saving it locally

I am trying to download a pickle file from a web-based API via the requests.post(url) function in python. I was able to download and load the pickle file in a dataframe however i had to save it locally before loading it. I wanted to check if there is a way to load the pickle file directly into the dataframe without having to save it locally. I was able to do it for csv files (as seen below) however not for pickle files:
r=requests.post(url)
data=r.content.decode()
df=pd.read_csv(io.StringIO(data),header=0,engine=None)
Any help is appreciated, thanks.
Just a guess at something that might work for you since it looks like the pickle file contains text-csv-like data.
df=pd.read_csv(io.StringIO(pd.read_pickle(url)),header=0,engine=None)
thanks for your suggestion. Basically i ended up using the following and it worked;
r=requests.post(url)
data=r.content
df=pd.read_pickle(io.BytesIO(data))

Downloading zip and parsing csv file in django, copying data to local DB(sqllite)

I have a url, which downloads a zip file, containing a csv file. I want to parse the data and import it to local Database(sqlite) in Django.
In brief, input = url, pre-processing download .zip-> convert to .csv, output=csv rows in DB which has columns of csv as fields.
Well, I think if you google a bit you can do it yourself. I will give you the keyword:
For download a file you can you use request:
import requests
url = 'https://www.facebook.com/favicon.ico'
r = requests.get(url,
allow_redirects=True)
open('facebook.ico', 'wb').write(r.content)
To parse a csv file use xlsxwriter
To save the data to database, I suggest you save the data to django model then call model.save()
To use django on standalone script read this
How to import Django Settings to python standalone script
To use it as a service you should use rest_framework and write a custom viewset like this. I had quite a hard time get familiar with it, so good luck, but when you get long, drf is quite handy tool for everything, just not fast.
https://medium.com/django-rest-framework/django-rest-framework-viewset-when-you-don-t-have-a-model-335a0490ba6f
You should django settings to save the path to temp file. In linux you can use /tmp

How to upload files to a Django Model that are created on the fly?

Here's an example:
Suppose a user, let's say 'Avi' is using my web app. Avi uploads a file (say and .xls/.xlsx) to my web app. I use django-storages to upload the files to either S3 or Azure. The problem arises when I create a dataframe from the uploaded file. When I export the dataframe, it does not get uploaded but rather gets exported to the location of the project.
Current Solution:
I use azure, boto3 api for converting the dataframe to a stream of bytes and then uploading to the corresponding container or s3 bucket.
Is there a easier way where I could pass the newly created dataframe through a serializer so it gets uploaded to the respective location?
After some digging and searching all over the internet. I've found a better solution to the existing one. I'll try to put down a detailed snippet!
Using SimpleUploadedFile which is a part of django.core.files.uploadedfile solves most of the problem and makes the code much easier to read.
Here's a snippet:
# Other imports
from django.core.files.uploadedfile import SimpleUploadedFile
import pandas as pd
import io
df = pd.DataFrame()
# Do something with dataframe
# After processing the dataframe do the following
output_stream = io.BytesIO()
df.to_excel(output_stream, index=False)
output_file = SimpleUploadedFile('sample.xlsx', output_stream.getvalue())
data = { 'record_type': 'OUTPUT_FILE', 'record':output_file }
record_serializer = RecordSerializer(data=data)
if record_serializer.is_valid():
record_serializer.save()
# Display data from the serializer/ Return serialized data
else:
# Return an exception or error from the serializer
Do let me know if it works for any of you who've tried this.
Also, since you're passing the file as a param to the serialzer, no matter if you're using default local storage or configures S3, Azure etc using django-storages, the file will be uploaded to your corresponding location without the hassle of calling the API to upload files created by python/django every now and then.

Where to upload a csv to then read it when coding on Jupyter

I need to upload a few CSV files somewhere on the internet, to be able to use it in Jupyter later using read_csv.
What would be some easy ways to do this?
The CSV contains a database. I want to upload it somewhere and use it in Jupyter using read_csv so that other people can run the code when I send them my file.
The CSV contains a database.
Since the CSV contains a database, I would not suggest uploading it on Github as mentioned by Steven K in the previous answer. It would be a better option to upload it to either Google Drive or Dropbox as rightly said in the previous answer.
To read the file from Google Drive, you could try the following:
Upload the file on Google Drive and click on "Get Sharable Link" and
ensure that anybody with the link can access it.
Click on copy link and get the file ID associated with the CSV.
Ex: If this is the URL https://drive.google.com/file/d/108ARMaD-pUJRmT9wbXfavr2wM0Op78mX/view?usp=sharing then 108ARMaD-pUJRmT9wbXfavr2wM0Op78mX is the file ID.
Simply use the file ID in the following sample code
import pandas as pd
gdrive_file_id = '108ARMaD-pUJRmT9wbXfavr2wM0Op78mX'
data = pd.read_csv(f'https://docs.google.com/uc?id={gdrive_file_id}&export=download', encoding='ISO-8859-1')
Here you are opening up the CSV to anybody with access to the link. A better and more controlled approach would be to share the access with known people and use a library like PyDrive which is a wrapper around Google API's official Python client.
NOTE: Since your question does not mention the version of Python that you are using, I've assumed Python 3.6+ and used f-strings in line #3 of the code. If you use any version of Python before 3.6, you would have to use format method to substitute the value of the variable in the string
You could use any cloud storage provider like Dropbox or Google Drive. Alternatively, you could use Github.
To do this in your notebook, import pandas and read_csv like you normally would for a local file.
import pandas as pd
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

Categories

Resources