Load .npz from a HTTP link

Load .npz from a HTTP link - python

I use a web service to train some of my deep learning models through a Jupyter Notebook on AWS. For cost reasons I would like to store my data as .npz files on my own server and load them straight to memory of my remote machine.
The np.load() function doesn't work with http links and using urlretrieve I wasn't able to make it work. I only got it working downloading the data with wget and then loading the file from a local path. However, this doesn't fully solve my problem.
Any recommendations?

The thing is that if the first argument of np.load is a file-like object, it has to be seek-able:
file : file-like object, string, or pathlib.Path The file to read.
File-like objects must support the seek() and read() methods. Pickled
files require that the file-like object support the readline() method
as well.
If you are going to serve those files over http and your server supports the Range headers, you could employ the implementation (Python 2) presented in this answer for example as:
F = HttpFile('http://localhost:8000/meta.data.npz')
with np.load(F) as data:
a = data['arr_0']
print(a)
Alternatively, you could fetch the entire file, create an in-memory file-like object and pass it to np.load:
from io import BytesIO
import numpy as np
import requests
r = requests.get('http://localhost:8000/meta.data.npz', stream = True)
data = np.load(BytesIO(r.raw.read()))
print(data['arr_0'])

Related

Python - faster library to compress data than ZipFile

I am developing a web application in Python 3.9 using django 3.0.7 as a framework. In Python I am creating a function that can convert a dictionary to a json and then to a zip file using the ZipFile library. Currently this is the code in use:
def zip_dict(data: dict) -> bytes:
with io.BytesIO() as archive:
unzipped = bytes(json.dumps(data), "utf-8")
with zipfile.ZipFile(archive, mode="w", compression=zipfile.ZIP_DEFLATED) as zipFile:
zipFile.writestr(zinfo_or_arcname="data", data=unzipped)
return archive.getvalue()
Then I save the zip in the Azure Blob Storage. It works but the problem is that this function is a bit slow for my purposes. I tried to use the zlib library but the performance doesn't change and, moreover, the created zip file seems to be corrupt (I can't even open it manually with WinRAR). Is there any other library to increase the compression speed (without touching the compression ratio)?

Try adding a compresslevel=3 or compresslevel=1 parameter to the zipfile.ZipFile() object creation, and see if that provides a more satisfactory speed and sufficient compression.

Why can't I read a joblib file from my github repo?

I've built a simple app in Python, with a front-end UI in Dash.
It relies on three files,
small dataframe, in pickle format ,95KB
large scipy sparse matrix, in NPZ format, 12MB
large scikit KNN-model, in job lib format, 65MB
I have read in the first dataframe successfully by
link = 'https://github.com/user/project/raw/master/filteredDF.pkl'
df = pd.read_pickle(link)
But when I try this with the others, say, the model by:
mLink = 'https://github.com/user/project/raw/master/knnModel.pkl'
filehandler = open(mLink, 'rb')
model_knn = pickle.load(filehandler)
I just get an error
Invalid argument: 'https://github.com/user/project/raw/master/knnModel45Percent.pkl'
I also pushed these files using Github LFS, but the same error occurs.
I understand that hosting large static files on github is bad practice, but I haven't been able to figure out how to use PyDrive or AWS S3 w/ my project. I just need these files to be read in by my project, and I plan to host the app on something like Heroku. I don't really need a full-on DB to store files.
The best case would be if I could read in these large files stored in my repo, but if there is a better approach, I am willing as well. I spent the past few days struggling through Dropbox, Amazon, and Google Cloud APIs and am a bit lost.
Any help appreciated, thank you.

Could you try the following?
from io import BytesIO
import pickle
import requests
mLink = 'https://github.com/aaronwangy/Kankoku/blob/master/filteredAnimeList45PercentAll.pkl?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model_knn = pickle.load(mfile)
Using the BytesIO you create a file object out of the response that you get from GitHub. That object can then be using in pickle.load. Note that I have added ?raw=true to the URL of the request.

For the ones having the KeyError 10 try
model_knn = joblib.load(mfile)
instead of
model_knn = pickle.load(mfile)

How to save your data you've already loaded and processed in Google Colab notebook so you don't have to reload it everytime?

I've read about 'pickle'-ing from the pickle library, but does that only save models you've trained and not the actual dataframe you've loaded into a variable from a massive csv file for instance?

This example notebook has some examples of different ways to save and load data.
You can actually use pickle to save any Python object, including Pandas dataframes, however it's more usual to serialize using one of Pandas' methods pandas.DataFrame.to_csv, to_feather etc.
I would probably recommend the option which uses the GCS command-line-tool which you can run from inside your notebook by prefixing with !
import pandas as pd
# Create a local file to upload.
df = pd.DataFrame([1,2,3])
df.to_csv("/tmp/to_upload.txt")
# Copy the file to our new bucket.
# Full reference: https://cloud.google.com/storage/docs/gsutil/commands/cp
!gsutil cp /tmp/to_upload.txt gs://my-bucket/

Unable to transform Presentation object from python-pptx library to upload to MS Graph API service endpoint

Context
I have a python script that uses the openpyxl and python-pptx libraries to generate Workbooks and PowerPoints respectively. These files need to be uploaded via the Microsoft Graph API as an octet-stream directly from virtual memory.
Obstacle
All works well for the Workbook thanks to the save_virtual_workbook method, which returns an in-memory workbook; but there is no analogous method like save_virtual_presentation that I know of, so it has been a challenge to get the Presentation object in a form that can be streamed via the io.BytesIO.read() method.
The python.pptx documentation says that Presentation.save(file) works...
"where file can be either a path to a file (a string) or a file-like
object."
Except while performing POC, I don't have the option of saving to the local file system, so I've experimented with various approaches with a file-like object. None come close to being accepted in the put request to the MS Graph API endpoint, except for the attempt below.
Nearest Miss
In this case, prs is the Presentation object that is created in previous code, which I have not included here on account of it being 556 lines.
<<Omitted code that generates the Workbook and PowerPoint, of which prs is an output>>
headers = {'Authorization' : 'Bearer {0}'.format(access_token),
'Accept' : 'application/json',
'Content-Type' : 'application/octet-stream'}
endpoint_url = 'https://graph.microsoft.com/v1.0/me/drive/items/<<removed id>>:/Test.pptx:/content'
target_stream = io.BytesIO()
prs.save(target_stream)
response = requests.put(url=endpoint_url,
headers=headers,
data=io.BytesIO.read(target_stream),
verify=False,
params=None)
The put request succeeds, BUT the file saved to the service endpoint is an empty pptx shell. I've ruled out that prs itself is an empty pptx shell, so I conclude that target_stream is not a valid transformation of prs.
Summary
Can someone please help me by suggesting how to transform the Presentation object prs into something that I can plug into data=io.BytesIO.read(<<input>>) and successfully upload to the MS Graph API endpoint? I would be much obliged!

Everything looks ok, up until you read the BytesIO object. In the put() call, try using data=target_stream.getvalue() instead of the read() call you have there now. That's the conventional way to get the contents of a BytesIO or StringIO object as bytes.

Upload image with an in-memory stream to input using Pillow + WebDriver?

I'm getting an Image from URL with Pillow, and creating an stream (BytesIO/StringIO).
r = requests.get("http://i.imgur.com/SH9lKxu.jpg")
stream = Image.open(BytesIO(r.content))
Since I want to upload this image using an <input type="file" /> with selenium WebDriver. I can do something like this to upload a file:
self.driver.find_element_by_xpath("//input[#type='file']").send_keys("PATH_TO_IMAGE")
I would like to know If its possible to upload that image from a stream without having to mess with files / file paths... I'm trying to avoid filesystem Read/Write. And do it in-memory or as much with temporary files. I'm also Wondering If that stream could be encoded to Base64, and then uploaded passing the string to the send_keys function you can see above :$
PS: Hope you like the image :P

You seem to be asking multiple questions here.
First, how do you convert a a JPEG without downloading it to a file? You're already doing that, so I don't know what you're asking here.
Next, "And do it in-memory or as much with temporary files." I don't know what this means, but you can do it with temporary files with the tempfile library in the stdlib, and you can do it in-memory too; both are easy.
Next, you want to know how to do a streaming upload with requests. The easy way to do that, as explained in Streaming Uploads, is to "simply provide a file-like object for your body". This can be a tempfile, but it can just as easily be a BytesIO. Since you're already using one in your question, I assume you know how to do this.
(As a side note, I'm not sure why you're using BytesIO(r.content) when requests already gives you a way to use a response object as a file-like object, and even to do it by streaming on demand instead of by waiting until the full content is available, but that isn't relevant here.)
If you want to upload it with selenium instead of requests… well then you do need a temporary file. The whole point of selenium is that it's scripting a web browser. You can't just type a bunch of bytes at your web browser in an upload form, you have to select a file on your filesystem. So selenium needs to fake you selecting a file on your filesystem. This is a perfect job for tempfile.NamedTemporaryFile.
Finally, "I'm also Wondering If that stream could be encoded to Base64".
Sure it can. Since you're just converting the image in-memory, you can just encode it with, e.g., base64.b64encode. Or, if you prefer, you can wrap your BytesIO in a codecs wrapper to base-64 it on the fly. But I'm not sure why you want to do that here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.