Can we load .pkl files from an external url? - python

I have a pkl file of 312 MB. I want to store it to an external server (S3) or a file storing service (for example, Google Drive, Dropbox or any other). When I run my model, the pkl file should be loaded from that external url.
I have checked out this post but was unable to make it work.
Code:
import urllib
import pickle
Nu_SVC_classifier = pickle.load(urllib.request.urlopen("https://drive.google.com/open?id=1M7Dt7CpEOtjWdHv_wLNZdkHw5Fxn83vW","rb"))
Error:
TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.

The second argument of urllib.request.urlopen is the post data, not file mode, which is not needed.
import urllib.request
import pickle
Nu_SVC_classifier = pickle.load(urllib.request.urlopen("https://drive.google.com/open?id=1M7Dt7CpEOtjWdHv_wLNZdkHw5Fxn83vW"))

Try joblib instead of pickle, It works for me.
from urllib.request import urlopen
from sklearn.externals import joblib
Nu_SVC_classifier = joblib.load(urlopen("https://drive.google.com/open?id=1M7Dt7CpEOtjWdHv_wLNZdkHw5Fxn83vW"))

Related

Loading a parquet file from a GitHub repository

I tried to read a parquet (.parq) file I have stored in a GitHub project, using the following script:
import pandas as pd
import numpy as np
import ipywidgets as widgets
import datetime
from ipywidgets import interactive
from IPython.display import display, Javascript
import warnings
warnings.filterwarnings('ignore')
parquet_file = r'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')
and it gave me this error:
ArrowInvalid: Could not open Parquet input source '': Parquet
magic bytes not found in footer. Either the file is corrupted or this
is not a parquet file.
Does anyone know what this error message means and how I can load the file in my GitHub repository? Thank you in advance.
You should use the URL under the domain raw.githubusercontent.com.
As for your example:
parquet_file = 'https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')
You can read parquet files directly from a web URL like this. However, when reading a data file from a git repository you need to make sure it is the raw file url:
url = 'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq?raw=true'

How to parse CSV into pandas dataframe

I am having a couple issues with setting up a way to automate the download of a csv. The two issues are when downloading using a simple pandas read_csv(url) method I get and SSL error, so I switched to using requests and trying to parse the response. The next issues is that I am getting some errors in parsing the response. I'm not sure if the reason is that the URL is actually returning a zip file and if that is how can I get around that.
Here is the URL: https://www.californiadgstats.ca.gov/download/interconnection_rule21_applications/
and here is the code:
import pandas as pd
import numpy as np
import os
import io
import requests
import urllib3
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'ALL:#SECLEVEL=1'
url = "https://www.californiadgstats.ca.gov/download/interconnection_rule21_applications/"
res = requests.get(url).content
data = pd.read_csv(io.StringIO(res.decode('utf-8')))
If the content is zip format, you should unzip it, and use its contents (csv, txt...).
I wasn't able to download the file due to the low speed from host
Here is the answer I found although I don't really need to actually save these files locally, so if anyone knows how to parse zipfiles without downloading that would be great. Also not sure why I get that SSL error with pandas, but not with requests...
import requests
import zipfile
from io import BytesIO
url = "https://www.californiadgstats.ca.gov/download/interconnection_rule21_applications/"
pathSave = "C:/Users/wherever"
filename = url.split('/')[-1]
r = requests.get(url)
zipfile= zipfile.ZipFile(BytesIO(r.content))
zipfile.extractall(pathSave)

Python script to download PDF not downloading the PDF?

I have a Python 3.10 script to download a PDF from a URL, I get no errors but when I run the code the PDF does not download. I've done a sanity check to ensure the PDF is actually on the URL (which it is)
I'm not sure if this maybe has something to do with HTTP/ HTTPS? This site does have an expired HTTPS certificate, but it is a government site and this is really for testing only so I am not worried about that and can ignore the error
from fileinput import filename
import os
import os.path
from datetime import datetime
import urllib.request
import requests
import urllib3
urllib3.disable_warnings()
resp = requests.get('http:// url domain .org', verify=False)
urllib.request.urlopen('http:// my url .pdf')
filename = datetime.now().strftime("%Y_%m_%d-%I_%M_%S_%p")
save_path = "C:/Users/bob/Desktop/folder"
Or maybe is the issue something to do with urllib3 ignoring the error and urllib downloading the file?
Redacted the specific URL here
The urllib.request.urlopen method doesn't save the remote URL to a file -- it returns a response object that can be treated as a file-like object. You could do something like:
response = urllib.request.urlopen('http:// my url .pdf')
with open('filename.pdf') as fd:
fd.write(response.read())
The urllib.request.urlretrieve method, on the other hand, will take care of writing the remote content to a local file. You would use it like this to write the PDF file to a local file named filename.pdf:
response = urllib.request.urlretrieve('http://my url .pdf',
filename='filename.pdf')
See the documentation for information about the return value from the urlretrieve method.

How to unpickle a file that has been hosted in a web URL in python

The normal way to pickle and unpickle an object is as follows:
Pickle an object:
import cloudpickle as cp
cp.dump(objects, open("picklefile.pkl", 'wb'))
UnPickle an object: (load the pickled file):
loaded_pickle_object = cp.load(open("picklefile.pkl", 'rb'))
Now, what if the pickled object is hosted in a server, for example a google drive: I am not able to unpickle the object if I directly provide the URL of that object in the path. The following is not working:I get an IOERROR
UnPickle an object: (load the pickled file):
loaded_pickle_object = cp.load(open("https://drive.google.com/file/d/pickled_file", 'rb'))
Can someone tell me how to load a pickled file into python that is hosted in a web URL?
The following has worked for me when importing gdrive pickled files into a Python 3 colab:
from urllib.request import urlopen
loaded_pickle_object = cp.load(urlopen("https://drive.google.com/file/d/pickled_file", 'rb'))

How to open a remote file with GDAL in Python through a Flask application

So, I'm developing a Flask application which uses the GDAL library, where I want to stream a .tif file through an url.
Right now I have method that reads a .tif file using gdal.Open(filepath). When run outside of the Flask environment (like in a Python console), it works fine by both specifying the filepath to a local file and a url.
from gdalconst import GA_ReadOnly
import gdal
filename = 'http://xxxxxxx.blob.core.windows.net/dsm/DSM_1km_6349_614.tif'
dataset = gdal.Open(filename, GA_ReadOnly )
if dataset is not None:
print 'Driver: ', dataset.GetDriver().ShortName,'/', \
dataset.GetDriver().LongName
However, when the following code is executed inside the Flask environement, I get the following message:
ERROR 4: `http://xxxxxxx.blob.core.windows.net/dsm/DSM_1km_6349_614.tif' does
not exist in the file system,
and is not recognised as a supported dataset name.
If I instead download the file to the local filesystem of the Flask app, and insert the path to the file, like this:
block_blob_service = get_blobservice() #Initialize block service
block_blob_service.get_blob_to_path('dsm', blobname, filename) # Get blob to local filesystem, path to file saved in filename
dataset = gdal.Open(filename, GA_ReadOnly)
That works just fine...
The thing is, since I'm requesting some big files (200 mb), I want to stream the files using the url instead of the local file reference.
Does anyone have an idea of what could be causing this? I also tried putting "/vsicurl_streaming/" in front of the url as suggested elsewhere.
I'm using Python 2.7, 32-bit with GDAL 2.0.2
Please try the follow code snippet:
from gzip import GzipFile
from io import BytesIO
import urllib2
from uuid import uuid4
from gdalconst import GA_ReadOnly
import gdal
def open_http_query(url):
try:
request = urllib2.Request(url,
headers={"Accept-Encoding": "gzip"})
response = urllib2.urlopen(request, timeout=30)
if response.info().get('Content-Encoding') == 'gzip':
return GzipFile(fileobj=BytesIO(response.read()))
else:
return response
except urllib2.URLError:
return None
url = 'http://xxx.blob.core.windows.net/container/example.tif'
image_data = open_http_query(url)
mmap_name = "/vsimem/"+uuid4().get_hex()
gdal.FileFromMemBuffer(mmap_name, image_data.read())
dataset = gdal.Open(mmap_name)
if dataset is not None:
print 'Driver: ', dataset.GetDriver().ShortName,'/', \
dataset.GetDriver().LongName
Which use a GDAL memory-mapped file to open an image retrieved via HTTP directly as a NumPy array without saving to a temporary file.
Refer to https://gist.github.com/jleinonen/5781308 for more info.

Categories

Resources