How to read bigdataset to Pandas dataframe? - python

I have several files (11) already as datasets (mltable) in Azure ML Studio. Loading to df's works to all the cases except one. I believe the reason for that is the size - 1.95 GB. I wonder how can I load this dataset to dataframe? So far I did not manage to load it at all.
Any tips how to do it effectively? I tried to figure out a way to do it in parallel with the modin but failed. Below you will find the load script.
subscription_id = 'xyz'
resource_group = 'rg-personal'
workspace_name = 'test'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='buses')
dataset.to_pandas_dataframe()

You can load the data using an AzureML long-form datastore URI directly into Pandas.
Ensure you have the azureml-fsspec Python library installed:
pip install azureml-fsspec
Next, just load the data:
import pandas as pd
df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()
As this uses the AzureML datastore, it will automatically handle authentication for you without exposing SAS keys in the URI. Authentication can be either identity-based (i.e. passthrough your AAD to storage) or credential-based.
AzureML Datastore URIs are a known implementation of Filesystem spec (fsspec): A unified pythonic interface to local, remote and embedded file systems and bytes storage.
This implementation leverages the AzureML data runtime: a fast and efficient engine to materialize the data into a Pandas or Spark dataframe. The engine is written in Rust, which is known for high speed and high memory efficiency for data processing tasks.

I found another solution, easier than what was posted by the #DeepDave
Instead of loading data from assets, I loaded them directly from blob with the URL, using the modin library instead of Pandas. Worked like a charm
Code below:
import modin.pandas as pd
url ='URLLINKHERE'
df_bus = pd.read_csv(url, encoding='utf16')
df_bus.head()
To supplement where to find URL.
Go go storage and find the file.
Right click on the file.
Generate SAS.
BLOB SAS URL -> that was the link I used.
Hope this help others.

Related

Logging a PySpark dataframe into a MLFlow Artifact

I am currently writing an MLFlow artifact to the dbfs but I am using pandas using the code below...
temp = tempfile.NamedTemporaryFile(prefix="*****", suffix=".csv")
temp_name = temp.name
try:
df.to_csv(temp_name, index=False)
mlflow.log_artifact(temp_name, "******")
finally:
temp.close() # Delete the temp file
How would I write this if 'df' was a spark dataframe?
You just need to use the filepath URLs with proper protocols. "dbfs" is a generic Databricks one. For Azure, "abffs" would be needed. (I can not recall AWS's S3 extension)
filepath="dbfs:///filepath"
df # My Spark DataFrame
df.write.csv(filepath)
mlflow.log_artifact(temp_name, filepath)
It looks like in your case the problem has to do with how spark APIs accesses the filesystem vs how python APIs accesses it see here for details. This is probably not the recommended way (I'm fairly new to Databricks myself), but if you're on a single node you can write your parquet to the local filesystem and mlflow can log it from there with something like:
with tempfile.TemporaryDirectory() as tmpdirname:
df.write.parquet(f'file:{tmpdirname}/my_parquet_table')
mlflow.log_artifacts(tmpdirname, artifact_path='my_parquet_table_name')
Keep in mind that a parquet "file" is actually a directory with a whole bunch of files in it, so you need to use log_artifacts, not log_artifact, and if you don't specify artifact_path you'll get all the little files that make up the parquet file (directory) dumped directly into the root of your your mlflow artifacts. Also, Mlflow doesn't have any previewing capability for parquet files, so depending on your use case, logging parquet artifacts may not be as convenient as it first seems.
HTH

Why can't I read a joblib file from my github repo?

I've built a simple app in Python, with a front-end UI in Dash.
It relies on three files,
small dataframe, in pickle format ,95KB
large scipy sparse matrix, in NPZ format, 12MB
large scikit KNN-model, in job lib format, 65MB
I have read in the first dataframe successfully by
link = 'https://github.com/user/project/raw/master/filteredDF.pkl'
df = pd.read_pickle(link)
But when I try this with the others, say, the model by:
mLink = 'https://github.com/user/project/raw/master/knnModel.pkl'
filehandler = open(mLink, 'rb')
model_knn = pickle.load(filehandler)
I just get an error
Invalid argument: 'https://github.com/user/project/raw/master/knnModel45Percent.pkl'
I also pushed these files using Github LFS, but the same error occurs.
I understand that hosting large static files on github is bad practice, but I haven't been able to figure out how to use PyDrive or AWS S3 w/ my project. I just need these files to be read in by my project, and I plan to host the app on something like Heroku. I don't really need a full-on DB to store files.
The best case would be if I could read in these large files stored in my repo, but if there is a better approach, I am willing as well. I spent the past few days struggling through Dropbox, Amazon, and Google Cloud APIs and am a bit lost.
Any help appreciated, thank you.
Could you try the following?
from io import BytesIO
import pickle
import requests
mLink = 'https://github.com/aaronwangy/Kankoku/blob/master/filteredAnimeList45PercentAll.pkl?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model_knn = pickle.load(mfile)
Using the BytesIO you create a file object out of the response that you get from GitHub. That object can then be using in pickle.load. Note that I have added ?raw=true to the URL of the request.
For the ones having the KeyError 10 try
model_knn = joblib.load(mfile)
instead of
model_knn = pickle.load(mfile)

How to create pandas dataframe from parquet files kept on google storage

I need to create data frame using pandas library using parquet files hosted on a google cloud storage bucket. I have searched the documents and online examples but can't seem to figure out how to go about it.
Could you please assist me by pointing me towards the right direction?
I am not looking for a solution but for a location where I could look for further information so that I could devise my own solution.
Thank you in advance.
You may use gcsfs and pyarrow libraries to do so.
import gcsfs
from pyarrow import parquet
url = "gs://bucket_name/.../folder_name"
fs = gcsfs.GCSFileSystem()
// Assuming your parquet files start with `part-` prefix
files = ["gs://" + path for path in fs.glob(url + "/part-*")]
ds = parquet.ParquetDataset(files, filesystem=fs)
df = ds.read().to_pandas()
You can read it with pandas.read_parquet like this:
df = pandas.read_parquet('gs:/bucket_name/file_name')
Additionally you will need gcsfs library and either pyarrow or fastparquet installed.
Don't forget to provide credentials in case you access private bucket.

What is the preferred way to load data from an API into BigQuery?

I am trying to get data from a REST API into BigQuery on the Google Cloud Platform (GCP). What is the best way to achieve that (without using any third party tools such as Funnel.io or Supermetrics)?
Most tutorials I could find suggest to write the data as CSV files to Cloud Storage and then use DataFlow to load the data into BigQuery. This however seems to be a bit cumbersome. There should be a way to do that without the intermediate step to write to CSV. Can this be achieved (within GCP) and if so, what is the best way?
PS: If the size of the data is relevant for the answer: I'm trying to load a total of about 10,000 rows of data (one-off) with about 100 new columns coming in every day - ideally updating every hour.
Following up on the hint by #Kolban above, loading data from an API into BigQuery without using third party tools and without writing an intermediate file to Google Cloud Storage is possible, and indeed quite simple, by "streaming" data into BigQuery:
rows_to_insert = [(u"Phred Phlyntstone", 32), (u"Wylma Phlyntstone", 29)]
errors = client.insert_rows(table, rows_to_insert) # Make an API request.
if errors == []:
print("New rows have been added.")
(From the BQ documentation)
In order to prepare the JSON data, it has to be turned into tuples. Here's an excerpt from my code to achieve this:
# Turn JSON into tuples
data_tuples = []
for key,value in resp_json[product_id].items():
data_tuples.append((
value["product_id"],
value["downloads"]
)
)
# Insert into BQ
errors = client.insert_rows(table, data_tuples)
if errors == []:
print("New rows have been added.")
else:
print(errors)
According to the documentation:
Currently, you can load data into BigQuery only from Cloud Storage or
a readable data source (such as your local machine).
Therefore, unless you are loading Datastore or Firestore exports, it is required that the files are in Google Cloud Storage. There are the following available readable formats from GCS:
Avro
CSV
JSON (newline delimited only)
ORC
Parquet
Datastore exports
Firestore exports
You should be aware of the limitations for each format. In addition, there are also limitations for load jobs, they are described here.
I would advice you to fetch the data from your Rest API, in one of the readable formats, store it in Google Cloud Storage then use Google Transfer Service to load it into BigQuery. Thus, it would not be necessary to use DataFlow.
Cloud Storage Transfer is used to schedule recurring data loads directly into BigQuery. According to the documentation, the minimum load interval is 1 hour, which I believe suits your need. You can read more about this service here.
I hope it helps.

Using Spectrify to offload data from Redshift to S3 in Parquet format

I'm trying to use Spectrify to unload data from Redshift to S3 in Parquet format, but I'm stuck in the process because I can't understand a few things. Spectrify documentation isn't great and I can't find any implementation example on the internet. I've also found a similar question here on StackOverflow, but the accepted answer is that it's recommended to use Spectrify which doesn't help much.
Here is the problem (this is the code from their documentation):
from spectrify.export import RedshiftDataExporter
from spectrify.convert import ConcurrentManifestConverter
from spectrify.utils.schema import SqlAlchemySchemaReader
RedshiftDataExporter(sa_engine, s3_config).export_to_csv('my_table')
csv_path_template = 's3://my-bucket/my-table/csv/{start.year}/{start.month:02d}/{start.day:02d}'
spectrum_path_template = 's3://my-bucket/my-table/spectrum/partition_key={start}'
csv_path = csv_path_template.format(start=start_date)
spectrum_path = spectrum_path_template.format(start=start_date)
s3_config = SimpleS3Config(csv_path, spectrum_path)
sa_table = SqlAlchemySchemaReader(engine).get_table_schema('my_table')
ConcurrentManifestConverter(sa_table, s3_config).convert_manifest()
RedshiftDataExporter is used to export data to CSV, sa_engine is a connection to Redshift engine.
Their documentation is vague on the conversion process. What would be the process here which should be used to unload data to CSV and then turn it into Parquet format while using Spectrify in a Python 3.x script? How should I modify the above code and what am I missing?
You can now unload Redshift data to S3 in Parquet format without any third party application. The new feature is now supported in Redshift:
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET
Documentation can be found at UNLOAD - Amazon Redshift

Categories

Resources