I am writing a Kubeflow component which reads an input query and creates a dataframe, roughly as:
from kfp.v2.dsl import component
#component(...)
def read_and_write():
# read the input query
# transform to dataframe
sql.to_dataframe()
I was wondering how I can pass this dataframe to the next operation in my Kubeflow pipeline.
Is this possible? Or do I have to save the dataframe in a csv or other formats and then pass the output path of this?
Thank you
You need to use the concept of the Artifact. Quoting:
Artifacts represent large or complex data structures like datasets or models, and are passed into components as a reference to a file path.
Related
I have written an optimization algorithm that tests some functions on historical stock data, then returns a 2d list of the pandas dataframes generated by each run and the function parameters used. This list takes the form of [[df,params],[df,params], ... [df,params],[df,params]]. After it has been generated, I would like to save this data to be processed in another script, but I am having trouble. Currently I am converting this list to a dataframe and using the to_csv() method from pandas, but this is mangling my data when I open it in another file - I expect the data types to be [[dataframe,list][dataframe,list]...[dataframe,list][dataframe,list]], but they instead become [[str,str],[str,str]...,[str,str],[str,str]]. I open the file using the read_csv() method from pandas, then I convert the resulting dataframe back into a list using the df.values.to_list() method.
To clarify, I save the list to a .csv like this, where out is the list:
out = pd.DataFrame(out)
out.to_csv('optimized_ticker.csv')
And I open the .csv and convert it back from a dataframe to a list like this:
df = pd.read_csv('optimized_ticker.csv')
list = df.values.tolist()
I figured that the problem was my dataframes had commas in there somewhere, so I tried changing the delimiter on the .csv to a few different things, but the issue persisted. How can I fix this issue, so that my datatypes aren't ? It is not imperative that I use the .csv format, so if there's a filetype more suited to the job I can switch to using it. The only purpose of saving the data is so that I can process it with any number of other scripts without having to re-run the simulation each time.
The best way to save a pandas dataframe is not via CSV if its only purpose is to be read by another pandas script. Parquet offers a much more robust option, it saves the datatypes for each column, can be compressed and you won't have to worry about things like comma's in values. Just use the following:
out.to_parquet('optimized_ticker.parquet')
df = pd.read_parquet('optimized_ticker.parquet')
EDIT:
As mentioned in the comments pickle is also a possibility, so the solution depends on your case. Google will be your best friend in figuring out whether to use pickle or parquet or feather.
I have multiple data sources of financial data that I want to parse into a common data model.
API retrieval. Single format from single source (currently)
CSV files – multiple formats from multiple sources
Once cleaned and validated, the data is stored in a database (this is a Django project, but I don’t think that’s important for this discussion).
I have opted to use Pydantic for the data cleaning and validation, but am open to other options.
Where I’m struggling is with the preprocessing of the data, especially with the CSVs.
Each CSV has a different set of headers and data structure. Some CSVs contain all information over a single row, while others present in multiple rows. As your can tell, there are very specific rules for each data source based on its origin. I have a dict that maps all the header variations to the model fields. I filter this by source.
Currently, I’m loading the CSV into a Pandas data frame using the group by function break the data up into blocks. I can then loop through the groups, modify the data based on it’s origin, and then assign the data to the appropriate columns to pass into a Pydantic BaseModel. After I did this, it seemed a bit pointless to be using Pydantic, as all the work was being done beforehand.
To make things more reusable, I thought of moving all the logic into the Pydantic BaseModel, passing the raw grouped data into a property, and processing into the appropriate data elements. But, this just seems wrong.
As with most problems, I’m sure this has been solved before. I’m looking for some guidance on appropriate patterns for this style of processing. All of the examples I’ve found to date are based on a single input format.
I've never used python before and I find myself in the dire need of using sklearn module in my node.js project for machine learning purposes.
I have been all day trying to understand the code examples in said module and now that I kind of understand how they work, I don't know how to use my own data set.
Each of the built in data sets has its own function (load_iris, load_wine, load_breast_cancer, etc) and they all load data from a .csv and an .rst file. I can't find a function that will allow me to load my own data set. (there's a load_data function but it seems to be for internal use of the previous three I mentioned, cause I can't import it)
How could I do that? What's the proper way to use sklearn with any other data set? Does it always have to be a .csv file? Could it be programmatically provided data (array, object, etc)?
In case it's important: all those built-in data sets have numeric features, my data set has both numeric and string features to be used in the decision tree.
Thanks
You can load whatever you want and then use sklearn models.
If you have a .csv file, pandas would be the best option.
import pandas as pd
mydataset = pd.read_csv("dataset.csv")
X = mydataset.values[:,0:10] # let's assume that the first 10 columns are the features/variables
y = mydataset.values[:,11] # let's assume that the 11th column has the target values/classes
...
sklearn_model.fit(X,y)
Similarily, you can load .txt or .xls files.
The important thing in order to use sklearn models is this:
X should be always be an 2D array with shape [n_samples, n_variables]
y should be the target varible.
I'm using the sample Python Machine Learning "IRIS" dataset (for starting point of a project). These data are POSTed into a Flask web service. Thus, the key difference between what I'm doing and all the examples I can find is that I'm trying to load a Pandas DataFrame from a variable and not from a file or URL which both seem to be easy.
I extract the IRIS data from the Flask's POST request.values. All good. But at that point, I can't figure out how to get the pandas dataframe like the "pd.read_csv(....)" does. So far, it seems the only solution is to parse each row and build up several series I can use with the DataFrame constructor? There must be something I'm missing since reading this data from a URL is a simple one-liner.
I'm assuming reading a variable into a Pandas DataFrame should not be difficult since it seems like an obvious use-case.
I tried wrapping with io.StringIO(csv_data), then following up with read_csv on that variable, but that doesn't work either.
Note: I also tried things like ...
data = pd.DataFrame(csv_data, columns=['....'])
but got nothing but errors (for example, "constructor not called correctly!")
I am hoping for a simple method to call that can infer the columns and names and create the DataFrame for me, from a variable, without me needing to know a lot about Pandas (just to read and load a simple CSV data set, anyway).
TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection.
I currently have a proprietary file format i'm using to feed into dask.DataFrame.
I have a function that accepts a file path and returns a pandas.DataFrame, used internally by dask.DataFrame successfully to load multiple files to the same dask.DataFrame.
Up until recently, I was using my own code to merge several pandas.DataFrames into one, and now i'm working on using dask instead. When parsing the file format i may encounter errors and certain conditions i want to log and associate with the dask.DataFrame object as metadata (logs, origin of data, etc).
Its important to note that when reasonable, I'm using MultiImdices quite heavily (13 index levels, 3 column levels). For metadata that describes the entire dataframe and not specific rows, I'm using attributes.
Using a custom function, I could pass the metadata in a tuple with the actual DataFrame. Using pandas, I could add it to the _metadata field and as attributes to the DataFrame obejcts.
How can I collect metadata from separate pandas.DataFrame objects when using the dask framework?
Thanks!
There are a few potential questions here:
Q: How do I load data from many files in a custom format into a single dask dataframe
A: You might check out the dask.delayed to load data and dask.dataframe.from_delayed to convert several dask Delayed objects into a single dask dataframe. Or, as you're probably doing now, you can use dask.dataframe.from_pandas and dask.dataframe.concat. See this example notebook on using dask.delayed from custom objects/functions.
Q: How do I store arbitrary metadata onto a dask.dataframe?
A: This is not supported. Generally I recommend using a different data structure to store your metadata if possible. If there are a number of use cases for this then we should consider adding it to dask dataframe. If this is the case then please raise an issue. Generally thought it'd be good to see better support for this in Pandas before dask.dataframe considers supporting it.
Q: I use multi-indexes heavily in Pandas, how can I integrate this workflow into dask.dataframe?
A: Unfortunately dask.dataframe does not currently support multi-indexes. These would clearly be helpful.