In ML.Net what are the counterparts of Numpy/ Pandas python libraries?
Here are all the available .NET counterparts that I know of:
Numpy
there are a few Tensor type proposals in dotnet/corefx:
https://github.com/dotnet/corefx/issues/25779
https://github.com/dotnet/corefx/issues/34527
There is also an implementation of NumPy made by the SciSharp org.
Pandas
On dotnet/corefx there is a DataFrame Discussion issue, which has spawned a dotnet/corefxlab project to implement a C# DataFrame library similar to Pandas.
There are also other DataFrame implementations:
Deedle
SciSharp Pandas.NET
ML.NET
In ML.NET, IDataView is an interface that abstracts the underlying storage for tabular data, ex. a DataFrame. It doesn't have the full rich APIs like a Pandas DataFrame does, but instead it supports reading data from any underlying source - for example a text file, SQL table, in-memory objects, etc.
There currently isn't a "data exploration" API in ML.NET v1.0, like you would have with a Pandas DataFrame. The current plan is for the corefxlab DataFrame class to implement IDataView, and then you can use DataFrame to do the data exploration, and feed it directly into ML.NET.
UPDATE: For a "data exploration" API similar to Pandas, check out the Microsoft.Data.Analysis package, which is currently in preview. It implements IDataView and can be fed directly into ML.NET to train or make predictions.
It is mostly the regular .NET types + the IDataView types.
The document is a bit out of date.
Related
I have built a Python wrapper around a .NET API. The wrapper is currently very slow at "unpacking" a .NET collection object into the desired pd.Series object to be returned. I would like to accelerate this part of the code by wrapping some C code to do the unpacking.
Detail
This API (specifically the OSI Pi AFSDK) is used to retrieve timeseries data from a proprietary database. The API call is achieved using the pythonnet library and returns a .NET collection called an AFValues object. The object is a collection of AFValue objects, which themselves each contain a timestamp and a value field, amongst other information. At present I "unzip" each of these objects using a Python list comprehension and combine together to form the series. Here's a much simplified version:
timestamps = [afvalue.Timestamp for afvalue in afvalues]
# (There is actually some timezone handling etc in the above as well)
values = [afvalue.Value for afvalue in afvalues]
result = pd.Series(index = timestamps, data = values)
This list comprehensions are noticeably slow on very large collections (ie. millions of values).
Desired outcome
Ideally I would like to:
Call the API using the existing pythonnet code
Pass the AFValues object into some precompiled code written in C (or maybe .NET? open to suggestions)
Have the C code return a Numpy array or similar to convert to a Pandas object.
I believe the above is how Pandas and Numpy achieve their speed in large operations. Is the above the right approach, and any suggestions on how I would go about coding this?
How can I import CSV file into PySpark as a dataset? Note that I am NOT asking about how to import them into dataframes.
While reading this page from Databricks, I learned some benefits of datasets over dataframes.
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
I want to learn how to work with them instead of RDDs and dataframes.
The linked blog post gives you the answer that it is impossible because of the python:
Note: Since Python and R have no compile-time type-safety, we only have untyped APIs, namely DataFrames.
I'm using the sample Python Machine Learning "IRIS" dataset (for starting point of a project). These data are POSTed into a Flask web service. Thus, the key difference between what I'm doing and all the examples I can find is that I'm trying to load a Pandas DataFrame from a variable and not from a file or URL which both seem to be easy.
I extract the IRIS data from the Flask's POST request.values. All good. But at that point, I can't figure out how to get the pandas dataframe like the "pd.read_csv(....)" does. So far, it seems the only solution is to parse each row and build up several series I can use with the DataFrame constructor? There must be something I'm missing since reading this data from a URL is a simple one-liner.
I'm assuming reading a variable into a Pandas DataFrame should not be difficult since it seems like an obvious use-case.
I tried wrapping with io.StringIO(csv_data), then following up with read_csv on that variable, but that doesn't work either.
Note: I also tried things like ...
data = pd.DataFrame(csv_data, columns=['....'])
but got nothing but errors (for example, "constructor not called correctly!")
I am hoping for a simple method to call that can infer the columns and names and create the DataFrame for me, from a variable, without me needing to know a lot about Pandas (just to read and load a simple CSV data set, anyway).
TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection.
I currently have a proprietary file format i'm using to feed into dask.DataFrame.
I have a function that accepts a file path and returns a pandas.DataFrame, used internally by dask.DataFrame successfully to load multiple files to the same dask.DataFrame.
Up until recently, I was using my own code to merge several pandas.DataFrames into one, and now i'm working on using dask instead. When parsing the file format i may encounter errors and certain conditions i want to log and associate with the dask.DataFrame object as metadata (logs, origin of data, etc).
Its important to note that when reasonable, I'm using MultiImdices quite heavily (13 index levels, 3 column levels). For metadata that describes the entire dataframe and not specific rows, I'm using attributes.
Using a custom function, I could pass the metadata in a tuple with the actual DataFrame. Using pandas, I could add it to the _metadata field and as attributes to the DataFrame obejcts.
How can I collect metadata from separate pandas.DataFrame objects when using the dask framework?
Thanks!
There are a few potential questions here:
Q: How do I load data from many files in a custom format into a single dask dataframe
A: You might check out the dask.delayed to load data and dask.dataframe.from_delayed to convert several dask Delayed objects into a single dask dataframe. Or, as you're probably doing now, you can use dask.dataframe.from_pandas and dask.dataframe.concat. See this example notebook on using dask.delayed from custom objects/functions.
Q: How do I store arbitrary metadata onto a dask.dataframe?
A: This is not supported. Generally I recommend using a different data structure to store your metadata if possible. If there are a number of use cases for this then we should consider adding it to dask dataframe. If this is the case then please raise an issue. Generally thought it'd be good to see better support for this in Pandas before dask.dataframe considers supporting it.
Q: I use multi-indexes heavily in Pandas, how can I integrate this workflow into dask.dataframe?
A: Unfortunately dask.dataframe does not currently support multi-indexes. These would clearly be helpful.
I am trying to use ConceptNet with the divisi2 package. The divisi package is particularly designed for working with knowledge in semantic networks. This package takes up the graph as a input and convert it into SVD forms. With the package distribution they had provided the basic conceptNet data into graph format, this data seems to be outdated. Divisi can be used in this way Using Divisi with conceptNet(link). But the data needs to be updated with conceptNet5 data, is there any way to do that. Provided I have all the conceptNet data setted up locally as described in Running your own copy. So, the sqlite database I have all the data. Also I have the data into csv formats separately. So how can I load this data into Divisi package. Thanks.