How can I import CSV file into PySpark as a dataset? Note that I am NOT asking about how to import them into dataframes.
While reading this page from Databricks, I learned some benefits of datasets over dataframes.
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
I want to learn how to work with them instead of RDDs and dataframes.
The linked blog post gives you the answer that it is impossible because of the python:
Note: Since Python and R have no compile-time type-safety, we only have untyped APIs, namely DataFrames.
Related
We are using Databricks on Azure with a reasonably large cluster (20 cores, 70GB memory across 5 executors). I have a parquet file with 4 million rows. Spark can read well, call that sdf.
I am hitting the problem that the data must be converted to a Pandas dataframe. Taking the easy/obvious way pdf = sdf.toPandas() causes an out of memory error.
So I want to apply my function separately to subsets of the Spark DataFrame. The sdf itself is in 19 partitions, so what I want to do is write a function and apply it to each partition separately. Here's where mapPartitions comes in.
I was trying to write my own function like
def example_function(sdf):
pdf = sdf.toPandas()
/* apply some Pandas and Python functions we've written to handle pdf.*/
output = great_function(pdf)
return output
Then I'd use mapPartitions to run that.
sdf.rdd.mapPartitions(example_function)
That fails with all kinds of errors.
Looking back at the instructions, I realize I'm clueless! Iwas too optimistic/simplistic in what they expect to get from me. They don't seem to imagine that I'm using my own functions to handle the whole Spark DF that exists partition. They seem to plan only for code that would handle the rows in the Spark data frame one row at a time and the parameters are Iterators.
Can you please share you thoughts on this?
In your example case it might be counter productive to start from a Spark Dataframe and fall back to RDD if you're aiming at using pandas.
Under the hood toPandas() is triggering collect() which retrieve all data on the driver node, which will fail on large data.
If you want to use pandas code on Spark, you can use pandas UDFs which are equivalent to UDFs but designed and optimized for pandas code.
https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html
I did not find a solution using Spark map or similar. Here is best option I've found.
The parquet folder has lots of smaller parquet files inside it. As long as default settings were used, these files have extension snappy.parquet. Use Python os.listdir and filter out the file list to ones with correct extension.
Use Python and Pandas, NOT SPARK, tools to read the individual parquet files. It is much faster to load a parquet file with a few 100,000 rows with pandas than it is with Spark.
For the loaded dataframes, run the function I described in the first message, where the dataframe gets put through the wringer.
def example_function(pdf):
/* apply some Pandas and Python functions we've written to handle pdf.*/
output = great_function(pdf)
return output
Since the work for each data section has to happen in Pandas anyway, there's no need to keep fighting with Spark tools.
Other bit worth mentioning is that joblib's Parallel tool can be used to distribute this work among cluster nodes.
I am using pandas to read CSV file data, but the CSV module is also there to manage the CSV file.
so my questions are :-
what is the difference between these both?
what are the cons of using pandas over the CSV module?
Based upon benchmarks
CSV is faster to load data for smaller datasets (< 1K rows)
Pandas is several times faster for larger datasets
Code to Generate Benchmarks
Benchmarks
csv is a built-in module but pandas not. if you want only reading csv file you should not install pandas because you must install it and increasing in dependencies of project is not a best practice.
if you want to analyze data of csv file with pandas, pandas changes csv file to dataframe needed for manipulating data with pandas and you should not use csv module for these cases.
if you have a big data or data with large volume you should consider libraries like numpy and pandas.
Pandas is better then csv for managing data and doing operations on the data. CSV doesn't provide you with the scientific data manipulation tools that Pandas does.
If you are talking only about the part of reading the file it depends. You may simply google both modules online but generally I find it more comfortable to work with Pandas. it provides easier readability as well since printing there is better too.
In ML.Net what are the counterparts of Numpy/ Pandas python libraries?
Here are all the available .NET counterparts that I know of:
Numpy
there are a few Tensor type proposals in dotnet/corefx:
https://github.com/dotnet/corefx/issues/25779
https://github.com/dotnet/corefx/issues/34527
There is also an implementation of NumPy made by the SciSharp org.
Pandas
On dotnet/corefx there is a DataFrame Discussion issue, which has spawned a dotnet/corefxlab project to implement a C# DataFrame library similar to Pandas.
There are also other DataFrame implementations:
Deedle
SciSharp Pandas.NET
ML.NET
In ML.NET, IDataView is an interface that abstracts the underlying storage for tabular data, ex. a DataFrame. It doesn't have the full rich APIs like a Pandas DataFrame does, but instead it supports reading data from any underlying source - for example a text file, SQL table, in-memory objects, etc.
There currently isn't a "data exploration" API in ML.NET v1.0, like you would have with a Pandas DataFrame. The current plan is for the corefxlab DataFrame class to implement IDataView, and then you can use DataFrame to do the data exploration, and feed it directly into ML.NET.
UPDATE: For a "data exploration" API similar to Pandas, check out the Microsoft.Data.Analysis package, which is currently in preview. It implements IDataView and can be fed directly into ML.NET to train or make predictions.
It is mostly the regular .NET types + the IDataView types.
The document is a bit out of date.
I'm using the sample Python Machine Learning "IRIS" dataset (for starting point of a project). These data are POSTed into a Flask web service. Thus, the key difference between what I'm doing and all the examples I can find is that I'm trying to load a Pandas DataFrame from a variable and not from a file or URL which both seem to be easy.
I extract the IRIS data from the Flask's POST request.values. All good. But at that point, I can't figure out how to get the pandas dataframe like the "pd.read_csv(....)" does. So far, it seems the only solution is to parse each row and build up several series I can use with the DataFrame constructor? There must be something I'm missing since reading this data from a URL is a simple one-liner.
I'm assuming reading a variable into a Pandas DataFrame should not be difficult since it seems like an obvious use-case.
I tried wrapping with io.StringIO(csv_data), then following up with read_csv on that variable, but that doesn't work either.
Note: I also tried things like ...
data = pd.DataFrame(csv_data, columns=['....'])
but got nothing but errors (for example, "constructor not called correctly!")
I am hoping for a simple method to call that can infer the columns and names and create the DataFrame for me, from a variable, without me needing to know a lot about Pandas (just to read and load a simple CSV data set, anyway).
TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection.
I currently have a proprietary file format i'm using to feed into dask.DataFrame.
I have a function that accepts a file path and returns a pandas.DataFrame, used internally by dask.DataFrame successfully to load multiple files to the same dask.DataFrame.
Up until recently, I was using my own code to merge several pandas.DataFrames into one, and now i'm working on using dask instead. When parsing the file format i may encounter errors and certain conditions i want to log and associate with the dask.DataFrame object as metadata (logs, origin of data, etc).
Its important to note that when reasonable, I'm using MultiImdices quite heavily (13 index levels, 3 column levels). For metadata that describes the entire dataframe and not specific rows, I'm using attributes.
Using a custom function, I could pass the metadata in a tuple with the actual DataFrame. Using pandas, I could add it to the _metadata field and as attributes to the DataFrame obejcts.
How can I collect metadata from separate pandas.DataFrame objects when using the dask framework?
Thanks!
There are a few potential questions here:
Q: How do I load data from many files in a custom format into a single dask dataframe
A: You might check out the dask.delayed to load data and dask.dataframe.from_delayed to convert several dask Delayed objects into a single dask dataframe. Or, as you're probably doing now, you can use dask.dataframe.from_pandas and dask.dataframe.concat. See this example notebook on using dask.delayed from custom objects/functions.
Q: How do I store arbitrary metadata onto a dask.dataframe?
A: This is not supported. Generally I recommend using a different data structure to store your metadata if possible. If there are a number of use cases for this then we should consider adding it to dask dataframe. If this is the case then please raise an issue. Generally thought it'd be good to see better support for this in Pandas before dask.dataframe considers supporting it.
Q: I use multi-indexes heavily in Pandas, how can I integrate this workflow into dask.dataframe?
A: Unfortunately dask.dataframe does not currently support multi-indexes. These would clearly be helpful.