I am currently using Databricks to process data coming from our Azure Data Lake. Majority of the data is being read into pySpark dataframes and are relatively big datasets. However I do have to perform some joins on smaller static tables to fetch additional attributes.
Currently, the only way in which I can do this is by converting those smaller static tables into pySpark dataframes as well. I'm just curious as to whether using such a small table as a pySpark dataframe is a bad practice? I know pySpark is meant for large datasets which need to be distributed but given that my large dataset is in a pySpark dataframe, I assumed I would have to convert the smaller static table into a pySpark dataframe as well in order to make the appropriate joins.
Any tips on best practices would be appreciated, as it relates to joining with very small datasets. Maybe I am overcomplicating something which isn't even a big deal but I was curious. Thanks in advance!
Take a look at Broadcast joins. Wonderfully explained here https://mungingdata.com/apache-spark/broadcast-joins/
The best practice in your case is to broadcast your small df and joins the broadcasted df to your large df like this code below:
val broadcastedDF = sc.broadcast(smallDF)
largeDF.join(broadcastedDF)
We are using Databricks on Azure with a reasonably large cluster (20 cores, 70GB memory across 5 executors). I have a parquet file with 4 million rows. Spark can read well, call that sdf.
I am hitting the problem that the data must be converted to a Pandas dataframe. Taking the easy/obvious way pdf = sdf.toPandas() causes an out of memory error.
So I want to apply my function separately to subsets of the Spark DataFrame. The sdf itself is in 19 partitions, so what I want to do is write a function and apply it to each partition separately. Here's where mapPartitions comes in.
I was trying to write my own function like
def example_function(sdf):
pdf = sdf.toPandas()
/* apply some Pandas and Python functions we've written to handle pdf.*/
output = great_function(pdf)
return output
Then I'd use mapPartitions to run that.
sdf.rdd.mapPartitions(example_function)
That fails with all kinds of errors.
Looking back at the instructions, I realize I'm clueless! Iwas too optimistic/simplistic in what they expect to get from me. They don't seem to imagine that I'm using my own functions to handle the whole Spark DF that exists partition. They seem to plan only for code that would handle the rows in the Spark data frame one row at a time and the parameters are Iterators.
Can you please share you thoughts on this?
In your example case it might be counter productive to start from a Spark Dataframe and fall back to RDD if you're aiming at using pandas.
Under the hood toPandas() is triggering collect() which retrieve all data on the driver node, which will fail on large data.
If you want to use pandas code on Spark, you can use pandas UDFs which are equivalent to UDFs but designed and optimized for pandas code.
https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html
I did not find a solution using Spark map or similar. Here is best option I've found.
The parquet folder has lots of smaller parquet files inside it. As long as default settings were used, these files have extension snappy.parquet. Use Python os.listdir and filter out the file list to ones with correct extension.
Use Python and Pandas, NOT SPARK, tools to read the individual parquet files. It is much faster to load a parquet file with a few 100,000 rows with pandas than it is with Spark.
For the loaded dataframes, run the function I described in the first message, where the dataframe gets put through the wringer.
def example_function(pdf):
/* apply some Pandas and Python functions we've written to handle pdf.*/
output = great_function(pdf)
return output
Since the work for each data section has to happen in Pandas anyway, there's no need to keep fighting with Spark tools.
Other bit worth mentioning is that joblib's Parallel tool can be used to distribute this work among cluster nodes.
I'm handling some CSV files with sizes in the range 1Gb to 2Gb. It takes 20-30 minutes just to load the files into a pandas dataframe, and 20-30 minutes more for each operation I perform, e.g. filtering the dataframe by column names, printing dataframe.head(), etc. Sometimes it also lags my computer when I try to use another application while I wait. I'm on a 2019 Macbook Pro, but I imagine it'll be the same for other devices.
I have tried using modin, but the data manipulations are still very slow.
Is there any way for me to work more efficiently?
Thanks in advance for the responses.
The pandas docs on Scaling to Large Datasets have some great tips which I'll summarize here:
Load less data. Read in a subset of the columns or rows using the usecols or nrows parameters to pd.read_csv. For example, if your data has many columns but you only need the col1 and col2 columns, use pd.read_csv(filepath, usecols=['col1', 'col2']). This can be especially important if you're loading datasets with lots of extra commas (e.g. the rows look like index,col1,col2,,,,,,,,,,,. In this case, use nrows to read in only a subset of the data to make sure that the result only includes the columns you need.
Use efficient datatypes. By default, pandas stores all integer data as signed 64-bit integers, floats as 64-bit floats, and strings as objects or string types (depending on the version). You can convert these to smaller data types with tools such as Series.astype or pd.to_numeric with the downcast option.
Use Chunking. Parsing huge blocks of data can be slow, especially if your plan is to operate row-wise and then write it out or to cut the data down to a smaller final form. Alternately, use the low_memory flag to get Pandas to use the chunked iterator on the backend, but return a single dataframe.
Use other libraries. There are a couple great libraries listed here, but I'd especially call out dask.dataframe, which specifically works toward your use case, by enabling chunked, multi-core processing of CSV files which mirrors the pandas API and has easy ways of converting the data back into a normal pandas dataframe (if desired) after processing the data.
Additionally, there are some csv-specific things I think you should consider:
Specifying column data types. Especially if chunking, but even if you're not, specifying the column types can dramatically reduce read time and memory usage and highlight problem areas in your data (e.g. NaN indicators or Flags that don't meet one of pandas's defaults). Use the dtypes parameter with a single data type to apply to all columns or a dict of column name, data type pairs to indicate the types to read in. Optionally, you can provide converters to format dates, times, or other numerical data if it's not in a format recognized by pandas.
Specifying the parser engine - pandas can read csvs in pure python (slow) or C (much faster). The python engine has slightly more features (e.g. currently the C engine can't read files with complex multi-character delimeters and it can't skip footers). Try using the argument engine='c' to make sure the C engine is being used. If you need one of the unsupported file types, I'd try fixing the file(s) first manually (e.g. stripping out a footer) and then parsing with the C engine, if possible.
Make sure you're catching all NaNs and data flags in numeric columns. This can be a tough one, and specifying specific data types in your inputs can be helpful in catching bad cases. Use the na_values, keep_default_na, date_parser, and converters argumentss to pd.read_csv. Currently, the default list of values interpreted as NaN are ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'].For example, if your numeric columns have non-numeric values coded as notANumber then this would be missed and would either cause an error (if you had dtypes specified) or would cause pandas to re-categorieze the entire column as an object column (suuuper bad for memory and speed!).
Read the pd.read_csv docs over and over and over again. Many of the arguments to read_csv have important performance considerations. pd.read_csv is optimized to smooth over a large amount of variation in what can be considered a csv, and the more magic pandas has to be ready to perform (determine types, interpret nans, convert dates (maybe), skip headers/footers, infer indices/columns, handle bad lines, etc) the slower the read will be. Give it as many hints/constraints as you can and you might see performance increase a lot! And if it's still not enough, many of these tweaks will also apply to the dask.dataframe API, so this scales up further nicely.
This may or may not help you, but I found that storing data in HDF files has significantly improved IO speed. If you are ultimately the source of CSV files, I think you should try to store them as HDF instead. Otherwise what Michael has already said may be the way to go.
Consider using polars. It is typically orders of magnitudes faster than pandas. Here are some benchmarks backing that claim.
If you really want full performance, consider using the lazy API. All the filters you describe can maybe even be done at scan level. We can also parralellize over all files easily with pl.collect_all().
Based on your description you could be fine with processing these csv files as streams instead of fully loading them into memory/swap to filter and calling head.
There's a Table (docs) helper in convtools library (github), which helps with streaming csv-like files, applying transforms and of course you can pipe the resulting stream of rows to another tool of your choice (polars / pandas).
For example:
import pandas as pd
from convtools import conversion as c
from convtools.contrib.tables import Table
pd.DataFrame(
Table.from_csv("input.csv", header=True)
.take("a", "c")
.update(b=c.col("a") + c.col("c"))
.filter(c.col("b") < -2)
.rename({"a": "A"})
.drop("c")
.into_iter_rows(dict) # .into_csv("out.csv") if passing to pandas is not needed
)
There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018.
I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes).
The initial file is a csv that doesn't fit in memory. Here are my needs:
Create features (mainly using groupby operations on multiple columns.)
Merge the new feature to the previous data (on disk because it doesn't fit in memory)
Use a subset (or all) columns/index for some ML applications
Repeat 1/2/3 (This is an iterative process like day1: create 4
features, day2: create 4 more ...)
Attempt with parquet and dask:
First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project.
Attempt with HDF and dask:
I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF).
data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf
data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get)
(Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?)
Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this:
data = dd.read_hdf('./hdf/data-*.hdf', key='base')
data['day_pow2'] = data['day']**2
data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get)
with dask.threaded I get "python stopped working" after 2%.
With dask.multiprocessing.get it takes forever and create new files
What are the most appropriated tools (storage and processing) for this workflow?
I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either.
Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.
I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...
TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection.
I currently have a proprietary file format i'm using to feed into dask.DataFrame.
I have a function that accepts a file path and returns a pandas.DataFrame, used internally by dask.DataFrame successfully to load multiple files to the same dask.DataFrame.
Up until recently, I was using my own code to merge several pandas.DataFrames into one, and now i'm working on using dask instead. When parsing the file format i may encounter errors and certain conditions i want to log and associate with the dask.DataFrame object as metadata (logs, origin of data, etc).
Its important to note that when reasonable, I'm using MultiImdices quite heavily (13 index levels, 3 column levels). For metadata that describes the entire dataframe and not specific rows, I'm using attributes.
Using a custom function, I could pass the metadata in a tuple with the actual DataFrame. Using pandas, I could add it to the _metadata field and as attributes to the DataFrame obejcts.
How can I collect metadata from separate pandas.DataFrame objects when using the dask framework?
Thanks!
There are a few potential questions here:
Q: How do I load data from many files in a custom format into a single dask dataframe
A: You might check out the dask.delayed to load data and dask.dataframe.from_delayed to convert several dask Delayed objects into a single dask dataframe. Or, as you're probably doing now, you can use dask.dataframe.from_pandas and dask.dataframe.concat. See this example notebook on using dask.delayed from custom objects/functions.
Q: How do I store arbitrary metadata onto a dask.dataframe?
A: This is not supported. Generally I recommend using a different data structure to store your metadata if possible. If there are a number of use cases for this then we should consider adding it to dask dataframe. If this is the case then please raise an issue. Generally thought it'd be good to see better support for this in Pandas before dask.dataframe considers supporting it.
Q: I use multi-indexes heavily in Pandas, how can I integrate this workflow into dask.dataframe?
A: Unfortunately dask.dataframe does not currently support multi-indexes. These would clearly be helpful.