I would like to use dask to parallelize a numbercrunching task.
This task utilizes only one of the cores in my computer.
As a result of that task I would like to add an entry to a DataFrame via shared_df.loc[len(shared_df)] = [x, 'y']. This DataFrame should be populized by all the (four) paralllel workers / threads in my computer.
How do I have to setup dask to perform this?
The right way to do something like this, in rough outline:
make a function that, for a given argument, returns a data-frame of some part of the total data
wrap this function in dask.delayed, make a list of calls for each input argument, and make a dask-dataframe with dd.from_delayed
if you really need the index to be sorted and the index to partition along different lines than the chunking you applied in the previous step, you may want to do set_index
Please read the docstrings and examples for each of these steps!
Related
I have a big pyspark dataframe which I am performing a number of transformations on and joining with other dataframes. I would like to investigate if the transformations and joins succeed and if the datafram looks like it is intended, but how can I show a small subset of the dataframe.
I have tried numerous things e.g.
df.show(5)
and
df.limit(5).show()
but everything I try requires a large portion of jobs resulting in slow performance.
I could spin up a very large cluster, but is there a way of getting only a small subset of the dataframe, fast?
Try the rdd equivalent of the dataframe
rdd_df = df.rdd
rdd_df.take(5)
Or, Try to print the dataframe schema
df.printSchema()
First, to show a certain number of rows you can use the limit() method after calling a select() method, like this:
df.select('*').limit(5).show()
also, the df.show() action will only print the first 20 rows, it will not print the whole dataframe.
second, and more importantly,
Spark Actions:
spark dataframe does not contain data, it contains instructions and operation graph, since spark works with big data it does not allow to perform any operation as its called, to prevent slow performance, instead, methods are separated into two kinds Actions and Transformations, transformations are collected and contained as an operation graph.
Action is a method that causes the dataframe to execute all accumulated operation in the graph, causing the slow performances, since it execute everything (note, UDFs are extremely slow).
show() is an action, when you call show() it has to calculate every other transformation to show you the true data.
keep that in mind.
to faster iterate you have to understand the difference between actions and transformation.
The Transformation is defined by any operation that result into another RDD/Spark Dataframe for example df.filter.join.groupBy. Action is defined by any operation that result into non-RDD for example df.write. or df.count() or df.show().
The transformation is lazy, saying the not like python df1=df.filter, df2=df1.groupby df and df1 and df3 was in memory. Instead the data will flow into memory until you call an action. like in your case .show()
And calling df.limit(5).show() will not fastern your job iteration, because this limit is limiting the final dataframe gets print out instead the original data that is flowing through your memory.
Like others suggestion, you should be able to limit your input datasize in order to faster test whether your transformation works. And further improve your iteration, you could cache dataframe from matured transformation, instead of running them over and over again.
I'm processing a bunch of text-based records in csv format using Dask, which I am learning to use to work around too large to fit in memory problems, and I'm trying to filter records within groups that best match a complicated criteria.
The best approach I've identified to approach this so far is to basically use Dash to group records in bite sized chunks and then write the applicable logic in Python:
def reduce_frame(partition):
records = partition.to_dict('record')
shortlisted_records = []
# Use Python to locate promising looking records.
# Some of the criteria can be cythonized; one criteria
# revolves around whether record is a parent or child
# of records in shortlisted_records.
for other in shortlisted_records:
if other.path.startswith(record.path) \
or record.path.startswith(other.path):
... # keep one, possibly both
...
return pd.DataFrame.from_dict(shortlisted_records)
df = df.groupby('key').apply(reduce_frame, meta={...})
In case it matters, the complicated criteria revolves around weeding out promising looking links on a web page based on link url, link text, and css selectors across the entire group. Think with given A, and B in shortlist, and C a new record, keep all if each are very very promising, else prefer C over A and/or B if more promising than either or both, else drop C. The resulting Pandas partition objects above are tiny. (The dataset in its entirety is not, hence my using Dask.)
Seeing how Pandas functionality exposes inherently row- and column-based functionality, I'm struggling to imagine any vectorized approach to solving this, so I'm exploring writing the logic in Python.
Is the above the correct way to proceed, or are there more Dask/Pandas idiomatic ways - or simply better ways - to approach this type of problem? Ideally one that allows to parallelize the computations across a cluster? For instance by using Dask.bag or Dask.delayed and/or cytoolz or something else I might have missed while learning Python?
I know nothing about Dask, but can tell a little on passing / blocking
some rows using Pandas.
It is possible to use groupby(...).apply(...) to "filter" the
source DataFrame.
Example: df.groupby('key').apply(lambda grp: grp.head(2)) returns
first 2 rows from each group.
In your case, write a function to applied to each group, which:
contains some logic, processing the current group,
generates the output DataFrame, based on this logic, e.g. returning
only some of input rows.
The returned rows are then concatenated, forming the result of apply.
Another possibility is to use groupby(...).filter(...), but in this
case the underlying function returns a decision "passing" or "blocking"
each group of rows.
Yet another possibility is to define a "filtering function",
say filtFun, which returns True (pass the row) or False (block the row).
Then:
Run: msk = df.apply(filtFun, axis=1) to generate a mask (which rows
passed the filter).
In further processing use df[msk], i.e. only these rows which passed
the filter.
But in this case the underlying function has acces only to the current row,
not to the whole group of rows.
I have a large collection of entries E and a function f: E --> pd.DataFrame. The execution time of function f can vary drastically for different inputs. Finally all DataFrames should be concatenated into a single DataFrame.
The situation I'd like to avoid is a partitioning (using 2 partitions for the sake of the example) where accidentally all fast function executions happen on partition 1 and all slow executions on partition 2, thus not optimally using the workers.
partition 1:
[==][==][==]
partition 2:
[============][=============][===============]
--------------------time--------------------->
My current solution is to iterate over the collection of entries and create a Dask graph using delayed, aggregating the delayed partial DataFrame results in a final result DataFrame with dd.from_delayed.
delayed_dfs = []
for e in collection:
delayed_partial_df = delayed(f)(e, arg2, ...)
delayed_dfs.append(delayed_partial_df)
result_df = from_delayed(delayed_dfs, meta=make_meta({..}))
I reasoned that the Dask scheduler would take care of optimally assigning work to the available workers.
is this a correct assumption?
would you consider the overall approach reasonable?
As mentioned in the comments above, yes, what you are doing is sensible.
The tasks will be assigned to workers initially, but if some workers finish their allotted tasks before others then they will dynamically steal tasks from those workers with excess work.
Also as mentioned in the comments, you might consider using the diagnostic dashboard to get a good sense of what the scheduler is doing. All of the information about worker load, work stealing, etc. are easily viewable.
http://distributed.readthedocs.io/en/latest/web.html
tl;dr: Is it possible to .set_index() method on several Dask Dataframes in parallel concurrently? Alternatively, is it possible to .set_index() lazily on several Dask Dataframes which, consequently, would lead to the indexes being set in parallel concurrently?
Here is the scenario:
I have several time series
Each time series is stored is several .csv files. Each file contains data related to a specific day. Also, files are scattered amongst different folders (each folder is contains data for one month)
Each time series has different sampling rates
All time series have the same columns. All have a column which contains DateTime, amongst others.
Data is too large to be processed in memory. That's why I am using Dask.
I want to merge all the time series into a single DataFrame, aligned by DateTime. For this, I need to first resample() each and all time series to a common sampling rate. And then .join() all time series.
.resample() can only be applied on index. Hence, before resampling I need to .set_index() on the DateTime column on each time series.
When I ask .set_index() method on one time series, computation starts immediately. Which leads to my code being blocked and waiting. At this moment, if I check my machine resources usage, I can see that many cores are being used but the usage does not go above ~15%. Which makes me think that, ideally, I could have the .set_index() method being applied to more than one time series at the same time.
After reaching the above situation, I've tried some not elegant solutions to parallelize application of .set_index() method on several time series (e.g. create a multiprocessing.Pool ), which were not successful. Before giving more details on those, is there a clean way on how to solve the situation above? Was the above scenario thought at some point when implementing Dask?
Alternatively, is it possible to .set_index() lazily? If .set_index() method could be applied lazily, I would create a full computation graph with the steps described above and in the end ,everything would be computed in parallel concurrently (I think).
Dask.dataframe needs to know the min and max values of all of the partitions of the dataframe in order to sensibly do datetime operations in parallel. By default it will read the data once in order to find good partitions. If the data is not sorted it will then do a shuffle (perhaps very expensive) to sort
In your case it sounds like your data is already sorted and that you might be able to provide these explicitly. You should look at the last example of the dd.DataFrame.set_index docstring
A common case is when we have a datetime column that we know to be
sorted and is cleanly divided by day. We can set this index for free
by specifying both that the column is pre-sorted and the particular
divisions along which is is separated
>>> import pandas as pd
>>> divisions = pd.date_range('2000', '2010', freq='1D')
>>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions) # doctest: +SKIP
I've written a program in Python and pandas which takes a very large dataset (~4 million rows per month for 6 months), groups it by 2 of the columns (date and a label), and then applies a function to each group of rows. There are a variable number of rows in each grouping - anywhere from a handful of rows to thousands of rows. There are thousands of groups per month (label-date combos).
My current program uses multiprocessing, so it's pretty efficient, and I thought would map well to Spark. I've worked with map-reduce before, but am having trouble implementing this in Spark. I'm sure I'm missing some concept in the pipelining, but everything I've read appears to focus on key-value processing, or splitting a distributed dataset by arbitrary partitions, rather than what I'm trying to do. Is there a simple example or paradigm for doing this? Any help would be greatly appreciated.
EDIT:
Here's some pseudo-code for what I'm currently doing:
reader = pd.read_csv()
pool = mp.Pool(processes=4)
labels = <list of unique labels>
for label in labels:
dates = reader[(reader.label == label)]
for date in dates:
df = reader[(reader.label==label) && (reader.date==date)]
pool.apply_async(process, df, callback=callbackFunc)
pool.close()
pool.join()
When I say asynchronous, I mean something analogous to pool.apply_async().
As for now (PySpark 1.5.0) is see only two three options:
You can try to express your logic using SQL operations and UDFs. Unfortunately Python API doesn't support UDAFs (User Defined Aggregate Functions) but it is still expressive enough, especially with window functions, to cover wide range of scenarios.
Access to the external data sources can be handled in a couple of ways including:
access inside UDF with optional memoization
loading to a data frame and using join operation
using broadcast variable
Converting data frame to PairRDD and using on of the following:
partitionBy + mapPartitions
reduceByKey / aggregateByKey
If Python is not a strong requirement Scala API > 1.5.0 supports UDAFs which enable something like this:
df.groupBy(some_columns: _*).agg(some_udaf)
Partitioning data by key and using local Pandas data frames per partition