I am currently using Databricks to process data coming from our Azure Data Lake. Majority of the data is being read into pySpark dataframes and are relatively big datasets. However I do have to perform some joins on smaller static tables to fetch additional attributes.
Currently, the only way in which I can do this is by converting those smaller static tables into pySpark dataframes as well. I'm just curious as to whether using such a small table as a pySpark dataframe is a bad practice? I know pySpark is meant for large datasets which need to be distributed but given that my large dataset is in a pySpark dataframe, I assumed I would have to convert the smaller static table into a pySpark dataframe as well in order to make the appropriate joins.
Any tips on best practices would be appreciated, as it relates to joining with very small datasets. Maybe I am overcomplicating something which isn't even a big deal but I was curious. Thanks in advance!
Take a look at Broadcast joins. Wonderfully explained here https://mungingdata.com/apache-spark/broadcast-joins/
The best practice in your case is to broadcast your small df and joins the broadcasted df to your large df like this code below:
val broadcastedDF = sc.broadcast(smallDF)
largeDF.join(broadcastedDF)
Related
Say I have two pandas dataframes, one containing data for general population and one containing the same data for a target group.
I assume this is a very common use case of population segmentation. My first idea to explore the data would be to perform some vizualization using e.g. seaborn Facetgrid or barplot & scatterplot or something like that to get a general idea of the trends and differences.
However, I found out that this operation is not as straightforward as I thought as seaborn is made to analyze one dataset and not compare two datasets.
I found this SO answer which provides a solution. But I am wondering how would people go about if if the dataframe was huge and a concat operation would not be possible ?
Datashader does not seem to provide such features as far as I have seen ?
Thanks for any ideas on how to go about such task
I would use the library Dask when data is too big for pandas. Dask is made by the same people who created pandas and it is a little bit more advanced, because it is a big data tool, but it has some of the same features including concat. I found dask easy enough to use and am using it for a couple of projects where I have dozens of columns and tens of millions of rows.
I would like to read multiple parquet files with different schemes to pandas dataframe with dask, and be able to merge the schemes. When I talking about the different schemes, I mean, that there are common columns in all these files but in some files there are columns that are not present in others.
Unfortunately, when I read the files with
dd.read_parquet(my_parquet_files, engine="fastparquet")
I have only common columns read. I know that in spark there is a read option mergeSchema, I wonder if there is a simple way to do the same in dask ?
I recommend reading the different kinds of files individually, and then concatenating them with dd.concat
dfs = [dd.read_parquet(...) for ... in ...]
df = dd.concat(dfs, axis=0)
Then whatever policy Pandas uses for concatenating dataframes with mixed columns will take over. If Pandas supports this kind of behavior then Dask dataframe will likely support the behavior.
If it doesn't, then it sounds like you're asking for a feature request, in which case you should probably raise an issue at https://github.com/dask/dask/issues/new
I had posted a question with regard to memory errors while working with large csv files using pandas dataframe. To be more clear, I'm asking another question: I have memory errors while merging big csv files (more than 30 million rows). So, what is the solution for this? Thanks!
Using Python/Pandas to process datasets with tens of millions of rows isn't ideal. Rather than processing a massive CSV, consider warehousing your data into a database like Redshift where you'll be able to query and manipulate your data thousands of times faster than you could do with Pandas. Once your data is in a database you can use SQL to aggregate/filter/reshape your data into "bite size" exports and extracts for local analysis using Pandas if you'd like.
Long term, consider using Spark which is a distributed data analysis framework built on Scala. It definitely has a steeper learning curve than Pandas but borrows a lot of the core concepts.
Redshift: https://aws.amazon.com/redshift/
Spark: http://spark.apache.org/
Question for experienced Pandas users on approach to working with Dataframe data.
Invariably we want to use Pandas to explore relationships among data elements. Sometimes we use groupby type functions to get summary level data on subsets of the data. Sometimes we use plots and charts to compare one column of data against another. I'm sure there are other application I haven't thought of.
When I speak with other fairly novice users like myself, they generally try to extract portions of a "large" dataframe into smaller dfs that are sorted or formatted properly to run applications or plot. This approach certainly has disadvantages in that if you strip out a subset of data into a smaller df and then want to run an analysis against a column of data you left in the bigger df, you have to go back and recut stuff.
My question is - is best practices for more experienced users to leave the large dataframe and try to syntactically pull out the data in such a way that the effect is the same or similar to cutting out a smaller df? Or is it best to actually cut out smaller dfs to work with?
Thanks in advance.
I've written a program in Python and pandas which takes a very large dataset (~4 million rows per month for 6 months), groups it by 2 of the columns (date and a label), and then applies a function to each group of rows. There are a variable number of rows in each grouping - anywhere from a handful of rows to thousands of rows. There are thousands of groups per month (label-date combos).
My current program uses multiprocessing, so it's pretty efficient, and I thought would map well to Spark. I've worked with map-reduce before, but am having trouble implementing this in Spark. I'm sure I'm missing some concept in the pipelining, but everything I've read appears to focus on key-value processing, or splitting a distributed dataset by arbitrary partitions, rather than what I'm trying to do. Is there a simple example or paradigm for doing this? Any help would be greatly appreciated.
EDIT:
Here's some pseudo-code for what I'm currently doing:
reader = pd.read_csv()
pool = mp.Pool(processes=4)
labels = <list of unique labels>
for label in labels:
dates = reader[(reader.label == label)]
for date in dates:
df = reader[(reader.label==label) && (reader.date==date)]
pool.apply_async(process, df, callback=callbackFunc)
pool.close()
pool.join()
When I say asynchronous, I mean something analogous to pool.apply_async().
As for now (PySpark 1.5.0) is see only two three options:
You can try to express your logic using SQL operations and UDFs. Unfortunately Python API doesn't support UDAFs (User Defined Aggregate Functions) but it is still expressive enough, especially with window functions, to cover wide range of scenarios.
Access to the external data sources can be handled in a couple of ways including:
access inside UDF with optional memoization
loading to a data frame and using join operation
using broadcast variable
Converting data frame to PairRDD and using on of the following:
partitionBy + mapPartitions
reduceByKey / aggregateByKey
If Python is not a strong requirement Scala API > 1.5.0 supports UDAFs which enable something like this:
df.groupBy(some_columns: _*).agg(some_udaf)
Partitioning data by key and using local Pandas data frames per partition