I have a pyspark dataframe containing 1000 columns and 10,000 records(rows).
I need to create 2000 more columns, by performing some computation on the existing columns.
df #pyspark dataframe contaning 1000 columns and 10,000 records
df = df.withColumn('C1001', ((df['C269'] * df['C285'])/df['C41'])) #existing column names range from C1 to C1000
df = df.withColumn('C1002', ((df['C4'] * df['C267'])/df['C146']))
df = df.withColumn('C1003', ((df['C87'] * df['C134'])/df['C238']))
.
.
.
df = df.withColumn('C3000', ((df['C365'] * df['C235'])/df['C321']))
The issue is, this takes way too long, around 45 minutes or so.
Since I am a newbie, I was wondering what I am doing wrong?
P.S.: I am running spark on databricks, with 1 driver and 1 worker node, both having 16GB Memory and 8 Cores.
Thanks!
A lot of what you're doing is just creating an execution plan. Spark is lazy executing until there's an action that triggers it. So the 45 minutes that you're seeing is probably from executing all the transformations that you've been setting up.
If you want to see how long a single withColumn takes, then trigger an action like df.count() or something before, and then do a single withColumn and then another df.count() (to trigger action again).
Take a look more into pyspark execution plan, transformations and actions.
Without being too specific
and looking at the observations of the 1st answer
and knowing that Execution Plans for many DF-columns aka "very wide data" are costly to compute
a move to RDD processing may well be the path to take.
Do this in single line rather than doing it one by one
df = df.withColumn('C1001', COl1).df.withColumn('C1002', COl2).df.withColumn('C1003', COl3) ......
Related
So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.
My dataframe has around 1 million records. Below is my code I am using to write each row of a spark dataframe to a separate file. But it takes hours to complete. Any suggestion to tweak this would be really helpful.
row_count = df.count()
row_count = 10,000,00
df1 = df.repartition(row_count)
df1.rdd.map(lambda row:row[0]).saveAsTextFile(targetfolder)
This will hamper the performance and you should consider checking onto logic that if you really need one row in one file.
Still if you want to do it, you can try this not very sure how much performance gain you can get
win = window.orderBy('anyColumn')
df2=df.withColumn('row',f.row_number().over(win))
df2.write.partitionBy('row').parquet('path')
Not at all recommended though.
I have to add multiple columns to a PySpark Dataframe, based on some conditions. Long story short, the code looks like this dumb example:
for col in df.columns:
df = df.withColumn('{}_without_otliers'.format(col), F.lit(1))
The problem is, when I have not so many columns (for example 15 or 20) it performs well, but when I have 100 columns it takes so long for spark to start the job, and the DAG looks huge. How can I optimize that ? Is there any way I can "force" spark to perform the operations every 10 columns ?
This is a follow-on query to my previous one: following suggestion there I got the row-over-row percentage changes and since the first row in the df_diff dataframe (df) was all null values, I did:
df_diff = df_diff.dropna()
df_diff.count()
The second statement throws the following error:
Py4JJavaError: An error occurred while calling o1844.count.
: java.lang.OutOfMemoryError: Java heap space
When I try above code on the toy df posted in the previous post it works fine but with my actual dataframe (834 rows, 51 columns) the above error happens. Any guidance as to why this is happening and how to handle it would be much appreciated. Thanks
EDIT:
In my actual dataframe (df) of 834 X 51, the first column is date and the remaining columns are closing stock prices for 50 stocks for which I'm trying to get the daily percentage changes. Partitioning the window by the date col made no difference to the previous error in this df in pyspark and there doesn't seem to any other natural candidate to partition by.
The only thing that sort of worked was to do this in spark-shell. Here without partitions I was getting warning messages ...
WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
... until I called cache() on the dataframe but this is not ideal for large df
Your original code is simply not scalable. Following
w = Window.orderBy("index")
window definition requires shuffling data to a single partition, and such is useful only for small, local datasets.
Depending on the data, you can try more complex approach, like the one show in Avoid performance impact of a single partition mode in Spark window functions
I'd like to use a DataFrame to manage data from many trials of an experiment I'm controlling with Python code. Ideally I will have one master dataframe with a row for each trial that lives in the main function namespace, and then a separate dict (or dataframe) returned from the function that I call to execute the important bits of code for each trial.
What is the best way to do a running update of the master dataframe with this returned set of data? So far I've come up with:
df = df.append(df_trial, ignore_index=True)
or
df = pd.concat([df, df_trial])
But neither seem ideal (and both take a relatively long time according to %timeit). Is there a more Pandonic way?
You should build a list of the pieces and concatenate them all in one shot at the end.