I have to add multiple columns to a PySpark Dataframe, based on some conditions. Long story short, the code looks like this dumb example:
for col in df.columns:
df = df.withColumn('{}_without_otliers'.format(col), F.lit(1))
The problem is, when I have not so many columns (for example 15 or 20) it performs well, but when I have 100 columns it takes so long for spark to start the job, and the DAG looks huge. How can I optimize that ? Is there any way I can "force" spark to perform the operations every 10 columns ?
Related
I have two dataframes and have a code to extract some data from one of the dataframes and add to the other dataframe:
sales= pd.read_excel("data.xlsx", sheet_name = 'sales', header = 0)
born= pd.read_excel("data.xlsx", sheet_name = 'born', header = 0)
bornuni = born.number.unique()
for babies in bornuni:
datafame = born[born["id"]==number]
for i, r in sales.iterrows():
if r["number"] == babies:
sales.loc[i,'ini_weight'] = datafame["weight"].iloc[0]
sales.loc[i,'ini_date'] = datafame["date of birth"].iloc[0]
else:
pass
this is pretty inefficient with bigger data sets so I want to parallelize this code but I donĀ“t have a clue how to do it. Any help would be great. Here is a link to a mock dataset.
So before worrying about parallelizing, I can't help but notice that you're using lots of for loops to deal with the dataframes. Dataframes are pretty fast when you use their vectorized capabilities.
I see a lot of inefficient use of pandas here, so maybe we first fix that and then worry about throwing more CPU cores at it.
It seems to me you want to accomplish the following:
For each unique baby id number in the born dataframe, you want to update the ini_weight and ini_date fields of the corresponding entry in the sales dataframe.
There's a good chance that you can use some dataframe merging / joining to help you with that, as well as using the pivot table functionality:
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
I strongly suggest you take a look at those, try using the ideas from these articles, and then reframe your question in terms of these operations, because as you correctly notice, looping over all the rows repeatedly to find the row with some matching index is very inefficient.
So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.
I have a pyspark dataframe containing 1000 columns and 10,000 records(rows).
I need to create 2000 more columns, by performing some computation on the existing columns.
df #pyspark dataframe contaning 1000 columns and 10,000 records
df = df.withColumn('C1001', ((df['C269'] * df['C285'])/df['C41'])) #existing column names range from C1 to C1000
df = df.withColumn('C1002', ((df['C4'] * df['C267'])/df['C146']))
df = df.withColumn('C1003', ((df['C87'] * df['C134'])/df['C238']))
.
.
.
df = df.withColumn('C3000', ((df['C365'] * df['C235'])/df['C321']))
The issue is, this takes way too long, around 45 minutes or so.
Since I am a newbie, I was wondering what I am doing wrong?
P.S.: I am running spark on databricks, with 1 driver and 1 worker node, both having 16GB Memory and 8 Cores.
Thanks!
A lot of what you're doing is just creating an execution plan. Spark is lazy executing until there's an action that triggers it. So the 45 minutes that you're seeing is probably from executing all the transformations that you've been setting up.
If you want to see how long a single withColumn takes, then trigger an action like df.count() or something before, and then do a single withColumn and then another df.count() (to trigger action again).
Take a look more into pyspark execution plan, transformations and actions.
Without being too specific
and looking at the observations of the 1st answer
and knowing that Execution Plans for many DF-columns aka "very wide data" are costly to compute
a move to RDD processing may well be the path to take.
Do this in single line rather than doing it one by one
df = df.withColumn('C1001', COl1).df.withColumn('C1002', COl2).df.withColumn('C1003', COl3) ......
I am trying to loop through multiple excel files in pandas. The structure of the files are very much similar, the first 10 column forms a key and rest of the columns have the values. I want to group by first 10 columns and sum the rest.
I have searched and found solutions online for similar cases but my problem is that
I have large number of columns with values ( to be aggregate as sum) and
Number / names of columns(with values) is different for each
file(dataframe)
#Key columns are same across all the files.
I can't share the actual data sample but here is the format sample of the file structure
and here is the desired output from the above data
It is like a groupby operation but with uncertain large number of columns and header name makes it difficult to use groupby or pivot. Can Any one suggest me what is the best possible solution for it in python.
Edited:
df.groupby(list(df.columns[:11])).agg(sum)
is working but for some reason it is taking 25-30 mins. the same thing MS Access is done in 1-2 mins . Am I doing something wrong here or is there any other way to do it in python itself
Just use df.columns which has the list of columns, you can then use a slice on that list to get the 10 leftmost columns.
This should work:
df.groupby(df.columns[:10].to_list()).sum()
I'd like to use a DataFrame to manage data from many trials of an experiment I'm controlling with Python code. Ideally I will have one master dataframe with a row for each trial that lives in the main function namespace, and then a separate dict (or dataframe) returned from the function that I call to execute the important bits of code for each trial.
What is the best way to do a running update of the master dataframe with this returned set of data? So far I've come up with:
df = df.append(df_trial, ignore_index=True)
or
df = pd.concat([df, df_trial])
But neither seem ideal (and both take a relatively long time according to %timeit). Is there a more Pandonic way?
You should build a list of the pieces and concatenate them all in one shot at the end.