using melt function in groupby for large data sets in python - python

I have one data frame with 1782568 distinct groups.
So, when i melt that data by grouping level my kernal got stuck.
So, I am decided to to melt the data by group wise and then i will combine all of them sequentially.
For that I wrote the following function.
def split(df,key):
df2=pd.DataFrame()
for i in range(df[key].drop_duplicates().shape[0]):
grp_key=tuple(df[key].drop_duplicates().iloc[i,:])
df1=df.groupby(key,as_index=False).
get_group(grp_key).reset_index().drop('index',axis=1)
df2=df2.append(df1.groupby(key,as_index=False).
apply(pd.melt,id_vars=key).reset_index()).dropna()
df2=df2.drop(grep('level',df2.columns),axis=1)
return(df2)
here grep is my user defined function, it is working as grep function in R.
In df i would pass data frame and in key i would pass grouping keys in list format.
But the function also took very huge time to complete the process.
Can any one help me to improve the performance.
Thanks in Advance.

Related

How to run the same function multiple times simultaneously?

I have a function that takes dataframe as an input and returns a dataframe. Like:
def process(df):
<all the code for processing>
return df
# input df has 250K rows and 30 columns
# saving it in a variable
result = process(df)
# transform input df into 10,000K rows and over 50 columns
It does a lot of processing and thus takes a long time to return the output. I am using jupyter notebook.
I have come up with a new function that filters the original dataframe into 5 chunks not of equal size but between 30K to 100K, based on some category filter on a column on the origianl df and have it passed separately as process(df1), process(df2)...etc. and save it as result1, result 2, etc and then merge the results together as one single final dataframe.
But I want them to run simultaneously and combine the results automatically. Like a code to run the 5 process functions together and once all are completed then they can join as one to give me the same "result" as earlier but with a lot of run time saved.
Even better if I can split the original dataframe into equal parts and run simultaneously each part using the process(df) function, like it splits randomly those 250 k rows into 5 dataframes of size 50k each and send them as an input to the process(df) five times and runs them parallelly and give me the same final output I would be getting right now without any of this optimization.
I was reading a lot about multi-threading and I found some useful answers on stack overflow but I wasn't able to really get it work. I am very new to this concept of multi-threading.
You can use the multiprocessing library for this, which allows you to run a function on different cores of the CPU.
The following is an example
from multiprocessing import Pool
def f(df):
# Process dataframe
return df
if __name__ == '__main__':
dataframes = [df1, df2, df3]
with Pool(len(dataframes)) as p:
proccessed_dfs = p.map(f, dataframes)
print(processed_dfs)
# You would join them here
You should check dask (https://dask.org/) since it seems like you have mostly operations on dataframes. A big advantage is that you won't have to worry about all the details of manually splitting your dataframe and all of that.

Issue in renaming the multiple aggregation outcome columns in pandas python

I got a question regarding the multiple aggregation in pandas.
Originally I have a dataset which shows the oil price, and the detail is as follows:
And the head of the dataset is as follows:
What I want to do here is to get the mean and standard deviation for each quarter of the year 2014. And the ideal output is as follows:
In my script, I have already created the quarter info by doing so .
However, one thing that I do not understand here:
If I tried to use this command to do so
brent[brent.index.year == 2014].groupby('quarter').agg({"average_price": np.mean, "std_price": np.std})
I got an error as follows:
And if I use the following script, then it works
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price','std'))
So the questions are:
What's wrong with the first approach here?
And why do we need to use the second approach here?
Thank you all for the help in advance!
What's wrong with the first approach here?
There is passed dict, so pandas looking for columns from keys average_price, std_price and because not exist in DataFrame if return error.
Possible solution is specified column after groupby and pass list of tuples for specified new columns names with aggregate functions:
brent[brent.index.year == 2014].groupby('quarter')['Price'].agg([('average_price','mean'),('std_price',np.std)])
It is possible, because for one column Price is possible defined multiple columns names.
In later pandas versions are used named aggregations:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std))
Here is logic - for each aggregation is defined nw column name with aggregate column and aggregate function. So is possible aggregate multiple columns with different functions:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std),
sumQ=('quarter','sum'))
Notice, np.std has default ddof=0 and pandas std has ddof=1, so different outputs.

Pandas: Preferred order of methods on a part of a table

I have started learning Pandas module ny "Data School" Q&A series and in his "How do I handle missing values in pandas?" video he wrote the following line of code:
ufo.isna().tail()
If I am not wrong, the following line would be more efficient:
ufo.tail().isna()
My question is not only in this case, but in general, is the order of methods on a part of a table matters? And if so then when exactly?
In my opinion, here should be used logic:
first filter for reduce number of rows and then apply some method only for filtered data
not like:
first apply method for all data and then filter
So for better performance is use first one - filter and apply method - here are tested missing values onle for first 5 rows:
ufo.tail().isna()
But here are tested all values and then filtered first 5 tows, so if 10M rows performance is much worse:
ufo.isna().tail()

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Running update of Pandas dataframe

I'd like to use a DataFrame to manage data from many trials of an experiment I'm controlling with Python code. Ideally I will have one master dataframe with a row for each trial that lives in the main function namespace, and then a separate dict (or dataframe) returned from the function that I call to execute the important bits of code for each trial.
What is the best way to do a running update of the master dataframe with this returned set of data? So far I've come up with:
df = df.append(df_trial, ignore_index=True)
or
df = pd.concat([df, df_trial])
But neither seem ideal (and both take a relatively long time according to %timeit). Is there a more Pandonic way?
You should build a list of the pieces and concatenate them all in one shot at the end.

Categories

Resources