For a current project, I am planning to winsorize a Pandas DataFrame that consists of two columns/objects df['Policies'] and df['ProCon']. This means that the outliers at the high and the low end of the set shall be cut out.
The winsorising shall be conducted at 0.05 and 0.95 based on the values shown in the df['ProCon'] section, while both columns shall be cut out in case an outlier is identified.
The code below is however not accepting the direct reference to the 'ProCon' column in line def winsorize_series(df['ProCon']):, yielding an error about an invalid syntax.
Is there any smart way to indicate that ProCon shall be the determining value for the winsorizing?
import pandas as pd
from scipy.stats import mstats
# Loading the file
df = pd.read_csv("3d201602.csv")
# Winsorizing
def winsorize_series(df['ProCon']):
return mstats.winsorize(df['ProCon'], limits=[0.05,0.95])
# Defining the winsorized DataFrame
df = df.transform(winsorize_series)
Have you tried separating the column name from the table?
def winsorize_series(df, column):
return mstats.winsorize(df[column], limits=[0.05,0.95])
Can't test it though if there's no sample data.
As per comments, .transform is not the right choice to modify only one or selected columns from df. Whatever the function definition and arguments passed, transform will iterate and pass EVERY column to func and try to broadcast the joined result to the original shape of df.
What you need is much simpler
limits = [0.05,0.95] # keep limits static for any calls you make
colname = 'ProCon' # you could even have a list of columns and loop... for colname in cols
df[colname] = mstats.winsorize(df[colname], limits=limits)
df.transform(func) can be passed *args and **kwargs which will be passed to func, as in
df = df.transform(mstats.winsorize, axis=0, a=df['ProCon'], limits=[0.05,0.95])
So there is no need for
def winsorize_series...
Related
I have a data frame and I want to add columns. I want to randomly allocate values to the rows of my new column from a function result. Like this.
def getRandomString():
return "woteva" + str(randint(0,100))
df = df.withColumn("MyNewColumn", lit(getRandomString()))
In the result I am getting my first random result but repeats the first random output for all rows.
How can I get a new result per row ?
lit creates a column of literal (constant) value. This means when your code is executed the function getRandomString() is called once and the return value is used to create a column with a constant value.
To execute getRandomString() once per row, you can turn getRandomString() into an udf. Udfs will be called by Spark once per row.
By default udfs are considered to be deterministic. If this is not the case an udf must be marked nondeterministic
import random
from pyspark.sql import functions as F
from pyspark.sql import types as T
randomstringudf = F.udf(lambda: "woteva" + str(random.randint(0,100)),
T.StringType()).asNondeterministic()
df.withColumn("MyNewColumn", randomstringudf()).show()
I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you
General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)
Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps
Alright I could use a little help on this one. I've created a function that I can feed two multi index dataframes into, as well as a list of kwargs, and the function will then take values from one dataframe and add them to the other into a new column.
Just to try to make sure that I'm explaining it well enough, the two dataframes are both stock info, where one dataframe is my "universe" or stocks that I'm analyzing, and the other is a dataframe of market and sector ETFs.
So what my function does is takes kwargs in the form of:
new_stock_column_name = "existing_sector_column_name"
Here is my actual function:
def map_columns(hist_multi_to, hist_multi_from, **kwargs):
''' Map columns from the sector multi index dataframe to a new column
in the existing universe multi index dataframe.
**kwargs should be in the format newcolumn="existing_sector_column"
'''
df_to = hist_multi_to.copy()
df_from = hist_multi_from.copy()
for key, value in kwargs.items():
df_to[key] = np.nan
for index, val in df_to.iterrows():
try:
df_to.loc[index, key] = df_from.loc[(index[0],val.xl_sect),value]
except KeyError:
pass
return df_to
I believe my function works exactly as I intend, except it takes a ridiculously long time to loop through the data. There has got to be a better way to do this, so any help you could provide would be greatly appreciated.
I apologize in advance, but I'm having trouble coming up with two simple example dataframes, but the only real requirement is that the stock dataframe has a column that lists its sector ETF in it, and that column value directly coincides to the level 1 index of the ETF dataframe.
The exception handler is in place simply because the ETFs sometimes do not exist for all the dates of the analysis, in which case I don't mind that the values stay as NaN.
Update:
Here is a revised code snippet that will allow you to run the function to see what I'm talking about. Forgive me, my coding skills are limited.
import pandas as pd
import numpy as np
stock_arrays = [np.array(['1/1/2020','1/1/2020','1/2/2020','1/2/2020']),
np.array(['aapl', 'amzn', 'aapl', 'amzn'])]
stock_tuples = list(zip(*stock_arrays))
stock_index = pd.MultiIndex.from_tuples(stock_tuples, names=['date', 'stock'])
etf_arrays = [np.array(['1/1/2020','1/1/2020','1/2/2020','1/2/2020']),
np.array(['xly', 'xlk','xly', 'xlk'])]
etf_tuples = list(zip(*etf_arrays))
etf_index = pd.MultiIndex.from_tuples(etf_tuples, names=['date', 'stock'])
stock_df = pd.DataFrame(np.random.randn(4), index=stock_index, columns=['close_price'])
etf_df = pd.DataFrame(np.random.randn(4), index=etf_index, columns=['close_price'])
stock_df['xl_sect'] = np.array(['xlk', 'xly','xlk', 'xly'])
def map_columns(hist_multi_to, hist_multi_from, **kwargs):
''' Map columns from the sector multi index dataframe to a new column
in the existing universe multi index dataframe.
**kwargs should be in the format newcolumn="existing_sector_column"
'''
df_to = hist_multi_to.copy()
df_from = hist_multi_from.copy()
for key, value in kwargs.items():
df_to[key] = np.nan
for index, val in df_to.iterrows():
try:
df_to.loc[index, key] = df_from.loc[(index[0],val.xl_sect),value]
except KeyError:
pass
return df_to
Now after running the above in a cell, you can access the function by calling it like this:
new_stock_df = map_columns(stock_df, etf_df, sect_etf_close='close_price')
new_stock_df
I hope this is more helpful. And as you can see, the function works, but with really large datasets it's extremely slow and inefficient.
Aim: To speed up applying a function row wise across a large data frame (1.9 million ~ rows)
Attempt: Using dask map_partitions where partitions == number of cores. I've written a function which is applied to each row, creates a dict containing a variable number of new values (between 1 and 55). This function works fine standalone.
Problem: I need a way to combine the output of each function into a final dataframe. I tried using df.append, where I'd append each dict to a new dataframe and return this dataframe. If I understand the Dask Docs, Dask should then combine them to one big DF. Unfortunately this line is tripping an error (ValueError: could not broadcast input array from shape (56) into shape (1)). Which leads me to believe it's something to do with the combine feature in Dask?
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
I am not quite sure I completely understand your code in lieu of an MCVE but I think there is a bit of a misunderstanding here.
In this piece of code you take a row and a DataFrame and append one row to that DataFrame.
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
Instead of appending to New_DF, I would recommend just returning a pd.Series which df.apply concatenates into a DataFrame. That is because if you are appending to the same New_DF object in all nCores partitions, you are bound to run into trouble.
#Function to applied row wise down the dataframe. Takes a row and returns a row.
def tobsecret_func(row):
post = str(row.post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
length_adjusted_series = pd.Series(scores).reindex(range(55))
return(length_adjusted_series)
Your error also suggests that as you wrote in your question, your function creates a variable number of values. If the pd.Series you return doesn't have the same shape and column names, then df.apply will fail to concatenate them into a pd.DataFrame. Therefore make sure you return a pd.Series of equal shape each time. This question shows you how to create pd.Series of equal length and index: Pandas: pad series on top or bottom
I don't know what kind of dict your OtherFUNC.countWords returns exactly, so you may want to adjust the line:
length_adjusted_series = pd.Series(scores).reindex(range(55))
As is, the line would return a Series with an index 0, 1, 2, ..., 54 and up to 55 values (if the dict originally had less than 55 keys, the remaining cells will contain NaN values).
This means after applied to a DataFrame, the columns of that DataFrame would be named 0, 1, 2, ..., 54.
Now you take your dataset and map your function to each partition and in each partition you apply it to the DataFrame using apply.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
map_partitions expects a function which takes as input a DataFrame and outputs a DataFrame. Your function is doing this by using a lambda function that basically calls your other function and applies it to a DataFrame, which in turn returns a DataFrame. This works but I highly recommend writing a named function which takes as input a DataFrame and outputs a DataFrame, it makes it easier for you to debug your code.
For example with a simple wrapper function like this:
df_wise(df):
return df.apply(tobsecret_func)
Especially as your code gets more complex, abstaining from using lambda functions that call non-trivial code like your custom func and instead making a simple named function can help you debug because the traceback will not just lead you to a line with a bunch of lambda functions like in your code but will also directly point to the named function df_wise, so you will see exactly where the error is coming from.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(df_wise,
meta=df_wise(dd.head())
).\
compute(get=get)
Notice that we just fed dd.head() to df_wise to create our meta-keyword which is similar to what Dask would do under the hood.
You are using dask.get, the synchronous scheduler which is why the whole New_DF.append(...) code could work, since you append to the DataFrame for each consecutive partition.
This does not give you any parallelism and thus will not work if you use one of the other schedulers, all of which parallelise your code.
The documentation also mentions the meta keyword argument, which you should supply to your map_partitions call, so dask knows what columns your DataFrame will have. If you don't do this, dask will first have to do a trial run of your function on one of the partitions and check what the shape of the output is before it can go ahead and do the other partitions. This can slow down your code by a ton if your partitions are large; giving the meta keyword bypasses this unnecessary computation for dask.
I want to group data based on different dataframe's cuts.
So for instance I cut from a frame:
my_fcuts = pd.qcut(frame1['prices'],5)
pd.groupby(frame2, my_fcuts)
Since the lengths must be same, the above statement will fail.
I know I can easily write a mapper function, but what if this was the case
my_fcuts = pd.qcut(frame1['prices'],20) or some higher number. Surely there must be some built-in statement in pandas to do this very simple thing. groupby should be able to accept "cuts" from different data and reclassify.
Any ideas?
Thanks I figured out the answer myself
volgroups = np.digitize(btest['vol_proxy'],np.linspace(min(data['vol_proxy']), max(data['vol_proxy']), 10))
trendgroups = np.digitize(btest['trend_proxy'],np.linspace(min(data['trend_proxy']), max(data['trend_proxy']), 10))
#btest.groupby([volgroups,trendgroups]).mean()['pnl'].plot(kind='bar')
#plt.show()
df = btest.groupby([volgroups,trendgroups]).groups