Conditional sum in Python between multiple columns - python

I have the following script, from a larger analysis of securities data,
returns_columns = []
df_merged[ticker + '_returns'] = df_merged[ticker + '_close'].pct_change(periods=1)
returns_columns.append(ticker + '_returns')
df_merged['applicable_returns_sum'] = (df_merged[returns_columns] > df_merged['return_threshold']).sum(axis=1)
'return_threshold' is a complete series of float numbers.
I've been able to successfully sum each row in the returns_columns array, but cannot figure out how to conditionally sum only the numbers in the returns_columns that are greater than the res'return_threshold' in that row.
This seems like a problem similar to the one shown here, Python Pandas counting and summing specific conditions, but I'm trying to sum based on the changing condition in the returns_columns.
Any help would be much appreciated, thanks as always!
EDIT: ANOTHER APPROACH
This is another approach I tried. The script below has an error associated with the ticker input, even though I think it's necessary, and then produces and error:
def compute_applicable_returns(row, ticker):
if row[ticker + '_returns'] >= row['top_return']:
return row[ticker + '_returns']
else:
return 0
df_merged['applicable_top_returns'] = df_merged[returns_columns].apply(compute_applicable_returns, axis=1)

The [] operator for a dataframe should allow you to filter by an expression df > threshold and return a dataframe. You can then call .sum() on this df.
df[df > threshold].sum()

answered the question like this:
def compute_applicable_returns(row, ticker):
if row[ticker + '_returns'] >= row['return_threshold']:
return row[ticker + '_returns']
else:
return 0
for ticker in tickers:
df_merged[ticker + '_applicable_returns'] = df_merged.apply(compute_applicable_returns, args=(ticker,), axis=1)

Related

Optimization of this python function

def split_trajectories(df):
trajectories_list = []
count = 0
for record in range(len(df)):
if record == 0:
continue
if df['time'].iloc[record] - df['time'].iloc[record - 1] > pd.Timedelta('0 days 00:00:30'):
temp_df = reset_index(df[count:record])
if not temp_df.empty:
if len(temp_df) > 50:
trajectories_list.append(temp_df)
count = record
return trajectories_list
This is a python function that receives a pandas dataframe and divides it into a list of dataframes when their time delta is greater than 30 seconds and if the dataframe contains than 50 records. In my case I need to execute this function thousands of times and I wonder if anyone can help me optimize it. Thanks in advance!
I tried to optimize it as far as I can.
You're doing a few things right here, like iterating using a range instead of .iterrows and using .iloc.
Simple things you can do:
Switch .iloc to .iat since you only need 'time' anyway
Don't recalculate Timedelta every time, just save it as a variable
The big thing is that you don't need to actually save each temp_df, or even create it. You can save a tuple (count, record) and retrieve from df afterwards, as needed. (Although, I have to admit I don't quite understand what you're accomplishing with the count:record logic; should one of your .iloc's involve count instead?)
def split_trajectories(df):
trajectories_list = []
td = pd.Timedelta('0 days 00:00:30')
count = 0
for record in range(1, len(df)):
if df['time'].iat[record] - df['time'].iat[record - 1] > td:
if record-count>50:
new_tuple = (count, record)
trajectories_list.append(new_tuple)
return trajectories_list
Maybe you can try this one below, you could use count if you want to keep track of number of df's or else the length of trajectories_list will give you that info.
def split_trajectories(df):
trajectories_list = []
df_difference = df['time'].diff()
if not (df_difference.empty) and (df_difference.shape[0]>50):
trajectories_list.append(df_difference)
return trajectories_list

Dividing each column in a pandas df by a value from another df

I have a dataframe of a size (44,44) and another one (44,)
I need to divide each item in a column 'EOFx' by a number in a column 'PCx'.
(e.g. All values in 'EOF1' by 'PC1')
I've been trying string and numeric loops but nothing seems to work at all (error) or I get NaNs.
Last thing I tried was
for k in eof_df.keys():
for m in pc_df.keys():
eof_df[k].divide(pc_df[m])
The end result is a modified eof_df.
What did work for 1 column outside the loop is this.
eof_df.iloc[:,0].divide(std_df.iloc[0]).head()
Thank you!
upd1. In response to MoRe:
for eof_df it will be:
{'EOF1': {'8410140.nc': -0.09481700372712784,
'8418150.nc': -0.11842440098461708,
'8443970.nc': -0.1275311990493338,
'8447930.nc': -0.1321116945944401,
'8449130.nc': -0.11649753033608201,
'8452660.nc': -0.14776686151828214,
'8454000.nc': -0.1451132595405897,
'8461490.nc': -0.17032364516557338,
'8467150.nc': -0.20725618455428937,
'8518750.nc': -0.2249648853806308},
'EOF2': {'8410140.nc': 0.051213689088367806,
'8418150.nc': 0.0858110390036938,
'8443970.nc': 0.09029173023479754,
'8447930.nc': 0.05526955432871537,
'8449130.nc': 0.05136680082838883,
'8452660.nc': 0.06105351220962777,
'8454000.nc': 0.052112043784544135,
'8461490.nc': 0.08652511173850089,
'8467150.nc': 0.1137754089944319,
'8518750.nc': 0.10461193696203},
and it goes to EOF44.
For pc_df it will be
{'PC1': 0.5734671652560537,
'PC2': 0.29256502033278076,
'PC3': 0.23586098119374838,
'PC4': 0.227069130368915,
'PC5': 0.1642170373016029,
'PC6': 0.14131097046499339,
'PC7': 0.09837935104899741,
'PC8': 0.0869056762311067,
'PC9': 0.08183389338415169,
'PC10': 0.07467191608481094}
output = pd.DataFrame(index=eof_df.index, data=eof_df.values / pc_df.values)
output.columns = eof_df.columns
data = pd.DataFrame(eof_df.values.T / pc_df.values.T).T
data.columns = ["divided" + str(i + 1) for i in data.columns.to_list()]

i am trying to write a function which divide a column 3 parts

i am trying to write a function which takes a column as input and divide it 3 parts as short, medium , long then return them as list.
i tried to do it with loc function, but, however, it return a dataframe rather than a list.
def DivideColumns(df,col):
mean = df[col].mean()
maxi = df[col].max()
mini = df[col].min()
less = mean - (maxi-mini)/3
more = mean + (maxi-mini)/3
short = df.loc[df[col] < less]
average = df.loc[df[col].between(df[col], less, more)]
long = df.loc[df[col] > more]
return short, average, long;
what i am expected was getting 3 different list, but unfortunately i got 3 different dataframe
Since you are using pandas you can use the concept of binning. By using the pandas cut function you can divide in the ranges you like and it makes your code easier to read. More info here
def DivideColumns(df,col):
mean = df[col].mean()
maxi = df[col].max()
mini = df[col].min()
less = mean - (maxi-mini)/3
more = mean + (maxi-mini)/3
# binning
bins_values = [mini, less, more, maxi]
group_names = ['short', 'avarage', 'long']
bins = pd.cut(df[col], bins_values, labels=group_names, include_lowest=True )
short = (df[col][bins == 'short']).tolist()
average = (df[col][bins == 'avarage']).tolist()
long = (df[col][bins == 'long']).tolist()
return short, average, long;
Use tolist() function to transform a pandas dataframe into a list.
short = df.loc[df[col] < less].values.tolist()
average = df.loc[df[col].between(df[col], less, more)].values.tolist()
long = df.loc[df[col] > more].values.tolist()

what is the source of this error: python pandas

import pandas as pd
census_df = pd.read_csv('census.csv')
#census_df.head()
def answer_seven():
census_df_1 = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
census_df_1['highest'] = census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].max()
census_df_1['lowest'] =census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].min()
x = abs(census_df_1['highest'] - census_df_1['lowest']).tolist()
return x[0]
answer_seven()
This is trying to use the data from census.csv to find the counties that have the largest absolute change in population within 2010-2015(POPESTIMATES), I wanted to simply find the difference between abs.value of max and min value for each year/column. You must return a string. also [(census_df['SUMLEV'] ==50)] means only counties are taken as they are set to 50. But the code gives an error that ends with
KeyError: "['POPESTIAMTE2010' 'POPESTIAMTE2011' 'POPESTIAMTE2012'
'POPESTIAMTE2013'\n 'POPESTIAMTE2014' 'POPESTIAMTE2015'] not in index"
Am I indexing the wrong data structure? I'm really new to datascience and coding.
I think the column names in the code have typo. The pattern is 'POPESTIMATE201?' and not 'POPESTIAMTE201?'
Any help with shortening the code will be appreciated. Here is the code that works -
census_df = pd.read_csv('census.csv')
def answer_seven():
cdf = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
columns = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
cdf['big'] = cdf[columns].max(axis =1)
cdf['sml'] = cdf[columns].min(axis =1)
cdf['change'] = cdf[['big']].sub(cdf['sml'], axis=0)
return cdf['change'].idxmax()

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Categories

Resources