I have a dataframe with 3 columns: a_id, b, c (with a_id as a unique key) and I would like to assign a score for each row based on the number in b and c columns. I have created the following:
def b_score_function(df):
if df['b'] <= 0 :
return 0
elif df['b'] <= 2 :
return 0.25
else:
return 1
def c_score_function(df):
if df['c'] <= 0 :
return 0
elif df['c'] <= 1 :
return 0.5
else:
return 1
Normally, I would use something like this:
df['b_score'] = df(b_score, axis = 1)
df['c_score'] = df(c_score, axis = 1)
However, the above approach will be too long if I have multiple columns. I would like to know how can I create a loop for the selected columns? I have tried the following:
ds_cols = df.columns.difference(['a_id']).to_list()
for col in ds_cols:
df[f'{col}_score'] = df.apply(f'{col}_score_function', axis = 1)
but it returned with the following error:
'b_score_function' is not a valid function for 'DataFrame' object
Can anyone please point out what I did wrong?
Also if anyone can suggest how to create a reusable, that would be appreciated.
Thank you.
IIUC, this should work for you:
df = pd.DataFrame({'a_id': range(5), 'b': [0.0, 0.25, 0.5, 2.0, 2.5], 'c': [0.0, 0.25, 0.5, 1.0, 1.5]})
def b_score_function(df):
if df['b'] <= 0 :
return 0
elif df['b'] <= 2 :
return 0.25
else:
return 1
def c_score_function(df):
if df['c'] <= 0 :
return 0
elif df['c'] <= 1 :
return 0.5
else:
return 1
ds_cols = df.columns.difference(['a_id']).to_list()
for col in ds_cols:
df[f'{col}_score'] = df.apply(eval(f'{col}_score_function'), axis = 1)
print(df)
Result:
a_id b c b_score c_score
0 0 0.00 0.00 0.00 0.0
1 1 0.25 0.25 0.25 0.5
2 2 0.50 0.50 0.25 0.5
3 3 2.00 1.00 0.25 0.5
4 4 2.50 1.50 1.00 1.0
For a vectorial way in a single shot, you can use dictionaries to hold the threshold and replacement values, then numpy.select:
# example input
df = pd.DataFrame({'b': [-1, 2, 5],
'c': [5, -1, 1]})
# dictionaries (one key:value per column)
thresh = {'b': 2, 'c': 1}
repl = {'b': 0.25, 'c': 0.5}
out = pd.DataFrame(
np.select([df.le(0), df.le(thresh)],
[0, pd.Series(repl)],
1),
columns=list(thresh),
index=df.index
).add_suffix('_score')
output:
b_score c_score
0 0.00 1.0
1 0.25 0.0
2 1.00 0.5
The problem with your attempt is that pandas cannot access your functions from strings with the same name. For example, you need to pass df.apply(b_score_function, axis=1), and not df.apply("b_score_function", axis=1) (note the double quotes).
My first thought would be to link the column names to functions with a dictionary:
funcs = {'b' : b_score_function,
'c' : c_score_function}
for col in ds_cols:
foo = funcs[col]
df[f'{col}_score'] = df.apply(foo, axis = 1)
Typing out the dictionary funcs may be tedious or infeasible depending on how many columns/functions you have. If that is the case, you may have to find additional ways to automate the creation and access of your column-specific functions.
One somewhat automatic way is to use locals() or globals() - these will return dictionaries which have the functions you defined (as well as other things):
for col in ds_cols:
key = f"{col}_score_function"
foo = locals()[key]
df.apply(foo, axis=1)
This code is dependent on the fact that the function for column "X" is called X_score_function(), but that seems to be met in your example. It also requires that every column in ds_cols will have a corresponding entry in locals().
Somewhat confusingly there are some functions which you can access by passing a string to apply, but these are only the ones that are shortcuts for numpy functions, like df.apply('sum') or df.apply('mean'). Documentation for this appears to be absent. Generally you would want to do df.sum() rather than df.apply('sum'), but sometimes being able to access the method by the string is convenient.
Related
I'll admit that this question is quite specific. I'm trying to write a function that reads two time columns (same label) in separate dataframes df1['gps'] and df2['gps']. I want to look for elements in the first column which are close to those in the second column, not necessarily in same row. When the condition on time distance is met, I want to save the close elements in df1['gps'] and df1['gps'] in a new dataframe called coinc in separate columns coinc['gps1'] and coinc['gps2'] in the fastest and most efficient way. This is my code:
def find_coinc(df1, df2=None, tdelta=.25, shift=0):
index_boolean = False
if df2 is None:
df2 = df1.copy()
coincs = pd.DataFrame()
for _, r1 in tqdm(df1.iterrows(), total=len(df1)):
ctrig = df2.loc[abs(r1.gps+shift-df2.gps)<tdelta]
print(r1.gps)
coincs_single = pd.DataFrame()
if len(ctrig)>0:
coincs_single['gps1'] = r1.gps
coincs_single['gps2'] = ctrig.gps
coincs = pd.concat((coincs, coincs_single), axis = 0, ignore_index=index_boolean)
index_boolean=True
else:
pass
return coincs
The script runs fine, but when investigating the output, I find that one column of coinc is all NaN and I don't understand why. Test case with generated data:
a = pd.DataFrame() #define dataframes and fill them
b = pd.DataFrame()
a['gps'] = [0.12, 0.13, 0.6, 0.7]
b['gps'] = [0.1, 0.3, 0.5, 0.81, 0.82, 0.83]
find_coinc(a, b, 0.16, 0)
The output yielded is:
gps1 gps2
0 NaN 0.10
1 NaN 0.10
2 NaN 0.50
3 NaN 0.81
4 NaN 0.82
5 NaN 0.83
How can I write coinc so that both columns turn out fine?
Well, here is another solution. Instead of concat two dataframes just add new rows to 'coincs' DataFrame. I will show you below.
def find_coinc(df1, df2=None, tdelta=.25, shift=0):
if df2 is None:
df2 = df1.copy()
coincs = pd.DataFrame(columns=['gps1', 'gps2'])
for _, r1 in tqdm(df1.iterrows(), total=len(df1)):
ctrig = df2.loc[abs(r1.gps+shift-df2.gps) < tdelta]
if len(ctrig)>0:
for ctrig_value in ctrig['gps']:
# Add n rows based on 'ctrig' length.
coincs.loc[len(coincs)] = [r1.gps, ctrig_value]
else:
pass
return coincs
# -------------------
a = pd.DataFrame() # define dataframes and fill them
b = pd.DataFrame()
a['gps'] = [0.12, 0.13, 0.6, 0.7]
b['gps'] = [0.1, 0.3, 0.5, 0.81, 0.82, 0.83]
coins = find_coinc(a, b, 0.16, 0)
print('\n\n')
print(coins.to_string())
Result:
gps1 gps2
0 0.12 0.10
1 0.13 0.10
2 0.60 0.50
3 0.70 0.81
4 0.70 0.82
5 0.70 0.83
I hope I could help! :D
So the issue is that there are multiple elements in df2['gps'] which satisfy the condition of being within a time window of df1['gps']. I think I found a solution, but looking for a better one if possible. Highlighting the modified line in the original function as ### FIX UPDATE comment:
def find_coinc(df1, df2=None, tdelta=.25, shift=0):
index_boolean = False
if df2 is None:
df2 = df1.copy()
coincs = pd.DataFrame()
for _, r1 in tqdm(df1.iterrows(), total=len(df1)):
ctrig = df2.loc[abs(r1.gps+shift-df2.gps)<tdelta]
ctrig.reset_index(drop=True, inplace=True)
coincs_single = pd.DataFrame()
if len(ctrig)>0:
coincs_single['gps1'] = [r1.gps]*len(ctrig) ### FIX UPDATE
coincs_single['gps2'] = ctrig.gps
print(ctrig.gps)
coincs = pd.concat((coincs, coincs_single), axis = 0, ignore_index=index_boolean)
index_boolean=True
else:
pass
return coincs
The solution I chose, since I want to have all the instances of the condition being met, was to write the same element in df1['gps'] into coinc['gps1'] the needed amount of times.
How do I drop all the rows that have a RET with absolute value greater than 0.10?
Tried using this, not sure what to do
df_drop = [data[data[abs(float('RET')) < 0.10]
You can keep RET with absolute value greater than 0.10 by
data = data[abs(data['RET'].astype('float')) >= 0.10]
I advice you to code it in two steps to figure it out better.
import pandas as pd
df = pd.DataFrame({'val':[5, 11, 89, 63],
'RET':[-0.1, 0.5, -0.04, 0.09],
})
# Step 1 : define the mask
m = abs(df['RET']) > 0.1
# Step 2 : apply the mask
df[m]
print(df)
Result
df[m]
val RET
1 11 0.5
You can use abs() and le() or lt() to filter for the wanted values.
df_drop = data[data['RET'].abs().lt(0.10)]
Also consider between(). Select a appropriate policy for inclusive. It can be "neither", "both", "left" or "right".
df_drop = data[data['RET'].between(-0.10, 0.10, inclusive="neither")]
Example
data = pd.DataFrame({
'RET':[0.011806, -0.122290, 0.274011, 0.039013, -0.05044],
'other': [1,2,3,4,5]
})
RET other
0 0.011806 1
1 -0.122290 2
2 0.274011 3
3 0.039013 4
4 -0.050440 5
Both ways aboth will lead to
RET other
0 0.011806 1
3 0.039013 4
4 -0.050440 5
All rows with an absolute value greater than 0.10 in RET are excluded.
I have this resulting correlation matrix:
id
row
col
corr
target_corr
0
a
b
0.95
0.2
1
a
c
0.7
0.2
2
a
d
0.2
0.2
3
b
a
0.95
0.7
4
b
c
0.35
0.7
5
b
d
0.65
0.7
6
c
a
0.7
0.6
7
c
b
0.35
0.6
8
c
d
0.02
0.6
9
d
a
0.2
0.3
10
d
b
0.65
0.3
11
d
c
0.02
0.3
After filtering high correlated variables based on "corr" variable I
try to add new column that will compare will decide to mark "keep" the
least correlated variable from "row" or mark "drop" of that variable
for the most correlated variable "target_corr" column. In other works
from corelated variables matching cut > 0.5 select the one least correlated to
"target_corr":
Expected result:
id
row
col
corr
target_corr
drop/keep
0
a
b
0.95
0.2
keep
1
a
c
0.7
0.2
keep
2
b
a
0.95
0.7
drop
3
b
d
0.65
0.7
drop
4
c
a
0.7
0.6
drop
5
d
b
0.65
0.3
keep
This approach does use very large dataframes so resulting corr matrix for example is > 100kx100k and generated using pyspark:
def corrwith_matrix_no_save(df, data_cols=None, select_targets = None, method='pearson'):
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.mllib.stat import Statistics
start_time = time.time()
vector_col = "corr_features"
if data_cols == None and select_targets == None:
data_cols = df.columns
select_target = list(df.columns)
assembler = VectorAssembler(inputCols=data_cols, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)
matrix = Correlation.corr(df_vector, vector_col, method)
result = matrix.collect()[0]["pearson({})".format(vector_col)].values
final_df = pd.DataFrame(result.reshape(-1, len(data_cols)), columns=data_cols, index=data_cols)
final_df = final_df.apply(lambda x: x.abs() if np.issubdtype(x.dtype, np.number) else x )
corr_df = final_df[select_target]
#corr_df.columns = [str(col) + '_corr' for col in corr_df.columns]
corr_df['column_names'] = corr_df.index
print('Execution time for correlation_matrix function:', time.time() - start_time)
return corr_df
created the dataframe from uper triagle with numpy.triuand numpy.stack + added the target column my merging 2 resulting dataframes (if code is required can provide but will increase the content a lot so will provide only if needs clarifcation).
def corrX_to_ls(corr_mtx) :
# Get correlation matrix and upper triagle
df_target = corr_mtx['target']
corr_df = corr_mtx.drop('target', inplace=True)
up = corr_df.where(np.triu(np.ones(corr_df.shape), k=1).astype(np.bool))
print('This is triu: \n', up )
df = up.stack().reset_index()
df.columns = ['row','col','corr']
df_lsDF = df.query("row" != "col")
df_target_corr = df_target.reset_index()
df_target_corr.columns = ['target_col', 'target_corr']
sample_df = df_lsDF.merge(df_target_corr, how='left', left_ob='row', right_on='target_col')
sample_df = sample_df.drop('target_col', 1)
return (sample_df)
Now after filtering resulting dataframe based on df.Corr > cut where cut > 0.50 got stuck at marking what variable o keep and what to drop
( I do look to mark them only then select into a list variables) ...
so help on solving it will be greatly appreciated and will also
benefit community when working on distributed system.
Note: Looking for example/solution to scale so I can distribute
operations on executors so lists or like a group/subset of the
dataframe to be done in parallel and avoid loops is what I do look, so
numpy.vectorize, threading and/or multiprocessing
approaches is what I do look.
Additional "thinking" from top of my mind: I do think on grouping by
"row" column so can distribute processing each group on executors or
by using lists distribute processing in parallel on executors so each
list will generate a job for each thread from ThreadPool ( I done
done this approach for column vectors but for very large
matrix/dataframes can become inefficient so for rows I think will
work).
Given final_df as the sample input, you can try:
# filter
output = final_df.query('corr>target_corr').copy()
# assign drop/keep
output['drop_keep'] = np.where(output['corr']>2*output['target_corr'],
'keep','drop')
Output:
id row col corr target_corr drop_keep
0 0 a b 0.95 0.2 keep
1 1 a c 0.70 0.2 keep
3 3 b a 0.95 0.7 drop
6 6 c a 0.70 0.6 drop
10 10 d b 0.65 0.3 keep
I am a newbie in python and i want to perform a sort of shifting based on a shift unit that i have in a column.
My data is as the following :
Group Rate
1 0.1
1 0.2
1 0.3
2 0.9
2 0.12
The shifting_Unit of the first group is 2 and for the second 1
The desired output is the following :
Group Shifted_Rate
1 0
1 0
1 0.1
2 0
2 0.9
I tried to do the following but it is not working :
df['Shifted_Rate'] = df['Rate'].shift(df['Shift_Unit'])
Is there another way to do it without the shift() method ?
I think this might be the first time I've worked with pandas, so this might not be helpful, but from what I've found in the documentation for pandas.DataFrame.shift(), it looks like the periods variable that relates to the "number of periods to shift" is an int. Because of this (that is, because this is an int rather than something like a list or dict), I have the feeling that you might need to approach this type of problem by making individual data frames and then putting these data frames together. I tried this out and used pandas.DataFrame.append() to put the individual data frames together. There might be a more efficient way to do this with pandas, but for now, I hope this helps with your immediate situation.
Here is the code that I used to do approach your situation (this code is in a file called q11.py in my case):
import numpy as np
import pandas as pd
# The periods used for the shifting of each group
# (e.g., '1' is for group 1, '2' is for group 2).
# You can add more items here later if need be.
periods = {
'1': 2,
'2': 1
}
# Building the first DataFrame
df1 = pd.DataFrame({
'Rate': pd.Series([0.1, 0.2, 0.3], index=[1, 1, 1]),
})
# Building the second DataFrame
df2 = pd.DataFrame({
'Rate': pd.Series([0.9, 0.12], index=[2, 2]),
})
# Shift
df1['Shifted_Rate'] = df1['Rate'].shift(
periods=periods['1'],
fill_value=0
)
df2['Shifted_Rate'] = df2['Rate'].shift(
periods=periods['2'],
fill_value=0
)
# Append the df2 DataFrame to df1 and save the result to a new DataFrame df3
# ref: https://pythonexamples.org/pandas-append-dataframe/
# ref: https://stackoverflow.com/a/51953935/1167750
# ref: https://stackoverflow.com/a/40014731/1167750
# ref: https://pandas.pydata.org/pandas-docs/stable/reference/api
# /pandas.DataFrame.append.html
df3 = df1.append(df2, ignore_index=False)
# ref: https://stackoverflow.com/a/18023468/1167750
df3.index.name = 'Group'
print("\n", df3, "\n")
# Optional: If you only want to keep the Shifted_Rate column:
del df3['Rate']
print(df3)
When running the program, the output should look like this:
$ python3 q11.py
Rate Shifted_Rate
Group
1 0.10 0.0
1 0.20 0.0
1 0.30 0.1
2 0.90 0.0
2 0.12 0.9
Shifted_Rate
Group
1 0.0
1 0.0
1 0.1
2 0.0
2 0.9
I'm attempting to run a for loop through a pandas dataframe and apply a logic expression to a column in each of the elements of the dataframe. My code compiles without error, but there is no output.
Example code:
for i in df:
if df['value'].all() >= 0.0 and df['value'].all() < 0.05:
print df['value']
Any help would be appreciated! Thanks
If you are looking to see whether all elements in a column are satisfying that logical expression, this can be used:
np.logical_and(df['value'] >= 0.0, df['value'] < 0.05).all()
This will return a single True or False.
By the way, I don't see how the for loop is being used. Since in the current format, the same code will run in each iteration.
.all() is going to return True or False and so based on the order of your checks, you're filtering everything out. I assume what you actually want is (df['value'] >= 0.0).all() and (df['value'] < 0.05).all().
EDIT: You're also not actually iterating over the columns. Replace 'value' with i.
In [11]: df = pd.DataFrame([np.arange(0, 0.04, 0.01), np.arange(0, 4, 1)]).T
In [12]: df
Out[12]:
0 1
0 0.00 0.0
1 0.01 1.0
2 0.02 2.0
3 0.03 3.0
In [13]: for c in df:
...: if (df[c] >= 0.0).all() and (df[c] < 0.05).all():
...: print df[c]
...:
0 0.00
1 0.01
2 0.02
3 0.03
Name: 0, dtype: float64
Is this the result you are looking for?
print(df.loc[(df['value'] >= 0.0) & (df['value'] < 0.05), 'value'])
For just the values, throw this in:
print(df.loc[(df['value'] >= 0.0) & (df['value'] < 0.05), 'value'].values)
or
print(list(df.loc[(df['value'] >= 0.0) & (df['value'] < 0.05), 'value']))
for a list of the values