Shift data groups - python

I am a newbie in python and i want to perform a sort of shifting based on a shift unit that i have in a column.
My data is as the following :
Group Rate
1 0.1
1 0.2
1 0.3
2 0.9
2 0.12
The shifting_Unit of the first group is 2 and for the second 1
The desired output is the following :
Group Shifted_Rate
1 0
1 0
1 0.1
2 0
2 0.9
I tried to do the following but it is not working :
df['Shifted_Rate'] = df['Rate'].shift(df['Shift_Unit'])
Is there another way to do it without the shift() method ?

I think this might be the first time I've worked with pandas, so this might not be helpful, but from what I've found in the documentation for pandas.DataFrame.shift(), it looks like the periods variable that relates to the "number of periods to shift" is an int. Because of this (that is, because this is an int rather than something like a list or dict), I have the feeling that you might need to approach this type of problem by making individual data frames and then putting these data frames together. I tried this out and used pandas.DataFrame.append() to put the individual data frames together. There might be a more efficient way to do this with pandas, but for now, I hope this helps with your immediate situation.
Here is the code that I used to do approach your situation (this code is in a file called q11.py in my case):
import numpy as np
import pandas as pd
# The periods used for the shifting of each group
# (e.g., '1' is for group 1, '2' is for group 2).
# You can add more items here later if need be.
periods = {
'1': 2,
'2': 1
}
# Building the first DataFrame
df1 = pd.DataFrame({
'Rate': pd.Series([0.1, 0.2, 0.3], index=[1, 1, 1]),
})
# Building the second DataFrame
df2 = pd.DataFrame({
'Rate': pd.Series([0.9, 0.12], index=[2, 2]),
})
# Shift
df1['Shifted_Rate'] = df1['Rate'].shift(
periods=periods['1'],
fill_value=0
)
df2['Shifted_Rate'] = df2['Rate'].shift(
periods=periods['2'],
fill_value=0
)
# Append the df2 DataFrame to df1 and save the result to a new DataFrame df3
# ref: https://pythonexamples.org/pandas-append-dataframe/
# ref: https://stackoverflow.com/a/51953935/1167750
# ref: https://stackoverflow.com/a/40014731/1167750
# ref: https://pandas.pydata.org/pandas-docs/stable/reference/api
# /pandas.DataFrame.append.html
df3 = df1.append(df2, ignore_index=False)
# ref: https://stackoverflow.com/a/18023468/1167750
df3.index.name = 'Group'
print("\n", df3, "\n")
# Optional: If you only want to keep the Shifted_Rate column:
del df3['Rate']
print(df3)
When running the program, the output should look like this:
$ python3 q11.py
Rate Shifted_Rate
Group
1 0.10 0.0
1 0.20 0.0
1 0.30 0.1
2 0.90 0.0
2 0.12 0.9
Shifted_Rate
Group
1 0.0
1 0.0
1 0.1
2 0.0
2 0.9

Related

How to pass the whole dataframe and the index of the row being operated upon to the apply() method

How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?
Specifically, I have a dataframe correlation_df with the following data:
id
scores
cosine
1
100
0.8
2
75
0.7
3
50
0.4
4
25
0.05
I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.
My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.
NB. Problem code:
import numpy as np
import pandas as pd
score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
{
"score": score,
"cosine": cosine,
}
)
corr = correlation_df.corr().values[0, 1]
[Edit] Roundabout solution that I'm sure can be improved:
def my_fuct(row):
i = int(row["index"])
r = list(range(correlation_df.shape[0]))
r.remove(i)
subset = correlation_df.iloc[r, :].copy()
subset = subset.set_index("index")
return subset.corr().values[0, 1]
correlation_df["diff_correlations"] = = correlation_df.apply(my_fuct, axis=1)
Your problem can be simplified to:
>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
score cosine diff_correlations
0 100 0.80 0.999015
1 75 0.70 0.988522
2 50 0.40 0.977951
3 25 0.05 0.960769
A more sophisticated method would be:
The whole correlation matrix isn't made every time this way.
df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)
The index can be accessed in an apply with .name or .index, depending on the axis:
>>> correlation_df.apply(lambda x: x.name, axis=1)
0 0
1 1
2 2
3 3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
score cosine
0 0 0
1 1 1
2 2 2
3 3 3
Using
correlation_df = correlation_df.reset_index()
gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:
correlation_df.apply(lambda r: r["index"])
After you are done you could do:
correlation_df = correlation_df.set_index("index")
to get your previous format back.

Iterate function using apply for similar column name

I have a dataframe with 3 columns: a_id, b, c (with a_id as a unique key) and I would like to assign a score for each row based on the number in b and c columns. I have created the following:
def b_score_function(df):
if df['b'] <= 0 :
return 0
elif df['b'] <= 2 :
return 0.25
else:
return 1
def c_score_function(df):
if df['c'] <= 0 :
return 0
elif df['c'] <= 1 :
return 0.5
else:
return 1
Normally, I would use something like this:
df['b_score'] = df(b_score, axis = 1)
df['c_score'] = df(c_score, axis = 1)
However, the above approach will be too long if I have multiple columns. I would like to know how can I create a loop for the selected columns? I have tried the following:
ds_cols = df.columns.difference(['a_id']).to_list()
for col in ds_cols:
df[f'{col}_score'] = df.apply(f'{col}_score_function', axis = 1)
but it returned with the following error:
'b_score_function' is not a valid function for 'DataFrame' object
Can anyone please point out what I did wrong?
Also if anyone can suggest how to create a reusable, that would be appreciated.
Thank you.
IIUC, this should work for you:
df = pd.DataFrame({'a_id': range(5), 'b': [0.0, 0.25, 0.5, 2.0, 2.5], 'c': [0.0, 0.25, 0.5, 1.0, 1.5]})
def b_score_function(df):
if df['b'] <= 0 :
return 0
elif df['b'] <= 2 :
return 0.25
else:
return 1
def c_score_function(df):
if df['c'] <= 0 :
return 0
elif df['c'] <= 1 :
return 0.5
else:
return 1
ds_cols = df.columns.difference(['a_id']).to_list()
for col in ds_cols:
df[f'{col}_score'] = df.apply(eval(f'{col}_score_function'), axis = 1)
print(df)
Result:
a_id b c b_score c_score
0 0 0.00 0.00 0.00 0.0
1 1 0.25 0.25 0.25 0.5
2 2 0.50 0.50 0.25 0.5
3 3 2.00 1.00 0.25 0.5
4 4 2.50 1.50 1.00 1.0
For a vectorial way in a single shot, you can use dictionaries to hold the threshold and replacement values, then numpy.select:
# example input
df = pd.DataFrame({'b': [-1, 2, 5],
'c': [5, -1, 1]})
# dictionaries (one key:value per column)
thresh = {'b': 2, 'c': 1}
repl = {'b': 0.25, 'c': 0.5}
out = pd.DataFrame(
np.select([df.le(0), df.le(thresh)],
[0, pd.Series(repl)],
1),
columns=list(thresh),
index=df.index
).add_suffix('_score')
output:
b_score c_score
0 0.00 1.0
1 0.25 0.0
2 1.00 0.5
The problem with your attempt is that pandas cannot access your functions from strings with the same name. For example, you need to pass df.apply(b_score_function, axis=1), and not df.apply("b_score_function", axis=1) (note the double quotes).
My first thought would be to link the column names to functions with a dictionary:
funcs = {'b' : b_score_function,
'c' : c_score_function}
for col in ds_cols:
foo = funcs[col]
df[f'{col}_score'] = df.apply(foo, axis = 1)
Typing out the dictionary funcs may be tedious or infeasible depending on how many columns/functions you have. If that is the case, you may have to find additional ways to automate the creation and access of your column-specific functions.
One somewhat automatic way is to use locals() or globals() - these will return dictionaries which have the functions you defined (as well as other things):
for col in ds_cols:
key = f"{col}_score_function"
foo = locals()[key]
df.apply(foo, axis=1)
This code is dependent on the fact that the function for column "X" is called X_score_function(), but that seems to be met in your example. It also requires that every column in ds_cols will have a corresponding entry in locals().
Somewhat confusingly there are some functions which you can access by passing a string to apply, but these are only the ones that are shortcuts for numpy functions, like df.apply('sum') or df.apply('mean'). Documentation for this appears to be absent. Generally you would want to do df.sum() rather than df.apply('sum'), but sometimes being able to access the method by the string is convenient.

Python Pandas Repeat a value x amount of times according to x value

I am new to Python and Pandas and am trying so have a simple function that will repeat the value x amount of times accoring to a adjacent value.
For example:
I want to take the first column (weight) and add it to a new column based on the amount next to it (wheels). So the column will have 1.5 27x, than immediatly after will have 2.4 177x and repeate this for all values shown. Does anyone know a simple way to do this?
Use Series.repeat:
out = df['Weight'].repeat(df['Wheels'])
print(out)
# Output
0 1.5
0 1.5
1 2.4
1 2.4
1 2.4
Name: Weight, dtype: float64
Setup:
df = pd.DataFrame({'Weight': [1.5, 2.4], 'Wheels': [2, 3]})
print(df)
# Output
Weight Wheels
0 1.5 2
1 2.4 3
Assuming you have a pandas dataframe named df.
import numpy as np
np.repeat(df['weigth'], df['wheels'])

correlation matrix filtering based on high variables correlation with selection of least correlated with target variable at scale using vectors

I have this resulting correlation matrix:
id
row
col
corr
target_corr
0
a
b
0.95
0.2
1
a
c
0.7
0.2
2
a
d
0.2
0.2
3
b
a
0.95
0.7
4
b
c
0.35
0.7
5
b
d
0.65
0.7
6
c
a
0.7
0.6
7
c
b
0.35
0.6
8
c
d
0.02
0.6
9
d
a
0.2
0.3
10
d
b
0.65
0.3
11
d
c
0.02
0.3
After filtering high correlated variables based on "corr" variable I
try to add new column that will compare will decide to mark "keep" the
least correlated variable from "row" or mark "drop" of that variable
for the most correlated variable "target_corr" column. In other works
from corelated variables matching cut > 0.5 select the one least correlated to
"target_corr":
Expected result:
id
row
col
corr
target_corr
drop/keep
0
a
b
0.95
0.2
keep
1
a
c
0.7
0.2
keep
2
b
a
0.95
0.7
drop
3
b
d
0.65
0.7
drop
4
c
a
0.7
0.6
drop
5
d
b
0.65
0.3
keep
This approach does use very large dataframes so resulting corr matrix for example is > 100kx100k and generated using pyspark:
def corrwith_matrix_no_save(df, data_cols=None, select_targets = None, method='pearson'):
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.mllib.stat import Statistics
start_time = time.time()
vector_col = "corr_features"
if data_cols == None and select_targets == None:
data_cols = df.columns
select_target = list(df.columns)
assembler = VectorAssembler(inputCols=data_cols, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)
matrix = Correlation.corr(df_vector, vector_col, method)
result = matrix.collect()[0]["pearson({})".format(vector_col)].values
final_df = pd.DataFrame(result.reshape(-1, len(data_cols)), columns=data_cols, index=data_cols)
final_df = final_df.apply(lambda x: x.abs() if np.issubdtype(x.dtype, np.number) else x )
corr_df = final_df[select_target]
#corr_df.columns = [str(col) + '_corr' for col in corr_df.columns]
corr_df['column_names'] = corr_df.index
print('Execution time for correlation_matrix function:', time.time() - start_time)
return corr_df
created the dataframe from uper triagle with numpy.triuand numpy.stack + added the target column my merging 2 resulting dataframes (if code is required can provide but will increase the content a lot so will provide only if needs clarifcation).
def corrX_to_ls(corr_mtx) :
# Get correlation matrix and upper triagle
df_target = corr_mtx['target']
corr_df = corr_mtx.drop('target', inplace=True)
up = corr_df.where(np.triu(np.ones(corr_df.shape), k=1).astype(np.bool))
print('This is triu: \n', up )
df = up.stack().reset_index()
df.columns = ['row','col','corr']
df_lsDF = df.query("row" != "col")
df_target_corr = df_target.reset_index()
df_target_corr.columns = ['target_col', 'target_corr']
sample_df = df_lsDF.merge(df_target_corr, how='left', left_ob='row', right_on='target_col')
sample_df = sample_df.drop('target_col', 1)
return (sample_df)
Now after filtering resulting dataframe based on df.Corr > cut where cut > 0.50 got stuck at marking what variable o keep and what to drop
( I do look to mark them only then select into a list variables) ...
so help on solving it will be greatly appreciated and will also
benefit community when working on distributed system.
Note: Looking for example/solution to scale so I can distribute
operations on executors so lists or like a group/subset of the
dataframe to be done in parallel and avoid loops is what I do look, so
numpy.vectorize, threading and/or multiprocessing
approaches is what I do look.
Additional "thinking" from top of my mind: I do think on grouping by
"row" column so can distribute processing each group on executors or
by using lists distribute processing in parallel on executors so each
list will generate a job for each thread from ThreadPool ( I done
done this approach for column vectors but for very large
matrix/dataframes can become inefficient so for rows I think will
work).
Given final_df as the sample input, you can try:
# filter
output = final_df.query('corr>target_corr').copy()
# assign drop/keep
output['drop_keep'] = np.where(output['corr']>2*output['target_corr'],
'keep','drop')
Output:
id row col corr target_corr drop_keep
0 0 a b 0.95 0.2 keep
1 1 a c 0.70 0.2 keep
3 3 b a 0.95 0.7 drop
6 6 c a 0.70 0.6 drop
10 10 d b 0.65 0.3 keep

Manual Feature Engineering in Pandas - Mean of 1 Column vs All Other Columns

Hard to describe this one, but for every column in a dataframe, create a new column that contains the mean of the current column vs the one next to it, then get the mean of that first column vs the next one down the line. Running Python 3.6.
For Example, given this dataframe:
I would like to get this output:
That exact order of the added columns at the end isn't important, but it needs to be able to handle every possible combination of means between all columns, with a depth of 2 (i.e. compare one column to another). Ideally, I would like to have the depth set as a separate variable, so I could have a depth of 3, where it would do this but compare 3 columns to one another.
Ideas? Thanks!
UPDATE
I got this to work, but wondering if there's a more computationally fast way of doing it. I basically just created 2 of the same loops (loop within a loop) to compare 1 column vs the rest, skipping the same column comparisons:
eng_features = pd.DataFrame()
for col in df.columns:
for col2 in df.columns:
# Don't compare same columns, or inversed same columns
if col == col2 or (str(col2) + '_' + str(col)) in eng_features:
continue
else:
eng_features[str(col) + '_' + str(col2)] = df[[col, col2]].mean(axis=1)
continue
df = pd.concat([df, eng_features], axis=1)
Use itertools, a python built in utility package for iterators:
from itertools import permutations
for col1, col2 in permutations(df.columns, r=2):
df[f'Mean_of_{col1}-{col2}'] = df[[col1,col2]].mean(axis=1)
and you will get what you need:
a b c Mean_of_a-b Mean_of_a-c Mean_of_b-a Mean_of_b-c Mean_of_c-a \
0 1 1 0 1.0 0.5 1.0 0.5 0.5
1 0 1 0 0.5 0.0 0.5 0.5 0.0
2 1 1 0 1.0 0.5 1.0 0.5 0.5
Mean_of_c-b
0 0.5
1 0.5
2 0.5

Categories

Resources