Pandas: replace list of values from list of columns - python

I've got a many row, many column dataframe with different 'placeholder' values needing substitution (in a subset of columns). I've read many examples in the forum using nested lists or dictionaries, but haven't had luck with variations..
# A test dataframe
df = pd.DataFrame({'Sample':['alpha','beta','gamma','delta','epsilon'],
'element1':[1,-0.01,-5000,1,-2000],
'element2':[1,1,1,-5000,2],
'element3':[-5000,1,1,-0.02,2]})
# List of headings containing values to replace
headings = ['element1', 'element2', 'element3']
And I am trying to do something like this (obviously this doesn't work):
# If any rows have value <-1, NaN
df[headings].replace(df[headings < -1], np.nan)
# If a value is between -1 and 0, make a replacement
df[headings].replace(df[headings < 0 & headings > -1], 0.05)
So, is there possibly a better way to accomplish this using loops or fancy pandas tricks?

You can set the Sample column as index and then replace values on the whole data frame based on conditions:
df = df.set_index('Sample')
df[df < -1] = np.nan
df[(df < 0) & (df > -1)] = 0.05
Which gives:
# element1 element2 element3
# Sample
# alpha 1.00 1.0 NaN
# beta 0.05 1.0 1.00
# gamma NaN 1.0 1.00
# delta 1.00 NaN 0.05
# epsilon NaN 2.0 2.00

Here is the successful answer as suggested by #Psidom.
The solution involves taking a slice out of the dataframe, applying the function, then reincorporates the amended slice:
df1 = df.loc[:, headings]
df1[df1 < -1] = np.nan
df1[(df1 < 0)] = 0.05
df.loc[:, headings] = df1

Related

How to pass the whole dataframe and the index of the row being operated upon to the apply() method

How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?
Specifically, I have a dataframe correlation_df with the following data:
id
scores
cosine
1
100
0.8
2
75
0.7
3
50
0.4
4
25
0.05
I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.
My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.
NB. Problem code:
import numpy as np
import pandas as pd
score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
{
"score": score,
"cosine": cosine,
}
)
corr = correlation_df.corr().values[0, 1]
[Edit] Roundabout solution that I'm sure can be improved:
def my_fuct(row):
i = int(row["index"])
r = list(range(correlation_df.shape[0]))
r.remove(i)
subset = correlation_df.iloc[r, :].copy()
subset = subset.set_index("index")
return subset.corr().values[0, 1]
correlation_df["diff_correlations"] = = correlation_df.apply(my_fuct, axis=1)
Your problem can be simplified to:
>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
score cosine diff_correlations
0 100 0.80 0.999015
1 75 0.70 0.988522
2 50 0.40 0.977951
3 25 0.05 0.960769
A more sophisticated method would be:
The whole correlation matrix isn't made every time this way.
df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)
The index can be accessed in an apply with .name or .index, depending on the axis:
>>> correlation_df.apply(lambda x: x.name, axis=1)
0 0
1 1
2 2
3 3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
score cosine
0 0 0
1 1 1
2 2 2
3 3 3
Using
correlation_df = correlation_df.reset_index()
gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:
correlation_df.apply(lambda r: r["index"])
After you are done you could do:
correlation_df = correlation_df.set_index("index")
to get your previous format back.

Cleaning outliers inside a column with interpolation

I'm trying to do the following.
I have some data with wrong values (x<=0 or x>=1100) inside a dataframe.
I am trying to change those values to values inside an acceptable range.
For the time being, this is what I do code-wise
def while_non_nan(A, k):
init = k
if k+1 >= len(A)-1:
return A.iloc[k-1]
while np.isnan(A[k+1]):
k += 1
#Calculate the value.
n = k-init+1
value = (n*A.iloc[init-1] + A.iloc[k])/(n+1)
return value
evoli.loc[evoli['T1'] >= 1100, 'T1'] = np.nan
evoli.loc[evoli['T1'] <= 0, 'T1'] = np.nan
inds = np.where(np.isnan(evoli))
#Place column means in the indices. Align the arrays using take
for k in inds[0] :
evoli['T1'].iloc[k] = while_non_nan(evoli['T1'], k)
I transform the outlier values into nan.
Afterwards, I get the position of those nan.
Finally, I modify the nan to the mean value between the previous value and the next one.
Since, several nan can be next to each other, the whie_non_nan search for the next non_nan value and get the ponderated mean.
Example of what I'm hoping to get:
Input :
[nan 0 1 2 nan 4 nan nan 7 nan ]
Output:
[0 0 1 2 3 4 5 6 7 7 ]
Hope it is clear enough. Thanks !
Pandas has a builtin interpolation you could use after setting your limits to NaN:
from numpy import NaN
import pandas as pd
df = pd.DataFrame({"T1": [1, 2, NaN, 3, 5, NaN, NaN, 4, NaN]})
df["T1"] = df["T1"].interpolate(method='linear', axis=0).ffill().bfill()
print(df)
Interpolate is a DataFrame method that fills NaN values with specified interpolation method (linear in this case). Calling .bfill() for backward fill and .ffill() for forward fill ensures the 1st and last item are also replaced if needed, with 2nd and 2nd to last item respectively. If you want some fancier strategy for 1st and last item you need to write it yourself.

Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows

I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve
With some subsequent code I have created a pandas dataframe that looks like the example below:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.
I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.
Any help would be appreciated!
You can use this code snippet to solve your problem.
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.
Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.
I also leveraged #sujanay's idea of calculating the weighted values first.
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
The output of this will be:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
If you want to extract only the records that are non NaN, you can do:
final_df = df[df['Mean Distance'].notnull()]
This will result in:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value
If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
Before doing the grouped operations you need to use the No. of values information to get the weighting in:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
Now get the sums like this:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
And finally the weighted group averages like this:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727

The best way to find the win percentage for a pandas DataFrame with win, loss columns

I have a pandas DataFrame with two columns ('win' and 'loss') and I want to find the win percentage ('win%') and pass it into the DataFrame. The thing is, for some rows, the entries are 0, so for those rows, I need to pass np.nan into 'win%'.
The following code does the job:
df=pd.DataFrame([[1,2],[0,0],[2,1],[0,1]],columns=['win','loss'])
df['total'] = df['win'] + df['loss']
x=[]
for i in range(df.shape[0]):
if df['total'].iloc[i] > 0:
x.append(df['win'].iloc[i] / df['total'].iloc[i])
else:
x.append(np.nan)
df['win%'] = x
Therefore, the desired outcome is:
win loss win%
0 1 2 0.333333
1 0 0 NaN
2 2 1 0.666667
3 0 1 0.000000
I was wondering if there is a more efficient (pandas-y) way to do it. Also, I don't want to add an unnecessary column ('total') if I don't have to. Any help is appreciated.
You can set all the zero values to np.nan first (using replace), because:
np.nan / np.nan = np.nan
And:
np.nan + np.nan = np.nan
So:
df = pd.DataFrame(
[[1,2],[0,0],[2,1]],columns=['win','loss']
).replace(0, np.nan)
df["win%"] = df["win"] / (df['win'] + df['loss'])

Manual Feature Engineering in Pandas - Mean of 1 Column vs All Other Columns

Hard to describe this one, but for every column in a dataframe, create a new column that contains the mean of the current column vs the one next to it, then get the mean of that first column vs the next one down the line. Running Python 3.6.
For Example, given this dataframe:
I would like to get this output:
That exact order of the added columns at the end isn't important, but it needs to be able to handle every possible combination of means between all columns, with a depth of 2 (i.e. compare one column to another). Ideally, I would like to have the depth set as a separate variable, so I could have a depth of 3, where it would do this but compare 3 columns to one another.
Ideas? Thanks!
UPDATE
I got this to work, but wondering if there's a more computationally fast way of doing it. I basically just created 2 of the same loops (loop within a loop) to compare 1 column vs the rest, skipping the same column comparisons:
eng_features = pd.DataFrame()
for col in df.columns:
for col2 in df.columns:
# Don't compare same columns, or inversed same columns
if col == col2 or (str(col2) + '_' + str(col)) in eng_features:
continue
else:
eng_features[str(col) + '_' + str(col2)] = df[[col, col2]].mean(axis=1)
continue
df = pd.concat([df, eng_features], axis=1)
Use itertools, a python built in utility package for iterators:
from itertools import permutations
for col1, col2 in permutations(df.columns, r=2):
df[f'Mean_of_{col1}-{col2}'] = df[[col1,col2]].mean(axis=1)
and you will get what you need:
a b c Mean_of_a-b Mean_of_a-c Mean_of_b-a Mean_of_b-c Mean_of_c-a \
0 1 1 0 1.0 0.5 1.0 0.5 0.5
1 0 1 0 0.5 0.0 0.5 0.5 0.0
2 1 1 0 1.0 0.5 1.0 0.5 0.5
Mean_of_c-b
0 0.5
1 0.5
2 0.5

Categories

Resources