I'm trying to do SMOTE oversampling from imblearn. This is my code:
X = data[['a','b','c']]
y = data['targets']
oversampler = SMOTE(random_state=42)
X_over, y_over = oversampler.fit_resample(X,y)
And, the last line X_over, y_over = oversampler.fit_resample(X,y) raises the error setting an array elemenet with a sequence
I am sure the reason is because of the shape of my 'X'.
X is a dataframe where each row of column 'a' is a list of length 118, each row of column 'b' a list of length 15 and column 'c' is an integer column.
i.e,
For example,
a(length - 118) b(length -15) c
[1,2,3,4,.....0] [4,7,8,9...0] 3
Now, how do I convert this dataframe X into array of shape (n_samples, n_features), which is required as per the documentation
Could someone please help me transform the input dataframe to get rid of this error?
You can expand the columns, check that the lengths are the same first:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
data = pd.DataFrame({'targets':np.random.binomial(1,0.15,100),
'a':np.random.randint(0,10,(100,2)).tolist(),
'b':np.random.randint(11,20,(100,3)).tolist(),
'c':np.random.randint(0,100,100)
})
data['a'].apply(len).value_counts()
2 100
Function to expand the columns, new columns will be named e.g a0..aN, and previous list columns will be dropped:
def expand_cols(da,col_list):
for C in col_list:
ix = [C+str(i) for i in range(len(da[C][0]))]
da[ix] = pd.DataFrame(data[C].tolist(),columns = ix)
da = da.drop(col_list,axis=1)
return da
Your code, and we expand it when we fit:
X = data[['a','b','c']]
y = data['targets']
oversampler = SMOTE(random_state=42)
X_over, y_over = oversampler.fit_resample(expand_cols(X,['a','b']),y)
Looks like this:
X_over.head()
c a0 a1 b0 b1 b2
0 67 4 0 19 15 16
1 12 3 7 12 17 19
2 41 8 9 15 18 18
3 35 8 0 11 13 11
4 46 0 5 12 12 12
Related
I have a sample data:
df = pd.DataFrame(columns=['X1', 'X2', 'X3'], data=[
[1,16,9],
[4,36,16],
[1,16,9],
[2,9,8],
[3,36,15],
[2,49,16],
[4,25,14],
[5,36,17]])
I want to create two complementary columns in my df based on x2 ad X3 and include it in the pipeline.
I am trying to follow the code:
def feat_comp(x):
x1 = 100-x
return x1
pipe_text = Pipeline([('col_test', FunctionTransformer(feat_comp, 'X2',validate=False))])
X = pipe_text.fit_transform(df)
It gives me an error:
TypeError: 'str' object is not callable
How can I apply the function transformer on selected columns and how can I use them in the pipeline?
If I understand you correctly, you want to add a new column based on a given column, e.g. X2. You need to pass this column as an additional argument to the function using kw_args:
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
df = pd.DataFrame(columns=['X1', 'X2', 'X3'], data=[
[1,16,9],
[4,36,16],
[1,16,9],
[2,9,8],
[3,36,15],
[2,49,16],
[4,25,14],
[5,36,17]])
def feat_comp(x, column):
x[f'100-{column}'] = 100 - x[column]
return x
pipe_text = Pipeline([('col_test', FunctionTransformer(feat_comp, validate=False, kw_args={'column': 'X2'}))])
pipe_text.fit_transform(df)
Result:
X1 X2 X3 100-X2
0 1 16 9 84
1 4 36 16 64
2 1 16 9 84
3 2 9 8 91
4 3 36 15 64
5 2 49 16 51
6 4 25 14 75
7 5 36 17 64
(in your example FunctionTransformer(feat_comp, 'X2',validate=False) X2 would be the inverse_func and the string X2 is not callalble, hence the error)
from itertools import product
import pandas as pd
df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
# c1 c2
# 0 0 0
# 1 0 1
# 2 0 2
# 3 0 3
# 4 0 4
# .. .. ..
# 85 9 4
# 86 9 5
# 87 9 7
# 88 9 8
# 89 9 9
#
# [90 rows x 2 columns]
How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?
An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.
The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.
You can sort the values, then groupby:
a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()
Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:
a= np.sort(df.to_numpy(), axis=1)
(df.assign(one=a[:,0], two=a[:,1]) # one and two can be changed
.drop_duplicates(['one','two']) # taken from above
.reindex(df.columns, axis=1)
)
One way is using np.unique with return_index=True and use the result to index the dataframe:
a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)
print(df.iloc[ix, :])
c1 c2
0 0 0
1 0 1
20 2 0
3 0 3
40 4 0
50 5 0
6 0 6
70 7 0
8 0 8
9 0 9
11 1 1
21 2 1
13 1 3
41 4 1
51 5 1
16 1 6
71 7 1
...
frozenset
mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()
df[~mask]
I will do
df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]
From pandas and numpy tri
s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
Here's one NumPy based one for integers -
def remove_symm_pairs(df):
a = df.to_numpy(copy=False)
b = np.sort(a,axis=1)
idx = np.ravel_multi_index(b.T,(b.max(0)+1))
sidx = idx.argsort(kind='mergesort')
p = idx[sidx]
m = np.r_[True,p[:-1]!=p[1:]]
a_out = a[np.sort(sidx[m])]
df_out = pd.DataFrame(a_out)
return df_out
If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].
For generic numbers (ints/floats, etc.), we will use a view-based one -
# https://stackoverflow.com/a/44999009/ #Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.
If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.
My dataframe:
A B C A_Q B_Q C_Q
27 40 41 2 1 etc
28 39 40 1 5
30 28 29 3 6
28 27 28 4 1
15 10 11 5 4
17 13 14 1 5
16 60 17 8 10
14 21 18 9 1
20 34 23 10 2
21 45 34 7 4
I want to iterate through each row in every column with a _Q suffix, starting with A_Q and do the following:
if row value = '1', grab the corresponding value in col 'A'
assign that value to a variable, call it x
keep looping down the col A_Q
if row value is either 1,2,3,4,5,6,7,8 or 9, ignore
if the value is 10, then get the corresponding value in col 'A' and assign that to variable y
calculate % change, call it chg, between y and x: (y/x)-1)*100
append chg to dataframe
keep going down the column with steps 1-7 above until the end
Then do the same for the other columns B_Q, C_Q etc
So for example, in the above, the first "1" that appears corresponds to 28 in col A. So x = 28. Then keep iterating, ignoring values 1 through 9, until you get a 10, which corresponds to 20 in col A. Calculate % change = ((20/27)-1)*100 = -25.9% and append that to df in a newly created col A_S. Then resume from that point on with same steps until reach end of the file. And finally, do the same for the rest of the columns.
So then the df would look like:
A B C A_Q B_Q C_Q A_S B_S C_S etc
27 40 41 2 1 etc
28 39 40 1 5
30 28 29 3 6
28 27 28 4 1
15 10 11 5 4
17 13 14 1 5
16 60 17 8 10 50
14 21 18 9 1
20 34 23 10 2 -25.9
21 45 34 7 4
I thought to create a function and then do something like df ['_S'] = df.apply ( function, axis =1) but am stuck on the implementation of the above steps 1-8. Thanks!
Do you need to append the results as a new column? You're going to end up with nearly empty columns with just one data value. Could you just append all of the results at the bottom of the '_Q' columns? Anyway here's my stab at the function to do all you asked:
def func(col1, col2):
l = []
x = None
for index in range(0, len(col1)):
if x is None and col1[index] == 1:
x = col2[index]
l.append(0)
elif not(x is None) and col1[index] == 10:
y = col2[index]
l.append(((float(y)/x)-1)*100)
x = None
else:
l.append(0)
return l
You'd then pass this function A_Q as col1 and A as col2 and it should return what you want. For passing functions, assuming that every A, B, C column has an associated _Q column, you could do something like:
q = [col for col in df.columns if '_Q' in col]
for col in q:
df[col[:len(col) - 2] + '_S] = func(df[col], df[col[:len(col) - 2]
I have a data frame according to below.
df = pd.DataFrame({'var1' : list('a' * 3) + list('b' * 2) + list('c' * 4)
,'var2' : [i for i in range(9)]
,'var3' : [20, 40, 100, 10, 80, 12,24, 53, 90]
})
End result that I want is the following:
var1 var2 var3 var3_lt_50
0 a 0 20 60
1 a 1 40 60
2 a 2 100 60
3 b 3 10 10
4 b 4 80 10
5 c 5 12 36
6 c 6 24 36
7 c 7 53 36
8 c 8 90 36
I get this result in two steps, through a group-by and a merge, according to code below:
df = df.merge(df[df.var3 < 50][['var1', 'var3']].groupby('var1', as_index = False).sum().rename(columns = {'var3' : 'var3_lt_50'})
,how = 'left'
,left_on = 'var1'
,right_on = 'var1')
Can someone show me a way of doing this type of boolean logic expression + broadcasting of inter groupby scalar without the "groupby" + "merge" step im doing today. I want a smoother line of code.
Thanks in advance for input,
/Swepab
You can use groupby.transform which keeps the shape of the transformed variable as well as the index so you can just assign the result back to the data frame:
df['var3_lt_50'] = df.groupby('var1').var3.transform(lambda g: g[g < 50].sum())
df
I have a pandas.DataFrame with with multiple columns and I would like to apply a curve_fit function to each of them. I would like the output to be a dataframe with the optimal values fitting the data in the columns (for now, I am not interested in their covariance).
The df has the following structure:
a b c
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 1 0 1
7 1 1 1
8 1 1 1
9 1 1 1
10 1 1 1
11 1 1 1
12 1 1 1
13 1 1 1
14 2 1 2
15 6 2 6
16 7 2 7
17 8 2 8
18 9 2 9
19 7 2 7
I have defined a function to fit to the data as so:
def sigmoid(x, a, x0, k):
y = a / (1 + np.exp(-k*(x-x0)))
return y
def fitdata(dataseries):
popt, pcov=curve_fit(sigmoid, dataseries.index, dataseries)
return popt
I can apply the function and get an array in return:
result_a=fitdata(df['a'])
In []: result_a
Out[]: array([ 8.04197008, 14.48710063, 1.51668241])
If I try to df.apply the function I get the following error:
fittings=df.apply(fitdata)
ValueError: Shape of passed values is (3, 3), indices imply (3, 20)
Ultimately I would like the output to look like:
a b c
0 8.041970 2.366496 8.041970
1 14.487101 12.006009 14.487101
2 1.516682 0.282359 1.516682
Can this be done with something similar to apply?
Hope my solution work for you.
result = pd.DataFrame()
for i in df.columns:
frames = [result, pd.DataFrame(fitdata(df[i]))]
result = pd.concat(frames, axis=1)
result.columns = df.columns
a b c
0 8.041970 2.366496 8.041970
1 14.487101 12.006009 14.487101
2 1.516682 0.282359 1.516682
I think the issue is that the apply of your fitting function returns an array of dim 3x3 (the 3 fitparameters as returned by conner). But expected is something in the shape of 20x3 as your df.
So you have to re-apply your fitfunction on these parameters to get your fitted y-values.
def fitdata(dataseries):
# fit the data
fitParams, fitCovariances=curve_fit(sigmoid, dataseries.index, dataseries)
# we have to re-apply a function to the coeffs. so that we get fittet
# data in shape of the df again.
y_fit = sigmoid(dataseries, fitParams[0], fitParams[1], fitParams[2])
return y_fit
Have a look here for more examples
(this post is based on both previous answers and provides a complete example including an improvement in the dataframe construction of the fit parameters)
The following function fit_to_dataframe fits an arbitrary function to each column in your data and returns the fit parameters (covariance ignored here) in a convenient format:
def fit_to_dataframe(df, function, parameter_names):
popts = {}
pcovs = {}
for c in df.columns:
popts[c], pcovs[c] = curve_fit(function, df.index, df[c])
fit_parameters = pd.DataFrame.from_dict(popts,
orient='index',
columns=parameter_names)
return fit_parameters
fit_parameters = fit_to_dataframe(df, sigmoid, parameter_names=['a', 'x0', 'k'])
The fit parameters are available in the following form:
a x0 k
a 8.869996 11.714575 0.844969
b 2.366496 12.006009 0.282359
c 8.041970 14.487101 1.516682
In order to inspect the fit result, you can use the following function to plot the results:
def plot_fit_results(df, function, fit_parameters):
NUM_POINTS = 50
t = np.linspace(df.index.values.min(), df.index.values.max(), NUM_POINTS)
df.plot(style='.')
for idx, column in enumerate(df.columns):
plt.plot(t,
function(t, *fit_parameters.loc[column]),
color='C{}'.format(idx))
plt.show()
plot_fit_results(df, sigmoid, fit_parameters)
Result: Output Graph
This answer is also available as an interactive jupyter notebook here.