FunctionTransformer & creating new columns in pipeline

FunctionTransformer & creating new columns in pipeline - python

I have a sample data:
df = pd.DataFrame(columns=['X1', 'X2', 'X3'], data=[
[1,16,9],
[4,36,16],
[1,16,9],
[2,9,8],
[3,36,15],
[2,49,16],
[4,25,14],
[5,36,17]])
I want to create two complementary columns in my df based on x2 ad X3 and include it in the pipeline.
I am trying to follow the code:
def feat_comp(x):
x1 = 100-x
return x1
pipe_text = Pipeline([('col_test', FunctionTransformer(feat_comp, 'X2',validate=False))])
X = pipe_text.fit_transform(df)
It gives me an error:
TypeError: 'str' object is not callable
How can I apply the function transformer on selected columns and how can I use them in the pipeline?

If I understand you correctly, you want to add a new column based on a given column, e.g. X2. You need to pass this column as an additional argument to the function using kw_args:
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
df = pd.DataFrame(columns=['X1', 'X2', 'X3'], data=[
[1,16,9],
[4,36,16],
[1,16,9],
[2,9,8],
[3,36,15],
[2,49,16],
[4,25,14],
[5,36,17]])
def feat_comp(x, column):
x[f'100-{column}'] = 100 - x[column]
return x
pipe_text = Pipeline([('col_test', FunctionTransformer(feat_comp, validate=False, kw_args={'column': 'X2'}))])
pipe_text.fit_transform(df)
Result:
X1 X2 X3 100-X2
0 1 16 9 84
1 4 36 16 64
2 1 16 9 84
3 2 9 8 91
4 3 36 15 64
5 2 49 16 51
6 4 25 14 75
7 5 36 17 64
(in your example FunctionTransformer(feat_comp, 'X2',validate=False) X2 would be the inverse_func and the string X2 is not callalble, hence the error)

Related

Setting values in pandas df using location

I have two dataframes as such:
df_pos = pd.DataFrame(
data = [[5,4,3,6,0,7,1,2], [2,5,3,6,4,7,1,0]]
)
df_value = pd.DataFrame(
data=[np.arange(10 + i, 50 + i, 5) for i in range(0,2)]
)
and I want to have a new dataframe df_final where df_pos notates the position and df_value the corresponding value.
I can do it like this:
df_value_copy = df_value.copy()
for i in range(len(df_pos)):
df_value_copy.iloc[i, df_pos.iloc[i, :]] = df_value.iloc[i].values
df_final = df_value_copy
However, I have very large dataframes that would be way too slow. Therefore I want to see whether there is any smarter way to do it.

We can also try np.put_along_axis to place df_value into df_final based on the df_pos:
df_final = df_value.copy()
np.put_along_axis(
arr=df_final.values, # Destination Arr
indices=df_pos.values, # Indices
values=df_value.values, # Source Values
axis=1 # Along Axis
)
The arguments do not need to be kwargs can be positional like:
df_final = df_value.copy()
np.put_along_axis(df_final.values, df_pos.values, df_value.values, 1)
df_final:
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36

You can try setting values with numpy advanced indexing:
df_final = df_value.copy()
df_final.values[np.arange(len(df_pos))[:,None], df_pos.values] = df_value.values
df_final
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36

ValueError: Setting an array element with a sequence - Imblearn

I'm trying to do SMOTE oversampling from imblearn. This is my code:
X = data[['a','b','c']]
y = data['targets']
oversampler = SMOTE(random_state=42)
X_over, y_over = oversampler.fit_resample(X,y)
And, the last line X_over, y_over = oversampler.fit_resample(X,y) raises the error setting an array elemenet with a sequence
I am sure the reason is because of the shape of my 'X'.
X is a dataframe where each row of column 'a' is a list of length 118, each row of column 'b' a list of length 15 and column 'c' is an integer column.
i.e,
For example,
a(length - 118) b(length -15) c
[1,2,3,4,.....0] [4,7,8,9...0] 3
Now, how do I convert this dataframe X into array of shape (n_samples, n_features), which is required as per the documentation
Could someone please help me transform the input dataframe to get rid of this error?

You can expand the columns, check that the lengths are the same first:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
data = pd.DataFrame({'targets':np.random.binomial(1,0.15,100),
'a':np.random.randint(0,10,(100,2)).tolist(),
'b':np.random.randint(11,20,(100,3)).tolist(),
'c':np.random.randint(0,100,100)
})
data['a'].apply(len).value_counts()
2 100
Function to expand the columns, new columns will be named e.g a0..aN, and previous list columns will be dropped:
def expand_cols(da,col_list):
for C in col_list:
ix = [C+str(i) for i in range(len(da[C][0]))]
da[ix] = pd.DataFrame(data[C].tolist(),columns = ix)
da = da.drop(col_list,axis=1)
return da
Your code, and we expand it when we fit:
X = data[['a','b','c']]
y = data['targets']
oversampler = SMOTE(random_state=42)
X_over, y_over = oversampler.fit_resample(expand_cols(X,['a','b']),y)
Looks like this:
X_over.head()
c a0 a1 b0 b1 b2
0 67 4 0 19 15 16
1 12 3 7 12 17 19
2 41 8 9 15 18 18
3 35 8 0 11 13 11
4 46 0 5 12 12 12

Chunk a large dataset by using a string in the column name using pandas

I have a dataset containing 3000 columns
every column looks like 'abc_dummy0', 'dfg_dummy0, asd_dummy0' and it's of length 130 before it moves onto 'dfg_dummy1'.... and so on until 'lkj_dummy39'
I can use
cols = [col for col in df.columns if 'dummy1' in col]
And it lists all the columns (130) containing the dummy1 at the end of each column name.
My question is, how can i create smaller chunks of data, each one containing 'dummy0', 'dummy1', 'dummy2'.. and so on all the way till 'dummy39' without doing this 40 times
I think it has something to do with
dummy{i} for i in range(0,39)
But I am not quite sure how to approach this, in a memory efficient and code efficient way
(because I would just write 40 lines of code for each respective 'dummy' group)
Here's what I can do, one at the time:
group_0 = [col for col in df.columns if 'dummy0' in col]
group_0 = df[group_0]
but how do I do this for all other 39 groups (both the "_i" at the end of the name and the dummy{i} part) ?

Try with groupby on axis=1 and create a dict entry for each group:
dfs = {group_name: frame
for group_name, frame in
df.groupby(df.columns.str.split('_').str[-1], axis=1)}
dfs:
{'dummy0': a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81,
'dummy1': d_dummy1 e_dummy1 f_dummy1
0 74 9 63
1 8 77 16}
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(1, 100, (2, 6)),
columns=[f'{chr(3 * i + j + 97)}_dummy{i}'
for i in range(2)
for j in range(3)])
df:
a_dummy0 b_dummy0 c_dummy0 d_dummy1 e_dummy1 f_dummy1
0 79 62 17 74 9 63
1 28 31 81 8 77 16
groups based on values after the _:
df.columns.str.split('_').str[-1]
Index(['dummy0', 'dummy0', 'dummy0', 'dummy1', 'dummy1', 'dummy1'], dtype='object')
dfs:
{'dummy0': a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81,
'dummy1': d_dummy1 e_dummy1 f_dummy1
0 74 9 63
1 8 77 16}
Then each group can be accessed like:
dfs['dummy0']
a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81
Alternatively with a loop + filter:
for i in range(0, 2):
print(f'dummy{i}')
curr_df = df.filter(like=f'_dummy{i}')
# Do something with curr_df
print(curr_df)
Output:
dummy0
a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81
dummy1
d_dummy1 e_dummy1 f_dummy1
0 74 9 63
1 8 77 16

You can use this like:
[col for col in df.columns for i in range(39) if str(i) in col if 'dummy' in col]

Finding row with closest numerical proximity within Pandas DataFrame

I have a Pandas DataFrame with the following hypothetical data:
ID Time X-coord Y-coord
0 1 5 68 5
1 2 8 72 78
2 3 1 15 23
3 4 4 81 59
4 5 9 78 99
5 6 12 55 12
6 7 5 85 14
7 8 7 58 17
8 9 13 91 47
9 10 10 29 87
For each row (or ID), I want to find the ID with the closest proximity in time and space (X & Y) within this dataframe. Bonus: Time should have priority over XY.
Ideally, in the end I would like to have a new column called "Closest_ID" containing the most proximal ID within the dataframe.
I'm having trouble coming up with a function for this.
I would really appreciate any help or hint that points me in the right direction!
Thanks a lot!

Let's denote df as our dataframe. Then you can do something like:
from sklearn.metrics import pairwise_distances
space_vals = df[['X-coord', 'Y-coord']]
time_vals =df['Time']
space_distance = pairwise_distance(space_vals)
time_distance = pairwise_distance(time_vals)
space_distance[space_distance == 0] = 1e9 # arbitrary large number
time_distance[time_distance == 0] = 1e9 # again
closest_space_id = np.argmin(space_distance, axis=0)
closest_time_id = np.argmin(time_distance, axis=0)
Then, you can store the last 2 results in 2 columns, or somehow decide which one is closer.
Note: this code hasn't been checked, and it might have a few bugs...

How to stack (per iteration) dataframes in column side by side in one csv file in python pandas?

If I could generate two columns of data per iteration in a for-loop and I want to save it in a csv file, how will it be done if the next iteration that I would generate two columns it will be stacked side by side on the same csv file(no overwriting)? same goes for the next iterations. I have searched for pandas.DataFrame(mode='a') but it only appends the columns vertically (by rows). I have looked into concatenating pd.concat, however, I don't know how to implement it in a for loop for more than two dataframes. Do you have some sample codes for this one? or some ideas to share?
import numpy as np, pandas as pd
for i in xrange (0, 4):
x = pd.DataFrame(np.arange(10).reshape((5,1)))
y = pd.DataFrame(np.arange(10).reshape((5,1)))
data = np.array([x,y])
df = pd.DataFrame(data.T, columns=['X','Y'])

A file is a one dimensional object that only grows in length. The rows are only separated by a \n character. So, it is impossible to add rows without rewriting the file.
You can load the file in memory and concatenate using dataframe and then write it back to (some other file). Here:
import numpy as np, pandas as pd
a = pd.DataFrame(np.arange(10).reshape((5,2)))
b = pd.DataFrame(np.arange(20).reshape((5,4)))
pd.concat([a,b],axis=1)

is that what you want?
In [84]: %paste
df = pd.DataFrame(np.arange(10).reshape((5,2)))
for i in range (0, 4):
new = pd.DataFrame(np.random.randint(0, 100, (5,2)))
df = pd.concat([df, new], axis=1)
## -- End pasted text --
In [85]: df
Out[85]:
0 1 0 1 0 1 0 1 0 1
0 0 1 50 82 24 53 84 65 59 48
1 2 3 26 37 83 28 86 59 38 33
2 4 5 12 25 19 39 1 36 26 9
3 6 7 35 17 46 27 53 5 97 52
4 8 9 45 17 3 85 55 7 94 97

An alternative:
def iter_stack(n, shape):
df = pd.DataFrame(np.random.choice(range(10), shape)).T
for _ in range(n-1):
df = df.append(pd.DataFrame(np.random.choice(range(10), shape)).T)
return df.T
iterstacking(5, (5, 2))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

FunctionTransformer & creating new columns in pipeline - python

Related

Setting values in pandas df using location

ValueError: Setting an array element with a sequence - Imblearn

Chunk a large dataset by using a string in the column name using pandas

Finding row with closest numerical proximity within Pandas DataFrame

How to stack (per iteration) dataframes in column side by side in one csv file in python pandas?

Categories

Resources