I need to test several category encoders to different columns containing same values. All the values appear in the columns but not at the same row. For example, I could have:
dft = pd.DataFrame({
'col0':["a", "b", "a", "c", "b", "d"],
'col1':["c", "d", "b", "d", "c", "c"],
'col2':["b", "a", "c", "b", "a", "a"],
})
col0 col1 col2
0 a c b
1 b d a
2 a b c
3 c d b
4 b c a
5 d c a
I could not have in the first row "a", "c", "c"
To encode columns I'm using the Python library category encoders. The problem is that I need to fit the encoder with one column and then apply encoding on multiple columns.
For example given a df like this:
dft = pd.DataFrame({
'col0':["a", "b", "a", "c", "b", "d"],
'col1':["c", "d", "b", "d", "c", "c"]})
col0 col1
0 a c
1 b d
2 a b
3 c d
4 b c
5 d c
What I'd like to have is:
col0 col1 a b c d
0 a c 1 0 1 0
1 b d 0 1 0 1
2 a b 1 1 0 0
3 c d 0 0 1 1
4 b c 0 1 1 0
5 d c 0 0 1 1
But using category encoders library I have to fit a column(s) and apply the transform to that same column(s).
Using category encoders on a column this happens:
dft = pd.DataFrame({
'col0':["a", "b", "a", "a", "b", "b"],
'col1':["c", "d", "c", "d", "c", "c"],
})
encoder = ce.OneHotEncoder(cols=None, use_cat_names=True) # encoding example to visualize better the problem
encoder.fit(dft['col0'])
encoder.transform(dft['col0'])
Output:
col0_a col0_b col0_c col0_d
0 1 0 0 0
1 0 1 0 0
2 1 0 0 0
3 0 0 1 0
4 0 1 0 0
5 0 0 0 1
Then apply transformation to the other column:
encoder.transform(dft['col1'])
Output:
KeyError: 'col0'
If the fit is done on both column (since col0 and col1 contain same unique values) the output is:
encoder.fit(dft[['col0','col1']])
encoder.transform(dft[['col0','col1']])
col0_a col0_b col0_c col0_d col1_c col1_d col1_b
0 1 0 0 0 1 0 0
1 0 1 0 0 0 1 0
2 1 0 0 0 0 0 1
3 0 0 1 0 0 1 0
4 0 1 0 0 1 0 0
5 0 0 0 1 1 0 0
The example above is just a method to encode my columns, my goal is trying different methods, there are other libraries to do this encoding without applying transform method only to the fitted columns (without writing every category encoding method from scratch)?
You can stack the dataframe to reshape then use str.get_dummies to create a dataframe of indicator variables for the stacked frame, finally take sum on level=0:
enc = dft.stack().str.get_dummies().sum(level=0)
out = dft.join(enc)
>>> out
col0 col1 a b c d
0 a c 1 0 1 0
1 b d 0 1 0 1
2 a b 1 1 0 0
3 c d 0 0 1 1
4 b c 0 1 1 0
5 d c 0 0 1 1
I actually would prefer encoding every single column with a separate encoder, and I believe this behavior, you've described, is intentional. You could have columns car color and phone color being both red resulting in a same feature red=True indifferent to whether it was car or phone. But if you really want to achieve this, you could do a simple post-processing like this:
categories = ['a', 'b', 'c', 'd']
columns = ['col0_a', 'col0_b', 'col0_c', 'col0_d', 'col1_c', 'col1_d', 'col1_b']
for category in categories:
sum_columns = []
for col in columns:
if col.endswith(f'_{category}'):
sum_columns.append(col)
df[category] = df[sum_columns].sum(axis=1).astype(bool).astype(int)
df = df.drop(columns, axis=1)
Related
I have a dataframe which contains many pre-defined column names. One column of this dataframe contains the name of these columns.
I want to write the value 1 where the string name is equal to the column name.
For example, I have this current situation:
df = pd.DataFrame(0,index=[0,1,2,3],columns = ["string","a","b","c","d"])
df["string"] = ["b", "b", "c", "a"]
string a b c d
------------------------------
b 0 0 0 0
b 0 0 0 0
c 0 0 0 0
a 0 0 0 0
And this is what I would like the desired result to be like:
string a b c d
------------------------------
b 0 1 0 0
b 0 1 0 0
c 0 0 1 0
a 1 0 0 0
You can use get_dummies on df['string'] and update the DataFrame in place:
df.update(pd.get_dummies(df['string']))
updated df:
string a b c d
0 b 0 1 0 0
1 b 0 1 0 0
2 c 0 0 1 0
3 a 1 0 0 0
you can also use this
df.loc[ df[“column_name”] == “some_value”, “column_name”] = “value”
In your case
df.loc[ df["string"] == "b", "b"] = 1
I have the following df:
a b c d e f
city-1 uc-1 1 1 0 1
city-2 uc-1 0 0 1 0
city-1 uc-2 1 1 1 1
city-2 uc-2 1 0 0 1
My code:
bnu_u = pd.pivot_table(df,
index=["a", "b"],
values=["b", "c", "d", "f"],
aggfunc={"b":len, "c":np.sum, "d":np.sum, "e":np.sum, "f":np.sum}
)
bnu_u.iloc[:,0] = bnu_u.iloc[:, 0].div(bnu_u.b, axis = 0)
This gives me error: ValueError: Grouper for 'b' not 1-dimensional pivot_table. Although in another same type of code it is running perfectly.
I have a gene interaction data like down below. Any idea how to make an adjacency matrix from this data?
Example data:
cat_type = pd.CategoricalDtype(list("abcdef"))
df = pd.DataFrame(
{
"node1":["a", "b", "a", "c", "b", "f"],
"node2":["b", "d", "c", "e", "f", "e"],
}
).astype(cat_type)
df looks like this:
node1 node2
0 a b
1 b d
2 a c
3 c e
4 b f
5 f e
Solution
adj_mat = pd.crosstab(df["node1"], df["node2"], dropna=False)
results in a dataframe:
a b c d e f
a 0 1 1 0 0 0
b 0 0 0 1 0 1
c 0 0 0 0 1 0
d 0 0 0 0 0 0
e 0 0 0 0 0 0
f 0 0 0 0 1 0
If you need it symmetrical around the diagonal then the following will give you a boolean result
adj_mat.transpose() + adj_mat > 0
which you can then convert to integer with .astype(int) if required
I have this DataFrame with both categorical and non-categorical data and I would like to dummy encode it but not all dummy values that I know are possible are present in the data.
For example let's use the following DataFrame:
>>> df = pd.DataFrame({"a": [1,2,3], "b": ["x", "y", "x"], "c": ["h", "h", "i"]})
>>> df
a b c
0 1 x h
1 2 y h
2 3 x i
Column a has a non-categorical values but both column b and c are categorical.
Now let's say column b can contain the categories x, y and z and column c the categories h, i, j and k
>>> dummy_map = {"b": ["x", "y", "z"], "c": ["h", "i", "j", "k"]}
I want to encode it so that the resulting dataframe is as follows:
>>> df_encoded
a b_x b_y b_z c_h c_i c_j c_k
0 1 1 0 0 1 0 0 0
1 2 0 1 0 1 0 0 0
2 3 1 0 0 0 1 0 0
My current solution is as follows:
df_encoded = pd.get_dummies(df)
for k, v in dummy_map.items():
for cat in v:
name = k + "_" + cat
if name not in result:
df_encoded[name] = 0
But it seems to me a bit inefficient and inelegant.
So is there a better solution for this?
Use Index.union with vae values generated by list comprehension and f-strings and DataFrame.reindex:
c = [f'{k}_{x}' for k, v in dummy_map.items() for x in v]
print (c)
['b_x', 'b_y', 'b_z', 'c_h', 'c_i', 'c_j', 'c_k']
df_encoded = pd.get_dummies(df)
vals = df_encoded.columns.union(c, sort=False)
df_encoded = df_encoded.reindex(vals, axis=1, fill_value=0)
print (df_encoded)
a b_x b_y c_h c_i b_z c_j c_k
0 1 1 0 1 0 0 0 0
1 2 0 1 1 0 0 0 0
2 3 1 0 0 1 0 0 0
If values should be sorted in union:
df_encoded = pd.get_dummies(df)
vals = df_encoded.columns.union(c)
df_encoded = df_encoded.reindex(vals, axis=1, fill_value=0)
print (df_encoded)
a b_x b_y b_z c_h c_i c_j c_k
0 1 1 0 0 1 0 0 0
1 2 0 1 0 1 0 0 0
2 3 1 0 0 0 1 0 0
Given the following dataframe:
df = pd.DataFrame({"values": ["a", "a", "a", "b", "b", "a", "a", "c"]})
How could I generate the given output:
values out
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 0
6 a 1
7 c 0
I can (if it allows easier options) ensure uniqueness over groups, hence having input values like:
df = pd.DataFrame({"values": ["a0", "a0", "a0", "b0", "b0", "a1", "a1", "c0"]})
Using shift and cumsum create the key , then we using category
df['strkey']=(df['values']!=df['values'].shift()).ne(0).cumsum()
df['values']+=df.groupby('values')['strkey'].apply(lambda x : x.astype('category').cat.codes.astype(str))
df
Out[568]:
values strkey
0 a0 1
1 a0 1
2 a0 1
3 b0 2
4 b0 2
5 a1 3
6 a1 3
7 c0 4