get_dummies of dataframe with incomplete data using pandas - python

I have this DataFrame with both categorical and non-categorical data and I would like to dummy encode it but not all dummy values that I know are possible are present in the data.
For example let's use the following DataFrame:
>>> df = pd.DataFrame({"a": [1,2,3], "b": ["x", "y", "x"], "c": ["h", "h", "i"]})
>>> df
a b c
0 1 x h
1 2 y h
2 3 x i
Column a has a non-categorical values but both column b and c are categorical.
Now let's say column b can contain the categories x, y and z and column c the categories h, i, j and k
>>> dummy_map = {"b": ["x", "y", "z"], "c": ["h", "i", "j", "k"]}
I want to encode it so that the resulting dataframe is as follows:
>>> df_encoded
a b_x b_y b_z c_h c_i c_j c_k
0 1 1 0 0 1 0 0 0
1 2 0 1 0 1 0 0 0
2 3 1 0 0 0 1 0 0
My current solution is as follows:
df_encoded = pd.get_dummies(df)
for k, v in dummy_map.items():
for cat in v:
name = k + "_" + cat
if name not in result:
df_encoded[name] = 0
But it seems to me a bit inefficient and inelegant.
So is there a better solution for this?

Use Index.union with vae values generated by list comprehension and f-strings and DataFrame.reindex:
c = [f'{k}_{x}' for k, v in dummy_map.items() for x in v]
print (c)
['b_x', 'b_y', 'b_z', 'c_h', 'c_i', 'c_j', 'c_k']
df_encoded = pd.get_dummies(df)
vals = df_encoded.columns.union(c, sort=False)
df_encoded = df_encoded.reindex(vals, axis=1, fill_value=0)
print (df_encoded)
a b_x b_y c_h c_i b_z c_j c_k
0 1 1 0 1 0 0 0 0
1 2 0 1 1 0 0 0 0
2 3 1 0 0 1 0 0 0
If values should be sorted in union:
df_encoded = pd.get_dummies(df)
vals = df_encoded.columns.union(c)
df_encoded = df_encoded.reindex(vals, axis=1, fill_value=0)
print (df_encoded)
a b_x b_y b_z c_h c_i c_j c_k
0 1 1 0 0 1 0 0 0
1 2 0 1 0 1 0 0 0
2 3 1 0 0 0 1 0 0

Related

Write value to column if row string contain column name

I have a dataframe which contains many pre-defined column names. One column of this dataframe contains the name of these columns.
I want to write the value 1 where the string name is equal to the column name.
For example, I have this current situation:
df = pd.DataFrame(0,index=[0,1,2,3],columns = ["string","a","b","c","d"])
df["string"] = ["b", "b", "c", "a"]
string a b c d
------------------------------
b 0 0 0 0
b 0 0 0 0
c 0 0 0 0
a 0 0 0 0
And this is what I would like the desired result to be like:
string a b c d
------------------------------
b 0 1 0 0
b 0 1 0 0
c 0 0 1 0
a 1 0 0 0
You can use get_dummies on df['string'] and update the DataFrame in place:
df.update(pd.get_dummies(df['string']))
updated df:
string a b c d
0 b 0 1 0 0
1 b 0 1 0 0
2 c 0 0 1 0
3 a 1 0 0 0
you can also use this
df.loc[ df[“column_name”] == “some_value”, “column_name”] = “value”
In your case
df.loc[ df["string"] == "b", "b"] = 1

How do i make adjacency Matrix with the given data

I have a gene interaction data like down below. Any idea how to make an adjacency matrix from this data?
Example data:
cat_type = pd.CategoricalDtype(list("abcdef"))
df = pd.DataFrame(
{
"node1":["a", "b", "a", "c", "b", "f"],
"node2":["b", "d", "c", "e", "f", "e"],
}
).astype(cat_type)
df looks like this:
node1 node2
0 a b
1 b d
2 a c
3 c e
4 b f
5 f e
Solution
adj_mat = pd.crosstab(df["node1"], df["node2"], dropna=False)
results in a dataframe:
a b c d e f
a 0 1 1 0 0 0
b 0 0 0 1 0 1
c 0 0 0 0 1 0
d 0 0 0 0 0 0
e 0 0 0 0 0 0
f 0 0 0 0 1 0
If you need it symmetrical around the diagonal then the following will give you a boolean result
adj_mat.transpose() + adj_mat > 0
which you can then convert to integer with .astype(int) if required

python category encoders on multiple columns

I need to test several category encoders to different columns containing same values. All the values appear in the columns but not at the same row. For example, I could have:
dft = pd.DataFrame({
'col0':["a", "b", "a", "c", "b", "d"],
'col1':["c", "d", "b", "d", "c", "c"],
'col2':["b", "a", "c", "b", "a", "a"],
})
col0 col1 col2
0 a c b
1 b d a
2 a b c
3 c d b
4 b c a
5 d c a
I could not have in the first row "a", "c", "c"
To encode columns I'm using the Python library category encoders. The problem is that I need to fit the encoder with one column and then apply encoding on multiple columns.
For example given a df like this:
dft = pd.DataFrame({
'col0':["a", "b", "a", "c", "b", "d"],
'col1':["c", "d", "b", "d", "c", "c"]})
col0 col1
0 a c
1 b d
2 a b
3 c d
4 b c
5 d c
What I'd like to have is:
col0 col1 a b c d
0 a c 1 0 1 0
1 b d 0 1 0 1
2 a b 1 1 0 0
3 c d 0 0 1 1
4 b c 0 1 1 0
5 d c 0 0 1 1
But using category encoders library I have to fit a column(s) and apply the transform to that same column(s).
Using category encoders on a column this happens:
dft = pd.DataFrame({
'col0':["a", "b", "a", "a", "b", "b"],
'col1':["c", "d", "c", "d", "c", "c"],
})
encoder = ce.OneHotEncoder(cols=None, use_cat_names=True) # encoding example to visualize better the problem
encoder.fit(dft['col0'])
encoder.transform(dft['col0'])
Output:
col0_a col0_b col0_c col0_d
0 1 0 0 0
1 0 1 0 0
2 1 0 0 0
3 0 0 1 0
4 0 1 0 0
5 0 0 0 1
Then apply transformation to the other column:
encoder.transform(dft['col1'])
Output:
KeyError: 'col0'
If the fit is done on both column (since col0 and col1 contain same unique values) the output is:
encoder.fit(dft[['col0','col1']])
encoder.transform(dft[['col0','col1']])
col0_a col0_b col0_c col0_d col1_c col1_d col1_b
0 1 0 0 0 1 0 0
1 0 1 0 0 0 1 0
2 1 0 0 0 0 0 1
3 0 0 1 0 0 1 0
4 0 1 0 0 1 0 0
5 0 0 0 1 1 0 0
The example above is just a method to encode my columns, my goal is trying different methods, there are other libraries to do this encoding without applying transform method only to the fitted columns (without writing every category encoding method from scratch)?
You can stack the dataframe to reshape then use str.get_dummies to create a dataframe of indicator variables for the stacked frame, finally take sum on level=0:
enc = dft.stack().str.get_dummies().sum(level=0)
out = dft.join(enc)
>>> out
col0 col1 a b c d
0 a c 1 0 1 0
1 b d 0 1 0 1
2 a b 1 1 0 0
3 c d 0 0 1 1
4 b c 0 1 1 0
5 d c 0 0 1 1
I actually would prefer encoding every single column with a separate encoder, and I believe this behavior, you've described, is intentional. You could have columns car color and phone color being both red resulting in a same feature red=True indifferent to whether it was car or phone. But if you really want to achieve this, you could do a simple post-processing like this:
categories = ['a', 'b', 'c', 'd']
columns = ['col0_a', 'col0_b', 'col0_c', 'col0_d', 'col1_c', 'col1_d', 'col1_b']
for category in categories:
sum_columns = []
for col in columns:
if col.endswith(f'_{category}'):
sum_columns.append(col)
df[category] = df[sum_columns].sum(axis=1).astype(bool).astype(int)
df = df.drop(columns, axis=1)

Loading columns into multiple DataFrames based on prefix

I want to load columns with specific prefixes into separate DataFrames.
The columns I want have specific prefixes i.e.
A_1 A_2 B_1 B_2 C_1 C_2
1 0 0 0 0 0
1 0 0 1 1 1
0 1 1 1 1 0
I have a list of all the prefixes:
prefixes = ["A", "B", "C"]
I want to do something like this:
for prefix in prefixes:
f"df_{prefix}" = pd.read_csv("my_file.csv",
usecols=[f"{prefix}_1,
f"{prefix}_2,
f"{prefix}_3,])
So each DataFrame has the prefix in the name, but I'm not quite sure of the best way to do this or the syntax required.
You could try it with a different approach. Load the full csv once. Create three dfs out of it by dropping the columns don't mach your prefix.
x = pd.read_csv("my_file.csv")
notA = [c for c in x.columns if 'A' not in c]
notB = [c for c in x.columns if 'B' not in c]
notC = [c for c in x.columns if 'C' not in c]
a = x.drop(notA,1)
b = x.drop(notB,1)
c = x.drop(notC,1)
Considering you have a big dataframe like this:
In [1341]: df
Out[1341]:
A_1 A_2 B_1 B_2 C_1 C_2
0 1 0 0 0 0 0
1 1 0 0 1 1 1
2 0 1 1 1 1 0
Have a master list of prefixes:
In [1374]: master_list = ['A','B','C']
Create an empty dictionary to hold multiple subsets of dataframe:
In [1377]: dct = {}
Loop through the master list and store the column names in the above dict:
In [1378]: for i in master_list:
...: dct['{}_list'.format(i)] = [e for e in df.columns if e.startswith('{}'.format(i))]
Now, the dct has below keys with values:
A_list : ['A_1', 'A_2']
B_list : ['B_1', 'B_2']
C_list : ['C_1', 'C_2']
Then, subset your dataframes like below:
In [1381]: for k in dct:
...: dct[k] = df[dct[k]]
Now, the dictionary has actual rows of dataframe against every key:
In [1384]: for k in dct:
...: print dct[k]
In [1347]: df_A
Out[1347]:
A_1 A_2
0 1 0
1 1 0
2 0 1
In [1350]: df_B
Out[1350]:
B_1 B_2
0 0 0
1 0 1
2 1 1
In [1355]: df_C
Out[1355]:
C_1 C_2
0 0 0
1 1 1
2 1 0
First filter out not matched columns with startswith with boolean indexing and loc, because filter columns:
print (df)
A_1 A_2 B_1 B_2 C_1 D_2
0 1 0 0 0 0 0
1 1 0 0 1 1 1
2 0 1 1 1 1 0
prefixes = ["A", "B", "C"]
df = df.loc[:, df.columns.str.startswith(tuple(prefixes))]
print (df)
A_1 A_2 B_1 B_2 C_1
0 1 0 0 0 0
1 1 0 0 1 1
2 0 1 1 1 1
Then create Multiindex by split and then dictionary with groupby for dictioanry of DataFrames:
df.columns = df.columns.str.split('_', expand=True)
print (df)
A B C
1 2 1 2 1
0 1 0 0 0 0
1 1 0 0 1 1
2 0 1 1 1 1
d = {k: v[k] for k, v in df.groupby(level=0, axis=1)}
print (d['A'])
1 2
0 1 0
1 1 0
2 0 1
Or use lambda function with split:
d = {k: v for k, v in df.groupby(lambda x: x.split('_')[0], axis=1)}
print (d['A'])
A_1 A_2
0 1 0
1 1 0
2 0 1

Changing values in multiple columns of a pandas DataFrame using known column values

Suppose I have a dataframe like this:
Knownvalue A B C D E F G H
17.3413 0 0 0 0 0 0 0 0
33.4534 0 0 0 0 0 0 0 0
what I wanna do is that when Knownvalue is between 0-10, A is changed from 0 to 1. And when Knownvalue is between 10-20, B is changed from 0 to 1,so on so forth.
It should be like this after changing:
Knownvalue A B C D E F G H
17.3413 0 1 0 0 0 0 0 0
33.4534 0 0 0 1 0 0 0 0
Anyone know how to apply a method to change it?
I first bucket the Knownvalue Series into a list of integers equal to its truncated value divided by ten (e.g. 27.87 // 10 = 2). These buckets represent the integer for the desired column location. Because the Knownvalue is in the first column, I add one to these values.
Next, I enumerate through these bin values which effectively gives me tuple pairs of row and column integer indices. I use iat to set the value of the these locations equal to 1.
import pandas as pd
import numpy as np
# Create some sample data.
df_vals = pd.DataFrame({'Knownvalue': np.random.random(5) * 50})
df = pd.concat([df_vals, pd.DataFrame(np.zeros((5, 5)), columns=list('ABCDE'))], axis=1)
# Create desired column locations based on the `Knownvalue`.
bins = (df.Knownvalue // 10).astype('int').tolist()
>>> bins
[4, 3, 0, 1, 0]
# Set these locations equal to 1.
for idx, col in enumerate(bins):
df.iat[idx, col + 1] = 1 # The first column is the `Knownvalue`, hence col + 1
>>> df
Knownvalue A B C D E
0 47.353937 0 0 0 0 1
1 37.460338 0 0 0 1 0
2 3.797964 1 0 0 0 0
3 18.323131 0 1 0 0 0
4 7.927030 1 0 0 0 0
A different approach would be to reconstruct the frame from the Knownvalue column using get_dummies:
>>> import string
>>> new_cols = pd.get_dummies(df["Knownvalue"]//10).loc[:,range(8)].fillna(0)
>>> new_cols.columns = list(string.ascii_uppercase)[:len(new_cols.columns)]
>>> pd.concat([df[["Knownvalue"]], new_cols], axis=1)
Knownvalue A B C D E F G H
0 17.3413 0 1 0 0 0 0 0 0
1 33.4534 0 0 0 1 0 0 0 0
get_dummies does the hard work:
>>> (df.Knownvalue//10)
0 1
1 3
Name: Knownvalue, dtype: float64
>>> pd.get_dummies((df.Knownvalue//10))
1 3
0 1 0
1 0 1

Categories

Resources