I would like to split the column of a dataframe as follows.
Here is the main dataframe.
import pandas as pd
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
df_az
Then, I applied this code to split the column.
out_az = (df_az.stack().apply(pd.Series).rename(columns=lambda x: f'a combination').unstack().swaplevel(0,1,axis=1).sort_index(axis=1))
out_az = pd.concat([out_az], axis=1)
out_az.head()
However, the result is as follows.
Meanwhile, the expected result is:
Could anyone help me what to change on the code, please? Thank you in advance.
You can apply np.ravel:
>>> pd.DataFrame.from_records(df_az['AZ Combination'].apply(np.ravel))
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 1
Convert column to list and reshape for 2d array, so possible use Dataframe contructor.
Then set columns names, for avoid duplicated columns names are add counter:
storage_AZ = [[[0,0,0],[0,0,0]],
[[0,0,0],[0,0,1]],
[[0,0,0],[0,1,0]],
[[0,0,0],[1,0,0]],
[[0,0,0],[1,0,1]]]
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
N = 3
L = ['a combination','z combination']
df = pd.DataFrame(np.array(df_az['AZ Combination'].tolist()).reshape(df_az.shape[0],-1))
df.columns = [f'{L[a]}_{b}' for a, b in zip(df.columns // N, df.columns % N)]
print(df)
a combination_0 a combination_1 a combination_2 z combination_0 \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1
If need MultiIndex:
df = pd.concat({'AZ Combination':df}, axis=1)
print(df)
AZ Combination \
a combination_0 a combination_1 a combination_2 z combination_0
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1
Related
I have a pandas dataframe:
Index 0 1 2
0 0 1 0
1 0 0 1
2 1 0 0
3 0 0 1
How do I create a new dataframe according to it column name where the value is existed or when the value = 1 ?
Expected output:
Index type
0 1
1 2
2 0
3 2
Use DataFrame.dot if only 1 or 0 values in columns:
#if Index is not column, but index
df['type'] = df.dot(df.columns)
#if Index is column or necessary omit first column
#df['type'] = df.iloc[:, 1:].dot(df.columns[1:])
print (df)
0 1 2 type
Index
0 0 1 0 1
1 0 0 1 2
2 1 0 0 0
3 0 0 1 2
Solution also working correct if no 1 value per row, then return empty string:
df['type'] = df.dot(df.columns)
print (df)
0 1 2 type
Index
0 0 0 0
1 0 0 1 2
2 1 0 0 0
3 0 0 1 2
Here's a way using np.nonzero
_, df['type'] = np.nonzero(df.values)
print(df)
0 1 2 type
0 0 1 0 1
1 0 0 1 2
2 1 0 0 0
3 0 0 1 2
As it's seems like dummies. you can also use pandas.DataFrame.idxmax
>>> df['type'] = df.idxmax(axis=1)
>>> df
0 1 2 type
0 0 1 0 1
1 0 0 1 2
2 1 0 0 0
3 0 0 1 2
So I have a pandas dataframe where certain columns have values of type list and a mix of columns of non-numeric and numeric data.
Example data
dst_address dst_enforcement fwd_count ...
1 1.2.3.4 [Any,core] 8
2 3.4.5.6 [] 9
3 6.7.8.9 [Any] 10
4 8.10.3.2 [core] 0
So far I've been able to find out which columns are non-numeric by these 2 lines of code
col_groups = df.columns.to_series().groupby(df.dtypes).groups
non_numeric_cols = col_groups[np.dtype('O')]
Of all these non-numeric columns, I need to figure out which ones have list as data type and I want to perform one-hot encoding on all non-numeric columns (including those list type)
EDIT: my expected output for above example would be something like
1.2.3.4 | 3.4.5.6 | 6.7.8.9 | 8.10.3.2 | empty | Any | core | fwd_count ...
1 1 0 0 0 0 1 1 8
2 0 1 0 0 1 0 0 9
3 0 0 1 0 0 1 0 10
4 0 0 0 1 0 0 1 0
I use 3 steps as follows:
df['dst_enforcement'] = df.dst_enforcement.apply(lambda x: x if x else ['empty'])
dm1 = pd.get_dummies(df[df.columns.difference(['dst_enforcement'])], prefix='', prefix_sep='')
dm2 = df.dst_enforcement.str.join('-').str.get_dummies('-')
pd.concat([dm1, dm2], axis=1)
Out[1221]:
fwd_count 1.2.3.4 3.4.5.6 6.7.8.9 8.10.3.2 Any core empty
1 8 1 0 0 0 1 1 0
2 9 0 1 0 0 0 0 1
3 10 0 0 1 0 1 0 0
4 0 0 0 0 1 0 1 0
Use unnesting to unnest the lists to seperate roes and call pd.get_dummies():
df_new=unnesting(df,['dst_enforcement']).combine_first(df)
df_new.dst_enforcement=df_new.dst_enforcement.apply(lambda y: 'empty' if len(y)==0 else y)
m=pd.get_dummies(df_new,prefix='',prefix_sep='').groupby('fwd_count').first().reset_index()
print(m)
fwd_count 1.2.3.4 3.4.5.6 6.7.8.9 8.10.3.2 Any core empty
0 0.0 0 0 0 1 0 1 0
1 8.0 1 0 0 0 1 0 0
2 9.0 0 1 0 0 0 0 1
3 10.0 0 0 1 0 1 0 0
Adding the function used for convenience:
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
Give a go to:
non_numeric_cols = col_groups[np.dtype('O')]
for non in non_numeric_cols:
print(pd.get_dummies(df[non].apply(pd.Series)))
Output:
0_1.2.3.4 0_3.4.5.6 0_6.7.8.9 0_8.10.3.2
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
0_Any 0_core 1_core
0 1 0 1
1 0 0 0
2 1 0 0
3 0 1 0
When you don't have neither "Any" or "core" the whole row is zeros.
Good luck.
I have a dataframe column contains 10 different digits. Through pd.get_dummies I've got 10 new columns which column names are numbers. Then I want to rename these number named columns by df = df.rename(columns={'0':'topic0'}) but failed. How can I rename these columns' name from numbers to strings?
Use DataFrame.add_prefix:
df = pd.DataFrame({'col':[1,5,7,8,3,6,5,8,9,10]})
df1 = pd.get_dummies(df['col']).add_prefix('topic')
print (df1)
topic1 topic3 topic5 topic6 topic7 topic8 topic9 topic10
0 1 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0
2 0 0 0 0 1 0 0 0
3 0 0 0 0 0 1 0 0
4 0 1 0 0 0 0 0 0
5 0 0 0 1 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 0 0 0 1 0 0
8 0 0 0 0 0 0 1 0
9 0 0 0 0 0 0 0 1
With the example dataframe you can do:
d = {0: [1, 2], 1: [3, 4]}
df = pd.DataFrame(data=d)
You can do for example:
df.rename(index=str, columns={0: "a", 1: "c"})
And then use this method to rename the other columns.
Compactly:
for x in range(3):
df.rename(index=str, columns={x: "topic"+str(x)})
I am trying to return a cumulative count based on other columns. For the df below I want to return a count using Outcome and Aa,Bb,Cc,Dd. Specifically, if X or Y is in Outcome, I want to return the most recent increase in integers in Aa,Bb,Cc,Dd. So when X or Y are listed I want to return that against which integer in Aa,Bb,Cc,Dd was the most recent to increase.
I have attempted this using the following:
import pandas as pd
d = ({
'Outcome' : ['','','X','','','X','','Y','','Y'],
'A' : [0,0,0,1,1,1,2,2,2,2],
'B' : [0,0,0,1,1,1,1,1,2,2],
'C' : [0,0,0,1,2,3,3,3,3,3],
'D' : [0,1,2,2,2,2,2,2,2,2],
})
df = pd.DataFrame(data = d)
m = pd.get_dummies(
df.where(df.Outcome.ne(df.Outcome.shift()) & df.Outcome.str.len().astype(bool)
), prefix='Count').cumsum()
df = pd.concat([
m.where(m.ne(m.shift())).fillna('', downcast='infer'), df], axis=1)
But it's not quite right.
My Intended Output is:
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 0 0 0 0 1 0 1 0
6 2 1 3 2 0 0 0 0 1 0 1 0
7 Y 2 1 3 2 0 1 0 0 1 0 1 0
8 2 2 3 2 0 1 0 0 1 0 1 0
9 Y 2 2 3 2 0 1 0 1 1 0 1 0
Below are 2 snippets:
As per description, which captures additional increases in A column between 1st and 2nd X
As per example, capturing the last increase out of all 4 columns
1) As per description
for col in 'ABCD':
df[col+'_X']=0
df[col+'_Y']=0
for i1, i2 in zip(df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index,
df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index[1::]):
for col in 'ABCD':
if df[col][i2]>df[col][i1]:
df.loc[i2::,col+'_'+df.Outcome[i2]]=df[col+'_'+df.Outcome[i2]][i2-1]+1
print(df)
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 1 0 1 0 1 0 1 0
6 2 1 3 2 1 0 1 0 1 0 1 0
7 Y 2 1 3 2 1 1 1 0 1 0 1 0
8 2 2 3 2 1 1 1 0 1 0 1 0
9 Y 2 2 3 2 1 1 1 1 1 0 1 0
2) As per example
for col in 'ABCD':
df[col+'_X']=0
df[col+'_Y']=0
for i1, i2 in zip(df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index,
df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index[1::]):
change_col = ''
change_pos = -1
for col in 'ABCD':
if df[col][i2]>df[col][i1]:
found_change_pos = df[df[col]==df[col][i2]-1].tail(1).index
if found_change_pos > change_pos:
change_col = col
change_pos = found_change_pos
if change_pos > -1:
df.loc[i2::,change_col+'_'+df.Outcome[i2]]=df[change_col+'_'+df.Outcome[i2]][i2-1]+1
print(df)
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 0 0 0 0 1 0 1 0
6 2 1 3 2 0 0 0 0 1 0 1 0
7 Y 2 1 3 2 0 1 0 0 1 0 1 0
8 2 2 3 2 0 1 0 0 1 0 1 0
9 Y 2 2 3 2 0 1 0 1 1 0 1 0
The columns to test for integer increases and the unique values column are set as a variables so that the routine may be easily adapted to input dataframes with other column names.
This routine is relatively fast even with large input dataframes because it uses fast numpy functions within the loop and throughout.
# this method assumes that only rows with an increase in one column
# only counts as an increase in value.
# rows with more than one column increasing are ignored.
# it also assumes that integers always increase by
# one.
import pandas as pd
import numpy as np
# designate the integer increase columns
tgt_cols = ['A', 'B', 'C', 'D']
unique_val_col = 'Outcome'
# put None in empty string positions within array
# of Outcome column values
oc_vals = df[unique_val_col].where(df[unique_val_col] != '', None).values
# find the unique strings in Outcome
uniques = pd.unique(oc_vals[oc_vals != None])
# use pandas diff to locate integer increases in columns
diffs = df[tgt_cols].diff().fillna(0).values.astype(int)
# add the values in each diffs row (this will help later
# to find rows without any column increase or multiple
# increases)
row_sums = np.sum(diffs, axis=1)
# find the row indexes where a single integer increase
# occurred
change_row_idx = np.where(row_sums == 1)[0]
# find the indexes where a single increase did not occur
no_change_idx = np.where((row_sums == 0) | (row_sums > 1))[0]
# remove row 0 from the index if it exists because it is
# not applicable to previous changes
if no_change_idx[0] == 0:
no_change_idx = no_change_idx[1:]
# locate the indexes of previous rows which had an integer
# increase to carry forward to rows without an integer increase
# (no_change_idx)
fwd_fill_index = \
[np.searchsorted(change_row_idx, x) - 1 for x in no_change_idx if x > 0]
# write over no change row(s) with data from the last row with an
# integer increase.
# now each row in diffs will have a one marking the last or current change
diffs[no_change_idx] = diffs[change_row_idx][fwd_fill_index]
# make an array to hold the combined output result array
num_rows = diffs.shape[0]
num_cols = diffs.shape[1] * len(uniques)
result_array = np.zeros(num_rows * num_cols) \
.reshape(diffs.shape[0], diffs.shape[1] * len(uniques)).astype(int)
# determine the pattern for combining the unique value arrays.
# (the example has alternating columns for X and Y results)
concat_pattern = np.array(range(len(tgt_cols) * len(uniques))) % len(uniques)
# loop through the uniques values and do the following each time:
# make an array of zeros the same size as the diffs array.
# find the rows in the diffs array which are located one row up from
# to each unique value location in df.Outcome.
# put those rows into the array of zeros.
for i, u in enumerate(uniques):
unique_val_ar = np.zeros_like(diffs)
urows = np.where(oc_vals == u)[0]
if urows[0] == 0:
urows = urows[1:]
# shift unique value index locations by -1
adj_urows = urows - 1
unique_val_ar[urows] = diffs[adj_urows]
# put the columns from the unique_val_ar arrays
# into the combined array according to the concat pattern
# (tiled pattern per example)
result_array[:, np.where(concat_pattern == i)[0]] = unique_val_ar
# find the cummulative sum of the combined array (vertical axis)
result_array_cumsums = np.cumsum(result_array, axis=0)
# make the column names for a new dataframe
# which will contain the result_array_cumsums array
tgt_vals = np.repeat(tgt_cols, len(uniques))
u_vals = np.tile(uniques, len(tgt_cols))
new_cols = ['_'.join(x) for x in list(zip(tgt_vals, u_vals))]
# make the dataframe, using the generated column names
df_results = pd.DataFrame(result_array_cumsums, columns=new_cols)
# join the result dataframe with the original dataframe
df_out = df.join(df_results)
print(df_out)
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 0 0 0 0 1 0 1 0
6 2 1 3 2 0 0 0 0 1 0 1 0
7 Y 2 1 3 2 0 1 0 0 1 0 1 0
8 2 2 3 2 0 1 0 0 1 0 1 0
9 Y 2 2 3 2 0 1 0 1 1 0 1 0
I'm trying to understand how can I address columns after using get_dummies.
For example, let's say I have three categorical variables.
first variable has 2 levels.
second variable has 5 levels.
third variable has 2 levels.
df=pd.DataFrame({"a":["Yes","Yes","No","No","No","Yes","Yes"], "b":["a","b","c","d","e","a","c"],"c":["1","2","2","1","2","1","1"]})
I created dummies for all three variable in order to use them in sklearn regression in python.
df1 = pd.get_dummies(df,drop_first=True)
Now I want to create two interactions (multiplication): bc , ba
how can I create the multiplication between each dummies variable to another one without using their specific names like that:
df1['a_yes_b'] = df1['a_Yes']*df1['b_b']
df1['a_yes_c'] = df1['a_Yes']*df1['b_c']
df1['a_yes_d'] = df1['a_Yes']*df1['b_d']
df1['a_yes_e'] = df1['a_Yes']*df1['b_e']
df1['c_2_b'] = df1['c_2']*df1['b_b']
df1['c_2_c'] = df1['c_2']*df1['b_c']
df1['c_2_d'] = df1['c_2']*df1['b_d']
df1['c_2_e'] = df1['c_2']*df1['b_e']
Thanks.
You can use loops for creating new columns, for filtering column names is possible use filtering by boolean indexing and str.startswith:
a = df1.columns[df1.columns.str.startswith('a')]
b = df1.columns[df1.columns.str.startswith('b')]
c = df1.columns[df1.columns.str.startswith('c')]
for col1 in b:
for col2 in a:
df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])
for col1 in b:
for col2 in c:
df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])
print (df1)
a_Yes b_b b_c b_d b_e c_2 a_Yes_b a_Yes_c a_Yes_d a_Yes_e c_2_b \
0 1 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 1 0 0 0 1
2 0 0 1 0 0 1 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0
6 1 0 1 0 0 0 0 1 0 0 0
c_2_c c_2_d c_2_e
0 0 0 0
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 1
5 0 0 0
6 0 0 0
But if a and b have only one columns (in sample yes, in real data maybe) use: filter, mul, squeeze and concat:
a = df1.filter(regex='^a')
b = df1.filter(regex='^b')
c = df1.filter(regex='^c')
dfa = b.mul(a.squeeze(), axis=0).rename(columns=lambda x: a.columns[0] + x[1:])
dfc = b.mul(c.squeeze(), axis=0).rename(columns=lambda x: c.columns[0] + x[1:])
df1 = pd.concat([df1, dfa, dfc], axis=1)
print (df1)
a_Yes b_b b_c b_d b_e c_2 a_Yes_b a_Yes_c a_Yes_d a_Yes_e c_2_b \
0 1 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 1 0 0 0 1
2 0 0 1 0 0 1 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0
6 1 0 1 0 0 0 0 1 0 0 0
c_2_c c_2_d c_2_e
0 0 0 0
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 1
5 0 0 0
6 0 0 0
You can convert dataframe column into numpy array and then multiply it accordingly. here's the link where you can find methos to do that:
Convert Select Columns in Pandas Dataframe to Numpy Array
This solves your problem:
def get_design_with_pair_interaction(data, group_pair):
""" Get the design matrix with the pairwise interactions
Parameters
----------
data (pandas.DataFrame):
Pandas data frame with the two variables to build the design matrix of their two main effects and their interaction
group_pair (iterator):
List with the name of the two variables (name of the columns) to build the design matrix of their two main effects and their interaction
Returns
-------
x_new (pandas.DataFrame):
Pandas data frame with the design matrix of their two main effects and their interaction
"""
x = pd.get_dummies(data[group_pair])
interactions_lst = list(
itertools.combinations(
x.columns.tolist(),
2,
),
)
x_new = x.copy()
for level_1, level_2 in interactions_lst:
if level_1.split('_')[0] == level_2.split('_')[0]:
continue
x_new = pd.concat(
[
x_new,
x[level_1] * x[level_2]
],
axis=1,
)
x_new = x_new.rename(
columns = {
0: (level_1 + '_' + level_2)
}
)
return x_new