Cumulative count based of values in another column - python

I am trying to return a cumulative count based on other columns. For the df below I want to return a count using Outcome and Aa,Bb,Cc,Dd. Specifically, if X or Y is in Outcome, I want to return the most recent increase in integers in Aa,Bb,Cc,Dd. So when X or Y are listed I want to return that against which integer in Aa,Bb,Cc,Dd was the most recent to increase.
I have attempted this using the following:
import pandas as pd
d = ({
'Outcome' : ['','','X','','','X','','Y','','Y'],
'A' : [0,0,0,1,1,1,2,2,2,2],
'B' : [0,0,0,1,1,1,1,1,2,2],
'C' : [0,0,0,1,2,3,3,3,3,3],
'D' : [0,1,2,2,2,2,2,2,2,2],
})
df = pd.DataFrame(data = d)
m = pd.get_dummies(
df.where(df.Outcome.ne(df.Outcome.shift()) & df.Outcome.str.len().astype(bool)
), prefix='Count').cumsum()
df = pd.concat([
m.where(m.ne(m.shift())).fillna('', downcast='infer'), df], axis=1)
But it's not quite right.
My Intended Output is:
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 0 0 0 0 1 0 1 0
6 2 1 3 2 0 0 0 0 1 0 1 0
7 Y 2 1 3 2 0 1 0 0 1 0 1 0
8 2 2 3 2 0 1 0 0 1 0 1 0
9 Y 2 2 3 2 0 1 0 1 1 0 1 0

Below are 2 snippets:
As per description, which captures additional increases in A column between 1st and 2nd X
As per example, capturing the last increase out of all 4 columns
1) As per description
for col in 'ABCD':
df[col+'_X']=0
df[col+'_Y']=0
for i1, i2 in zip(df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index,
df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index[1::]):
for col in 'ABCD':
if df[col][i2]>df[col][i1]:
df.loc[i2::,col+'_'+df.Outcome[i2]]=df[col+'_'+df.Outcome[i2]][i2-1]+1
print(df)
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 1 0 1 0 1 0 1 0
6 2 1 3 2 1 0 1 0 1 0 1 0
7 Y 2 1 3 2 1 1 1 0 1 0 1 0
8 2 2 3 2 1 1 1 0 1 0 1 0
9 Y 2 2 3 2 1 1 1 1 1 0 1 0
2) As per example
for col in 'ABCD':
df[col+'_X']=0
df[col+'_Y']=0
for i1, i2 in zip(df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index,
df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index[1::]):
change_col = ''
change_pos = -1
for col in 'ABCD':
if df[col][i2]>df[col][i1]:
found_change_pos = df[df[col]==df[col][i2]-1].tail(1).index
if found_change_pos > change_pos:
change_col = col
change_pos = found_change_pos
if change_pos > -1:
df.loc[i2::,change_col+'_'+df.Outcome[i2]]=df[change_col+'_'+df.Outcome[i2]][i2-1]+1
print(df)
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 0 0 0 0 1 0 1 0
6 2 1 3 2 0 0 0 0 1 0 1 0
7 Y 2 1 3 2 0 1 0 0 1 0 1 0
8 2 2 3 2 0 1 0 0 1 0 1 0
9 Y 2 2 3 2 0 1 0 1 1 0 1 0

The columns to test for integer increases and the unique values column are set as a variables so that the routine may be easily adapted to input dataframes with other column names.
This routine is relatively fast even with large input dataframes because it uses fast numpy functions within the loop and throughout.
# this method assumes that only rows with an increase in one column
# only counts as an increase in value.
# rows with more than one column increasing are ignored.
# it also assumes that integers always increase by
# one.
import pandas as pd
import numpy as np
# designate the integer increase columns
tgt_cols = ['A', 'B', 'C', 'D']
unique_val_col = 'Outcome'
# put None in empty string positions within array
# of Outcome column values
oc_vals = df[unique_val_col].where(df[unique_val_col] != '', None).values
# find the unique strings in Outcome
uniques = pd.unique(oc_vals[oc_vals != None])
# use pandas diff to locate integer increases in columns
diffs = df[tgt_cols].diff().fillna(0).values.astype(int)
# add the values in each diffs row (this will help later
# to find rows without any column increase or multiple
# increases)
row_sums = np.sum(diffs, axis=1)
# find the row indexes where a single integer increase
# occurred
change_row_idx = np.where(row_sums == 1)[0]
# find the indexes where a single increase did not occur
no_change_idx = np.where((row_sums == 0) | (row_sums > 1))[0]
# remove row 0 from the index if it exists because it is
# not applicable to previous changes
if no_change_idx[0] == 0:
no_change_idx = no_change_idx[1:]
# locate the indexes of previous rows which had an integer
# increase to carry forward to rows without an integer increase
# (no_change_idx)
fwd_fill_index = \
[np.searchsorted(change_row_idx, x) - 1 for x in no_change_idx if x > 0]
# write over no change row(s) with data from the last row with an
# integer increase.
# now each row in diffs will have a one marking the last or current change
diffs[no_change_idx] = diffs[change_row_idx][fwd_fill_index]
# make an array to hold the combined output result array
num_rows = diffs.shape[0]
num_cols = diffs.shape[1] * len(uniques)
result_array = np.zeros(num_rows * num_cols) \
.reshape(diffs.shape[0], diffs.shape[1] * len(uniques)).astype(int)
# determine the pattern for combining the unique value arrays.
# (the example has alternating columns for X and Y results)
concat_pattern = np.array(range(len(tgt_cols) * len(uniques))) % len(uniques)
# loop through the uniques values and do the following each time:
# make an array of zeros the same size as the diffs array.
# find the rows in the diffs array which are located one row up from
# to each unique value location in df.Outcome.
# put those rows into the array of zeros.
for i, u in enumerate(uniques):
unique_val_ar = np.zeros_like(diffs)
urows = np.where(oc_vals == u)[0]
if urows[0] == 0:
urows = urows[1:]
# shift unique value index locations by -1
adj_urows = urows - 1
unique_val_ar[urows] = diffs[adj_urows]
# put the columns from the unique_val_ar arrays
# into the combined array according to the concat pattern
# (tiled pattern per example)
result_array[:, np.where(concat_pattern == i)[0]] = unique_val_ar
# find the cummulative sum of the combined array (vertical axis)
result_array_cumsums = np.cumsum(result_array, axis=0)
# make the column names for a new dataframe
# which will contain the result_array_cumsums array
tgt_vals = np.repeat(tgt_cols, len(uniques))
u_vals = np.tile(uniques, len(tgt_cols))
new_cols = ['_'.join(x) for x in list(zip(tgt_vals, u_vals))]
# make the dataframe, using the generated column names
df_results = pd.DataFrame(result_array_cumsums, columns=new_cols)
# join the result dataframe with the original dataframe
df_out = df.join(df_results)
print(df_out)
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 0 0 0 0 1 0 1 0
6 2 1 3 2 0 0 0 0 1 0 1 0
7 Y 2 1 3 2 0 1 0 0 1 0 1 0
8 2 2 3 2 0 1 0 0 1 0 1 0
9 Y 2 2 3 2 0 1 0 1 1 0 1 0

Related

How to split the column of a dataframe

I would like to split the column of a dataframe as follows.
Here is the main dataframe.
import pandas as pd
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
df_az
Then, I applied this code to split the column.
out_az = (df_az.stack().apply(pd.Series).rename(columns=lambda x: f'a combination').unstack().swaplevel(0,1,axis=1).sort_index(axis=1))
out_az = pd.concat([out_az], axis=1)
out_az.head()
However, the result is as follows.
Meanwhile, the expected result is:
Could anyone help me what to change on the code, please? Thank you in advance.
You can apply np.ravel:
>>> pd.DataFrame.from_records(df_az['AZ Combination'].apply(np.ravel))
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 1
Convert column to list and reshape for 2d array, so possible use Dataframe contructor.
Then set columns names, for avoid duplicated columns names are add counter:
storage_AZ = [[[0,0,0],[0,0,0]],
[[0,0,0],[0,0,1]],
[[0,0,0],[0,1,0]],
[[0,0,0],[1,0,0]],
[[0,0,0],[1,0,1]]]
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
N = 3
L = ['a combination','z combination']
df = pd.DataFrame(np.array(df_az['AZ Combination'].tolist()).reshape(df_az.shape[0],-1))
df.columns = [f'{L[a]}_{b}' for a, b in zip(df.columns // N, df.columns % N)]
print(df)
a combination_0 a combination_1 a combination_2 z combination_0 \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1
If need MultiIndex:
df = pd.concat({'AZ Combination':df}, axis=1)
print(df)
AZ Combination \
a combination_0 a combination_1 a combination_2 z combination_0
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1

How to create a new dataframe according to it column name where the value is existed?

I have a pandas dataframe:
Index 0 1 2
0 0 1 0
1 0 0 1
2 1 0 0
3 0 0 1
How do I create a new dataframe according to it column name where the value is existed or when the value = 1 ?
Expected output:
Index type
0 1
1 2
2 0
3 2
Use DataFrame.dot if only 1 or 0 values in columns:
#if Index is not column, but index
df['type'] = df.dot(df.columns)
#if Index is column or necessary omit first column
#df['type'] = df.iloc[:, 1:].dot(df.columns[1:])
print (df)
0 1 2 type
Index
0 0 1 0 1
1 0 0 1 2
2 1 0 0 0
3 0 0 1 2
Solution also working correct if no 1 value per row, then return empty string:
df['type'] = df.dot(df.columns)
print (df)
0 1 2 type
Index
0 0 0 0
1 0 0 1 2
2 1 0 0 0
3 0 0 1 2
Here's a way using np.nonzero
_, df['type'] = np.nonzero(df.values)
print(df)
0 1 2 type
0 0 1 0 1
1 0 0 1 2
2 1 0 0 0
3 0 0 1 2
As it's seems like dummies. you can also use pandas.DataFrame.idxmax
>>> df['type'] = df.idxmax(axis=1)
>>> df
0 1 2 type
0 0 1 0 1
1 0 0 1 2
2 1 0 0 0
3 0 0 1 2

new column in pandas DataFrame based on unique values (lists) of an existing column

I have a dataframe where some cells contain lists of multiple values. How can I create new columns based on unique values of those lists? Those lists can contain values already included in previous observations, and also can be empty. How I create a new column (One Hot Encoding) based on those values?
CHECK EDIT - Data is within quotation marks:
data = {'tokens': ['["Spain", "Germany", "England", "Japan"]',
'["Spain", "Germany"]',
'["Morocco"]',
'[]',
'["Japan"]',
'[]']}
my_new_pd = pd.DataFrame(data)
0 ["Spain", "Germany", "England", "Japan"]
1 ["Spain", "Germany"]
2 ["Morocco"]
3 []
4 ["Japan", ""]
5 []
Name: tokens, dtype: object
I want something like
tokens_Spain|tokens_Germany |tokens_England |tokens_Japan|tokens_Morocco
0 1 1 1 1 0
1 1 1 0 0 0
2 0 0 0 0 1
3. 0 0 0 0 0
4. 0 0 1 1 0
5. 0 0 0 0 0
Method one from sklearn, since you already have the list type column in your dfs
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
yourdf=pd.DataFrame(mlb.fit_transform(df['tokens']),columns=mlb.classes_, index=df.index)
Method two we do explode first then find the dummies
df['tokens'].explode().str.get_dummies().sum(level=0).add_prefix('tokens_')
tokens_A tokens_B tokens_C tokens_D tokens_Z
0 1 1 1 1 0
1 1 1 0 0 0
2 0 0 0 0 1
3 0 0 0 0 0
4 0 0 0 1 1
5 0 0 0 0 0
Method three kind of like "explode" on the axis = 0
pd.get_dummies(pd.DataFrame(df.tokens.tolist()),prefix='tokens',prefix_sep='_').sum(level=0,axis=1)
tokens_A tokens_D tokens_Z tokens_B tokens_C
0 1 1 0 1 1
1 1 0 0 1 0
2 0 0 1 0 0
3 0 0 0 0 0
4 0 1 1 0 0
5 0 0 0 0 0
Update
df['tokens'].explode().str.get_dummies().sum(level=0).add_prefix('tokens_')
tokens_England tokens_Germany tokens_Japan tokens_Morocco tokens_Spain
0 1 1 1 0 1
1 0 1 0 0 1
2 0 0 0 1 0
3 0 0 0 0 0
4 1 0 1 0 0
5 0 0 0 0 0

Interactions between dummies variables in python

I'm trying to understand how can I address columns after using get_dummies.
For example, let's say I have three categorical variables.
first variable has 2 levels.
second variable has 5 levels.
third variable has 2 levels.
df=pd.DataFrame({"a":["Yes","Yes","No","No","No","Yes","Yes"], "b":["a","b","c","d","e","a","c"],"c":["1","2","2","1","2","1","1"]})
I created dummies for all three variable in order to use them in sklearn regression in python.
df1 = pd.get_dummies(df,drop_first=True)
Now I want to create two interactions (multiplication): bc , ba
how can I create the multiplication between each dummies variable to another one without using their specific names like that:
df1['a_yes_b'] = df1['a_Yes']*df1['b_b']
df1['a_yes_c'] = df1['a_Yes']*df1['b_c']
df1['a_yes_d'] = df1['a_Yes']*df1['b_d']
df1['a_yes_e'] = df1['a_Yes']*df1['b_e']
df1['c_2_b'] = df1['c_2']*df1['b_b']
df1['c_2_c'] = df1['c_2']*df1['b_c']
df1['c_2_d'] = df1['c_2']*df1['b_d']
df1['c_2_e'] = df1['c_2']*df1['b_e']
Thanks.
You can use loops for creating new columns, for filtering column names is possible use filtering by boolean indexing and str.startswith:
a = df1.columns[df1.columns.str.startswith('a')]
b = df1.columns[df1.columns.str.startswith('b')]
c = df1.columns[df1.columns.str.startswith('c')]
for col1 in b:
for col2 in a:
df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])
for col1 in b:
for col2 in c:
df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])
print (df1)
a_Yes b_b b_c b_d b_e c_2 a_Yes_b a_Yes_c a_Yes_d a_Yes_e c_2_b \
0 1 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 1 0 0 0 1
2 0 0 1 0 0 1 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0
6 1 0 1 0 0 0 0 1 0 0 0
c_2_c c_2_d c_2_e
0 0 0 0
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 1
5 0 0 0
6 0 0 0
But if a and b have only one columns (in sample yes, in real data maybe) use: filter, mul, squeeze and concat:
a = df1.filter(regex='^a')
b = df1.filter(regex='^b')
c = df1.filter(regex='^c')
dfa = b.mul(a.squeeze(), axis=0).rename(columns=lambda x: a.columns[0] + x[1:])
dfc = b.mul(c.squeeze(), axis=0).rename(columns=lambda x: c.columns[0] + x[1:])
df1 = pd.concat([df1, dfa, dfc], axis=1)
print (df1)
a_Yes b_b b_c b_d b_e c_2 a_Yes_b a_Yes_c a_Yes_d a_Yes_e c_2_b \
0 1 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 1 0 0 0 1
2 0 0 1 0 0 1 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0
6 1 0 1 0 0 0 0 1 0 0 0
c_2_c c_2_d c_2_e
0 0 0 0
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 1
5 0 0 0
6 0 0 0
You can convert dataframe column into numpy array and then multiply it accordingly. here's the link where you can find methos to do that:
Convert Select Columns in Pandas Dataframe to Numpy Array
This solves your problem:
def get_design_with_pair_interaction(data, group_pair):
""" Get the design matrix with the pairwise interactions
Parameters
----------
data (pandas.DataFrame):
Pandas data frame with the two variables to build the design matrix of their two main effects and their interaction
group_pair (iterator):
List with the name of the two variables (name of the columns) to build the design matrix of their two main effects and their interaction
Returns
-------
x_new (pandas.DataFrame):
Pandas data frame with the design matrix of their two main effects and their interaction
"""
x = pd.get_dummies(data[group_pair])
interactions_lst = list(
itertools.combinations(
x.columns.tolist(),
2,
),
)
x_new = x.copy()
for level_1, level_2 in interactions_lst:
if level_1.split('_')[0] == level_2.split('_')[0]:
continue
x_new = pd.concat(
[
x_new,
x[level_1] * x[level_2]
],
axis=1,
)
x_new = x_new.rename(
columns = {
0: (level_1 + '_' + level_2)
}
)
return x_new

Remove several rows with zero values in a dataframe using python

HI everybody i need some help with python.
I'm working with an excel with several rows, some of this rows has zero value in all the columns, so i need to delete that rows.
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
a 0 1 5 0
d 0 0 0 1
e 1 0 0 1
I think in something like show the rows that do not contain zeros, but do not work because is deleting all the rows with zero and without zero
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC != 0) & (df.TM != 0) & (df.Lease != 0) & (df.Maint != 0) & (df.Support != 0) & (df.Other != 0)]
Then i think like just show the columns with zero
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
b 0 0 0 0
c 0 0 0 0
So i make a little change and i have something like this
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC == 0) & (df.TM == 0) & (df.Lease == 0) & (df.Maint == 0) & (df.Support == 0) & (df.Other == 0)]
In this way I just get the column with zeros. I need a way to remove this 2 rows from the original input, and receive the output without that rows. Thanks, and sorry for the bad English, I'm working on that too
Given your input you can group by whether all the columns are zero or not, then access them, eg:
groups = df.groupby((df.drop('id', axis= 1) == 0).all(axis=1))
all_zero = groups.get_group(True)
non_all_zero = groups.get_group(False)
For this dataframe:
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 f 0 0 0 0 0
6 g 0 2 1 0 2
7 h 0 0 0 0 0
8 i 1 2 2 0 2
9 j 2 2 1 2 1
Temporarily set the index:
df = df.set_index('id')
Drop rows containing all zeros and reset the index:
df = df[~(df==0).all(axis=1)].reset_index()
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 g 0 2 1 0 2
6 i 1 2 2 0 2
7 j 2 2 1 2 1

Categories

Resources