Remove several rows with zero values in a dataframe using python - python

HI everybody i need some help with python.
I'm working with an excel with several rows, some of this rows has zero value in all the columns, so i need to delete that rows.
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
a 0 1 5 0
d 0 0 0 1
e 1 0 0 1
I think in something like show the rows that do not contain zeros, but do not work because is deleting all the rows with zero and without zero
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC != 0) & (df.TM != 0) & (df.Lease != 0) & (df.Maint != 0) & (df.Support != 0) & (df.Other != 0)]
Then i think like just show the columns with zero
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
b 0 0 0 0
c 0 0 0 0
So i make a little change and i have something like this
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC == 0) & (df.TM == 0) & (df.Lease == 0) & (df.Maint == 0) & (df.Support == 0) & (df.Other == 0)]
In this way I just get the column with zeros. I need a way to remove this 2 rows from the original input, and receive the output without that rows. Thanks, and sorry for the bad English, I'm working on that too

Given your input you can group by whether all the columns are zero or not, then access them, eg:
groups = df.groupby((df.drop('id', axis= 1) == 0).all(axis=1))
all_zero = groups.get_group(True)
non_all_zero = groups.get_group(False)

For this dataframe:
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 f 0 0 0 0 0
6 g 0 2 1 0 2
7 h 0 0 0 0 0
8 i 1 2 2 0 2
9 j 2 2 1 2 1
Temporarily set the index:
df = df.set_index('id')
Drop rows containing all zeros and reset the index:
df = df[~(df==0).all(axis=1)].reset_index()
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 g 0 2 1 0 2
6 i 1 2 2 0 2
7 j 2 2 1 2 1

Related

Create a feature table in Python from a df

I have the following df:
id step1 step2 step3 step4 .... stepn-1, stepn, event
1 a b c null null null 1
2 b d f null null null 0
3 a d g h l m 1
Where the id is a session, the steps represent a certain path, and event is whether something specific happened
I want to create a feature store where we take all the possible steps (a, b, c, ... all the way to some arbitrary number) and make them the columns. Then I want the x-column to remain the id and it just fill a 1 or zero if that session hit that step in the column. The result is below:
id a b c d e f g ... n event
1 1 1 1 0 0 0 0 0 1
2 0 1 0 0 0 1 0 0 0
3 1 0 0 1 0 0 1 1 1
I have a unique list of all the possible steps which I assume will be used to construct the new table. But after that I am struggling thinking how to create this.
What you are looking for is often used in machine learning, and is called one-hot encoding.
There is a pandas function specifically designed for this purpose, called pd.get_dummies().
step_cols = [c for c in df.columns if c.startswith('step')]
other_cols = [c for c in df.columns if not c.startswith('step')]
new_df = pd.get_dummies(df[step_cols].stack()).groupby(level=0).max()
new_df[other_cols] = df[other_cols]
Output:
>>> new_df
a b c d f g h l m id event
0 1 1 1 0 0 0 0 0 0 1 1
1 0 1 0 1 1 0 0 0 0 2 0
2 1 0 0 1 0 1 1 1 1 3 1
Probably not the most elegant way:
step_cols = [col for col in df.columns if col.startswith("step")]
values = pd.Series(sorted(set(df[step_cols].melt().value.dropna())))
df1 = pd.DataFrame(
(values.isin(row).to_list() for row in zip(*(df[col] for col in step_cols))),
columns=values
).astype(int)
df = pd.concat([df.id, df1, df.event], axis=1)
Result for
df =
id step1 step2 step3 step4 event
0 1 a b c NaN 1
1 2 b d f NaN 0
2 3 a d g h 1
is
id a b c d f g h event
0 1 1 1 1 0 0 0 0 1
1 2 0 1 0 1 1 0 0 0
2 3 1 0 0 1 0 1 1 1

How to create new column based on value in a set of columns

I have a pandas df like this:
time a b c
1 0 1 0
1 0 1 0
1 1 0 0
1 0 1 0
1 0 0 1
1 0 0 0
I want to create a new column, df.code based on the following logic:
if df.a == 1, return 4
if df.b == 1, return 2
if df.c == 1, return 1
if a,b, or c != 1, return 0
time a b c code
1 0 1 0 2
1 0 1 0 2
1 1 0 0 4
1 0 1 0 2
1 0 0 1 1
1 0 0 0 0
How do I do this? I'm essentially trying to compress select dummy columns into a multiclass columns.
We can dot the dataframe with the codes.
df['code'] = df[['a','b','c']].dot([4,2,1])
df
Output
time a b c code
0 1 0 1 0 2
1 1 0 1 0 2
2 1 1 0 0 4
3 1 0 1 0 2
4 1 0 0 1 1
5 1 0 0 0 0
This example should works as is:
stack.csv
time a b c
1 0 1 0
1 0 1 0
1 1 0 0
1 0 1 0
1 0 0 1
1 0 0 0
main.py
df = pd.read_csv('stack.csv', sep=' ', index_col=False)
df['code'] = 0
df.loc[df['a'] == 1, 'code'] = 4
df.loc[df['b'] == 1, 'code'] = 2
df.loc[df['c'] == 1, 'code'] = 1
print(df)
output:
time a b c code
0 1 0 1 0 2
1 1 0 1 0 2
2 1 1 0 0 4
3 1 0 1 0 2
4 1 0 0 1 1
5 1 0 0 0 0

Cumulative count based of values in another column

I am trying to return a cumulative count based on other columns. For the df below I want to return a count using Outcome and Aa,Bb,Cc,Dd. Specifically, if X or Y is in Outcome, I want to return the most recent increase in integers in Aa,Bb,Cc,Dd. So when X or Y are listed I want to return that against which integer in Aa,Bb,Cc,Dd was the most recent to increase.
I have attempted this using the following:
import pandas as pd
d = ({
'Outcome' : ['','','X','','','X','','Y','','Y'],
'A' : [0,0,0,1,1,1,2,2,2,2],
'B' : [0,0,0,1,1,1,1,1,2,2],
'C' : [0,0,0,1,2,3,3,3,3,3],
'D' : [0,1,2,2,2,2,2,2,2,2],
})
df = pd.DataFrame(data = d)
m = pd.get_dummies(
df.where(df.Outcome.ne(df.Outcome.shift()) & df.Outcome.str.len().astype(bool)
), prefix='Count').cumsum()
df = pd.concat([
m.where(m.ne(m.shift())).fillna('', downcast='infer'), df], axis=1)
But it's not quite right.
My Intended Output is:
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 0 0 0 0 1 0 1 0
6 2 1 3 2 0 0 0 0 1 0 1 0
7 Y 2 1 3 2 0 1 0 0 1 0 1 0
8 2 2 3 2 0 1 0 0 1 0 1 0
9 Y 2 2 3 2 0 1 0 1 1 0 1 0
Below are 2 snippets:
As per description, which captures additional increases in A column between 1st and 2nd X
As per example, capturing the last increase out of all 4 columns
1) As per description
for col in 'ABCD':
df[col+'_X']=0
df[col+'_Y']=0
for i1, i2 in zip(df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index,
df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index[1::]):
for col in 'ABCD':
if df[col][i2]>df[col][i1]:
df.loc[i2::,col+'_'+df.Outcome[i2]]=df[col+'_'+df.Outcome[i2]][i2-1]+1
print(df)
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 1 0 1 0 1 0 1 0
6 2 1 3 2 1 0 1 0 1 0 1 0
7 Y 2 1 3 2 1 1 1 0 1 0 1 0
8 2 2 3 2 1 1 1 0 1 0 1 0
9 Y 2 2 3 2 1 1 1 1 1 0 1 0
2) As per example
for col in 'ABCD':
df[col+'_X']=0
df[col+'_Y']=0
for i1, i2 in zip(df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index,
df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index[1::]):
change_col = ''
change_pos = -1
for col in 'ABCD':
if df[col][i2]>df[col][i1]:
found_change_pos = df[df[col]==df[col][i2]-1].tail(1).index
if found_change_pos > change_pos:
change_col = col
change_pos = found_change_pos
if change_pos > -1:
df.loc[i2::,change_col+'_'+df.Outcome[i2]]=df[change_col+'_'+df.Outcome[i2]][i2-1]+1
print(df)
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 0 0 0 0 1 0 1 0
6 2 1 3 2 0 0 0 0 1 0 1 0
7 Y 2 1 3 2 0 1 0 0 1 0 1 0
8 2 2 3 2 0 1 0 0 1 0 1 0
9 Y 2 2 3 2 0 1 0 1 1 0 1 0
The columns to test for integer increases and the unique values column are set as a variables so that the routine may be easily adapted to input dataframes with other column names.
This routine is relatively fast even with large input dataframes because it uses fast numpy functions within the loop and throughout.
# this method assumes that only rows with an increase in one column
# only counts as an increase in value.
# rows with more than one column increasing are ignored.
# it also assumes that integers always increase by
# one.
import pandas as pd
import numpy as np
# designate the integer increase columns
tgt_cols = ['A', 'B', 'C', 'D']
unique_val_col = 'Outcome'
# put None in empty string positions within array
# of Outcome column values
oc_vals = df[unique_val_col].where(df[unique_val_col] != '', None).values
# find the unique strings in Outcome
uniques = pd.unique(oc_vals[oc_vals != None])
# use pandas diff to locate integer increases in columns
diffs = df[tgt_cols].diff().fillna(0).values.astype(int)
# add the values in each diffs row (this will help later
# to find rows without any column increase or multiple
# increases)
row_sums = np.sum(diffs, axis=1)
# find the row indexes where a single integer increase
# occurred
change_row_idx = np.where(row_sums == 1)[0]
# find the indexes where a single increase did not occur
no_change_idx = np.where((row_sums == 0) | (row_sums > 1))[0]
# remove row 0 from the index if it exists because it is
# not applicable to previous changes
if no_change_idx[0] == 0:
no_change_idx = no_change_idx[1:]
# locate the indexes of previous rows which had an integer
# increase to carry forward to rows without an integer increase
# (no_change_idx)
fwd_fill_index = \
[np.searchsorted(change_row_idx, x) - 1 for x in no_change_idx if x > 0]
# write over no change row(s) with data from the last row with an
# integer increase.
# now each row in diffs will have a one marking the last or current change
diffs[no_change_idx] = diffs[change_row_idx][fwd_fill_index]
# make an array to hold the combined output result array
num_rows = diffs.shape[0]
num_cols = diffs.shape[1] * len(uniques)
result_array = np.zeros(num_rows * num_cols) \
.reshape(diffs.shape[0], diffs.shape[1] * len(uniques)).astype(int)
# determine the pattern for combining the unique value arrays.
# (the example has alternating columns for X and Y results)
concat_pattern = np.array(range(len(tgt_cols) * len(uniques))) % len(uniques)
# loop through the uniques values and do the following each time:
# make an array of zeros the same size as the diffs array.
# find the rows in the diffs array which are located one row up from
# to each unique value location in df.Outcome.
# put those rows into the array of zeros.
for i, u in enumerate(uniques):
unique_val_ar = np.zeros_like(diffs)
urows = np.where(oc_vals == u)[0]
if urows[0] == 0:
urows = urows[1:]
# shift unique value index locations by -1
adj_urows = urows - 1
unique_val_ar[urows] = diffs[adj_urows]
# put the columns from the unique_val_ar arrays
# into the combined array according to the concat pattern
# (tiled pattern per example)
result_array[:, np.where(concat_pattern == i)[0]] = unique_val_ar
# find the cummulative sum of the combined array (vertical axis)
result_array_cumsums = np.cumsum(result_array, axis=0)
# make the column names for a new dataframe
# which will contain the result_array_cumsums array
tgt_vals = np.repeat(tgt_cols, len(uniques))
u_vals = np.tile(uniques, len(tgt_cols))
new_cols = ['_'.join(x) for x in list(zip(tgt_vals, u_vals))]
# make the dataframe, using the generated column names
df_results = pd.DataFrame(result_array_cumsums, columns=new_cols)
# join the result dataframe with the original dataframe
df_out = df.join(df_results)
print(df_out)
Outcome A B C D A_X A_Y B_X B_Y C_X C_Y D_X D_Y
0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 X 0 0 0 2 0 0 0 0 0 0 1 0
3 1 1 1 2 0 0 0 0 0 0 1 0
4 1 1 2 2 0 0 0 0 0 0 1 0
5 X 1 1 3 2 0 0 0 0 1 0 1 0
6 2 1 3 2 0 0 0 0 1 0 1 0
7 Y 2 1 3 2 0 1 0 0 1 0 1 0
8 2 2 3 2 0 1 0 0 1 0 1 0
9 Y 2 2 3 2 0 1 0 1 1 0 1 0

How to apply ffill to 1?

I have a dataframe like below,
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 0
I want to convert this into like this,
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 1 0 0
3 1 1 1 0
so far I tried,
df= df.replace('0',np.NaN)
df=df.fillna(method='ffill').fillna('0')
my above code works fine,
But I think there is some other better approach to solve this problem,
Use cumsum with data converted to numeric and then replace by DataFrame.mask:
df = df.mask(df.astype(int).cumsum() >= 1, '1')
print (df)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 1 0 0
3 1 1 1 0
Detail:
print (df.astype(int).cumsum())
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 2 0 0
3 1 2 1 0
Or same principe in numpy with numpy.where:
arr = df.values.astype(int)
df = pd.DataFrame(np.where(np.cumsum(arr, axis=0) >= 1, '1', '0'),
index=df.index,
columns= df.columns)
print (df)
A B C D
0 1 0 0 0
1 1 1 0 0
2 1 1 0 0
3 1 1 1 0

pandas - pivot table to square matrix

I have this simple dataframe in a data.csv file:
I,C,v
a,b,1
b,a,2
e,a,1
e,c,0
b,d,1
a,e,1
b,f,0
I would like to pivot it, and then return a square table (as a matrix). So far I've read the dataframe and build a pivot table with:
df = pd.read_csv('data.csv')
d = pd.pivot_table(df,index='I',columns='C',values='v')
d.fillna(0,inplace=True)
correctly obtaining:
C a b c d e f
I
a 0 1 0 0 1 0
b 2 0 0 1 0 0
e 1 0 0 0 0 0
Now I would like to return a square table with the missing columns indices in the rows, so that the resulting table would be:
C a b c d e f
I
a 0 1 0 0 1 0
b 2 0 0 1 0 0
c 0 0 0 0 0 0
d 0 0 0 0 0 0
e 1 0 0 0 0 0
f 0 0 0 0 0 0
reindex can add rows and columns, and fill missing values with 0:
index = d.index.union(d.columns)
d = d.reindex(index=index, columns=index, fill_value=0)
yields
a b c d e f
a 0 1 0 0 1 0
b 2 0 0 1 0 0
c 0 0 0 0 0 0
d 0 0 0 0 0 0
e 1 0 0 0 0 0
f 0 0 0 0 0 0

Categories

Resources