Drop specific rows from pandas dataframe according to condition - python

I have a dataframe with several columns [A, B, C, ..., Z]. I want to delete all rows from the dataframe which have the property that their values in columns [B, C, ..., Z] are equal to 0 (integer zero).
Example df:
A B C ... Z
0 3 0 0 ... 0
1 1 0 0 ... 0
2 2 1 2 ... 3 <-- keep only this as it has values other than zero
I tried to do this like so:
df = df[(df.columns[1:] != 0).all()]
I can't get it to work. I am not too experienced with conditions in indexers. I wanted to avoid a solution that chains a zero test for every column. I am sure that there is a more elegant solution to this.
Thanks!
EDIT:
The solution worked for an artificially created dataframe, but when I used it on my df that I got from reading a csv, it failed. The file looks like this:
A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z
0;25310;169;81;0;0;0;12291181;31442;246;0;0;0;0;0;0;0;0;0;251;31696;0;0;329;0;0
1;6252727;20480;82;0;0;0;31088;85;245;0;0;0;0;0;0;0;0;0;20567;331;0;0;329;0;0
2;6032184;10961;82;0;0;0;31024;84;245;0;0;0;0;0;0;0;0;0;11046;330;0;0;329;0;0
3;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
4;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
5;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
6;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
7;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
8;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
9;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
10;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
I read it using the following commands:
import pandas as pd
# retrieve csv file as dataframe
df = pd.read_csv('PATH/TO/FILE'),
decimal=',',
sep=';')
df[list(df)] = df[list(df)].astype('int')
print(df)
df = df[(df.iloc[:, 1:] != 0).all(axis=1)]
print(df)
The first print statement shows that the frame is read correctly, but the second print gives me an empty dataframe. How can this be?

Use iloc for select all columns without first:
df = df[(df.iloc[:, 1:] != 0).all(axis=1)]
print (df)
A B C Z
2 2 1 2 3
EDIT:
df = df[(df.iloc[:, 1:] != 0).any(axis=1)]
print (df)
A B C D E F G H I J ... Q R S T \
0 0 25310 169 81 0 0 0 12291181 31442 246 ... 0 0 0 251
1 1 6252727 20480 82 0 0 0 31088 85 245 ... 0 0 0 20567
2 2 6032184 10961 82 0 0 0 31024 84 245 ... 0 0 0 11046
U V W X Y Z
0 31696 0 0 329 0 0
1 331 0 0 329 0 0
2 330 0 0 329 0 0
[3 rows x 26 columns]

Related

python pandas group by and aggregate columns

I am using panda version 0.23.0. I want to use data frame group by function to generate new aggregated columns using [lambda] functions..
My data frame looks like
ID Flag Amount User
1 1 100 123345
1 1 55 123346
2 0 20 123346
2 0 30 123347
3 0 50 123348
I want to generate a table which looks like
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM Flag0_User_Count Flag1_User_Count
1 2 2 0 155 0 2
2 2 0 50 0 2 0
3 1 0 50 0 1 0
here:
Flag0_Count is count of Flag = 0
Flag1_Count is count of Flag = 1
Flag0_Amount_SUM is SUNM of amount when Flag = 0
Flag1_Amount_SUM is SUNM of amount when Flag = 1
Flag0_User_Count is Count of Distinct User when Flag = 0
Flag1_User_Count is Count of Distinct User when Flag = 1
I have tried something like
df.groupby(["ID"])["Flag"].apply(lambda x: sum(x==0)).reset_index()
but it creates a new a new data frame. This means I will have to this for all columns and them merge them together into a new data frame.
Is there an easier way to accomplish this?
Use DataFrameGroupBy.agg by dictionary by column names with aggregate function, then reshape by unstack, flatten MultiIndex of columns, rename columns and last reset_index:
df = (df.groupby(["ID", "Flag"])
.agg({'Flag':'size', 'Amount':'sum', 'User':'nunique'})
.unstack(fill_value=0))
#python 3.6+
df.columns = [f'{i}{j}' for i, j in df.columns]
#python below
#df.columns = [f'{}{}'.format(i, j) for i, j in df.columns]
d = {'Flag0':'Flag0_Count',
'Flag1':'Flag1_Count',
'Amount0':'Flag0_Amount_SUM',
'Amount1':'Flag1_Amount_SUM',
'User0':'Flag0_User_Count',
'User1':'Flag1_User_Count',
}
df = df.rename(columns=d).reset_index()
print (df)
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM \
0 1 0 2 0 155
1 2 2 0 50 0
2 3 1 0 50 0
Flag0_User_Count Flag1_User_Count
0 0 2
1 2 0
2 1 0

How to save a pandas dataframe such that there is no delimiter?

I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame(np.random.choice([0,1], (6,3)), columns=list('XYZ'))
X Y Z
0 1 0 1
1 1 1 0
2 0 0 0
3 0 1 1
4 0 1 1
5 1 1 1
Let's say I take the transpose and wish to save it
df = df.T
0 1 2 3 4 5
X 1 1 0 0 0 1
Y 0 1 0 1 1 1
Z 1 0 0 1 1 1
So, there three rows. I would like to save it in this format:
X 110001
Y 010111
Z 100111
I tried
df.to_csv("filename.txt", header=None, index=None, sep='')
However, this outputs an error:
TypeError: "delimiter" must be an 1-character string
Is it possible to save the dataframe in this manner, or is there some what to combine all columns into one? What is the most "pandas" solution?
Leave original df alone. Don't transpose.
df = pd.DataFrame(np.random.choice([0,1], (6,3)), columns=list('XYZ'))
df.astype(str).apply(''.join)
X 101100
Y 101001
Z 111110
dtype: object
If you do want to transpose, then you can do something like this:
In [126]: df.T.apply(lambda row: ''.join(map(str, row)), axis=1)
Out[126]:
X 001111
Y 000000
Z 010010
dtype: object

How to use trailing rows on a column for calculations on that same column | Pandas Python

I'm trying to figure out how to compare the element of the previous row of a column to a different column on the current row in a Pandas DataFrame. For example:
data = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','0']})
Output:
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0
And now I want to make a new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column.
Theoretically:
data['c'] = np.where(data['a']==( the previous row value of data['a'] ),min((data['b']+( the previous row value of data['c'] )),1),data['b'])
So that I can theoretically output:
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 1
4 1 0 1
I'm wondering how to do this because I'm trying to recreate this excel conditional statement: =IF(A70=A69,MIN((P70+Q69),1),P70)
where data['a'] = column A and data['b'] = column P.
If anyone has any ideas on how to do this, I'd greatly appreciate your advice.
According to your statement: 'new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column' I can suggest you to solve it by this way:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','3']})
>>> df
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 3
>>> df['c'] = np.where(df['a']+df['b'] > df['a'].shift(1)+df['b'].shift(1), 1, 0)
>>> df
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 0
4 1 3 1
But it doesn't looking for 'previous value of that same column'.
If you would try to write df['c'].shift(1) in np.where(), it gonna to raise KeyError: 'c'.

dividing a string into separate columns pandas python

I have seen similar questions, but nothing that really matchs my problem. If I have a table of values such as:
value
a
b
b
c
I want to use pandas to add in columns to the table to show for example:
value a b
a 1 0
b 0 1
c 0 0
I have tried the following:
df['a'] = 0
def string_count(indicator):
if indicator == 'a':
df['a'] == 1
df['a'].apply(string_count)
But this produces:
0 None
1 None
2 None
3 None
I would like to at least get to the point where the choices are hardcoded in (i.e I already know that a,b and c appear), but would even better if I could look set the column of strings and then insert a column for each unique string.
Am I approaching this the wrong way?
dummies = pd.get_dummies(df.value)
a b c
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
If you only want to display unique occurrences, you can add:
dummies.index = df.value
dummies.drop_duplicates()
a b c
value
a 1 0 0
b 0 1 0
c 0 0 1
Alternatively:
df = df.join(pd.get_dummies(df.value))
value a b c
0 a 1 0 0
1 b 0 1 0
2 b 0 1 0
3 c 0 0 1
Where you could again .drop_duplicates() to only see unique entries from the value column.

R dcast equivalent in python pandas

I am trying to do the equivalent of the below commands in python:
test <- data.frame(convert_me=c('Convert1','Convert2','Convert3'),
values=rnorm(3,45, 12), age_col=c('23','33','44'))
test
library(reshape2)
t <- dcast(test, values ~ convert_me+age_col, length )
t
That is, this:
convert_me values age_col
Convert1 21.71502 23
Convert2 58.35506 33
Convert3 60.41639 44
becomes this:
values Convert2_33 Convert1_23 Convert3_44
21.71502 0 1 0
58.35506 1 0 0
60.41639 0 0 1
I know that with dummy variables I can get the value of the columns and transform as the name of the column, but is there a way to merge them(combination) easily, as R does?
You can use the crosstab function for this:
In [14]: pd.crosstab(index=df['values'], columns=[df['convert_me'], df['age_col']])
Out[14]:
convert_me Convert1 Convert2 Convert3
age_col 23 33 44
values
21.71502 1 0 0
58.35506 0 1 0
60.41639 0 0 1
or the pivot_table (with len as the aggregating function, but here you have to fillna the NaNs with zeros manually):
In [18]: df.pivot_table(index=['values'], columns=['age_col', 'convert_me'], aggfunc=len).fillna(0)
Out[18]:
age_col 23 33 44
convert_me Convert1 Convert2 Convert3
values
21.71502 1 0 0
58.35506 0 1 0
60.41639 0 0 1
See here for the docs on this: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#pivot-tables-and-cross-tabulations
Most functions in pandas will return a multi-level (hierarchical) index, in this case for the columns. If you want to 'melt' this into one level like in R you can do:
In [15]: df_cross = pd.crosstab(index=df['values'], columns=[df['convert_me'], df['age_col']])
In [16]: df_cross.columns = ["{0}_{1}".format(l1, l2) for l1, l2 in df_cross.columns]
In [17]: df_cross
Out[17]:
Convert1_23 Convert2_33 Convert3_44
values
21.71502 1 0 0
58.35506 0 1 0
60.41639 0 0 1
We can use pd.get_dummies function. In the current pandas 0.22.0, it is common to use pd.get_dummies when one-hot encoding to Dataframe.
import pandas as pd
df_dummies = pd.get_dummies(
df[['convert_me', 'age_col']].apply(lambda x: '_'.join(x.astype(str)), axis=1),
prefix_sep='')
df = pd.concat([df["values"], df_dummies], axis=1)
# Out[39]:
# values Convert1_23 Convert2_33 Convert3_44
# 0 21.71502 1 0 0
# 1 58.35506 0 1 0
# 2 60.41639 0 0 1

Categories

Resources