Pandas datrafame inconsistent data in mulitple rows - python

I have something like that:
>>> x = {'id': [1,1,2,2,2,3,4,5,5], 'value': ['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']}
>>> df = pd.DataFrame(x)
>>> df
id value
0 1 a
1 1 a
2 2 b
3 2 b
4 2 c
5 3 d
6 4 e
7 5 f
8 5 g
I want to filter inconsistent values in this table. For example, columns with id=2 or id=5 are inconsistent, because the same id is associated with different values. I have read solutions about where or any, but they are not something like "comparing if columns with this id always have the same value.
How can I solve this problem?

You can use groupby and filter. This should give you the ids with inconsistent values.
df.groupby('id').filter(lambda x: x.value.nunique()>1)
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g

In your case we do groupby + transform with nunique
unc_df=df[df.groupby('id').value.transform('nunique').ne(1)]
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g

I guess, you can use drop_duplicates to drop repetitive rows based on id column:
In [599]: df.drop_duplicates('id', keep='first')
Out[599]:
id value
0 1 a
2 2 b
5 3 d
6 4 e
7 5 f
So the above will pick the first value for duplicated id column. And you will have 1 row per id in your resultant dataframe.

Related

Sum up multiple columns into one columns [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Performing operations on certain rows of certain columns in python pandas

I have been trying to solve the following problem for the past while.
I have a dataframe with 7 columns and a variable number of rows, between 10 and 20, that I read in from an csv file. I would like to perform the following operation: divide columns A, B, C, D of the row corresponding to unique_string1 by 4 and add these values to unique_string2's A, B, C, D columns.
Title Description A B C D
0 unique_string1 2 1 4 6
1 unique_string2 6 2 4 5
2 unique_string3 B 1 8 8 2
3 unique_string4 B 1 1 2 3
4 unique_string5 C 3 1 2 5
To get values for specific columns in a specific DataFrame row:
vals = df.loc[df['Title']=='unique_string1', ['A', 'B', 'C', 'D']].values
Divide these by 4:
vals /= 4
Add back to 'unique_string2' row in DF:
df.loc[df['Title']=='unique_string2', ['A', 'B', 'C', 'D']] += vals
I would recommend reading the documentation for the DataFrame.loc operator in pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

How to apply function to ALL columns of dataframe GROUPWISELY ? (In python pandas)

How to apply a function to each column of dataframe "groupwisely" ?
I.e. group by values of one column and calculate e.g. means for each group+ other columns. The expected output is dataframe with index - names of different groups, and values - means for each group+column
E.g. consider:
df = pd.DataFrame(np.arange(16).reshape(4,4), columns=['A', 'B', 'C', 'D'])
df['group'] = ['a', 'a', 'b','b']
A B C D group
0 0 1 2 3 a
1 4 5 6 7 a
2 8 9 10 11 b
3 12 13 14 15 b
I want to calculate e.g. np.mean for each column, but "groupwisely",
in that particular example it can be done by:
t = df.groupby('group').agg({'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean })
A B C D
group
a 2 3 4 5
b 10 11 12 13
However, it requires explicit use of column names 'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean
which is unacceptable for my task, since they can be changed.
As MaxU commented simplier is groupby + GroupBy.mean:
df1 = df.groupby('group').mean()
print (df1)
A B C D
group
a 2 3 4 5
b 10 11 12 13
If need column from index:
df1 = df.groupby('group', as_index=False).mean()
print (df1)
group A B C D
0 a 2 3 4 5
1 b 10 11 12 13
You don't need to explicitly name the columns.
df.groupby('group').agg('mean')
Will produce the mean for each group for each column as requested:
A B C D
group
a 2 3 4 5
b 10 11 12 13
The below does the job:
df.groupby('group').apply(np.mean, axis=0)
giving back
A B C D
group
a 2.0 3.0 4.0 5.0
b 10.0 11.0 12.0 13.0
apply takes axis = {0,1} as additional argument, which in turn specifies whether you want to apply the function row-wise or column-wise.

How to groupby column names Stored in list through a loop and store there values in new variable

I have dataframe with column names as A,B,C,D,e,f,g,h.These columns names are stored in a list as cols1=[A,B,C,D,e,f,g,h]
I have to groupby these columns as df.groupby(['A','B','C','D','e']) and store it in variable names as e
And again as df.groupby(['A','B','C','D','f']) and store it in variable names as f
And again as df.groupby(['A','B','C','D','g']) till the end of the list.
This should be done in a loop.And then store the groupby.sum() values of columns e,f,g etc in a new variable to compare the values of e,f,g,h.
Any way of doing it in pandas.Thanks in Advance
IIUC you need groupby with sum:
df = pd.DataFrame({'A':[1,8,8],
'B':[4,6,6],
'C':[7,2,2],
'D':[1,3,3],
'e':[2,3,6],
'f':[0,2,4],
'g':[7,4,1],
'h':[1,4,2]})
print (df)
A B C D e f g h
0 1 4 7 1 2 0 7 1
1 8 6 2 3 3 2 4 4
2 8 6 2 3 6 4 1 2
cols1=['A','B','C','D','e','f','g','h']
cols11 = cols1[:4]
print (cols11)
['A', 'B', 'C', 'D']
cols12 = cols1[4:]
print (cols12)
['e', 'f', 'g', 'h']
df = df.groupby(cols11)[cols12].sum()
print (df)
e f g h
A B C D
1 4 7 1 2 0 7 1
8 6 2 3 9 6 5 6
df = df.reset_index(drop=True)
print (df)
e f g h
0 2 0 7 1
1 9 6 5 6

Pandas: sum DataFrame rows for given columns

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Categories

Resources