Performing operations on certain rows of certain columns in python pandas - python

I have been trying to solve the following problem for the past while.
I have a dataframe with 7 columns and a variable number of rows, between 10 and 20, that I read in from an csv file. I would like to perform the following operation: divide columns A, B, C, D of the row corresponding to unique_string1 by 4 and add these values to unique_string2's A, B, C, D columns.
Title Description A B C D
0 unique_string1 2 1 4 6
1 unique_string2 6 2 4 5
2 unique_string3 B 1 8 8 2
3 unique_string4 B 1 1 2 3
4 unique_string5 C 3 1 2 5

To get values for specific columns in a specific DataFrame row:
vals = df.loc[df['Title']=='unique_string1', ['A', 'B', 'C', 'D']].values
Divide these by 4:
vals /= 4
Add back to 'unique_string2' row in DF:
df.loc[df['Title']=='unique_string2', ['A', 'B', 'C', 'D']] += vals
I would recommend reading the documentation for the DataFrame.loc operator in pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

Related

Sum up multiple columns into one columns [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Pandas datrafame inconsistent data in mulitple rows

I have something like that:
>>> x = {'id': [1,1,2,2,2,3,4,5,5], 'value': ['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']}
>>> df = pd.DataFrame(x)
>>> df
id value
0 1 a
1 1 a
2 2 b
3 2 b
4 2 c
5 3 d
6 4 e
7 5 f
8 5 g
I want to filter inconsistent values in this table. For example, columns with id=2 or id=5 are inconsistent, because the same id is associated with different values. I have read solutions about where or any, but they are not something like "comparing if columns with this id always have the same value.
How can I solve this problem?
You can use groupby and filter. This should give you the ids with inconsistent values.
df.groupby('id').filter(lambda x: x.value.nunique()>1)
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
In your case we do groupby + transform with nunique
unc_df=df[df.groupby('id').value.transform('nunique').ne(1)]
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
I guess, you can use drop_duplicates to drop repetitive rows based on id column:
In [599]: df.drop_duplicates('id', keep='first')
Out[599]:
id value
0 1 a
2 2 b
5 3 d
6 4 e
7 5 f
So the above will pick the first value for duplicated id column. And you will have 1 row per id in your resultant dataframe.

Python: How to pad with zeros?

Assuming we have a dataframe as below:
df = pd.DataFrame({ 'Col1' : ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c'],
'col2' : ['0.5', '0.78', '0.78', '0.4', '2', '9', '2', '7',]
})
I counted the number of rows for all the unique values in col1. Like a has 4 rows, b and c have 2 rows each, by doing:
df.groupby(['Col1']).size()
and I get the output as
Col1
a 4
b 2
c 2
dtype: int64
After this is done, I would like to check which among a, b, c has the maximum number of rows (in this case, a has the maximum rows) and pad the others (b and c) with the difference between the the maximum value and the rows they have, with zeros (both b and c have 2 rows each, and since 4 is the maximum number of rows, I want to pad b and c with 2 more zeros). The zeros must be added at the end.
I want to pad it with zeros since I want to apply a window of fixed size on all the variables (a, b, c) to plot graphs.
You can create counter by GroupBy.cumcount, create MultiIndex and DataFrame.reindex by all combinations created by MultiIndex.from_product:
df1 = df.set_index(['Col1', df.groupby('Col1').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
df2 = df1.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df2)
Col1 col2
0 a 0.5
1 a 0.78
2 a 0.78
3 a 0.4
4 b 2
5 b 9
6 b 0
7 b 0
8 c 2
9 c 7
10 c 0
11 c 0
Same logic like Jez using cumcount , but with stack and unstack chain
df.assign(key2=df.groupby('Col1').cumcount()).set_index(['Col1','key2']).unstack(fill_value=0).stack().reset_index('Col1')
Out[1047]:
Col1 col2
key2
0 a 0.5
1 a 0.78
2 a 0.78
3 a 0.4
0 b 2
1 b 9
2 b 0
3 b 0
0 c 2
1 c 7
2 c 0
3 c 0

Csv missing columns with Pandas Dataframe

I want to read a csv as dataframe into Pandas.
My csv file has the following format
a b c d
0 1 2 3 4 5
1 2 3 4 5 6
When I read the csv with Pandas I get the following dataframe
a b c d
0 1 2 3 4 5
1 2 3 4 5 6
When I execute print df.columns
I get something like :
Index([u'a', u'b', u'c', u'd'], dtype='object')
And when I execute print df.iloc[0]
I get :
a 2
b 3
c 4
d 5
Name: (0, 1)
I would like to have something a dataframe like
a b c d col1 col2
0 1 2 3 4 5
1 2 3 4 5 6
I don't know how many columns I will have to had. But I need as many columns as the number of value in the first line after the header. How can I achieve that ?
One way to do this would be to read in the data twice. Once with the first row (the original columns) skipped and the second with only the column names read (and all the rows skipped)
df = pd.read_csv(header=None, skiprows=1)
columns = pd.read_csv(nrows=0).columns.tolist()
columns
Output
['a', 'b', 'c', 'd']
Now find number of missing columns and use a list comprehension to make new columns
num_missing_cols = len(df.columns) - len(columns)
new_cols = ['col' + str(i+1) for i in range(num_missing_cols)]
df.columns = columns + new_cols
df
a b c d col1 col2
0 0 1 2 3 4 5
1 1 2 3 4 5 6

Pandas: sum DataFrame rows for given columns

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Categories

Resources