Python: How to pad with zeros? - python

Assuming we have a dataframe as below:
df = pd.DataFrame({ 'Col1' : ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c'],
'col2' : ['0.5', '0.78', '0.78', '0.4', '2', '9', '2', '7',]
})
I counted the number of rows for all the unique values in col1. Like a has 4 rows, b and c have 2 rows each, by doing:
df.groupby(['Col1']).size()
and I get the output as
Col1
a 4
b 2
c 2
dtype: int64
After this is done, I would like to check which among a, b, c has the maximum number of rows (in this case, a has the maximum rows) and pad the others (b and c) with the difference between the the maximum value and the rows they have, with zeros (both b and c have 2 rows each, and since 4 is the maximum number of rows, I want to pad b and c with 2 more zeros). The zeros must be added at the end.
I want to pad it with zeros since I want to apply a window of fixed size on all the variables (a, b, c) to plot graphs.

You can create counter by GroupBy.cumcount, create MultiIndex and DataFrame.reindex by all combinations created by MultiIndex.from_product:
df1 = df.set_index(['Col1', df.groupby('Col1').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
df2 = df1.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df2)
Col1 col2
0 a 0.5
1 a 0.78
2 a 0.78
3 a 0.4
4 b 2
5 b 9
6 b 0
7 b 0
8 c 2
9 c 7
10 c 0
11 c 0

Same logic like Jez using cumcount , but with stack and unstack chain
df.assign(key2=df.groupby('Col1').cumcount()).set_index(['Col1','key2']).unstack(fill_value=0).stack().reset_index('Col1')
Out[1047]:
Col1 col2
key2
0 a 0.5
1 a 0.78
2 a 0.78
3 a 0.4
0 b 2
1 b 9
2 b 0
3 b 0
0 c 2
1 c 7
2 c 0
3 c 0

Related

Sum up multiple columns into one columns [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Performing operations on certain rows of certain columns in python pandas

I have been trying to solve the following problem for the past while.
I have a dataframe with 7 columns and a variable number of rows, between 10 and 20, that I read in from an csv file. I would like to perform the following operation: divide columns A, B, C, D of the row corresponding to unique_string1 by 4 and add these values to unique_string2's A, B, C, D columns.
Title Description A B C D
0 unique_string1 2 1 4 6
1 unique_string2 6 2 4 5
2 unique_string3 B 1 8 8 2
3 unique_string4 B 1 1 2 3
4 unique_string5 C 3 1 2 5
To get values for specific columns in a specific DataFrame row:
vals = df.loc[df['Title']=='unique_string1', ['A', 'B', 'C', 'D']].values
Divide these by 4:
vals /= 4
Add back to 'unique_string2' row in DF:
df.loc[df['Title']=='unique_string2', ['A', 'B', 'C', 'D']] += vals
I would recommend reading the documentation for the DataFrame.loc operator in pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

Unexpected row getting changed in pandas loc assignment

I want to copy a portion of a pandas dataframe onto a different portion, overwriting the existing values there. I am using .loc but more rows are changing than the ones I am referencing.
My example:
df = pd.DataFrame({
'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': range(1, 6),
'col3': range(6, 11)
})
print(df)
col1 col2 col3
0 A 1 6
1 B 2 7
2 C 3 8
3 D 4 9
4 E 5 10
I want to write the values of col2 and col3 from the C and D rows onto the A and B rows. Using .loc:
df.loc[0:2, ["col2", "col3"]] = df.loc[2:4, ["col2", "col3"]].values
print(df)
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 5 10
3 D 4 9
4 E 5 10
This does what I want for rows A and B, but row C has also changed. I expect only the first two rows to change, i.e. my expected output is
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 3 8
3 D 4 9
4 E 5 10
Why did the C row also change, and how may I do this with only changing the first two rows?
Unlike list slicing pandas.DataFrame.loc slicing is inclusive-inclusive
Warning Note that contrary to usual python slices, both the start and the stop
are included
so you should do
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values
In addition, you can also pass a list of exhaustive elements, this way the rows need not to be consecutive:
df.loc[[0,1], ["col2", "col3"]] = df.loc[[2,3], ["col2", "col3"]].values
You went too far with the indices:
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values

Pandas datrafame inconsistent data in mulitple rows

I have something like that:
>>> x = {'id': [1,1,2,2,2,3,4,5,5], 'value': ['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']}
>>> df = pd.DataFrame(x)
>>> df
id value
0 1 a
1 1 a
2 2 b
3 2 b
4 2 c
5 3 d
6 4 e
7 5 f
8 5 g
I want to filter inconsistent values in this table. For example, columns with id=2 or id=5 are inconsistent, because the same id is associated with different values. I have read solutions about where or any, but they are not something like "comparing if columns with this id always have the same value.
How can I solve this problem?
You can use groupby and filter. This should give you the ids with inconsistent values.
df.groupby('id').filter(lambda x: x.value.nunique()>1)
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
In your case we do groupby + transform with nunique
unc_df=df[df.groupby('id').value.transform('nunique').ne(1)]
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
I guess, you can use drop_duplicates to drop repetitive rows based on id column:
In [599]: df.drop_duplicates('id', keep='first')
Out[599]:
id value
0 1 a
2 2 b
5 3 d
6 4 e
7 5 f
So the above will pick the first value for duplicated id column. And you will have 1 row per id in your resultant dataframe.

How to apply function to ALL columns of dataframe GROUPWISELY ? (In python pandas)

How to apply a function to each column of dataframe "groupwisely" ?
I.e. group by values of one column and calculate e.g. means for each group+ other columns. The expected output is dataframe with index - names of different groups, and values - means for each group+column
E.g. consider:
df = pd.DataFrame(np.arange(16).reshape(4,4), columns=['A', 'B', 'C', 'D'])
df['group'] = ['a', 'a', 'b','b']
A B C D group
0 0 1 2 3 a
1 4 5 6 7 a
2 8 9 10 11 b
3 12 13 14 15 b
I want to calculate e.g. np.mean for each column, but "groupwisely",
in that particular example it can be done by:
t = df.groupby('group').agg({'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean })
A B C D
group
a 2 3 4 5
b 10 11 12 13
However, it requires explicit use of column names 'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean
which is unacceptable for my task, since they can be changed.
As MaxU commented simplier is groupby + GroupBy.mean:
df1 = df.groupby('group').mean()
print (df1)
A B C D
group
a 2 3 4 5
b 10 11 12 13
If need column from index:
df1 = df.groupby('group', as_index=False).mean()
print (df1)
group A B C D
0 a 2 3 4 5
1 b 10 11 12 13
You don't need to explicitly name the columns.
df.groupby('group').agg('mean')
Will produce the mean for each group for each column as requested:
A B C D
group
a 2 3 4 5
b 10 11 12 13
The below does the job:
df.groupby('group').apply(np.mean, axis=0)
giving back
A B C D
group
a 2.0 3.0 4.0 5.0
b 10.0 11.0 12.0 13.0
apply takes axis = {0,1} as additional argument, which in turn specifies whether you want to apply the function row-wise or column-wise.

Categories

Resources