Python: How to pad with zeros?

Python: How to pad with zeros? - python

Assuming we have a dataframe as below:
df = pd.DataFrame({ 'Col1' : ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c'],
'col2' : ['0.5', '0.78', '0.78', '0.4', '2', '9', '2', '7',]
})
I counted the number of rows for all the unique values in col1. Like a has 4 rows, b and c have 2 rows each, by doing:
df.groupby(['Col1']).size()
and I get the output as
Col1
a 4
b 2
c 2
dtype: int64
After this is done, I would like to check which among a, b, c has the maximum number of rows (in this case, a has the maximum rows) and pad the others (b and c) with the difference between the the maximum value and the rows they have, with zeros (both b and c have 2 rows each, and since 4 is the maximum number of rows, I want to pad b and c with 2 more zeros). The zeros must be added at the end.
I want to pad it with zeros since I want to apply a window of fixed size on all the variables (a, b, c) to plot graphs.

You can create counter by GroupBy.cumcount, create MultiIndex and DataFrame.reindex by all combinations created by MultiIndex.from_product:
df1 = df.set_index(['Col1', df.groupby('Col1').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
df2 = df1.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df2)
Col1 col2
0 a 0.5
1 a 0.78
2 a 0.78
3 a 0.4
4 b 2
5 b 9
6 b 0
7 b 0
8 c 2
9 c 7
10 c 0
11 c 0

Same logic like Jez using cumcount , but with stack and unstack chain
df.assign(key2=df.groupby('Col1').cumcount()).set_index(['Col1','key2']).unstack(fill_value=0).stack().reset_index('Col1')
Out[1047]:
Col1 col2
key2
0 a 0.5
1 a 0.78
2 a 0.78
3 a 0.4
0 b 2
1 b 9
2 b 0
3 b 0
0 c 2
1 c 7
2 c 0
3 c 0

Related

Sum up multiple columns into one columns [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.

You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7

If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.

Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'

This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)

You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:

Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)

You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4

The shortest and simplest way here is to use
df.eval('e = a + b + d')

Performing operations on certain rows of certain columns in python pandas

I have been trying to solve the following problem for the past while.
I have a dataframe with 7 columns and a variable number of rows, between 10 and 20, that I read in from an csv file. I would like to perform the following operation: divide columns A, B, C, D of the row corresponding to unique_string1 by 4 and add these values to unique_string2's A, B, C, D columns.
Title Description A B C D
0 unique_string1 2 1 4 6
1 unique_string2 6 2 4 5
2 unique_string3 B 1 8 8 2
3 unique_string4 B 1 1 2 3
4 unique_string5 C 3 1 2 5

To get values for specific columns in a specific DataFrame row:
vals = df.loc[df['Title']=='unique_string1', ['A', 'B', 'C', 'D']].values
Divide these by 4:
vals /= 4
Add back to 'unique_string2' row in DF:
df.loc[df['Title']=='unique_string2', ['A', 'B', 'C', 'D']] += vals
I would recommend reading the documentation for the DataFrame.loc operator in pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

Unexpected row getting changed in pandas loc assignment

I want to copy a portion of a pandas dataframe onto a different portion, overwriting the existing values there. I am using .loc but more rows are changing than the ones I am referencing.
My example:
df = pd.DataFrame({
'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': range(1, 6),
'col3': range(6, 11)
})
print(df)
col1 col2 col3
0 A 1 6
1 B 2 7
2 C 3 8
3 D 4 9
4 E 5 10
I want to write the values of col2 and col3 from the C and D rows onto the A and B rows. Using .loc:
df.loc[0:2, ["col2", "col3"]] = df.loc[2:4, ["col2", "col3"]].values
print(df)
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 5 10
3 D 4 9
4 E 5 10
This does what I want for rows A and B, but row C has also changed. I expect only the first two rows to change, i.e. my expected output is
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 3 8
3 D 4 9
4 E 5 10
Why did the C row also change, and how may I do this with only changing the first two rows?

Unlike list slicing pandas.DataFrame.loc slicing is inclusive-inclusive
Warning Note that contrary to usual python slices, both the start and the stop
are included
so you should do
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values

In addition, you can also pass a list of exhaustive elements, this way the rows need not to be consecutive:
df.loc[[0,1], ["col2", "col3"]] = df.loc[[2,3], ["col2", "col3"]].values

You went too far with the indices:
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values

Pandas datrafame inconsistent data in mulitple rows

I have something like that:
>>> x = {'id': [1,1,2,2,2,3,4,5,5], 'value': ['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']}
>>> df = pd.DataFrame(x)
>>> df
id value
0 1 a
1 1 a
2 2 b
3 2 b
4 2 c
5 3 d
6 4 e
7 5 f
8 5 g
I want to filter inconsistent values in this table. For example, columns with id=2 or id=5 are inconsistent, because the same id is associated with different values. I have read solutions about where or any, but they are not something like "comparing if columns with this id always have the same value.
How can I solve this problem?

You can use groupby and filter. This should give you the ids with inconsistent values.
df.groupby('id').filter(lambda x: x.value.nunique()>1)
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g

In your case we do groupby + transform with nunique
unc_df=df[df.groupby('id').value.transform('nunique').ne(1)]
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g

I guess, you can use drop_duplicates to drop repetitive rows based on id column:
In [599]: df.drop_duplicates('id', keep='first')
Out[599]:
id value
0 1 a
2 2 b
5 3 d
6 4 e
7 5 f
So the above will pick the first value for duplicated id column. And you will have 1 row per id in your resultant dataframe.

How to apply function to ALL columns of dataframe GROUPWISELY ? (In python pandas)

How to apply a function to each column of dataframe "groupwisely" ?
I.e. group by values of one column and calculate e.g. means for each group+ other columns. The expected output is dataframe with index - names of different groups, and values - means for each group+column
E.g. consider:
df = pd.DataFrame(np.arange(16).reshape(4,4), columns=['A', 'B', 'C', 'D'])
df['group'] = ['a', 'a', 'b','b']
A B C D group
0 0 1 2 3 a
1 4 5 6 7 a
2 8 9 10 11 b
3 12 13 14 15 b
I want to calculate e.g. np.mean for each column, but "groupwisely",
in that particular example it can be done by:
t = df.groupby('group').agg({'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean })
A B C D
group
a 2 3 4 5
b 10 11 12 13
However, it requires explicit use of column names 'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean
which is unacceptable for my task, since they can be changed.

As MaxU commented simplier is groupby + GroupBy.mean:
df1 = df.groupby('group').mean()
print (df1)
A B C D
group
a 2 3 4 5
b 10 11 12 13
If need column from index:
df1 = df.groupby('group', as_index=False).mean()
print (df1)
group A B C D
0 a 2 3 4 5
1 b 10 11 12 13

You don't need to explicitly name the columns.
df.groupby('group').agg('mean')
Will produce the mean for each group for each column as requested:
A B C D
group
a 2 3 4 5
b 10 11 12 13

The below does the job:
df.groupby('group').apply(np.mean, axis=0)
giving back
A B C D
group
a 2.0 3.0 4.0 5.0
b 10.0 11.0 12.0 13.0
apply takes axis = {0,1} as additional argument, which in turn specifies whether you want to apply the function row-wise or column-wise.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: How to pad with zeros? - python

Same logic like Jez using cumcount , but with stack and unstack chain df.assign(key2=df.groupby('Col1').cumcount()).set_index(['Col1','key2']).unstack(fill_value=0).stack().reset_index('Col1') Out[1047]: Col1 col2 key2 0 a 0.5 1 a 0.78 2 a 0.78 3 a 0.4 0 b 2 1 b 9 2 b 0 3 b 0 0 c 2 1 c 7 2 c 0 3 c 0

Related

Sum up multiple columns into one columns [duplicate]

Performing operations on certain rows of certain columns in python pandas

Unexpected row getting changed in pandas loc assignment

Pandas datrafame inconsistent data in mulitple rows

How to apply function to ALL columns of dataframe GROUPWISELY ? (In python pandas)

Categories

Resources