Pandas .iloc indexing coupled with boolean indexing in a Dataframe - python

I looked into existing threads regarding indexing, none of said threads address the present use case.
I would like to alter specific values in a DataFrame based on their position therein, ie., I'd like the values in the second column from the first to the 4th row to be NaN and values in the third column, first and second row to be NaN say we have the following `DataFrame`:
df = pd.DataFrame(np.random.standard_normal((7,3)))
print(df)
0 1 2
0 -1.102888 1.293658 -2.290175
1 -1.826924 -0.661667 -1.067578
2 1.015479 0.058240 -0.228613
3 -0.760368 0.256324 -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350
And I want alter df like below with the least amount of code:
0 1 2
0 -1.102888 NaN NaN
1 -1.826924 NaN NaN
2 1.015479 NaN -0.228613
3 -0.760368 NaN -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350
I tried using boolean indexing with .loc but resulted in an error:
df.loc[(:2,1:) & (2:4,1)] = np.nan
# exception message:
df.loc[(:2,1:) & (2:4,1)] = np.nan
^
SyntaxError: invalid syntax
I also thought about converting the DataFrame object to a numpy narray object but then I wouldn't know how to use boolean in that case.

One way is define the requirement and assign to be clear:
d = {1:4,2:2}
for col,val in d.items():
df.iloc[:val,col] = np.nan
print(df)
0 1 2
0 -1.102888 NaN NaN
1 -1.826924 NaN NaN
2 1.015479 NaN -0.228613
3 -0.760368 NaN -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350

Related

How to sum duplicate columns in dataframe and return nan if at least one value is nan

I have a dataframe with duplicate columns (number not known a priori) like this example:
a
a
a
b
b
0
1
1
1
1
1
1
1
nan
1
1
1
I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.
I have tried this code:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()
The result i get is as follows, but it does not return NaN in the second row of column 'a'.
a
b
0
3
2
1
2
2
In the documentation of pandas.DataFrame.sum, there is the skipna parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count which does what i want but the number is not known in advance and would be different for each duplicate column.
For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.
The result I want to achieve is:
a
b
0
3
2
1
nan
2
One workaround might be to use apply to get the DataFrame.sum:
df.groupby(level=0, axis=1).apply(lambda x: x.sum(axis=1, skipna=False))
Output:
a b
0 3.0 2.0
1 NaN 2.0
Another possible solution:
cols, ldf = df.columns.unique(), len(df)
pd.DataFrame(
np.reshape([sum(df.loc[i, x]) for i in range(ldf) for x in cols],
(len(cols), ldf)),
columns=cols)
Output:
a b
0 3.0 2.0
1 NaN 2.0

Iterate through rows and identify which columns is true, assign new column the name of the column.header

I have the following DataFrame:
Index
Time Lost
Cause 1
Cause 2
Cause 3
0
40
x
Nan
Nan
1
15
Nan
x
Nan
2
65
x
Nan
Nan
3
10
Nan
Nan
x
There is only one "X" per row which identifies the cause of the time lost column. I am trying to iterate through each row (and column) to determine which column holds the "X". I would then like to add a "Type" column with the name of the column header that was True for each row. This is what I would like as a result:
Index
Time Lost
Cause 1
Cause 2
Cause 3
Type
0
40
x
Nan
Nan
Cause 1
1
15
Nan
x
Nan
Cause 2
2
65
x
Nan
Nan
Cause 1
3
10
Nan
Nan
x
Cause 3
Currently my code looks like this, I am trying to iterate through the DataFrame. However, I'm not sure if there is a function or non-iterative approach to assign the proper value to the "Type" column:
cols = ['Cause1', 'Cause 2', 'Cause 3']
for index, row in df.iterrows():
for col in cols:
if df.loc[index,col] =='X':
df.loc[index,'Type'] = col
continue
else:
df.loc[index,'Type'] = 'Other'
continue
The issue I get with this code is that it iterates but only identifies rows with the last item in the cols list and the remainder go to "Other".
Any help is appreciated!
You could use idxmax on the boolean array of your data:
df['Type'] = df.drop('Time Lost', axis=1).eq('x').idxmax(axis=1)
Note that this only report the first cause if several
output:
Time Lost Cause 1 Cause 2 Cause 3 Type
0 40 x Nan Nan Cause 1
1 15 Nan x Nan Cause 2
2 65 x Nan Nan Cause 1
3 10 Nan Nan x Cause 3

Why sometimes we have to add .values when we do elementwise operation in pandas?

Suppose I have a dataframe looks like
A
0 0
1 1
2 2
3 3
and when I run:
a = df.loc[np.arange(0,2)] / df.loc[np.arange(2,4)]
I get
A
0 NaN
1 NaN
2 NaN
3 NaN
I know I could get the right result by writing
a = df.loc[np.arange(0,2)].values / df.loc[np.arange(2,4)]
b = df.loc[np.arange(0,2)] / df.loc[np.arange(2,4)].values
Can anyone explain why?
Due to pandas is index and columns sensitive, when you do the calculation the hidden key for them get match first , if we only need to get the value match and remove the impact of index and columns is adding .values or to_numpy() , however, index also bring some advantage as well
Example 1 index not match so the value will return NaN
s1=pd.Series([1],index=[1])
s2=pd.Series([1],index=[999])
s1/s2
1 NaN
999 NaN
dtype: float64
s1.values/s2.values
array([1.])
Example 2 index match so pandas will return the value when the index match
s1=pd.Series([1],index=[1])
s2=pd.Series([1,999],index=[1,999])
s1/s2
1 1.0
999 NaN
dtype: float64

Pandas: Sum multiple columns, but write NaN if any column in that row is NaN or 0

I am trying to create a new column in a pandas dataframe that sums the total of other columns. However, if any of the source columns are blank (NaN or 0), I need the new column to also be written as blank (NaN)
a b c d sum
3 5 7 4 19
2 6 0 2 NaN (note the 0 in column c)
4 NaN 3 7 NaN
I am currently using the pd.sum function, formatted like this
df['sum'] = df[['a','b','c','d']].sum(axis=1, numeric_only=True)
which ignores the NaNs, but does not write NaN to the sum column.
Thanks in advance for any advice
replace your 0 to np.nan then pass skipna = False
df.replace(0,np.nan).sum(1,skipna=False)
0 19.0
1 NaN
2 NaN
dtype: float64
df['sum'] = df.replace(0,np.nan).sum(1,skipna=False)

Python: Create New Column Equal to the Sum of all Columns Starting from Column Number 9

I want to create a new column called 'test' in my dataframe that is equal to the sum of all the columns starting from column number 9 to the end of the dataframe. These columns are all datatype float.
Below is the code I tried but it didn't work --> gives me back all NaN values in 'test' column:
df_UBSrepscomp['test'] = df_UBSrepscomp.iloc[:, 9:].sum()
If I'm understanding your question, you want the row-wise sum starting at column 9. I believe you want .sum(axis=1). See an example below using column 2 instead of 9 for readability.
df = DataFrame(npr.rand(10, 5))
df.iloc[0:3, 0:4] = np.nan # throw in some na values
df.loc[:, 'test'] = df.iloc[:, 2:].sum(axis=1); print(df)
0 1 2 3 4 test
0 NaN NaN NaN NaN 0.73046 0.73046
1 NaN NaN NaN NaN 0.79060 0.79060
2 NaN NaN NaN NaN 0.53859 0.53859
3 0.97469 0.60224 0.90022 0.45015 0.52246 1.87283
4 0.84111 0.52958 0.71513 0.17180 0.34494 1.23187
5 0.21991 0.10479 0.60755 0.79287 0.11051 1.51094
6 0.64966 0.53332 0.76289 0.38522 0.92313 2.07124
7 0.40139 0.41158 0.30072 0.09303 0.37026 0.76401
8 0.59258 0.06255 0.43663 0.52148 0.62933 1.58744
9 0.12762 0.01651 0.09622 0.30517 0.78018 1.18156

Categories

Resources