Python Pandas creating column on condition with dynamic amount of columns - python

If I create a new dataframe based on a user parameter, say a = 2. Therefore my dataframe df shrinks to 4 (ax2) columns into df_new. For example:
df_new = pd.DataFrame(data = {'col_01_01': [float('nan'),float('nan'),1,2,float('nan')], 'col_02_01': [float('nan'),float('nan'),1,2,float('nan')],'col_01_02': [0,0,0,0,1],'col_02_02': [1,0,0,1,1],'output':[1,0,1,1,1]})
To be more precise on the output column, let's look at the first row. [(nan,nan,0,1)] -> apply notna()-function to the first two entries and a comparison '==1' to the third and fourth row. -> This gives [(false, false, false, true)] -> compare these with an OR-expression and receive the desired result True -> 1
In the second row we find [(nan,nan,0,0)] therefore we find the output to be 0, since there is no valid value in the first two cols and 0 in the last two.
For a parameter a=3 we find 6 columns.
The result loos like this:
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1

You can use vectorised operations with notnull and eq:
null_cols = ['col_01_01', 'col_02_01']
int_cols = ['col_01_02', 'col_02_02']
df['output'] = (df[null_cols].notnull().any(1) | df[int_cols].eq(1).any(1)).astype(int)
print(df)
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?
Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

Setting the last n non NaN vale per group with nan

I have a DataFrame with (several) grouping variables and (several) value variables. My goal is to set the last n non nan values to nan. So let's take a simple example:
df = pd.DataFrame({'id':[1,1,1,2,2,],
'value':[1,2,np.nan, 9,8]})
df
Out[1]:
id value
0 1 1.0
1 1 2.0
2 1 NaN
3 2 9.0
4 2 8.0
The desired result for n=1 would look like the following:
Out[53]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
Use with groupby().cumcount():
N=1
groups = df.loc[df['value'].notna()].groupby('id')
enum = groups.cumcount()
sizes = groups['value'].transform('size')
df['value'] = df['value'].where(enum < sizes - N)
Output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
You can check cumsum after groupby get how many notna value per-row
df['value'].where(df['value'].notna().iloc[::-1].groupby(df['id']).cumsum()>1,inplace=True)
df
Out[86]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
One option: create a reversed cumcount on the non-NA values:
N = 1
m = (df
.loc[df['value'].notna()]
.groupby('id')
.cumcount(ascending=False)
.lt(N)
)
df.loc[m[m].index, 'value'] = np.nan
Similar approach with boolean masking:
m = df['value'].notna()
df['value'] = df['value'].mask(m[::-1].groupby(df['id']).cumsum().le(N))
output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN

combining two dataframes giving NaN values [duplicate]

I am curious why a simple concatenation of two dataframes in pandas:
initId.shape # (66441, 1)
initId.isnull().sum() # 0
ypred.shape # (66441, 1)
ypred.isnull().sum() # 0
of the same shape and both without NaN values
foo = pd.concat([initId, ypred], join='outer', axis=1)
foo.shape # (83384, 2)
foo.isnull().sum() # 16943
can result in a lot of NaN values if joined.
How can I fix this problem and prevent NaN values being introduced?
Trying to reproduce it like
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
pd.concat([aaa, bbb], axis=1)
failed e.g. worked just fine as no NaN values were introduced.
I think there is problem with different index values, so where concat cannot align get NaN:
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])
print(aaa)
prediction
4 0
5 1
8 0
7 1
10 0
12 0
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 NaN 0.0
1 NaN 0.0
2 NaN 1.0
3 NaN 0.0
4 0.0 1.0
5 1.0 1.0
7 1.0 NaN
8 0.0 NaN
10 0.0 NaN
12 0.0 NaN
Solution is reset_index if indexes values are not necessary:
aaa.reset_index(drop=True, inplace=True)
bbb.reset_index(drop=True, inplace=True)
print(aaa)
prediction
0 0
1 1
2 0
3 1
4 0
5 0
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 0 0
1 1 0
2 0 1
3 1 0
4 0 1
5 0 1
EDIT: If need same index like aaa and length of DataFrames is same use:
bbb.index = aaa.index
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
4 0 0
5 1 0
8 0 1
7 1 0
10 0 1
12 0 1
You can do something like this:
concatenated_dataframes = concat(
[
dataframe_1.reset_index(drop=True),
dataframe_2.reset_index(drop=True),
dataframe_3.reset_index(drop=True)
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dataframe_1.columns),
list(dataframe_2.columns),
list(dataframe_3.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
To concatenate multiple DataFrames and keep the columns names / avoid NaN.
As jezrael pointed out, this is due to different index labels. concat matches on index, so if they are not the same, this problem will occur. For a straightforward horizontal concatenation, you must "coerce" the index labels to be the same. One way is via set_axis method. This makes the second dataframes index to be the same as the first's.
joined_df = pd.concat([df1, df2.set_axis(df1.index)], axis=1)
or just reset the index of both frames
joined_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)

How to fill nan column with natural numbers beginning in order?

I have a data-frame
Columns
0 Nan
1 Nan
2 Nan
3 Nan
I want to fill all the Nan columns here with natural numbers starting from 1 to rest of the empty columns in increasing order.
Expected Output
Columns
0 1
1 2
2 3
3 4
Any suggestions to do this?
df['Columns'] = df['Columns'].fillna(??????????)
Solution if need replace only missing values use DataFrame.loc with Series.cumsum, then Trues are processing like 1:
m = df['Columns'].isna()
#nice solution from #Ch3steR, thank you
df.loc[m, 'Columns'] = m.cumsum()
#alternative
#df.loc[m, 'Columns'] = range(1, m.sum() + 1)
print (df)
Columns
0 1
1 2
2 3
3 4
Test for another data:
print (df)
Columns
0 NaN
1 NaN
2 100.0
3 NaN
m = df['Columns'].isna()
df.loc[m, 'Columns'] = m.cumsum()
print (df)
Columns
0 1.0
1 2.0
2 100.0
3 3.0
If need set values by range, so original column values are overwritten, use:
df['Columns'] = range(1, len(df) + 1)
print (df)
Columns
0 1
1 2
2 3
3 4

Remove duplicate column based on a condition in pandas

I have a DataFrame in which I have a duplicate column namely weather.
As Seen in this picture of dataframe. One of them contains NaN values that is the one I want to remove from the DataFrame.
I tried this method
data_cleaned4.drop('Weather', axis=1)
It dropped both columns as it should. I tried to pass a condition to drop method but I couldn't. It shows me an error.
data_cleaned4.drop(data_cleaned4['Weather'].isnull().sum() > 0, axis=1)
Can anyone tell me how do I remove this column. Remember that the second last contains the NaN values not the last one.
A general solution. (df.isnull().any(axis=0).values) gets which columns have any NaN values and df.columns.duplicated(keep=False) marks all duplicates as True, both combined will give the columns which you want to retain
General Solution:
df.loc[:, ~((df.isnull().any(axis=0).values) & df.columns.duplicated(keep=False))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C
0 1 1 1
1 1 1 1
2 2 3 4
3 1 1 1
Just for column C:
df.loc[:, ~(df.columns.duplicated(keep=False) & (df.isnull().any(axis=0).values)
& (df.columns == 'C'))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C A
0 1 1 1 NaN
1 1 1 1 1.0
2 2 3 4 2.0
3 1 1 1 1.0
Due to the duplicate names you can rename a little bit, that's what the first lien of the code belwo does, then it should work...
data_cleaned4 = data_cleaned4.iloc[:, [j for j, c in enumerate(data_cleaned4.columns) if j != i]]
checkone = data_cleaned4.iloc[:,-1].isna().any()
checktwo = data_cleaned4.iloc[:,-2].isna().any()
if checkone:
data_cleaned4.drop(data_cleaned4.columns[-1], axis=1)
elif checktwo:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
else:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
Without a testable sample and assuming you don't have NaNs anywhere else in your dataframe
df = df.dropna(axis=1)
should work

Categories

Resources