Setting values in one dataframe from the boolean values in another - python

I have a MWE that can be reproduced with the following code:
import pandas as pd
a = pd.DataFrame([[1,2],[3,4]], columns=['A', 'B'])
b = pd.DataFrame([[True,False],[False,True]], columns=['A', 'B'])
Which creates the following dataframes:
In [8]: a
Out[8]:
A B
0 1 2
1 3 4
In [9]: b
Out[9]:
A B
0 True False
1 False True
My question is, how can I change the values for dataframe A based on the boolean values in dataframe B?
Say for example if I wanted to make NAN values in dataframe A where there's an instance of False in dataframe B?

If need replace False to NaN:
print (a[b])
A B
0 1.0 NaN
1 NaN 4.0
or:
print (a.where(b))
A B
0 1.0 NaN
1 NaN 4.0
and if need replace True to NaN:
print (a[~b])
A B
0 NaN 2.0
1 3.0 NaN
or:
print (a.mask(b))
A B
0 NaN 2.0
1 3.0 NaN
Also you can use where or mask with some scalar value:
print (a.where(b, 7))
A B
0 1 7
1 7 4
print (a.mask(b, 7))
A B
0 7 2
1 3 7
print (a.where(b, 'TEST'))
A B
0 1 TEST
1 TEST 4

Related

Pandas: Find the max value in one column containing lists

I have a dataframe like this:
fly_frame:
day plcae
0 [1,2,3,4,5] A
1 [1,2,3,4] B
2 [1,2] C
3 [1,2,3,4] D
If I want to find the max value in each entry in the day column.
For example:
fly_frame:
day plcae
0 5 A
1 4 B
2 2 C
3 4 D
What should I do?
Thanks for your help.
df.day.apply(max)
#0 5
#1 4
#2 2
#3 4
Use apply with max:
#if strings
#import ast
#print (type(df.loc[0, 'day']))
#<class 'str'>
#df['day'] = df['day'].apply(ast.literal_eval)
print (type(df.loc[0, 'day']))
<class 'list'>
df['day'] = df['day'].apply(max)
Or list comprehension:
df['day'] = [max(x) for x in df['day']]
print (df)
day plcae
0 5 A
1 4 B
2 2 C
3 4 D
Try a combination of pd.concat() and df.apply() with:
import numpy as np
import pandas as pd
fly_frame = pd.DataFrame({'day':[[1,2,3,4,5],[1,2,3,4],[1,2],[1,2,3,4]],'place':['A','B','C','D']})
df = pd.concat([fly_frame['day'].apply(max),fly_frame.drop('day',axis=1)],axis=1)
print(df)
day place
0 5 A
1 4 B
2 2 C
3 4 D
Edit
You can also use df.join() with:
fly_frame.drop('day',axis=1).join(fly_frame['day'].apply(np.max,axis=0))
place day
0 A 5
1 B 4
2 C 2
3 D 4
I suggest bringing your dataframe into a better format first.
>>> df
day plcae
0 [1, 2, 3, 4, 5] A
1 [1, 2, 3, 4] B
2 [1, 2] C
3 [1, 2, 3, 4] D
>>>
>>> df = pd.concat([df.pop('day').apply(pd.Series), df], axis=1)
>>> df
0 1 2 3 4 plcae
0 1.0 2.0 3.0 4.0 5.0 A
1 1.0 2.0 3.0 4.0 NaN B
2 1.0 2.0 NaN NaN NaN C
3 1.0 2.0 3.0 4.0 NaN D
Now everything is easier, for example computing the maximum of numeric values along the columns.
>>> df.max(axis=1)
0 5.0
1 4.0
2 2.0
3 4.0
dtype: float64
edit: renaming the index might also be useful to you.
>>> df.max(axis=1).rename(df['plcae'])
A 5.0
B 4.0
C 2.0
D 4.0
dtype: float64

drops a column if it exceeds a specific number of NA values

i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b

Why Python Pandas append to DataFrame like this?

I want to add l in column 'A' but it creates a new column and adds l to the last one. Why is it happening? And how can I make what I want?
import pandas as pd
l=[1,2,3]
df = pd.DataFrame(columns =['A'])
df = df.append(l, ignore_index=True)
df = df.append(l, ignore_index=True)
print(df)
A 0
0 NaN 1.0
1 NaN 2.0
2 NaN 3.0
3 NaN 1.0
4 NaN 2.0
5 NaN 3.0
Edited
Is this what you want to do:
In[6]:df=df.A.append(pd.Series(l)).reset_index().drop('index',1).rename(columns={0:'A'})
In[7]:df
Out[7]:
A
0 1
1 2
2 3
Then you can add any list of different length.
Suppose:
a=[9,8,7,6,5]
In[11]:df=df.A.append(pd.Series(a)).reset_index().drop('index',1).rename(columns={0:'A'})
In[12]:df
Out[12]:
A
0 1
1 2
2 3
3 9
4 8
5 7
6 6
7 5
Previously
are you looking for this :
df=pd.DataFrame(l,columns=['A'])
df
Out[5]:
A
0 1
1 2
2 3
You can just pass a dictionary in the dataframe constructor, that if I understand your question correctly.
l = [1,2,3]
df = pd.DataFrame({'A': l})
df
A
0 1
1 2
2 3

Value Counts of Column Slice to Contain All Possible Unique Values in Column

I have a df that looks like this:
group val
A 1
A 1
A 2
B 1
B 2
B 3
I want to get the value_counts for each group separately, but want to show all possible values for each value_count group:
> df[df['group']=='A']['val'].value_counts()
1 2
2 1
3 NaN
Name: val, dtype: int64
But it currently looks like this:
> df[df['group']=='A']['val'].value_counts()
1 2
2 1
Name: val, dtype: int64
Any one know any way I can show value_counts with all possible values represented?
In [185]: df.groupby('group')['val'].value_counts().unstack('group')
Out[185]:
group A B
val
1 2.0 1.0
2 1.0 1.0
3 NaN 1.0
In [186]: df.groupby('group')['val'].value_counts().unstack('group')['A']
Out[186]:
val
1 2.0
2 1.0
3 NaN
Name: A, dtype: float64
This works:
from io import StringIO
import pandas as pd
import numpy as np
data = StringIO("""group,val
A,1
A,1
A,2
B,1
B,2
B,3""")
df = pd.read_csv(data)
print(df, '\n')
res_idx = pd.MultiIndex.from_product([df['group'].unique(), df['val'].unique()])
res = pd.concat([pd.DataFrame(index=res_idx),
df.groupby('group').apply(lambda x: x['val'].value_counts())],
axis=1)
print(res)
Produces:
group val
0 A 1
1 A 1
2 A 2
3 B 1
4 B 2
5 B 3
val
A 1 2.0
2 1.0
3 NaN
B 1 1.0
2 1.0
3 1.0

How can I non-iteratively place a NaN in a DataFrame column if there is a corresponding NaN in any other columns?

Given a 3-column DataFrame, df:
a b c
0 NaN a True
1 1 b True
2 2 c False
3 3 NaN False
4 4 e True
[5 rows x 3 columns]
I would like to place aNaN in column c for each row where a NaN exists in any other colunn. My current approach is as follows:
for col in df:
df['c'][pd.np.isnan(df[col])] = pd.np.nan
I strongly suspect that there is a way to do this via logical indexing instead of iterating through columns as I am currently doing.
How could this be done?
Thank you!
If you don't care about the bool/float issue, I propose:
>>> df.loc[df.isnull().any(axis=1), "c"] = np.nan
>>> df
a b c
0 NaN a NaN
1 1 b 1
2 2 c 0
3 3 NaN NaN
4 4 e 1
[5 rows x 3 columns]
If you really do, then starting again from your frame df you could:
>>> df["c"] = df["c"].astype(object)
>>> df.loc[df.isnull().any(axis=1), "c"] = np.nan
>>> df
a b c
0 NaN a NaN
1 1 b True
2 2 c False
3 3 NaN NaN
4 4 e True
[5 rows x 3 columns]
df.c[df.ix[:, :'c'].apply(lambda r: any(r.isnull()), axis=1)] = np.nan
Note that you may need to change the type of column c to float or you'll get an error about being unable to assign nan to integer column.
filter and select the rows where you have NaN for either 'a' or 'b' and assign 'c' to NaN:
In [18]:
df.ix[pd.isnull(df.a) | pd.isnull(df.b),'c'] = NaN
In [19]:
df
Out[19]:
a b c
0 NaN a NaN
1 1 b 1
2 2 c 0
3 3 d 0
4 4 NaN NaN
[5 rows x 3 columns]

Categories

Resources