python isnull().sum() handle headers - python

I have a dataset in which I want to count the missing values for each column. If there are missing values, I want to print the header name. I use the following code in order to find the missing values per column
isnull().sum()
If I print the result everything is OK, if I try to put the result in a list and then handle the headers, I can't!
newList = pd.isnull(myData).sum()
print(newList)
In this case the output is:
Name 5
Surname 0
Age 3
and I want to print only Surname but I can't find how to return it to a variable.
newList = pd.isnull(myData).sum()
print(newList[0])
This print 5 (the number of missing values for column 'Name')

Use boolean indexing with Series:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[np.nan,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 NaN 1.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 9.0 5.0 6 a
3 d 5 4.0 NaN 9 b
4 e 5 2.0 1.0 2 b
5 f 4 3.0 0.0 4 b
newList = df.isnull().sum()
print (newList)
A 0
B 0
C 1
D 1
E 0
F 0
dtype: int64
#for return NaNs columns
print(newList.index[newList != 0].tolist())
['C', 'D']
#for return non NaNs columns
print(newList.index[newList == 0].tolist())
['A', 'B', 'E', 'F']

Related

Merge duplicated cells instead of dropping them [duplicate]

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

Give normal indexing to a dataframe which has been altered by set_index()

I have a dataframe which looks like this:
Priority RID_solve Prob RID_prob Remarks
0 1 5001 34.4% 5040 Caution: FIDs are different
1 1 5001 38.5% 5057 Caution: FIDs are different
2 1 5001 3.3% 5056 Caution: FIDs are different
3 2 5002 74.0% 5057 Caution: FIDs are different
4 2 5002 87.6% 5056 Caution: FIDs are different
5 3 5003 89.4% 5056 Same FID
6 3 5003 89.4% 5056 Caution: FIDs are different
Then I use set_index() to group the similar Priority and RID_solve data so that the repetition could be removed. This is the code I wrote:
df1 = df.set_index(['Priority', 'RID_solve', 'Prob', 'RID_prob', 'Remarks']).sort_values(by=['Priority'], ascending = True)
which gives the data like this:
which is what I want. But I also need the normal index which starts with 0. So far I am not able to figure out how to get it. I tried reset_index() but that just changes my data back to it's original form.
Is there a way to keep the above format intact and get indexes too?
Then I use set_index() to group the similar Priority and RID_solve data so that the repetition could be removed.
No, you are wrong. Repetition is not removed, only not displayed, so you have to decide if need MulitIndex or default RangeIndex.
You can check it:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[1] * 6,
'F':list('aaabbb')
})
df = df.set_index(['C','B', 'A'])
print (df)
F
C B A
1 4 a a
5 b a
4 c a
5 d b
e b
4 f b
with pd.option_context('display.multi_sparse', False):
print (df)
F
C B A
1 4 a a
1 5 b a
1 4 c a
1 5 d b
1 5 e b
1 4 f b
EDIT:
If necessary, you can replace duplicated by missing values:
df = pd.DataFrame({
'A':[1] * 6,
'B':[4,5,4,5,5,4],
'C':list('abcdef'),
'F':list('aaabbb')
})
cols = ['A','B', 'C']
m = df[cols].apply(lambda x: x.duplicated())
df[cols]= df[cols].mask(m, '')
print (df)
A B C F
0 1 4 a a
1 5 b a
2 c a
3 d b
4 e b
5 f b
But if duplicated are not in first column, only in second or more, then get:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[1] * 6,
'F':list('aaabbb')
})
cols = ['A','B', 'C']
m = df[cols].apply(lambda x: x.duplicated())
df[cols]= df[cols].mask(m, '')
print (df)
A B C F
0 a 4 1 a
1 b 5 a
2 c a
3 d b
4 e b
5 f b

Find the column name which has the 2nd maximum value for each row (pandas)

Based on this post: Find the column name which has the maximum value for each row it is clear how to get the column name with the max value of each row using df.idxmax(axis=1).
The question is, how can I get the 2nd, 3rd and so on maximum value per row?
You need numpy.argsort for position and then reorder columns names by indexing:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
arr = np.argsort(-df.values, axis=1)
df1 = pd.DataFrame(df.columns[arr], index=df.index)
print (df1)
0 1 2 3 4
0 A B D E C
1 D B C E A
2 E A B C D
3 C D A E B
4 C A E D B
Verify:
#first column
print (df.idxmax(axis=1))
0 A
1 D
2 E
3 C
4 C
dtype: object
#last column
print (df.idxmin(axis=1))
0 C
1 A
2 D
3 B
4 B
dtype: object
While there is no method to find specific ranks within a row, you can rank elements in a pandas dataframe using the rank method.
For example, for a dataframe like this:
df = pd.DataFrame([[1, 2, 4],[3, 1, 7], [10, 4, 2]], columns=['A','B','C'])
>>> print(df)
A B C
0 1 2 4
1 3 1 7
2 10 4 2
You can get the ranks of each row by doing:
>>> df.rank(axis=1,method='dense', ascending=False)
A B C
0 3.0 2.0 1.0
1 2.0 3.0 1.0
2 1.0 2.0 3.0
By default, applying rank to dataframes and using method='dense' will result in float ranks. This can be easily fixed just by doing:
>>> ranks = df.rank(axis=1,method='dense', ascending=False).astype(int)
>>> ranks
A B C
0 3 2 1
1 2 3 1
2 1 2 3
Finding the indices is a little trickier in pandas, but it can be resumed to apply a filter on a condition (i.e. ranks==2):
>>> ranks.where(ranks==2)
A B C
0 NaN 2.0 NaN
1 2.0 NaN NaN
2 NaN 2.0 NaN
Applying where will return only the elements matching the condition and the rest set to NaN. We can retrieve the columns and row indices by doing:
>>> ranks.where(ranks==2).notnull().values.nonzero()
(array([0, 1, 2]), array([1, 0, 1]))
And for retrieving the column index or position within a row, which is the answer to your question:
>>> ranks.where(ranks==2).notnull().values.nonzero()[0]
array([1, 0, 1])
For the third element you just need to change the condition in where to ranks.where(ranks==3) and so on for other ranks.

Pandas: Merge two dataframe columns

Consider two dataframes:
df_a = pd.DataFrame([
['a', 1],
['b', 2],
['c', NaN],
], columns=['name', 'value'])
df_b = pd.DataFrame([
['a', 1],
['b', NaN],
['c', 3],
['d', 4]
], columns=['name', 'value'])
So looking like
# df_a
name value
0 a 1
1 b 2
2 c NaN
# df_b
name value
0 a 1
1 b NaN
2 c 3
3 d 4
I want to merge these two dataframes and fill in the NaN values of the value column with the existing values in the other column. In other words, I want out:
# DESIRED RESULT
name value
0 a 1
1 b 2
2 c 3
3 d 4
Sure, I can do this with a custom .map or .apply, but I want a solution that uses merge or the like, not writing a custom merge function. How can this be done?
I think you can use combine_first:
print (df_b.combine_first(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Or fillna:
print (df_b.fillna(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Solution with update is not so common as combine_first:
df_b.update(df_a)
print (df_b)
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0

How to change the column order in a pandas dataframe when there are too many columns?

I have a large pandas dataframe that contains many columns.
I would like to change the order of the columns so that only a subset of them appears first. I dont care about the ordering of the rest (and there are too many variables to list them all)
For instance, if my dataframe is like this
a b c d e f g h i
5 8 7 2 1 4 1 2 3
1 4 2 2 3 4 1 5 3
I would like to specify a subset of the columns
mysubset=['d','f'] and reorder the dataframe such that
the order of the columns is now
d,f,a,b,c,e,g,h,i
Is there a way to do that in a panda-esque way?
You could use a column mask:
>>> mysubset = ["d","f"]
>>> mask = df.columns.isin(mysubset)
>>> pd.concat([df.loc[:,mask], df.loc[:,~mask]], axis=1)
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
or use sorted:
>>> mysubset = ["d","f"]
>>> df[sorted(df, key=lambda x: x not in mysubset)]
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
which works because x not in mysubset will be False for d and f, and False < True.
I usually do something like this:
mysubset = ['d', 'f']
othercols = [c for c in df.columns if c not in mysubset]
df = df[mysubset+othercols]
use a multi-index to do that :
priority=[ 0 if x in {'d','f'} else 1 for x in df.columns]
newdf=df.T.set_index([priority,df.columns]).sort_index().T
Then you have :
In [3]: newdf
Out[3]:
0 1
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
To move an entire subset of columns, you could do this:
#!/usr/bin/python
import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print df
cols = df.columns.tolist()
print cols
mysubset = ['B','D']
for idx, item in enumerate(mysubset):
cols.remove(item)
cols.insert(idx, item)
print cols
df = df[cols]
print df
Here I moved B and D first and let the others trailing. Output:
A B C D
2013-01-01 0.905122 -0.004839 -0.697663 -1.307550
2013-01-02 0.651998 -1.092546 0.594493 0.341066
2013-01-03 0.355832 -0.840057 0.016989 0.377502
2013-01-04 -0.544407 0.826708 -0.889118 0.871769
2013-01-05 0.190630 0.717418 1.325479 -0.882652
2013-01-06 2.730582 0.195908 -0.657642 1.606263
['A', 'B', 'C', 'D']
['B', 'D', 'A', 'C']
B D A C
2013-01-01 -0.004839 -1.307550 0.905122 -0.697663
2013-01-02 -1.092546 0.341066 0.651998 0.594493
2013-01-03 -0.840057 0.377502 0.355832 0.016989
2013-01-04 0.826708 0.871769 -0.544407 -0.889118
2013-01-05 0.717418 -0.882652 0.190630 1.325479
2013-01-06 0.195908 1.606263 2.730582 -0.657642
For more, read this answer.
a=list('abcdefghi')
b=list('dfabceghi')
ind = pd.Series(range(9),index=b).reindex(a)
df.sort_index(axis=1,inplace=True,key=lambda x:ind)
The benefit of the above approach is inplace=True , and costs lower memory and time when df is a large dataframe.
If your dataframe is in common shape:
df.filter(b)
may be more pythonic.

Categories

Resources