In pandas, How to select the rows that contains NaN? [duplicate] - python

This question already has answers here:
How to select rows with one or more nulls from a pandas DataFrame without listing columns explicitly?
(6 answers)
Closed 6 years ago.
Suppose I have the following dataframe in df:
a | b | c
------+-------+-------
5 | 2 | 4
NaN | 6 | 8
5 | 9 | 0
3 | 7 | 1
If I do df.loc[df['a'] == 5] it will correctly return the first and third row, but then if I do a df.loc[df['a'] == np.NaN] it returns nothing.
I think this is more a python thing than a pandas one. If I compare np.nan against anything, even np.nan == np.nan will evaluate as False, so the question is, how should I test for np.nan?

Try using isnull like so:
import pandas as pd
import numpy as np
a=[1,2,3,np.nan,5,6,7]
df = pd.DataFrame(a)
df[df[0].isnull()]

Related

Filter rows based on two columns together [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 2 years ago.
I am trying to filter out rows based on two rows values. Most of the questions' solution that I see use the following approach:
df.loc[(df['A'] != 'yes') & (df['B'] != 'no')]
This filters the rows with a A and B different than one value, what I want to do is to filter rows where the columns have the values I am filtering, example:
Player | action | result
1 A B
2 B A
3 C A
4 A B
5 A C
In this example I want to remove rows that have action A and result B. Using the example above it would remove actions equal to A and rows with result equal to B. I want to remove actions A that have result B.
Output expected:
Player | action | result
2 B A
3 C A
5 A C
Probably I am making a lot of confusion here and this is straightforward. Anyhow, any help would be appreciated!
Regards
Could you please try following.
import pandas as pd
df2[~((df2["action"]=='A') & (df2["result"]=='B'))]
Output of data frame will be as follows.
Player action result
1 2 B A
2 3 C A
4 5 A C
I think this is what you want
pd.concat([df[(df['action'] == 'A') & (df['result'] != 'B')],df[(df['action'] != 'A')]])

What is the pandas version of tidyr::separate? [duplicate]

This question already has answers here:
Pandas split column into multiple columns by comma
(8 answers)
Closed 3 years ago.
The R package tidyr has a nice separate function to "Separate one column into multiple columns."
What is the pandas version?
For example here is a dataset:
import pandas
from six import StringIO
df = """ i | j | A
AR | 5 | Paris,Green
For | 3 | Moscow,Yellow
For | 4 | New York,Black"""
df = StringIO(df.replace(' ',''))
df = pandas.read_csv(df, sep="|", header=0)
I'd like to separate the A column into 2 columns containing the content of the 2 columns.
This question is related: Accessing every 1st element of Pandas DataFrame column containing lists
The equivalent of tidyr::separate is str.split with a special assignment:
df['Town'], df['Color'] = df['A'].str.split(',', 1).str
print(df)
# i j A Town Color
# 0 AR 5 Paris,Green Paris Green
# 1 For 3 Moscow,Yellow Moscow Yellow
# 2 For 4 NewYork,Black NewYork Black
The equivalent of tidyr::unite is a simple concatenation of the character vectors:
df["B"] = df["i"] + df["A"]
df
# i j A B
# 0 AR 5 Paris,Green ARParis,Green
# 1 For 3 Moscow,Yellow ForMoscow,Yellow
# 2 For 4 NewYork,Black ForNewYork,Black

How to delete many columns in python with one line of code? [duplicate]

This question already has answers here:
Deleting multiple columns based on column names in Pandas
(11 answers)
Closed 3 years ago.
I am trying to delete the following columns on my dataframe: 1,2,101:117,121:124,126.
So far the two ways I have found to delete columns is:
df.drop(df.columns[2:6],axis=1)
df.drop(df.columns[[0,3,5]],axis=1)
however if I try
df.drop(df.columns[1,2,101:117,121:124],axis=1)
I get a "too many indices" error
I also tried this
a=df.drop(df.columns[[1,2]],axis=1)
b=a.drop(a.columns[99:115],axis=1)
c=b.drop(b.columns[102:105],axis=1)
d=c.drop(c.columns[103],axis=1)
but this isn't deleting the columns I'm wanting to for some reason.
Use np.r_ to slice:
import numpy as np
df.drop(columns=df.columns[np.r_[1, 2, 101:117, 121:124, 126]])
import pandas pd
df = pd.DataFrame(np.random.randint(1, 10, (2, 130)))
df.drop(columns=df.columns[np.r_[1, 2, 101:117, 121:124, 126]])
# 0 3 4 5 6 ... 120 124 125 127
#0 6 1 3 7 2 ... 8 7 2 6
#1 1 9 2 5 3 ... 7 3 9 4
This should work:
df.drop(df.columns[[indexes_of_columns_you_want_to_delete]], axis=1,
inplace=True)
Please try this:
import numpy as np
import pandas as pd
input_df.drop(input_df.columns[[np.r_[0,2:4]]],axis=1, inplace = True)

How do I rename an index row in Python Pandas? [duplicate]

This question already has answers here:
How can I change a specific row label in a Pandas dataframe?
(2 answers)
Closed 5 years ago.
I see how to rename columns, but I want to rename an index (row name) that I have in a data frame.
I had a table with 350 rows in it, I then added a total to the bottom. I then removed every row except the last row.
-------------------------------------------------
| | A | B | C |
-------------------------------------------------
| TOTAL | 1243 | 423 | 23 |
-------------------------------------------------
So I have the row called 'Total', and then several columns. I want to rename the word 'Total' to something else.
Is this even possible?
Many thanks
You could use a dictionary structure with rename(), for example,
In [1]: import pandas as pd
df = pd.Series([1, 2, 3])
df
Out[1]: 0 1
1 2
2 3
dtype: int64
In [2]: df.rename({1: 3, 2: 'total'})
Out[2]: 0 1
3 2
total 3
dtype: int64
Easy as this...
df.index.name = 'Name'

Selecting max within partition for pandas dataframe [duplicate]

This question already has answers here:
Python pandas - filter rows after groupby
(4 answers)
Closed 8 years ago.
I have a pandas dataframe. My goal is to select only those rows where column C has the largest value within group B. For example, when B is "one" the maximum value of C is 311, so I would like the row where C = 311 and B = "one."
import pandas as pd
import numpy as np
df2 = pd.DataFrame({ 'A' : 1.,
'A' : pd.Categorical(["test1","test2","test3","test4"]),
'B' : pd.Categorical(["one","one","two","two"]),
'C' : np.array([311,42,31,41]),
'D' : np.array([9,8,7,6])
})
df2.groupby('C').max()
Output should be:
test1 one 311 9
test4 two 41 6
You can use idxmax(), which returns the indices of the max values:
maxes = df2.groupby('B')['C'].idxmax()
df2.loc[maxes]
Output:
Out[11]:
A B C D
0 test1 one 311 9
3 test4 two 41 6

Categories

Resources