In [3]: import numpy as np
In [4]: b = pd.DataFrame(np.array([
...: [1,np.nan,3,4],
...: [np.nan, 4, np.nan, 4]
...: ]))
In [13]: b
Out[13]:
0 1 2 3
0 1.0 NaN 3.0 4.0
1 NaN 4.0 NaN 4.0
I want to find column name and index where Nan value exists.
For example, "b has NaN value at index 0, col1, index 0, col0, index 1 col2.
What I've tried:
1
In [14]: b[b.isnull()]
Out[14]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
=> I don't know why it shows DataFrame filled with NaN
2
In [15]: b[b[0].isnull()]
Out[15]:
0 1 2 3
1 NaN 4.0 NaN 4.0
=> It only shows part of DataFrame where Nan value exist in column 0..
How can I
You could use np.where to find the indices where pd.isnull(b) is True:
import numpy as np
import pandas as pd
b = pd.DataFrame(np.array([
[1,np.nan,3,4],
[np.nan, 4, np.nan, 4]]))
idx, idy = np.where(pd.isnull(b))
result = np.column_stack([b.index[idx], b.columns[idy]])
print(result)
# [[0 1]
# [1 0]
# [1 2]]
or use DataFrame.stack to reshape the DataFrame by moving the column labels into the index.
This creates a Series which is True where b is null:
mask = pd.isnull(b).stack()
# 0 0 False
# 1 True
# 2 False
# 3 False
# 1 0 True
# 1 False
# 2 True
# 3 False
and then read off the row and column labels from the MultiIndex:
print(mask.loc[mask])
# 0 1 True
# 1 0 True
# 2 True
# dtype: bool
print(mask.loc[mask].index.tolist())
# [(0, 1), (1, 0), (1, 2)]
Related
The objective is to assign 1s to any index in the group that is a higher value than the one retrieved from idxmax()
import numpy as np
import pandas as pd
df = pd.DataFrame({'id':[1, 1, 1, 2, 2, 2, 3, 3, 3], 'val':[1,np.NaN, 0, np.NaN, 1, 0, 1, 0, 0]})
id val
0 1 1.0
1 1 NaN
2 1 0.0
3 2 NaN
4 2 1.0
5 2 0.0
6 3 1.0
7 3 0.0
8 3 0.0
We can use idxmax() to get the index values for the highest value in each group
test = df.groupby('id')['val'].idxmax()
id
1 0
2 4
3 6
The objective is to transform the data to look as such (which is that every value in group that has a higher index than the one from idxmax() gets assigned a 1.
id val
0 1 1.0
1 1 1.0
2 1 1.0
3 2 NaN
4 2 1.0
5 2 1.0
6 3 1.0
7 3 1.0
8 3 1.0
This question does not necessarily need to be done with idxmax(). Open to any suggestions.
If i understand correctly the problem, you can use apply and np.where
nd = df.groupby('id')['val'].idxmax().tolist()
df['val'] = df.groupby('id')['val'].transform(lambda x: np.where(x.index>nd[x.name-1], 1, x))
df
Output:
id val
0 1 1.0
1 1 1.0
2 1 1.0
3 2 NaN
4 2 1.0
5 2 1.0
6 3 1.0
7 3 1.0
8 3 1.0
Considering the comment, it is probably best to have a dictionary in case the df.id column is not sequential:
nd = {k:v for k,v in zip(df.id.unique(),df.groupby('id')['val'].idxmax().tolist())}
df['val'] = df.groupby('id')['val'].transform(lambda x: np.where(x.index>nd[x.name], 1, x))
(the whole thing is significantly slower than the solution offered by not_a_robot)
Try
df = pd.DataFrame({'id':[1, 1, 1, 2, 2, 2, 3, 3, 3], 'val':[1,np.NaN, 0, np.NaN, 1, 0, 1, 0, 0]})
# cummax fills everything after the first True to True in each group
# mask replaces the 0s that were originally nan by nan
df.val = df.val.eq(1).groupby(df.id).cummax().astype(int).mask(lambda x: x.eq(0) & df.val.isna())
df
So I have two columns for example A & B and they look like this:
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
and I want it like this:
A
1
2
3
4
5
6
Any ideas?
I'm assuming your data is in two columns in a DataFrame, you can append B values to the end of A values, then drop the NA values with np.nan != np.nan trick. Here's an example
import pandas as pd
import numpy as np
d = {
'A': [1,2,3, np.nan, np.nan, np.nan],
'B': [4,5,6, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(d)
>>> df
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
# np.nan == np.nan trick
>>> df['A'] == df['A']
0 True
1 True
2 True
3 False
4 False
5 False
Name: A, dtype: bool
x = pd.concat([df['A'], df['B']])
>>> x
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 NaN
0 4.0
1 5.0
2 6.0
3 NaN
4 NaN
5 NaN
dtype: float64
x = x[x == x]
>>> x
A
1
2
3
4
5
6
Using numpy, it could be something like:
import numpy as np
A = np.array([1, 2, 3, np.nan, np.nan, np.nan])
B = np.array([4, 5, 6, np.nan, np.nan, np.nan])
C = np.hstack([A[A < np.infty], B[B < np.infty]])
print(C) # [1. 2. 3. 4. 5. 6.]
What you might want is:
import pandas as pd
a = pd.Series([1, 2, 3, None, None, None])
b = pd.Series([4, 5, 6, None, None, None])
print(pd.concat([a.iloc[:3], b.iloc[:3]]))
And if you are just looking for non-NaN values feel free to use .dropna() in Series.
I'm new to pandas and was looking for some advice on how to reshape my dataframe:
Currently, I have a dataframe like this.
panelist_id
type
type_count
refer_sm_count
refer_se_count
refer_non_n_count
1
HP
2
2
1
1
1
PB
1
0
1
0
1
TN
3
0
3
0
2
HP
1
1
0
0
2
PB
2
1
1
0
0
Ideally, I want my dataframe to look like this:
panelist_id
type_HP_count
type_PB_count
type_TN_count
refer_sm_count_HP
refer_se_count_HP
refer_non_n_count_HP
refer_sm_count_PB
refer_se_count_PB
refer_non_n_count_PB
refer_sm_count_TN
refer_se_count_TN
refer_non_n_count_TN
1
2
1
3
2
1
0
0
1
0
0
0
0
2
1
2
0
1
0
0
1
1
0
0
0
0
Basically, I need to convert the different row values in the 'type' column into new columns, showing the count for each type. The next three columns on the original df titled 'refer' need to account for each different 'type'. e.g., refers_sm_count_[from type X (e.g., HP)]. Any help would be much appreciated. Thanks
Try via pivot_table() and rename_axis() method:
out=(df.pivot_table(index='panelist_id',columns='type',fill_value=0)
.rename_axis(columns=[None,None],index=None))
Finally use map() method and .columns attribute:
out.columns=out.columns.map('_'.join)
Now If you print out you will get your desired output
A pivot_wider option via pyjanitor:
new_df = df.pivot_wider(index='panelist_id',
names_from='type',
names_from_position='last',
fill_value=0)
new_df:
panelist_id type_count_HP type_count_PB type_count_TN refer_sm_count_HP refer_sm_count_PB refer_sm_count_TN refer_se_count_HP refer_se_count_PB refer_se_count_TN refer_non_n_count_HP refer_non_n_count_PB refer_non_n_count_TN
1 2 1 3 2 0 0 1 1 3 1 0 0
2 1 2 0 1 1 0 0 1 0 0 0 0
Complete Working Example:
import janitor
import pandas as pd
df = pd.DataFrame({
'panelist_id': [1, 1, 1, 2, 2],
'type': ['HP', 'PB', 'TN', 'HP', 'PB'],
'type_count': [2, 1, 3, 1, 2],
'refer_sm_count': [2, 0, 0, 1, 1],
'refer_se_count': [1, 1, 3, 0, 1],
'refer_non_n_count': [1, 0, 0, 0, 0]
})
new_df = df.pivot_wider(index='panelist_id',
names_from='type',
names_from_position='last',
fill_value=0)
print(new_df.to_string(index=False))
Just adding one more option:
df = df.set_index(['panelist_id', 'type']).unstack(-1, ,fill_value=0)
df.columns = df.columns.map('_'.join)
use pivot_table to create a multi-index
df_p = df.pivot_table(index='panelist_id', columns='type', aggfunc=sum)
refer_non_n_count refer_se_count \
type HP PB TN HP PB TN
panelist_id
1 1.0 0.0 0.0 1.0 1.0 3.0
2 0.0 0.0 NaN 0.0 1.0 NaN
refer_sm_count type_count
type HP PB TN HP PB TN
panelist_id
1 2.0 0.0 0.0 2.0 1.0 3.0
2 1.0 1.0 NaN 1.0 2.0 NaN
if you do want to flatten your columns then
df_p.columns = ['_'.join(col) for col in df_p.columns.values]
First, import libs:
import numpy as np
import pandas as pd
Then, read your data:
data = pd.read_excel('base.xlsx')
Reshape your data using pivot_table:
data_reshaped = pd.pivot_table(data, values=['type_count', 'refer_sm_count', 'refer_se_count', 'refer_non_n_count'],
index=['panelist_id'], columns=['type'], aggfunc=np.sum)
But, your index will not be good. So, reset then:
columns = [data_reshaped.columns[i][0] + '_' + data_reshaped.columns[i][1]
for i in range(len(data_reshaped.columns))] # to create new columns names
data_reshaped.columns = columns # to assign new columns names to dataframe
data_reshaped.reset_index(inplace=True) # to reset index
data_reshaped.fillna(0, inplace=True) # to substitute nan to 0
Then, your data will be like good
I have a DataFrame like
A B
1 2
2 -
5 -
4 5
I want to apply a function func() on column B (but the function gives an error if - is passed). I cannot modify the func() function. I need something like:
df['B']=df['B'].apply(func) only if value not equal to -
Use a custom function to apply on a df column if a condition is satisfied:
def func(a):
return a + 10
#new pandas dataframe with four rows and 2 columns. 3rd row having a nan
df = pd.DataFrame([[1, 2], [3, 4], [5, pd.np.nan], [7, 8]], columns=["A", "B"])
print(df)
#coerce column named B to numeric
s = pd.to_numeric(df['B'], errors='coerce')
#a mask has true for numeric rows, false for non numeric rows
mask = s.notna()
#mask
print(mask)
#run function named func across the B column
df.loc[mask, 'B'] = s[mask].apply(func)
print(df)
Which prints:
A B
0 1 2.0
1 3 4.0
2 5 NaN
3 7 8.0
0 True
1 True
2 False
3 True
A B
0 1 12.0
1 3 14.0
2 5 NaN
3 7 18.0
Try:
df['B'] = df[df['B']!='-']['B'].apply(func)
Or when the - is actaully nan you can use:
df['B'] = df[pd.notnull(df['B'])]['B'].apply(func)
I have a dataframe with ~300K rows and ~40 columns.
I want to find out if any rows contain null values - and put these 'null'-rows into a separate dataframe so that I could explore them easily.
I can create a mask explicitly:
mask = False
for col in df.columns:
mask = mask | df[col].isnull()
dfnulls = df[mask]
Or I can do something like:
df.ix[df.index[(df.T == np.nan).sum() > 1]]
Is there a more elegant way of doing it (locating rows with nulls in them)?
[Updated to adapt to modern pandas, which has isnull as a method of DataFrames..]
You can use isnull and any to build a boolean Series and use that to index into your frame:
>>> df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)])
>>> df.isnull()
0 1 2
0 False False False
1 False True False
2 False False True
3 False False False
4 False False False
>>> df.isnull().any(axis=1)
0 False
1 True
2 True
3 False
4 False
dtype: bool
>>> df[df.isnull().any(axis=1)]
0 1 2
1 0 NaN 0
2 0 0 NaN
[For older pandas:]
You could use the function isnull instead of the method:
In [56]: df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)])
In [57]: df
Out[57]:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
In [58]: pd.isnull(df)
Out[58]:
0 1 2
0 False False False
1 False True False
2 False False True
3 False False False
4 False False False
In [59]: pd.isnull(df).any(axis=1)
Out[59]:
0 False
1 True
2 True
3 False
4 False
leading to the rather compact:
In [60]: df[pd.isnull(df).any(axis=1)]
Out[60]:
0 1 2
1 0 NaN 0
2 0 0 NaN
def nans(df): return df[df.isnull().any(axis=1)]
then when ever you need it you can type:
nans(your_dataframe)
If you want to filter rows by a certain number of columns with null values, you may use this:
df.iloc[df[(df.isnull().sum(axis=1) >= qty_of_nuls)].index]
So, here is the example:
Your dataframe:
>>> df = pd.DataFrame([range(4), [0, np.NaN, 0, np.NaN], [0, 0, np.NaN, 0], range(4), [np.NaN, 0, np.NaN, np.NaN]])
>>> df
0 1 2 3
0 0.0 1.0 2.0 3.0
1 0.0 NaN 0.0 NaN
2 0.0 0.0 NaN 0.0
3 0.0 1.0 2.0 3.0
4 NaN 0.0 NaN NaN
If you want to select the rows that have two or more columns with null value, you run the following:
>>> qty_of_nuls = 2
>>> df.iloc[df[(df.isnull().sum(axis=1) >=qty_of_nuls)].index]
0 1 2 3
1 0.0 NaN 0.0 NaN
4 NaN 0.0 NaN NaN
Four fewer characters, but 2 more ms
%%timeit
df.isna().T.any()
# 52.4 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.isna().any(axis=1)
# 50 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'd probably use axis=1
.any() and .all() are great for the extreme cases, but not when you're looking for a specific number of null values. Here's an extremely simple way to do what I believe you're asking. It's pretty verbose, but functional.
import pandas as pd
import numpy as np
# Some test data frame
df = pd.DataFrame({'num_legs': [2, 4, np.nan, 0, np.nan],
'num_wings': [2, 0, np.nan, 0, 9],
'num_specimen_seen': [10, np.nan, 1, 8, np.nan]})
# Helper : Gets NaNs for some row
def row_nan_sums(df):
sums = []
for row in df.values:
sum = 0
for el in row:
if el != el: # np.nan is never equal to itself. This is "hacky", but complete.
sum+=1
sums.append(sum)
return sums
# Returns a list of indices for rows with k+ NaNs
def query_k_plus_sums(df, k):
sums = row_nan_sums(df)
indices = []
i = 0
for sum in sums:
if (sum >= k):
indices.append(i)
i += 1
return indices
# test
print(df)
print(query_k_plus_sums(df, 2))
Output
num_legs num_wings num_specimen_seen
0 2.0 2.0 10.0
1 4.0 0.0 NaN
2 NaN NaN 1.0
3 0.0 0.0 8.0
4 NaN 9.0 NaN
[2, 4]
Then, if you're like me and want to clear those rows out, you just write this:
# drop the rows from the data frame
df.drop(query_k_plus_sums(df, 2),inplace=True)
# Reshuffle up data (if you don't do this, the indices won't reset)
df = df.sample(frac=1).reset_index(drop=True)
# print data frame
print(df)
Output:
num_legs num_wings num_specimen_seen
0 4.0 0.0 NaN
1 0.0 0.0 8.0
2 2.0 2.0 10.0
df1 = df[df.isna().any(axis=1)]
Refer link: (Display rows with one or more NaN values in pandas dataframe)