How to find "?" values in a pandas dataframe - python

I have been trying to read a CSV file in a dataframe which has "?" values in some of the rows.
I want to find the rows which contain these values (?) over all the columns
I tried using loc but it returns an Empty Dataframe
test_df.loc(test_df['rbc'] == "?"]
test_df.loc(test_df['rbc'] == None]
This returns an Empty DataFrame
I want to iterate the dataframe over all the columns
Can someone suggest a way to do this

If want check ? values only in all columns:
df1 = df.loc[:, (df.astype(str) == '?').any()]
More general if want check all possible substrings ? in all columns:
df2 = df.loc[:, df.apply(lambda x: x.astype(str).str.contains('\?')).any()]
EDIT:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,'?',2,3],
'D':['?',3,5,7,1,0],
'E':[5,3,6,9,2,'?'],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 ? 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 ? 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 ? b
You can create boolean DataFrame first and then check any True per rows and per columns for filtering:
mask = df.apply(lambda x: x.astype(str).str.contains('\?'))
df2 = df.loc[mask.any(axis=1), mask.any()]
print (df2)
C D E
0 7 ? 5
3 ? 7 9
5 3 0 ?
Detail:
print (mask)
A B C D E F
0 False False False True False False
1 False False False False False False
2 False False False False False False
3 False False True False False False
4 False False False False False False
5 False False False False True False
print (mask.any(axis=1))
0 True
1 False
2 False
3 True
4 False
5 True
dtype: bool
print (mask.any())
A False
B False
C True
D True
E True
F False
dtype: bool

This will work.
result = test_df[test_df['rbc'].str.contains("?")]

Related

How to get column index which is matching with specific value in Pandas?

I have the following dataframe as below.
0 1 2 3 4 5 6 7
True False False False False False False False
[1 rows * 8 columns]
As you can see, there is one True value which is the first column.
Therefore, I want to get the 0 index which is True element in the dataframe.
In other case, there is True in the 4th column index, then I would like to get the 4 as 4th column has the True value for below dataframe.
0 1 2 3 4 5 6 7
False False False False True False False False
[1 rows * 8 columns]
I tried to google it but failed to get what I want.
And for assumption, there is no designated column name in the case.
Look forward to your help.
Thanks.
IIUC, you are looking for idxmax:
>>> df
0 1 2 3 4 5 6 7
0 True False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
>>> df
0 1 2 3 4 5 6 7
0 False False False False True False False False
>>> df.idxmax(axis=1)
0 4
dtype: object
Caveat: if all values are False, Pandas returns the first index because index 0 is the lowest index of the highest value:
>>> df
0 1 2 3 4 5 6 7
0 False False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
Workaround: replace False by np.nan:
>>> df.replace(False, np.nan).idxmax(axis=1)
0 NaN
dtype: float64
if you want every field that is true:
cols_true = []
for idx, row in df.iterrows():
for i in cols:
if row[i]:
cols_true.append(i)
print(cols_true)
Use boolean indexing:
df.columns[df.iloc[0]]
output:
Index(['0'], dtype='object')
Or numpy.where
np.where(df)[1]
You may want to index the dataframe's index by a column itself (0 in this case), as follows:
df.index[df[0]]
You'll get:
Int64Index([0], dtype='int64')
df.loc[:, df.any()].columns[0]
# 4
If you have several True values you can also get them all with columns
Generalization
Imagine we have the following dataframe (several True values in positions 4, 6 and 7):
0 1 2 3 4 5 6 7
0 False False False False True False True True
With the formula above :
df.loc[:, df.any()].columns
# Int64Index([4, 6, 7], dtype='int64')
df1.apply(lambda ss:ss.loc[ss].index.min(),axis=1).squeeze()
out:
0
or
df1.loc[:,df1.iloc[0]].columns.min()

How to forward propagate/fill a specific value in a Pandas DataFrame Column/Series?

I have a boolean column in a dataframe that looks like the following:
True
False
False
False
False
True
False
False
False
I want to forward propagate/fill the True values n number of times. e.g. 2 times:
True
True
True
False
False
True
True
True
False
the ffill does something similar for NaN values, but I can't find anything for a specific value as described. Is the easiest way to do this just to do a standard loop and just iterate over the rows and modify the column in question with a counter?
Each row is an equi-distant time series entry
EDIT:
The current answers all solve my specific problem with a bool column, but one answer can be modified to be more general purpose:
>> s = pd.Series([1, 2, 3, 4, 5, 1, 2, 3])
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
>> condition_mask = s == 2
>> s.mask(~(condition_mask)).ffill(limit=2).fillna(s).astype(int)
0 1
1 2
2 2
3 2
4 5
5 1
6 2
7 2
You can still use ffill but first you have to mask the False values
s.mask(~s).ffill(limit=2).fillna(s)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
8 False
Name: 0, dtype: bool
For 2 times you could have:
s = s | s.shift(1) | s.shift(2)
You could generalize to n-times from there.
Try with rolling
n = 3
s.rolling(n, min_periods=1).max().astype(bool)
Out[147]:
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
8 False
Name: s, dtype: bool

Pandas: check if column value is unique

I have a DataFrame like:
value
0 1
1 2
2 2
3 3
4 4
5 4
I need to check if each value is unique or not, and mark that boolean value to new column. Expected result would be:
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
I have tried:
df['unique'] = ""
df.loc[df["value"].is_unique, 'unique'] = True
But this throws exception:
cannot use a single bool to index into setitem
Any advise would be highly appreciated. Thanks.
Use Series.duplicated witn invert mask by ~:
df['unique'] = ~df['value'].duplicated(keep=False)
print (df)
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
Or:
df['unique'] = np.where(df['value'].duplicated(keep=False), False, True)
This works as well:
df['unique'] = df.merge(df.value_counts().to_frame(), on='value')[0]==1

Compare columns of 2 dataframes with a combination of index and row value

There are quite a few similar questions out there, but I am not sure if there is one that tackles both index and row values. (relevant to binary classification df)
So what I am trying to do is compare the columns with the same name to have the same values and index. If not, simply return an error.
Let's say DataFrame df has columns a, b and c and df_orginal has columns from a to z.
How can we first find the columns that have the same name between those 2 DataFrames, and then check the contents of those columns such that they match row by row in value and index between a, b and c from df and df_orginal
The contents of all the columns are numerical, that's why I want to compare the combination of index and values
Demo:
In [1]: df
Out[1]:
a b c
0 0 1 2
1 1 2 0
2 0 1 0
3 1 1 0
4 3 1 0
In [3]: df_orginal
Out[3]:
a b c d e f g ......
0 4 3 1 1 0 0 0
1 3 1 2 1 1 2 1
2 1 2 1 1 1 2 1
3 3 4 1 1 1 2 1
4 0 3 0 0 1 1 1
In the above example, for those columns that have the same column name, compare the combination of index and value and flag an error if the combination of index and value is not correct
common_cols = df.columns.intersection(df_original.columns)
for col in common_cols:
df1_ind_val_pair = df[col].index.astype(str) + ' ' + df[col].astype(str)
df2_ind_val_pair = df_original[col].index.astype(str) + ' ' + df_original[col].astype(str)
if any(df1_ind_val_pair != df2_ind_val_pair):
print('Found one or more unequal (index, value) pairs in col {}'.format(col))
IIUC:
Use pd.DataFrame.align with a join method of inner. Then pass the resulting tuple unpacked to pd.DataFrame.eq
pd.DataFrame.eq(*df.align(dfo, 'inner'))
a b c
0 False False False
1 False False False
2 False False False
3 False False False
4 False False True
To see rows that have all columns True, filter with this mask:
pd.DataFrame.eq(*df.align(dfo, 'inner')).all(1)
0 False
1 False
2 False
3 False
4 False
dtype: bool
with the sample data however, the result will be empty
df[pd.DataFrame.eq(*df.align(dfo, 'inner')).all(1)]
Empty DataFrame
Columns: [a, b, c]
Index: []
Same answer but with clearer code
def eq(d1, d2):
d1, d2 = d1.align(d2, 'inner')
return d1 == d2
eq(df, dfo)
a b c
0 False False False
1 False False False
2 False False False
3 False False False
4 False False True

compare df1 column 1 to all columns in df2 returning the index of df2

I'm new to pandas so likely overlooking something but I've been searching and haven't found anything helpful yet.
What I'm trying to do is this. I have 2 dataframes. df1 has only 1 column and an unknown number of rows. df2 has an unknown number of rows also and also an unknown number of columns for each index.
Example:
df1:
0 1117454
1 1147637
2 1148945
3 1149662
4 1151543
5 1151545
6 1236268
7 1236671
8 1236673
...
300 1366962
df2:
1 2 3 4 5 6 7
8302813476 1375294 1375297 1375313 1375318 1375325 1375330 1375331
8302813477 1317422 1363270 1363288 1363262 None None None
8302813478 1187269 1187276 1149662 1147843 1147639 1236650 1236656
So what I want is to check all df1 values against df2 column 1 - n and if there is a match with any value in df1 mark the index of df2 as True else it is False.
I think you can use isin for testing matching of Series created from df2 by stack with Series created from one column df1 by squeeze. Last reshape by unstack:
df3 = df2.stack().isin(df1.squeeze()).unstack()
print (df3)
1 2 3 4 5 6 7
8302813476 False False False False False False False
8302813477 False False False False False False False
8302813478 False False True False False False False
Then get find all values where at least one True by any:
a = df3.any(axis=1)
print (a)
8302813476 False
8302813477 False
8302813478 True
dtype: bool
And last boolean indexing:
print (a[a].index)
Int64Index([8302813478], dtype='int64')
Another solution is instead squeeze use df1['col'].unique(), thank you Ted Petrou:
df3 = df2.stack().isin(df1['col'].unique()).unstack()
print (df3)
1 2 3 4 5 6 7
8302813476 False False False False False False False
8302813477 False False False False False False False
8302813478 False False True False False False False
---
I like squeeze more, but same output is simple selecting column of df1:
df3 = df2.stack().isin(df1['col']).unstack()
print (df3)
1 2 3 4 5 6 7
8302813476 False False False False False False False
8302813477 False False False False False False False
8302813478 False False True False False False False
As an interesting numpy alternative
l1 = df1.values.ravel()
l2 = df2.values.ravel()
pd.DataFrame(
np.equal.outer(l1, l2).any(0).reshape(df2.values.shape),
df2.index, df2.columns
)
or using set, list and comprehension
l1 = set(df1.values.ravel().tolist())
l2 = df2.values.ravel().tolist()
pd.DataFrame(
np.array([bool(l1.intersection([d])) for d in l2]).reshape(df2.values.shape),
df2.index, df2.columns
)

Categories

Resources