Subset pandas dataframe by dtype [duplicate] - python

This question already has answers here:
In Pandas, how to filter a Series based on the type of the values?
(3 answers)
Closed 4 years ago.
I have a pandas dataframe df with a column, call it A, that contains multiple data types. I want to select all rows of df where A has a particular data type.
For example, suppose that A has types int and str. I want to do something like df[type(df[A])==int] .

Setup
df = pd.DataFrame({'A': ['hello', 1, 2, 3, 'bad']})
This entire column will be assigned dtype Object. If you just want to find numeric values:
pd.to_numeric(df.A, errors='coerce').dropna()
1 1.0
2 2.0
3 3.0
Name: A, dtype: float64
However, this would also allow floats, string representations of numbers, etc. into the mix. If you really want to find elements that are of type int, you can use a list comprehension:
df.loc[[isinstance(val, int) for val in df.A], 'A']
1 1
2 2
3 3
Name: A, dtype: object
But notice that the dtype is still Object.
If the column has Boolean values, these will be kept, since bool is a subclass of int. If you don't want this behavior, you can use type instead of isinstance

Group by type
dod = dict(tuple(df.groupby(df['A'].map(type), sort=False)))
Setup
df = pd.DataFrame(dict(A=[1, 'one', {1}, [1], (1,)] * 2))
Validation
for t, d in dod.items():
print(t, d, sep='\n')
print()
<class 'int'>
A
0 1
5 1
<class 'str'>
A
1 one
6 one
<class 'set'>
A
2 {1}
7 {1}
<class 'list'>
A
3 [1]
8 [1]
<class 'tuple'>
A
4 (1,)
9 (1,)

Using groupby data from user3483203
for _,x in df.groupby(df.A.apply(lambda x : type(x).__name__)):
print(x)
A
1 1
2 2
3 3
A
0 hello
4 bad
d={ y:x for y,x in df.groupby(df.A.apply(lambda x : type(x).__name__))}

a = [2, 'B',3.0, 'c', 1, 'a', 2.0, 'b',3, 'C', 'A', 1.0]
df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])
df['float'] = df['a'].apply(isinstance,args = [float])
print(df)
a upper lower int float
0 2 NaN NaN True False
1 B True False False False
2 3 NaN NaN False True
3 c False True False False
4 1 NaN NaN True False
5 a False True False False
6 2 NaN NaN False True
7 b False True False False
8 3 NaN NaN True False
9 C True False False False
10 A True False False False
11 1 NaN NaN False True
integer = df[df['int']]['a']
print(integer)
0 2
4 1
8 3
Name: a, dtype: object

Related

How to get column index which is matching with specific value in Pandas?

I have the following dataframe as below.
0 1 2 3 4 5 6 7
True False False False False False False False
[1 rows * 8 columns]
As you can see, there is one True value which is the first column.
Therefore, I want to get the 0 index which is True element in the dataframe.
In other case, there is True in the 4th column index, then I would like to get the 4 as 4th column has the True value for below dataframe.
0 1 2 3 4 5 6 7
False False False False True False False False
[1 rows * 8 columns]
I tried to google it but failed to get what I want.
And for assumption, there is no designated column name in the case.
Look forward to your help.
Thanks.
IIUC, you are looking for idxmax:
>>> df
0 1 2 3 4 5 6 7
0 True False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
>>> df
0 1 2 3 4 5 6 7
0 False False False False True False False False
>>> df.idxmax(axis=1)
0 4
dtype: object
Caveat: if all values are False, Pandas returns the first index because index 0 is the lowest index of the highest value:
>>> df
0 1 2 3 4 5 6 7
0 False False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
Workaround: replace False by np.nan:
>>> df.replace(False, np.nan).idxmax(axis=1)
0 NaN
dtype: float64
if you want every field that is true:
cols_true = []
for idx, row in df.iterrows():
for i in cols:
if row[i]:
cols_true.append(i)
print(cols_true)
Use boolean indexing:
df.columns[df.iloc[0]]
output:
Index(['0'], dtype='object')
Or numpy.where
np.where(df)[1]
You may want to index the dataframe's index by a column itself (0 in this case), as follows:
df.index[df[0]]
You'll get:
Int64Index([0], dtype='int64')
df.loc[:, df.any()].columns[0]
# 4
If you have several True values you can also get them all with columns
Generalization
Imagine we have the following dataframe (several True values in positions 4, 6 and 7):
0 1 2 3 4 5 6 7
0 False False False False True False True True
With the formula above :
df.loc[:, df.any()].columns
# Int64Index([4, 6, 7], dtype='int64')
df1.apply(lambda ss:ss.loc[ss].index.min(),axis=1).squeeze()
out:
0
or
df1.loc[:,df1.iloc[0]].columns.min()

How to forward propagate/fill a specific value in a Pandas DataFrame Column/Series?

I have a boolean column in a dataframe that looks like the following:
True
False
False
False
False
True
False
False
False
I want to forward propagate/fill the True values n number of times. e.g. 2 times:
True
True
True
False
False
True
True
True
False
the ffill does something similar for NaN values, but I can't find anything for a specific value as described. Is the easiest way to do this just to do a standard loop and just iterate over the rows and modify the column in question with a counter?
Each row is an equi-distant time series entry
EDIT:
The current answers all solve my specific problem with a bool column, but one answer can be modified to be more general purpose:
>> s = pd.Series([1, 2, 3, 4, 5, 1, 2, 3])
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
>> condition_mask = s == 2
>> s.mask(~(condition_mask)).ffill(limit=2).fillna(s).astype(int)
0 1
1 2
2 2
3 2
4 5
5 1
6 2
7 2
You can still use ffill but first you have to mask the False values
s.mask(~s).ffill(limit=2).fillna(s)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
8 False
Name: 0, dtype: bool
For 2 times you could have:
s = s | s.shift(1) | s.shift(2)
You could generalize to n-times from there.
Try with rolling
n = 3
s.rolling(n, min_periods=1).max().astype(bool)
Out[147]:
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
8 False
Name: s, dtype: bool

Select rows in Pandas which does not contain a specific character

I need something similar to
.str.startswith()
.str.endswith()
but for the middle part of a string.
For example, given the following pd.DataFrame
str_name
0 aaabaa
1 aabbcb
2 baabba
3 aacbba
4 baccaa
5 ababaa
I need to throw rows 1, 3 and 4 which contain (at least one) letter 'c'.
The position of the specific letter ('c') is not known.
The task is to remove all rows which do contain at least one specific letter
You want df['string_column'].str.contains('c')
>>> df
str_name
0 aaabaa
1 aabbcb
2 baabba
3 aacbba
4 baccaa
5 ababaa
>>> df['str_name'].str.contains('c')
0 False
1 True
2 False
3 True
4 True
5 False
Name: str_name, dtype: bool
Now, you can "delete" like this
>>> df = df[~df['str_name'].str.contains('c')]
>>> df
str_name
0 aaabaa
2 baabba
5 ababaa
>>>
Edited to add:
If you only want to check the first k characters, you can slice. Suppose k=3:
>>> df.str_name.str.slice(0,3)
0 aaa
1 aab
2 baa
3 aac
4 bac
5 aba
Name: str_name, dtype: object
>>> df.str_name.str.slice(0,3).str.contains('c')
0 False
1 False
2 False
3 True
4 True
5 False
Name: str_name, dtype: bool
Note, Series.str.slice does not behave like a typical Python slice.
you can use numpy
df[np.core.chararray.find(df.str_name.values.astype(str), 'c') < 0]
str_name
0 aaabaa
2 baabba
5 ababaa
You can use str.contains()
str_name = pd.Series(['aaabaa', 'aabbcb', 'baabba', 'aacbba', 'baccaa','ababaa'])
str_name.str.contains('c')
This will return the boolean
The following will return the inverse of the above
~str_name.str.contains('c')

Pandas transform() vs apply()

I don't understand why apply and transform return different dtypes when called on the same data frame. The way I explained the two functions to myself before went something along the lines of "apply collapses the data, and transform does exactly the same thing as apply but preserves the original index and doesn't collapse." Consider the following.
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [1,1,0,0,1,0,0,0,0,1]})
Let's identify those ids which have a nonzero entry in the cat column.
>>> df.groupby('id')['cat'].apply(lambda x: (x == 1).any())
id
1 True
2 True
3 False
4 True
Name: cat, dtype: bool
Great. If we wanted to create an indicator column, however, we could do the following.
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64
I don't understand why the dtype is now int64 instead of the boolean returned by the any() function.
When I change the original data frame to contain some booleans (note that the zeros remain), the transform approach returns booleans in an object column. This is an extra mystery to me since all of the values are boolean, but it's listed as object apparently to match the dtype of the original mixed-type column of integers and booleans.
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [True,True,0,0,True,0,0,0,0,True]})
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
Name: cat, dtype: object
However, when I use all booleans, the transform function returns a boolean column.
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [True,True,False,False,True,False,False,False,False,True]})
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
Name: cat, dtype: bool
Using my acute pattern-recognition skills, it appears that the dtype of the resulting column mirrors that of the original column. I would appreciate any hints about why this occurs or what's going on under the hood in the transform function. Cheers.
It looks like SeriesGroupBy.transform() tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform() doesn't seem to do that:
In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64
# v v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
cat
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
In [141]: df.dtypes
Out[141]:
cat int64
id int64
dtype: object
Just adding another illustrative example with sum as I find it more explicit:
df = (
pd.DataFrame(pd.np.random.rand(10, 3), columns=['a', 'b', 'c'])
.assign(a=lambda df: df.a > 0.5)
)
Out[70]:
a b c
0 False 0.126448 0.487302
1 False 0.615451 0.735246
2 False 0.314604 0.585689
3 False 0.442784 0.626908
4 False 0.706729 0.508398
5 False 0.847688 0.300392
6 False 0.596089 0.414652
7 False 0.039695 0.965996
8 True 0.489024 0.161974
9 False 0.928978 0.332414
df.groupby('a').apply(sum) # drop rows
a b c
a
False 0.0 4.618465 4.956997
True 1.0 0.489024 0.161974
df.groupby('a').transform(sum) # keep dims
b c
0 4.618465 4.956997
1 4.618465 4.956997
2 4.618465 4.956997
3 4.618465 4.956997
4 4.618465 4.956997
5 4.618465 4.956997
6 4.618465 4.956997
7 4.618465 4.956997
8 0.489024 0.161974
9 4.618465 4.956997
However when applied to pd.DataFrame and not pd.GroupBy object I was not able to see any difference.

Creating NaN values in Pandas (instead of Numpy)

I'm converting a .ods spreadsheet to a Pandas DataFrame. I have whole columns and rows I'd like to drop because they contain only "None". As "None" is a str, I have:
pandas.DataFrame.replace("None", numpy.nan)
...on which I call: .dropna(how='all')
Is there a pandas equivalent to numpy.nan?
Is there a way to use .dropna() with the *string "None" rather than NaN?
You can use float('nan') if you really want to avoid importing things from the numpy namespace:
>>> import pandas as pd
>>> s = pd.Series([1, 2, 3])
>>> s[1] = float('nan')
>>> s
0 1.0
1 NaN
2 3.0
dtype: float64
>>>
>>> s.dropna()
0 1.0
2 3.0
dtype: float64
Moreover, if you have a string value "None", you can .replace("None", float("nan")):
>>> s[1] = "None"
>>> s
0 1
1 None
2 3
dtype: object
>>>
>>> s.replace("None", float("nan"))
0 1.0
1 NaN
2 3.0
dtype: float64
If you are trying to drop directly the rows containing a "None" string value (without converting these "None" cells to NaN values), I guess it can be done without using replace + dropna
Considering a DataFrame like :
In [3]: df = pd.DataFrame({
"foo": [1,2,3,4],
"bar": ["None",5,5,6],
"baz": [8, "None", 9, 10]
})
In [4]: df
Out[4]:
bar baz foo
0 None 8 1
1 5 None 2
2 5 9 3
3 6 10 4
Using replace and dropna will return
In [5]: df.replace('None', float("nan")).dropna()
Out[5]:
bar baz foo
2 5.0 9.0 3
3 6.0 10.0 4
Which can also be obtained by simply selecting the row you need :
In [7]: df[df.eval("foo != 'None' and bar != 'None' and baz != 'None'")]
Out[7]:
bar baz foo
2 5 9 3
3 6 10 4
You can also use the drop method of your dataframe, selecting appropriately the axis/labels targeted :
In [9]: df.drop(df[(df.baz == "None") |
(df.bar == "None") |
(df.foo == "None")].index)
Out[9]:
bar baz foo
2 5 9 3
3 6 10 4
These two methods are more or less interchangeable as you can also do for example:
df[(df.baz != "None") & (df.bar != "None") & (df.foo != "None")]
(but i guess the comparison df.somecolumns == "Some string" is only possible if the column type allows it, before theses last 2 examples, which wasn't the case with eval, i had to do df = df.astype (object) as the foo columns was of type int64)

Categories

Resources