How do I check a Pandas column for "any" row that matches a condition? (in my case, I want to test for type string).
Background: I was using the df.columnName.dtype.kind == 'O' to check for strings. But then I encountered the issue where some of my columns had decimal values. So I am looking for a different way to check and what I have come up with is:
display(df.col1.apply(lambda x: isinstance(x,str)).any()) #true
But the above code causes isinstance to be evaluated on every row and that seems inefficient, if I have a very large number of rows. How can I implement the above check, such that it stops evaluating further after encountering the first true value.
here is a more complete example:
from decimal import *
import pandas as pd
data = {
'c1': [None,'a','b'],
'c2': [None,1,2],
'c3': [None,Decimal(1),Decimal(2)]
}
dx = pd.DataFrame(data)
print(dx) #displays the dataframe
print('dx.dtypes')
print(dx.dtypes) #displays the datatypes in the dataframe
print('dx.c1.dtype:',dx.c1.dtype) #'O'
print('dx.c2.dtype:',dx.c2.dtype) #'float64'
print('dx.c3.dtype:',dx.c3.dtype) #'O'!
print('dx.c1.apply(lambda x: isinstance(x,str)')
print(dx.c1.apply(lambda x: isinstance(x,str)).any())#true
print('dx.c2.apply(lambda x: isinstance(x,str)).any()')
print(dx.c2.apply(lambda x: isinstance(x,str)).any())#false
#the following line shows that the apply function applies it to every row
print('dx.c1.apply(lambda x: isinstance(x,str))')
print(dx.c1.apply(lambda x: isinstance(x,str))) #false,false,false
#and only after that is the any function applied
print('dx.c1.apply(lambda x: isinstance(x,str)).any()')
print(dx.c1.apply(lambda x: isinstance(x,str)).any())#true
The above code outputs:
c1 c2 c3
0 None NaN None
1 a 1.0 1
2 b 2.0 2
dx.dtypes
c1 object
c2 float64
c3 object
dtype: object
dx.c1.dtype: object
dx.c2.dtype: float64
dx.c3.dtype: object
dx.c1.apply(lambda x: isinstance(x,str)
True
dx.c2.apply(lambda x: isinstance(x,str)).any()
False
dx.c1.apply(lambda x: isinstance(x,str))
0 False
1 True
2 True
Name: c1, dtype: bool
dx.c1.apply(lambda x: isinstance(x,str)).any()
True
Is there a better way?
More detail: I am trying to fix this line, which breaks when the column has "decimal" values: https://github.com/capitalone/datacompy/blob/8a74e60d26990e3e05d5b15eb6fb82fef62f4776/datacompy/core.py#L273
Copying my comment as an answer:
It seems what you needed was the built-in function any:
any(isinstance(x,str) for x in df['col1'])
That way rows are only evaluated until an instance of string is found.
Related
I have a DataFrame like this:
x = ['3.13.1.7-2.1', '3.21.1.8-2.2', '4.20.1.6-2.1', '4.8.1.2-2.0', '5.23.1.10-2.2']
df = pd.DataFrame(data = x, columns = ['id'])
id
0 3.13.1.7-2.1
1 3.21.1.8-2.2
2 4.20.1.6-2.1
3 4.8.1.2-2.0
4 5.23.1.10-2.2
I need to split each id string on the periods, and then I need to know when the second part is 13 and the third part is 1. Ideally, I would have one additional column that is a boolean value (in the above example, index 0 would be TRUE and all others would be FALSE). But I could live with multiple additional columns, where one or more contain individual string parts, and one is for said boolean value.
I first tried to just split the string into parts:
df['id_split'] = df['id'].apply(lambda x: str(x).split('.'))
This worked, however if I try to isolate only the second part of the string like this...
df['id_split'] = df['id'].apply(lambda x: str(x).split('.')[1])
...I get an error that the list index is out of range.
Yet, if I check any individual index in the DataFrame like this...
df['id_split'][0][1]
...this works, producing only the second item in the list of strings.
I guess I'm not familiar enough with what the .apply() method is doing to know why it won't accept list indices. But anyway, I'd like to know how I can isolate just the second and third parts of each string, check their values, and output a boolean based on those values, in a scalable manner (the actual dataset is millions of rows). Thanks!
Let's use str.split to get the parts, then you can compare:
parts = df['id'].str.split('\.', expand=True)
(parts[[1,2]] == ['13','1']).all(1)
Output:
0 True
1 False
2 False
3 False
4 False
dtype: bool
You can do something like this
df['flag'] = df['id'].apply(lambda x: True if x.split('.')[1] == '13' and x.split('.')[2]=='1' else False)
Output
id flag
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False
You can do it directly, like below:
df['new'] = df['id'].apply(lambda x: str(x).split('.')[1]=='13' and str(x).split('.')[2]=='1')
>>> print(df)
id new
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False
I need to adapt an existing function, that essentially performs a Series.str.contains and returns the resulting Series, to be able to handle SeriesGroupBy as input.
As suggested by the pandas error message
Cannot access attribute 'str' of 'SeriesGroupBy' objects, try using the 'apply' method
I have tried to use apply() on the SeriesGroupBy object, which works in a way, but results in a Series object. I would now like to apply the same grouping as before, to this Series.
Original function
def contains(series, expression):
return series.str.contains(expression)
My attempt so far
>>> import pandas as pd
... from functools import partial
...
... def _f(series, expression):
... return series.str.contains(expression)
...
... def contains(grouped_series, expression):
... result = grouped_series.apply(partial(_f, expression=expression))
... return result
>>> df = pd.DataFrame(zip([1,1,2,2], ['abc', 'def', 'abq', 'bcq']), columns=['group', 'text'])
>>> gdf = df.groupby('group')
>>> gs = gdf['text']
>>> type(gs)
<class 'pandas.core.groupby.generic.SeriesGroupBy'>
>>> r = contains(gdf['text'], 'b')
>>> r
0 True
1 False
2 True
3 True
Name: text, dtype: bool
>>> type(r)
<class 'pandas.core.series.Series'>
The desired result would by a boolean series grouped by the same indices as the original grouped_series.
The actual result is a Series object without any grouping.
EDIT / CLARIFICATION:
The initial answers make me think I didn't stress the core of the problem enough. For the sake of the question, lets assume I cannot change anything outside of the contains(grouped_series, expression) function.
I think I know how to solve my problem if I approach it from another angle, and if I don't that would then become another question. The real world context makes it very complicated to change code outside of that one function. So I would really appreciate suggestions that work within that constraint.
So, let me rephrase the question as follows:
I'm looking for a function contains(grouped_series, expression), so that the following code works:
>>> df = pd.DataFrame(zip([1,1,2,2], ['abc', 'def', 'abq', 'bcq']), columns=['group', 'text'])
>>> grouped_series = contains(df.groupby('group')['text'], 'b')
>>> grouped_series.sum()
group
1 1.0
2 2.0
Name: text, dtype: float64
groupby is not needed unless you want to do something with the "group" -- like calculating its sum or check if all rows in the group contain the letter b. When you call apply on a GroupBy object, you can pass additional argument to the function being applied by keywords:
def contains(frame, expression):
return frame['text'].str.contains(expression).all()
df.groupby('group').apply(contains, expression='b')
Result:
group
1 False
2 True
dtype: bool
I like to think that the first parameter to the function being applied (frame) is a smaller view of the original dataframe, being chopped up by the groupby clause.
That said, apply is pretty slow compared to specialized aggregate functions lime min, max or sum. Use these as much as possible and save apply for complex cases.
Following the advice of the error message, you could use apply:
df.groupby('group').apply(lambda x : x.text.str.contains('b'))
Out[10]:
group
1 0 True
1 False
2 2 True
3 True
Name: text, dtype: bool
If you want to put these indices into your data set and return a DataFrame, use reset_index:
df.groupby('group').apply(lambda x : x.text.str.contains('b')).reset_index()
Out[11]:
group level_1 text
0 1 0 True
1 1 1 False
2 2 2 True
3 2 3 True
_f has absolutely no relationship to the groups. The way to deal with this is to instead define a column prior to grouping (not a separate function), then group. Now that column (called 'to_sum') is part of your Series.GroupBy object.
df.assign(to_sum = _f(df['text'], 'b')).groupby('group').to_sum.sum()
#group
#1 1.0
#2 2.0
#Name: to_sum, dtype: float64
If you don't need the entire DataFrame for your subsequent operations, you can sum the Series returned by _f using df to group (as they will share the same index)
_f(df['text'], 'b').groupby(df['group']).sum()
You can just do this. No need to do group-by
df['eval']= df['text'].str.contains('b')
eval is the name of the column which you want add. You can name what you want.
df.groupby('group')['eval'].sum()
Run this after the first line. The result is
group
1 1.0
2 2.0
I'm trying to find a substring in a frozenset, however I'm a bit out of options.
My data structure is a pandas.dataframe (it's from the association_rules from the mlxtend package if you are familiar with that one) and I want to print all the rows where the antecedents (which is a frozenset) include a specific string.
Sample data:
print(rules[rules["antecedents"].str.contains('line', regex=False)])
However whenever I run it, I get an Empty Dataframe.
When I try running only the inner function on my series of rules["antecedents"], I get only False values for all entries. But why is that?
Because dataframe.str.* functions are for string data only. Since your data is not string, it will always be NaN regardless the string representation of it. To prove:
>>> x = pd.DataFrame(np.random.randn(2, 5)).astype("object")
>>> x
0 1 2 3 4
0 -1.17191 -1.92926 -0.831576 -0.0814279 0.099612
1 -1.55183 -0.494855 1.14398 -1.72675 -0.0390948
>>> x[0].str.contains("-1")
0 NaN
1 NaN
Name: 0, dtype: float64
What can you do:
Use apply:
>>> x[0].apply(lambda x: "-1" in str(x))
0 True
1 True
Name: 0, dtype: bool
So your code should write:
print(rules[rules["antecedents"].apply(lambda x: 'line' in str(x))])
You might want to use 'line' in x if you mean an exact match on element
I am struggling to understand how df.apply()exactly works.
My problem is as follows: I have a dataframe df. Now I want to search in several columns for certain strings. If the string is found in any of the columns I want to add for each row where the string is found a "label" (in a new column).
I am able to solve the problem with map and applymap(see below).
However, I would expect that the better solution would be to use applyas it applies a function to an entire column.
Question: Is this not possible using apply? Where is my mistake?
Here are my solutions for using map and applymap.
df = pd.DataFrame([list("ABCDZ"),list("EAGHY"), list("IJKLA")], columns = ["h1","h2","h3","h4", "h5"])
Solution using map
def setlabel_func(column):
return df[column].str.contains("A")
mask = sum(map(setlabel_func, ["h1","h5"]))
df.ix[mask==1,"New Column"] = "Label"
Solution using applymap
mask = df[["h1","h5"]].applymap(lambda el: True if re.match("A",el) else False).T.any()
df.ix[mask == True, "New Column"] = "Label"
For applyI don't know how to pass the two columns into the function / or maybe don't understand the mechanics at all ;-)
def setlabel_func(column):
return df[column].str.contains("A")
df.apply(setlabel_func(["h1","h5"]),axis = 1)
Above gives me alert.
'DataFrame' object has no attribute 'str'
Any advice? Please note that the search function in my real application is more complex and requires a regex function which is why I use .str.contain in the first place.
Another solutions are use DataFrame.any for get at least one True per row:
print (df[['h1', 'h5']].apply(lambda x: x.str.contains('A')))
h1 h5
0 True False
1 False False
2 False True
print (df[['h1', 'h5']].apply(lambda x: x.str.contains('A')).any(1))
0 True
1 False
2 True
dtype: bool
df['new'] = np.where(df[['h1','h5']].apply(lambda x: x.str.contains('A')).any(1),
'Label', '')
print (df)
h1 h2 h3 h4 h5 new
0 A B C D Z Label
1 E A G H Y
2 I J K L A Label
mask = df[['h1', 'h5']].apply(lambda x: x.str.contains('A')).any(1)
df.loc[mask, 'New'] = 'Label'
print (df)
h1 h2 h3 h4 h5 New
0 A B C D Z Label
1 E A G H Y NaN
2 I J K L A Label
pd.DataFrame.apply iterates over each column, passing the column as a pd.Series to the function being applied. In you case, the function you're trying to apply doesn't lend itself to being used in apply
Do this instead to get your idea to work
mask = df[['h1', 'h5']].apply(lambda x: x.str.contains('A').any(), 1)
df.loc[mask, 'New Column'] = 'Label'
h1 h2 h3 h4 h5 New Column
0 A B C D Z Label
1 E A G H Y NaN
2 I J K L A Label
IIUC you can do it this way:
In [23]: df['new'] = np.where(df[['h1','h5']].apply(lambda x: x.str.contains('A'))
.sum(1) > 0,
'Label', '')
In [24]: df
Out[24]:
h1 h2 h3 h4 h5 new
0 A B C D Z Label
1 E A G H Y
2 I J K L A Label
Others have given good alternative methods. Here is a way to use apply 'row wise' (axis=1) to get your new column indicating presence of "A" for a bunch of columns.
If you are passed a row, you can just join the strings together into one big string and then use a string comparison ("in") see below. here I am combing all columns, but you can do it with just H1 and h5 easily.
df = pd.DataFrame([list("ABCDZ"),list("EAGHY"), list("IJKLA")], columns = ["h1","h2","h3","h4", "h5"])
def dothat(row):
sep = ""
return "A" in sep.join(row['h1':'h5'])
df['NewColumn'] = df.apply(dothat,axis=1)
This just squashes squashes each row into one string (e.g. ABCDZ) and looks for "A". This is not that efficient though if you just want to quit the first time you find the string then combining all the columns could be a waste of time. You could easily change the function to look column by column and quit (return true) when it finds a hit.
I am on the process of cleaning a dataframe, and I want to check if there are any values from a list of words in a dataframe. If it is present, the value should be replaced by NA values. For example,
My dataframe is like.
p['title']
1 Forest
2 [VIDEO_TITLE]
3 [VIDEO_TITLE]
4 [VIDEO_TITLE]
5 [${title}url=${videourl}]
p.dtypes
title object
dtype: object
and
c= ('${title}', '[VIDEO_TITLE]')
Since the rows 2,3,4,5 have the words in c, I want that to be replaced by NA values.
I'm trying the following,
p['title'].replace('|'.join(c),np.NAN,regex=True).fillna('NA')
This one runs without error, but I am getting the same input as output. There are no changes at all.
My next try is,
p['title'].apply(lambda x: 'NA' if any(s in x for s in c) else x)
which is throwing an error,
TypeError: argument of type 'float' is not iterable
I am trying several other things without much success. I am not sure what mistake I am doing.
My ideal output would be,
p['title']
1 Forest
2 NA
3 NA
4 NA
5 NA
Can anybody help me in solving this?
You can loc to set them as 'NA'. Since your values are sometimes inside a list, first they need to be extracted from the list. The second line extracts the first string from the list, if it's in a list. The third line checks for a match.
c = ('${title}', 'VIDEO_TITLE')
string_check = p['title'].map(lambda x: x if not isinstance(x, list) else x[0])
string_check = string_check.map(lambda s: any(c_str in s for c_str in c))
p.loc[string_check, 'title'] = 'NA'
Depending on what you're doing, you may want to consider setting the values to numpy.nan instead of the string 'NA'. This is the usual way pandas handles null values and there's a lot of functionality already built around this.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'A' : ('a','b','c', 'd', 'a', 'b', 'c')})
>>> restricted = ['a', 'b', 'c']
>>> df[df['A'].isin(restricted)] = np.NAN
>>> df
A
0 NaN
1 NaN
2 NaN
3 d
4 NaN
5 NaN