PYTHON - Check if cell contains a string from another cell - python

I want to create a column "Held?" in my pandas dataframe that flags whenever the character in one of the cells "DrpType" is contained in another cell "HeldDrpTypes" from the same row.
I've tried using where and in but it didn't work:
df['Held?'] = where(df['DrpType'] in df['HeldDrpTypes'] == True),'Yes','No')
This is what I want to accomplish:
> print(df)
DrpType HeldDrpTypes Held?
0 A B No
1 B BC Yes
2 C B No
3 B BC Yes
4 A BC No
5 C BC Yes
Any ideas how I can go about this?

For a pure pandas way, you can use df.apply()
import pandas as pd
df = pd.DataFrame({
'DrpType': ['A', 'B', 'C', 'B', 'A', 'C',],
'HeldDrpTypes':['B', 'BC', 'B', 'BC', 'BC', 'BC']
})
df['Held?'] = df.apply(lambda row: row['DrpType'] in row['HeldDrpTypes'], axis=1)
print(df)
# DrpType HeldDrpTypes Held?
# 0 A B False
# 1 B BC True
# 2 C B False
# 3 B BC True
# 4 A BC False
# 5 C BC True
If you are a stickler for Yes/No rather than True/False, you can use the following, but I'd suggest sticking with the binary True/False to make it easier to check for truthiness rather than having to parse a string.
df['Held?'] = df.apply(
lambda row: 'Yes' if row['DrpType'] in row['HeldDrpTypes'] else 'No', axis='columns')
print(df)
# DrpType HeldDrpTypes Held?
# 0 A B No
# 1 B BC Yes
# 2 C B No
# 3 B BC Yes
# 4 A BC No
# 5 C BC Yes

I was curious about the timings of both answers, so I tested them out using a bigger dataframe:
df = pd.concat([df]*100000, ignore_index=True)
print(df.shape)
(600000, 2)
Timings:
Wen-Bens answer with list comprehension:
%%timeit
df['Held'] = ['Yes' if x in y else 'No' for x , y in zip(df.DrpType,df.HeldDrpTypes)]
Gives the following:
304 ms ± 17.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Brian Cohans answer using .apply
%%timeit
df['Held?'] = df.apply(
lambda row: 'Yes' if row['DrpType'] in row['HeldDrpTypes'] else 'No', axis='columns')
Gives the following:
23.2 s ± 1.23 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So the difference in speed is actually in magnitude of +- 1000 in favor of the list comprehension

You may check
l=['Yes' if x in y else 'No' for x , y in zip(df.DrpType,df.HeldDrpTypes)]
l
Out[196]: ['No', 'Yes', 'No', 'Yes', 'No', 'Yes']
df['Held']=l
Or we using the method from numpy
np.core.chararray.find(df.HeldDrpTypes.values.astype(str),df.DrpType.values.astype(str))!=-1
Out[201]: array([False, True, False, True, False, True])

Related

Why is getting the reverse of an index in pandas so slow?

I have a pandas dataframe that I'm using to store network data; it looks like:
from_id, to_id, count
X, Y, 3
Z, Y, 4
Y, X, 2
...
I am trying to add a new column, inverse_count, which gets the count value for the row where the from_id and to_id are reversed from the current row.
I'm taking the following approach. I thought that it would be fast but it is much slower than I anticipated, and I can't figure out why.
def get_inverse_val(x):
# Takes the inverse of the index for a given row
# When passed to apply with axis = 1, the index becomes the name
try:
return df.loc[(x.name[1], x.name[0]), 'count']
except KeyError:
return 0
df = df.set_index(['from_id', 'to_id'])
df['inverse_count'] = df.apply(get_inverse_val, axis = 1)
Why not do a simple merge for this?
df = pd.DataFrame({'from_id': ['X', 'Z', 'Y'], 'to_id': ['Y', 'Y', 'X'], 'count': [3,4,2]})
pd.merge(
left = df,
right = df,
how = 'left',
left_on = ['from_id', 'to_id'],
right_on = ['to_id', 'from_id']
)
from_id_x to_id_x count_x from_id_y to_id_y count_y
0 X Y 3 Y X 2.0
1 Z Y 4 NaN NaN NaN
2 Y X 2 X Y 3.0
Here we merge from (from, to) -> (to, from) to get reversed matching pairs. In general, you should avoid using apply() as it's slow. (To understand why, realized that it is not a vectorized operation.)
You can use .set_index twice to create two dataframes with opposite index orders and assign to create your inverse_count column.
df = (df.set_index(['from_id','to_id'])
.assign(inverse_count=df.set_index(['to_id','from_id'])['count'])
.reset_index())
from_id to_id count inverse_count
0 X Y 3 2.0
1 Z Y 4 NaN
2 Y X 2 3.0
Since the question was regarding speed let's look at performance on a larger dataset:
Setup:
import pandas as pd
import string
import itertools
df = pd.DataFrame(list(itertools.permutations(string.ascii_uppercase, 2)), columns=['from_id', 'to_id'])
df['count'] = df.index % 25 + 1
print(df)
from_id to_id count
0 A B 1
1 A C 2
2 A D 3
3 A E 4
4 A F 5
.. ... ... ...
645 Z U 21
646 Z V 22
647 Z W 23
648 Z X 24
649 Z Y 25
Set_index:
%timeit (df.set_index(['from_id','to_id'])
.assign(inverse_count=df.set_index(['to_id','from_id'])['count'])
.reset_index())
6 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Merge (from Ben's answer):
%timeit pd.merge(
left = df,
right = df,
how = 'left',
left_on = ['from_id', 'to_id'],
right_on = ['to_id', 'from_id'] )
1.73 ms ± 57.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, it looks like the merge approach is the faster option.

Efficient (& perhaps shortest) way of getting a list of column names that satisfies a condition, for each row

I have a dataframe with 5 columns (first column is ID and 4 are country names)
I want a list of country names for each row, that satisfies a particular condition on that respective country column.
df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}
In the above case I need a IDs for which rating is 4 and above.
I'm expecting the output to be list of companies for where rating was 4 and above for each ID. Can be a dataframe or a dict.
highest_rated_companies = { 'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2'] }
You could try something like this, with a to_records, that seems to be the fastest as you can see here:
First Option
import pandas as pd
import numpy as np
data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}
df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[1]:[df.columns[idx] for idx,val in enumerate(list(row)[2:],1) if val>=4] for row in df.to_records()}
Second Option
import pandas as pd
data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}
df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[0]:[df.columns[idx] for idx,val in enumerate(row[1:],1) if val>=4] for i, row in df.iterrows()}
print(highest_rated_companies)
Both outputs:
df:
id c1 c2 c3 c4
0 i1 3 4 4 5
1 i2 2 2 5 1
2 i3 4 5 3 2
3 i4 1 5 3 2
4 i5 4 5 3 2
highest_rated_companies:
{'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2']}
Timestamps:
First Option:
0.0113047 seconds best case, when executed 100 times the script
1.2424291999999468 seconds best case, when executed 10000 times the script
Second Option
0.07292359999996734 seconds best case, when executed 100 times the script
7.821904700000005 seconds best case, when executed 10000 times the script
Edit:
Using dt.to_records(), seem to be the fastest way, since I tested Ehsan's answer and I got when executed 10000 times the script, a timestamp of 50.64001639999992 seconds, and when executed 100 times the script, a timestamp of 0.5399872999998934 seconds. Even it's faster than the Second Option, the First Option keep being the fastest.
You can do this:
df = pd.DataFrame(df)
keys, values = np.where(df[['c1','c2','c3','c4']].ge(4))
highest_rated_companies = pd.DataFrame({'id':df.iloc[keys].id,'c':df.columns[values+1]})
output:
id c
0 i1 c2
0 i1 c3
0 i1 c4
1 i2 c3
2 i3 c1
2 i3 c2
3 i4 c2
4 i5 c1
4 i5 c2
You can easily convert it to a dictionary if you prefer.
Another option is to use the to_dict method. If you set your id column as the index, you can do:
df = df[df>=4]
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}
The last line is to convert the dictionary into the desired format. Time test:
In[0]:
import pandas as pd
df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}
df = pd.DataFrame(df)
df = df.set_index('id',drop=True)
df = df[df>=4]
%%timeit -n 1000
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}
Out[0]
243 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This isn't as fast as what #MrNobody33 answered, though:
135 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas count NAs with a groupby for all columns [duplicate]

This question already has answers here:
Pandas count null values in a groupby function
(3 answers)
Groupby class and count missing values in features
(5 answers)
Closed 3 years ago.
This question shows how to count NAs in a dataframe for a particular column C. How do I count NAs for all columns (that aren't the groupby column)?
Here is some test code that doesn't work:
#!/usr/bin/env python3
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,2,2],
'b':[1,np.nan,2,np.nan],
'c':[1,np.nan,2,3]})
# result = df.groupby('a').isna().sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
# result = df.groupby('a').transform('isna').sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
result = df.isna().groupby('a').sum()
print(result)
# result:
# b c
# a
# False 2.0 1.0
result = df.groupby('a').apply(lambda _df: df.isna().sum())
print(result)
# result:
# a b c
# a
# 1 0 2 1
# 2 0 2 1
Desired output:
b c
a
1 1 1
2 1 0
It's always best to avoid groupby.apply in favor of the basic functions which are cythonized, as this scales better with many groups. This will lead to a great increase in performance. In this case first check isnull() on the entire DataFrame then groupby + sum.
df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
# b c
#a
#1 1 1
#2 1 0
To illustrate the performance gain:
import pandas as pd
import numpy as np
N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
'b': np.random.choice([1, np.nan], N),
'c': np.random.choice([1, np.nan], N)})
%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Your question has the answer (You mistyped _df as df):
result = df.groupby('a')['b', 'c'].apply(lambda _df: _df.isna().sum())
result
b c
a
1 1 1
2 1 0
Using apply with isna and sum. Plus we select the correct columns, so we don't get the unnecessary a column:
Note: apply can be slow, it's recommended to use one of the vectorized solutions, see the answers of WenYoBen, Anky or ALollz
df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
Output
b c
a
1 1 1
2 1 0
Another way would be set_index() on a and groupby on the index and sum:
df.set_index('a').isna().groupby(level=0).sum()*1
Or:
df.set_index('a').isna().groupby(level=0).sum().astype(int)
Or without groupby courtesy #WenYoBen:
df.set_index('a').isna().sum(level=0).astype(int)
b c
a
1 1 1
2 1 0
I will do count then sub with value_counts, the reason why I did not using apply , cause it is usually has bad performance
df.groupby('a')[['b','c']].count().rsub(df.a.value_counts(dropna=False),axis=0)
Out[78]:
b c
1 1 1
2 1 0
Alternative
df.isna().drop('a',1).astype(int).groupby(df['a']).sum()
Out[83]:
b c
a
1 1 1
2 1 0
You need to drop the column after using apply.
df.groupby('a').apply(lambda x: x.isna().sum()).drop('a',1)
Output:
b c
a
1 1 1
2 1 0
Another dirty work:
df.set_index('a').isna().astype(int).groupby(level=0).sum()
Output:
b c
a
1 1 1
2 1 0
You could write your own aggregation function as follows:
df.groupby('a').agg(lambda x: x.isna().sum())
which results in
b c
a
1 1.0 1.0
2 1.0 0.0

How to check if a particular cell in pandas DataFrame isnull?

I have the following df in pandas.
0 A B C
1 2 NaN 8
How can I check if df.iloc[1]['B'] is NaN?
I tried using df.isnan() and I get a table like this:
0 A B C
1 false true false
but I am not sure how to index the table and if this is an efficient way of performing the job at all?
Use pd.isnull, for select use loc or iloc:
print (df)
0 A B C
0 1 2 NaN 8
print (df.loc[0, 'B'])
nan
a = pd.isnull(df.loc[0, 'B'])
print (a)
True
print (df['B'].iloc[0])
nan
a = pd.isnull(df['B'].iloc[0])
print (a)
True
jezrael response is spot on. If you are only concern with NaN value, I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
If you are looking for the indexes of NaN in a specific column you can use
list(df['B'].index[df['B'].apply(np.isnan)])
In case you what to get the indexes of all possible NaN values in the dataframe you may do the following
row_col_indexes = list(map(list, np.where(np.isnan(np.array(df)))))
indexes = []
for i in zip(row_col_indexes[0], row_col_indexes[1]):
indexes.append(list(i))
And if you are looking for a one liner you can use:
list(zip(*[x for x in list(map(list, np.where(np.isnan(np.array(df)))))]))

Square of each element of a column in pandas

How can I square each element of a column/series of a DataFrame in pandas (and create another column to hold the result)?
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
>>> df
a b
0 1 2
1 3 4
>>> df['c'] = df['b']**2
>>> df
a b c
0 1 2 4
1 3 4 16
Nothing wrong with the accepted answer, there is also:
df = pd.DataFrame({'a': range(0,100)})
np.square(df)
np.power(df, 2)
Which is ever so slightly faster:
In [11]: %timeit df ** 2
10000 loops, best of 3: 95.9 µs per loop
In [13]: %timeit np.square(df)
10000 loops, best of 3: 85 µs per loop
In [15]: %timeit np.power(df, 2)
10000 loops, best of 3: 85.6 µs per loop
You can also use pandas.DataFrame.pow() method.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2], [3,4]], columns=list('ab'))
>>> df
a b
0 1 2
1 3 4
>>> df['c'] = df['b'].pow(2)
>>> df
a b c
0 1 2 4
1 3 4 16

Categories

Resources