Select rows in Pandas which does not contain a specific character - python

I need something similar to
.str.startswith()
.str.endswith()
but for the middle part of a string.
For example, given the following pd.DataFrame
str_name
0 aaabaa
1 aabbcb
2 baabba
3 aacbba
4 baccaa
5 ababaa
I need to throw rows 1, 3 and 4 which contain (at least one) letter 'c'.
The position of the specific letter ('c') is not known.
The task is to remove all rows which do contain at least one specific letter

You want df['string_column'].str.contains('c')
>>> df
str_name
0 aaabaa
1 aabbcb
2 baabba
3 aacbba
4 baccaa
5 ababaa
>>> df['str_name'].str.contains('c')
0 False
1 True
2 False
3 True
4 True
5 False
Name: str_name, dtype: bool
Now, you can "delete" like this
>>> df = df[~df['str_name'].str.contains('c')]
>>> df
str_name
0 aaabaa
2 baabba
5 ababaa
>>>
Edited to add:
If you only want to check the first k characters, you can slice. Suppose k=3:
>>> df.str_name.str.slice(0,3)
0 aaa
1 aab
2 baa
3 aac
4 bac
5 aba
Name: str_name, dtype: object
>>> df.str_name.str.slice(0,3).str.contains('c')
0 False
1 False
2 False
3 True
4 True
5 False
Name: str_name, dtype: bool
Note, Series.str.slice does not behave like a typical Python slice.

you can use numpy
df[np.core.chararray.find(df.str_name.values.astype(str), 'c') < 0]
str_name
0 aaabaa
2 baabba
5 ababaa

You can use str.contains()
str_name = pd.Series(['aaabaa', 'aabbcb', 'baabba', 'aacbba', 'baccaa','ababaa'])
str_name.str.contains('c')
This will return the boolean
The following will return the inverse of the above
~str_name.str.contains('c')

Related

df.duplicated() not finding duplicates

I am trying to run this code.
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
print(df.duplicated())
It is giving me the output.
0 False
1 False
dtype: bool
I want to know why it is showing index 1 as False and not True.
I'm expecting output this.
0 False
1 True
dtype: bool
I'm using Python 3.11.1 and Pandas 1.4.4
duplicated is working on full rows (or a subset of the columns if the parameter is used).
Here you don't have any duplicate:
A B C
0 1 1 1 # this row is unique
1 2 2 2 # this one is also unique
I believe you might want duplication column-wise?
df.T.duplicated()
Output:
A False
B True
C True
dtype: bool
You are not getting the expected output because you don't have duplicates, to begin with. I added the duplicate rows to the end of your dataframe and this is closer to what you are looking for:
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
df = pd.concat([df]*2)
df
A B C
0 1 1 1
1 2 2 2
0 1 1 1
1 2 2 2
df.duplicated(keep='first')
Output:
0 False
1 False
0 True
1 True
dtype: bool
And the if you want to keep duplicates the other way around:
df.duplicated(keep='last')
0 True
1 True
0 False
1 False
dtype: bool

Pandas: check if column value is unique

I have a DataFrame like:
value
0 1
1 2
2 2
3 3
4 4
5 4
I need to check if each value is unique or not, and mark that boolean value to new column. Expected result would be:
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
I have tried:
df['unique'] = ""
df.loc[df["value"].is_unique, 'unique'] = True
But this throws exception:
cannot use a single bool to index into setitem
Any advise would be highly appreciated. Thanks.
Use Series.duplicated witn invert mask by ~:
df['unique'] = ~df['value'].duplicated(keep=False)
print (df)
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
Or:
df['unique'] = np.where(df['value'].duplicated(keep=False), False, True)
This works as well:
df['unique'] = df.merge(df.value_counts().to_frame(), on='value')[0]==1

Different index return value types for similar conditionals in pandas

I'm trying to solve some questions in pandas and I had two different functions:
One was coded with the return value as
return df.index[df['Gold']==df['Gold'].max()]
This was to return which index value had the highest value of gold
Other one:
return df.index[(df['Gold']-df['Gold.1']).min()]
This is used to represent the index with maximum difference of gold and gold1
When I return values, they are represented differently.
For the first code my return value is:
Index(['United States'], dtype='object')
For the second one:
'Montenegro'
What am I doing differently here?
Pandas already provides a method for that, idxmin() and idxmax():
>>> df = pd.DataFrame({'gold':[1,2,3,6,3,4,7], 'gold.1':[2,6,3,4,6,2,2]})
>>> df
gold gold.1
0 1 2
1 2 6
2 3 3
3 6 4
4 3 6
5 4 2
6 7 2
>>> df.gold.idxmax()
6
>>> (df['gold'] - df['gold.1']).idxmin()
1
# though you probably need
# >>> abs(df.gold - df['gold.1']).idxmin()
# 2
BTW, in your approaches, one is logical indexing other is general indexing, for my example dataframe it would look like:
>>> df['gold']==df['gold'].max()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
Name: gold, dtype: bool
>>> (df['gold']-df['gold.1']).min()
-4
The second one does not do what you intend. It simply finds the minimum difference between gold and gold.1, which is -4 in this case, and finds that index from df.index so it is essentially:
>>> df.index[-4] # i.e. fourth from the end
3
However, you would see that the minimum difference occurs at index -1, not 3:
>>> df['gold']-df['gold.1']
0 -1
1 -4
2 0
3 2
4 -3
5 2
6 5
dtype: int64

Perform operation on corresponding matching grouped Pandas dataframe

I have a Dataframe:
User Numbers
A 0
A 4
A 5
B 0
B 0
C 1
C 3
I want to perform an operation on each corresponding grouped data. For example, if I want to remove all Users that have the Number 0, it should look like:
User Numbers
A 0
A 4
A 5
C 1
C 3
since all Numbers of User B is 0.
Or for example, if I want to find the variance of the Numbers of all the Users, it should look like:
Users Variance
A 7
B 0
C 2
This means only the Numbers of A are calculated for finding the variance of A and so on.
Is there a general way to do all these computations for matching grouped data?
You want 2 different operations - filtration per groups and aggregation per groups.
Filtration:
For better performance is better use transform for boolean mask and filter by boolean indexing.
df1 = df[~df['Number'].eq(0).groupby(df['User']).transform('all')]
print (df1)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Steps:
1.First create boolean Series by comparing Number by eq:
print (df['Number'].eq(0))
0 True
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
2.Then use syntactic sugar - groupby by another column and transform function all for check if all Trues per group and transform is for mask with same size as original DataFrame:
print (df['Number'].eq(0).groupby(df['User']).transform('all'))
0 False
1 False
2 False
3 True
4 True
5 False
6 False
Name: Number, dtype: bool
3.Invert boolen mask by ~:
print (~df['Number'].eq(0).groupby(df['User']).transform('all'))
0 True
1 True
2 True
3 False
4 False
5 True
6 True
Name: Number, dtype: bool
4.Filter:
print (df[~df['Number'].eq(0).groupby(df['User']).transform('all')])
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Another slowier solution in large DataFrame with filter and same logic as first solution:
df2 = df.groupby('User').filter(lambda x: ~x['Number'].eq(0).all())
print (df2)
User Number
0 A 0
1 A 4
2 A 5
5 C 1
6 C 3
Aggregation:
For simplier aggregation by one column with one aggregate function, e.g. GroupBy.var use:
df3 = df.groupby('User', as_index=False)['Number'].var()
print (df3)
User Number
0 A 7
1 B 0
2 C 2

Return the unmatched rows from the regex pattern

If I have a pandas dataframe that looks like this:
Sequence Rating
0 HYHIVQKF 1
1 YGEIFEKF 2
2 TYGGSWKF 3
3 YLESFYKF 4
4 YYNTAVKL 5
5 WPDVIHSF 6
This is the code that I am using the return the rows that match the following pattern:
\b.[YF]\w+[LFI]\b
pat = r'\b.[YF]\w+[LFI]\b'
new_df.Sequence.str.contains(pat)
new_df[new_df.Sequence.str.contains(pat)]
The above code is returning the rows that match the pattern, but what can I use to return the unmatched rows?
Expected Output:
Sequence Rating
1 YGEIFEKF 2
3 YLESFYKF 4
5 WPDVIHSF 6
You can use ~ for not:
pat = r'\b.[YF]\w+[LFI]\b'
new_df[~new_df.Sequence.str.contains(pat)]
# Sequence Rating
#1 YGEIFEKF 2
#3 YLESFYKF 4
#5 WPDVIHSF 6
You can just do a negation of your existing Boolean series:
df[~df.Sequence.str.contains(pat)]
This will give you the desired output:
Sequence Rating
1 YGEIFEKF 2
3 YLESFYKF 4
5 WPDVIHSF 6
Brief explanation:
df.Sequence.str.contains(pat)
will return a Boolean series:
0 True
1 False
2 True
3 False
4 True
5 False
Name: Sequence, dtype: bool
Negating it using ~ yields
~df.Sequence.str.contains(pat)
0 False
1 True
2 False
3 True
4 False
5 True
Name: Sequence, dtype: bool
which is another Boolean series you can pass to your original dataframe.
Psidom's answer is more elegant, but another way to solve this problem is to modify the regex pattern to use a negative lookahead assertion, and then use match() instead of contains():
pat = r'\b.[YF]\w+[LFI]\b'
not_pat = r'(?!{})'.format(pat)
>>> new_df[new_df.Sequence.str.match(pat)]
Sequence Rating
0 HYHIVQKF 1
2 TYGGSWKF 3
4 YYNTAVKL 5
>>> new_df[new_df.Sequence.str.match(not_pat)]
Sequence Rating
1 YGEIFEKF 2
3 YLESFYKF 4
5 WPDVIHSF 6

Categories

Resources