Related
How can I compare strings in a case insensitive way in Python?
I would like to encapsulate comparison of a regular strings to a repository string, using simple and Pythonic code. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.
Assuming ASCII strings:
string1 = 'Hello'
string2 = 'hello'
if string1.lower() == string2.lower():
print("The strings are the same (case insensitive)")
else:
print("The strings are NOT the same (case insensitive)")
As of Python 3.3, casefold() is a better alternative:
string1 = 'Hello'
string2 = 'hello'
if string1.casefold() == string2.casefold():
print("The strings are the same (case insensitive)")
else:
print("The strings are NOT the same (case insensitive)")
If you want a more comprehensive solution that handles more complex unicode comparisons, see other answers.
Comparing strings in a case insensitive way seems trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.
The first thing to note is that case-removing conversions in Unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":
>>> "ß".lower()
'ß'
>>> "ß".upper().lower()
'ss'
But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal - that's the newer capital form. The recommended way is to use casefold:
str.casefold()
Return a casefolded copy of the string. Casefolded strings may be used for
caseless matching.
Casefolding is similar to lowercasing but more aggressive because it is
intended to remove all case distinctions in a string. [...]
Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).
Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" - but it doesn't:
>>> "ê" == "ê"
False
This is because the accent on the latter is a combining character.
>>> import unicodedata
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E WITH CIRCUMFLEX']
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']
The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does
>>> unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
True
To finish up, here this is expressed in functions:
import unicodedata
def normalize_caseless(text):
return unicodedata.normalize("NFKD", text.casefold())
def caseless_equal(left, right):
return normalize_caseless(left) == normalize_caseless(right)
Using Python 2, calling .lower() on each string or Unicode object...
string1.lower() == string2.lower()
...will work most of the time, but indeed doesn't work in the situations #tchrist has described.
Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:
>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'
>>> u = utf8_bytes.decode('utf8')
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True
The Σ character has two lowercase forms, ς and σ, and .lower() won't help compare them case-insensitively.
However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:
>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True
So if you care about edge-cases like the three sigmas in Greek, use Python 3.
(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)
Section 3.13 of the Unicode standard defines algorithms for caseless
matching.
X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).
Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'å'). D145 introduces "canonical caseless matching":
import unicodedata
def NFD(text):
return unicodedata.normalize('NFD', text)
def canonical_caseless(text):
return NFD(NFD(text).casefold())
NFD() is called twice for very infrequent edge cases involving U+0345 character.
Example:
>>> 'å'.casefold() == 'å'.casefold()
False
>>> canonical_caseless('å') == canonical_caseless('å')
True
There are also compatibility caseless matching (D146) for cases such as '㎒' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.
I saw this solution here using regex.
import re
if re.search('mandy', 'Mandy Pande', re.IGNORECASE):
# is True
It works well with accents
In [42]: if re.search("ê","ê", re.IGNORECASE):
....: print(1)
....:
1
However, it doesn't work with unicode characters case-insensitive. Thank you #Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:
In [36]: "ß".lower()
Out[36]: 'ß'
In [37]: "ß".upper()
Out[37]: 'SS'
In [38]: "ß".upper().lower()
Out[38]: 'ss'
In [39]: if re.search("ß","ßß", re.IGNORECASE):
....: print(1)
....:
1
In [40]: if re.search("SS","ßß", re.IGNORECASE):
....: print(1)
....:
In [41]: if re.search("ß","SS", re.IGNORECASE):
....: print(1)
....:
You can use casefold() method. The casefold() method ignores cases when comparing.
firstString = "Hi EVERYONE"
secondString = "Hi everyone"
if firstString.casefold() == secondString.casefold():
print('The strings are equal.')
else:
print('The strings are not equal.')
Output:
The strings are equal.
The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:
>>> "hello".upper() == "HELLO".upper()
True
>>>
How about converting to lowercase first? you can use string.lower().
a clean solution that I found, where I'm working with some constant file extensions.
from pathlib import Path
class CaseInsitiveString(str):
def __eq__(self, __o: str) -> bool:
return self.casefold() == __o.casefold()
GZ = CaseInsitiveString(".gz")
ZIP = CaseInsitiveString(".zip")
TAR = CaseInsitiveString(".tar")
path = Path("/tmp/ALL_CAPS.TAR.GZ")
GZ in path.suffixes, ZIP in path.suffixes, TAR in path.suffixes, TAR == ".tAr"
# (True, False, True, True)
You can mention case=False in the str.contains()
data['Column_name'].str.contains('abcd', case=False)
def search_specificword(key, stng):
key = key.lower()
stng = stng.lower()
flag_present = False
if stng.startswith(key+" "):
flag_present = True
symb = [',','.']
for i in symb:
if stng.find(" "+key+i) != -1:
flag_present = True
if key == stng:
flag_present = True
if stng.endswith(" "+key):
flag_present = True
if stng.find(" "+key+" ") != -1:
flag_present = True
print(flag_present)
return flag_present
Output:
search_specificword("Affordable housing", "to the core of affordable outHousing in europe")
False
search_specificword("Affordable housing", "to the core of affordable Housing, in europe")
True
from re import search, IGNORECASE
def is_string_match(word1, word2):
# Case insensitively function that checks if two words are the same
# word1: string
# word2: string | list
# if the word1 is in a list of words
if isinstance(word2, list):
for word in word2:
if search(rf'\b{word1}\b', word, IGNORECASE):
return True
return False
# if the word1 is same as word2
if search(rf'\b{word1}\b', word2, IGNORECASE):
return True
return False
is_match_word = is_string_match("Hello", "hELLO")
True
is_match_word = is_string_match("Hello", ["Bye", "hELLO", "#vagavela"])
True
is_match_word = is_string_match("Hello", "Bye")
False
Consider using FoldedCase from jaraco.text:
>>> from jaraco.text import FoldedCase
>>> FoldedCase('Hello World') in ['hello world']
True
And if you want a dictionary keyed on text irrespective of case, use FoldedCaseKeyedDict from jaraco.collections:
>>> from jaraco.collections import FoldedCaseKeyedDict
>>> d = FoldedCaseKeyedDict()
>>> d['heLlo'] = 'world'
>>> list(d.keys()) == ['heLlo']
True
>>> d['hello'] == 'world'
True
>>> 'hello' in d
True
>>> 'HELLO' in d
True
def insenStringCompare(s1, s2):
""" Method that takes two strings and returns True or False, based
on if they are equal, regardless of case."""
try:
return s1.lower() == s2.lower()
except AttributeError:
print "Please only pass strings into this method."
print "You passed a %s and %s" % (s1.__class__, s2.__class__)
This is another regex which I have learned to love/hate over the last week so usually import as (in this case yes) something that reflects how im feeling!
make a normal function.... ask for input, then use ....something = re.compile(r'foo*|spam*', yes.I)...... re.I (yes.I below) is the same as IGNORECASE but you cant make as many mistakes writing it!
You then search your message using regex's but honestly that should be a few pages in its own , but the point is that foo or spam are piped together and case is ignored.
Then if either are found then lost_n_found would display one of them. if neither then lost_n_found is equal to None. If its not equal to none return the user_input in lower case using "return lost_n_found.lower()"
This allows you to much more easily match up anything thats going to be case sensitive. Lastly (NCS) stands for "no one cares seriously...!" or not case sensitive....whichever
if anyone has any questions get me on this..
import re as yes
def bar_or_spam():
message = raw_input("\nEnter FoO for BaR or SpaM for EgGs (NCS): ")
message_in_coconut = yes.compile(r'foo*|spam*', yes.I)
lost_n_found = message_in_coconut.search(message).group()
if lost_n_found != None:
return lost_n_found.lower()
else:
print ("Make tea not love")
return
whatz_for_breakfast = bar_or_spam()
if whatz_for_breakfast == foo:
print ("BaR")
elif whatz_for_breakfast == spam:
print ("EgGs")
why does the output of these two functions give different outputs when the logic or idea is the same and they are working with the same string?
def solution(inputString):
a = ""
b = a[::-1]
if a == b:
return True
else:
return False
print(solution("az"))
def ans(something):
if something == reversed(something):
print(True)
else:
print(False)
ans('az')
This is I think because you are not using your inputString parameter in the function solution(). This may be closer to what you want:
def solution(inputString):
a = inputString
b = a[::-1]
if a == b:
return True
else:
return False
when the logic or idea is the same
No, the solution and ans functions have different logic.
solution uses the common way of reversing a string, thats fine
However, the second function uses reversed() function, which does:
reversed(seq)
Return a reverse iterator. seq must be an object which has a __reversed__() method or supports the sequence protocol ...
It does not return the reversed string as you'd probably expected.
To put that in perspective, the following code returns False:
print("oof" == reversed("foo"))
Because the return value of reversed("foo") is an <reversed object> and not the reversed String.
Assume that x is a string variable that has been given a value. Write an expression whose value is true if and only if x is NOT a letter.
def isLetter(ch):
import string
return len(ch) == 1 and ch in string.ascii_letters
print(isLetter('A'))
True
If you want to check the type of the variable x, you can use the following:
if type(x) is str:
print 'is a string'
In python, String and Char will have the same type and same output unlike languages like java.
type(chr(65)) == str
type('A') == str
EDIT:
As #Kay suggested, you should use isinstance(foo, Bar) instead of type(foo) is bar since isinstance is checking for inheritance while type does not.
See this for more details about isinstance vs type
Using isinstance will also support unicode strings.
isinstance(u"A", basestring)
>>> true
# Here is an example of why isinstance is better than type
type(u"A") is str
>>> false
type(u"A") is basestring
>>> false
type(u"A") is unicode
>>> true
EDIT 2:
Using regex to validate ONLY ONE letter
import re
re.match("^[a-zA-Z]$", "a") is not None
>>> True
re.match("^[a-zA-Z]$", "0") is not None
>>> False
Turns out the answer to it is not((x>='A' and x<='Z') or (x>='a' and x<='z'))
Use regular expression. http://www.pythex.org is a good learning spot, official docs here: https://docs.python.org/2/library/re.html
something like this should work though:
if x != '[a-zA-Z]':
I have always thought that using -1 in a condition is alway the same as the writing False (boolean value). But from my code, I get different results:
Using True and False:
def count(sub, s):
count = 0
index = 0
while True:
if string.find(s, sub, index) != False:
count += 1
index = string.find(s, sub, index) + 1
else:
return count
print count('nana', 'banana')
Result: Takes to long for interpreter to respond.
Using 1 and -1:
def count(sub, s):
count = 0
index = 0
while 1:
if string.find(s, sub, index) != -1:
count += 1
index = string.find(s, sub, index) + 1
else:
return count
print count('nana', 'banana')
Result: 1
Why does using -1 and 1 give me the correct result whereas using the bool values True and False do not?
string.find doesn't return a boolean so string.find('banana', 'nana', index) will NEVER return 0 (False) regardless of the value of index.
>>> import string
>>> help(string.find)
Help on function find in module string:
find(s, *args)
find(s, sub [, start [, end]]) -> int
Return the lowest index in s where substring sub is found,
such that sub is contained within s[start,end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.
>>>
Your example simply repeats:
index = string.find('banana', 'nana', 0) + 1 # index = 3
index = string.find('banana', 'nana', 3) + 1 # index = 0
The -1 version works because it correctly interprets the return value of string.find!
False is of type bool, which is a sub-type of int, and its value is 0.
In Python, False is similar to using 0, not -1
There's a difference between equality and converting to a boolean value for truth testing, for both historical and flexibility reasons:
>>> True == 1
True
>>> True == -1
False
>>> bool(-1)
True
>>> False == 0
True
>>> bool(0)
False
>>> True == 2
False
>>> bool(2)
True
I have always thought that using -1 in a condition is alway the same as the writing False (boolean value).
1) No. It is never the same, and I can't imagine why you would have ever thought this, let alone always thought it. Unless for some reason you had only ever used if with string.find or something.
2) You shouldn't be using the string module in the first place. Quoting directly from the documentation:
DESCRIPTION
Warning: most of the code you see here isn't normally used nowadays.
Beginning with Python 1.6, many of these functions are implemented as
methods on the standard string object. They used to be implemented by
a built-in module called strop, but strop is now obsolete itself.
So instead of string.find('foobar', 'foo'), we use the .find method of the str class itself (the class that 'foobar' and 'foo' belong to); and since we have objects of that class, we can make bound method calls, thus: 'foobar'.find('foo').
3) The .find method of strings returns a number that tells you where the substring was found, if it was found. If the substring wasn't found, it returns -1. It cannot return 0 in this case, because that would mean "was found at the beginning".
4) False will compare equal to 0. It is worth noting that Python actually implements its bool type as a subclass of int.
5) No matter what language you are using, you should not compare to boolean literals. x == False or equivalent is, quite simply, not the right thing to write. It gains you nothing in terms of clarity, and creates opportunities to make mistakes.
You would never, ever say "If it is true that it is raining, I will need an umbrella" in English, even though that is grammatically correct. There is no point; it is not more polite nor more clear than the obvious "If it is raining, I will need an umbrella".
If you want to use a value as a boolean, then use it as a boolean. If you want to use the result of a comparison (i.e. "is the value equal to -1 or not?"), then perform the comparison.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I'm looking for a string.contains or string.indexof method in Python.
I want to do:
if not somestring.contains("blah"):
continue
Use the in operator:
if "blah" not in somestring:
continue
If it's just a substring search you can use string.find("substring").
You do have to be a little careful with find, index, and in though, as they are substring searches. In other words, this:
s = "This be a string"
if s.find("is") == -1:
print("No 'is' here!")
else:
print("Found 'is' in the string.")
It would print Found 'is' in the string. Similarly, if "is" in s: would evaluate to True. This may or may not be what you want.
Does Python have a string contains substring method?
99% of use cases will be covered using the keyword, in, which returns True or False:
'substring' in any_string
For the use case of getting the index, use str.find (which returns -1 on failure, and has optional positional arguments):
start = 0
stop = len(any_string)
any_string.find('substring', start, stop)
or str.index (like find but raises ValueError on failure):
start = 100
end = 1000
any_string.index('substring', start, end)
Explanation
Use the in comparison operator because
the language intends its usage, and
other Python programmers will expect you to use it.
>>> 'foo' in '**foo**'
True
The opposite (complement), which the original question asked for, is not in:
>>> 'foo' not in '**foo**' # returns False
False
This is semantically the same as not 'foo' in '**foo**' but it's much more readable and explicitly provided for in the language as a readability improvement.
Avoid using __contains__
The "contains" method implements the behavior for in. This example,
str.__contains__('**foo**', 'foo')
returns True. You could also call this function from the instance of the superstring:
'**foo**'.__contains__('foo')
But don't. Methods that start with underscores are considered semantically non-public. The only reason to use this is when implementing or extending the in and not in functionality (e.g. if subclassing str):
class NoisyString(str):
def __contains__(self, other):
print(f'testing if "{other}" in "{self}"')
return super(NoisyString, self).__contains__(other)
ns = NoisyString('a string with a substring inside')
and now:
>>> 'substring' in ns
testing if "substring" in "a string with a substring inside"
True
Don't use find and index to test for "contains"
Don't use the following string methods to test for "contains":
>>> '**foo**'.index('foo')
2
>>> '**foo**'.find('foo')
2
>>> '**oo**'.find('foo')
-1
>>> '**oo**'.index('foo')
Traceback (most recent call last):
File "<pyshell#40>", line 1, in <module>
'**oo**'.index('foo')
ValueError: substring not found
Other languages may have no methods to directly test for substrings, and so you would have to use these types of methods, but with Python, it is much more efficient to use the in comparison operator.
Also, these are not drop-in replacements for in. You may have to handle the exception or -1 cases, and if they return 0 (because they found the substring at the beginning) the boolean interpretation is False instead of True.
If you really mean not any_string.startswith(substring) then say it.
Performance comparisons
We can compare various ways of accomplishing the same goal.
import timeit
def in_(s, other):
return other in s
def contains(s, other):
return s.__contains__(other)
def find(s, other):
return s.find(other) != -1
def index(s, other):
try:
s.index(other)
except ValueError:
return False
else:
return True
perf_dict = {
'in:True': min(timeit.repeat(lambda: in_('superstring', 'str'))),
'in:False': min(timeit.repeat(lambda: in_('superstring', 'not'))),
'__contains__:True': min(timeit.repeat(lambda: contains('superstring', 'str'))),
'__contains__:False': min(timeit.repeat(lambda: contains('superstring', 'not'))),
'find:True': min(timeit.repeat(lambda: find('superstring', 'str'))),
'find:False': min(timeit.repeat(lambda: find('superstring', 'not'))),
'index:True': min(timeit.repeat(lambda: index('superstring', 'str'))),
'index:False': min(timeit.repeat(lambda: index('superstring', 'not'))),
}
And now we see that using in is much faster than the others.
Less time to do an equivalent operation is better:
>>> perf_dict
{'in:True': 0.16450627865128808,
'in:False': 0.1609668098178645,
'__contains__:True': 0.24355481654697542,
'__contains__:False': 0.24382793854783813,
'find:True': 0.3067379407923454,
'find:False': 0.29860888058124146,
'index:True': 0.29647137792585454,
'index:False': 0.5502287584545229}
How can in be faster than __contains__ if in uses __contains__?
This is a fine follow-on question.
Let's disassemble functions with the methods of interest:
>>> from dis import dis
>>> dis(lambda: 'a' in 'b')
1 0 LOAD_CONST 1 ('a')
2 LOAD_CONST 2 ('b')
4 COMPARE_OP 6 (in)
6 RETURN_VALUE
>>> dis(lambda: 'b'.__contains__('a'))
1 0 LOAD_CONST 1 ('b')
2 LOAD_METHOD 0 (__contains__)
4 LOAD_CONST 2 ('a')
6 CALL_METHOD 1
8 RETURN_VALUE
so we see that the .__contains__ method has to be separately looked up and then called from the Python virtual machine - this should adequately explain the difference.
if needle in haystack: is the normal use, as #Michael says -- it relies on the in operator, more readable and faster than a method call.
If you truly need a method instead of an operator (e.g. to do some weird key= for a very peculiar sort...?), that would be 'haystack'.__contains__. But since your example is for use in an if, I guess you don't really mean what you say;-). It's not good form (nor readable, nor efficient) to use special methods directly -- they're meant to be used, instead, through the operators and builtins that delegate to them.
in Python strings and lists
Here are a few useful examples that speak for themselves concerning the in method:
>>> "foo" in "foobar"
True
>>> "foo" in "Foobar"
False
>>> "foo" in "Foobar".lower()
True
>>> "foo".capitalize() in "Foobar"
True
>>> "foo" in ["bar", "foo", "foobar"]
True
>>> "foo" in ["fo", "o", "foobar"]
False
>>> ["foo" in a for a in ["fo", "o", "foobar"]]
[False, False, True]
Caveat. Lists are iterables, and the in method acts on iterables, not just strings.
If you want to compare strings in a more fuzzy way to measure how "alike" they are, consider using the Levenshtein package
Here's an answer that shows how it works.
If you are happy with "blah" in somestring but want it to be a function/method call, you can probably do this
import operator
if not operator.contains(somestring, "blah"):
continue
All operators in Python can be more or less found in the operator module including in.
So apparently there is nothing similar for vector-wise comparison. An obvious Python way to do so would be:
names = ['bob', 'john', 'mike']
any(st in 'bob and john' for st in names)
>> True
any(st in 'mary and jane' for st in names)
>> False
You can use y.count().
It will return the integer value of the number of times a sub string appears in a string.
For example:
string.count("bah") >> 0
string.count("Hello") >> 1
Here is your answer:
if "insert_char_or_string_here" in "insert_string_to_search_here":
#DOSTUFF
For checking if it is false:
if not "insert_char_or_string_here" in "insert_string_to_search_here":
#DOSTUFF
OR:
if "insert_char_or_string_here" not in "insert_string_to_search_here":
#DOSTUFF
You can use regular expressions to get the occurrences:
>>> import re
>>> print(re.findall(r'( |t)', to_search_in)) # searches for t or space
['t', ' ', 't', ' ', ' ']