How can I compare strings in a case insensitive way in Python?
I would like to encapsulate comparison of a regular strings to a repository string, using simple and Pythonic code. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.
Assuming ASCII strings:
string1 = 'Hello'
string2 = 'hello'
if string1.lower() == string2.lower():
print("The strings are the same (case insensitive)")
else:
print("The strings are NOT the same (case insensitive)")
As of Python 3.3, casefold() is a better alternative:
string1 = 'Hello'
string2 = 'hello'
if string1.casefold() == string2.casefold():
print("The strings are the same (case insensitive)")
else:
print("The strings are NOT the same (case insensitive)")
If you want a more comprehensive solution that handles more complex unicode comparisons, see other answers.
Comparing strings in a case insensitive way seems trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.
The first thing to note is that case-removing conversions in Unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":
>>> "ß".lower()
'ß'
>>> "ß".upper().lower()
'ss'
But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal - that's the newer capital form. The recommended way is to use casefold:
str.casefold()
Return a casefolded copy of the string. Casefolded strings may be used for
caseless matching.
Casefolding is similar to lowercasing but more aggressive because it is
intended to remove all case distinctions in a string. [...]
Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).
Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" - but it doesn't:
>>> "ê" == "ê"
False
This is because the accent on the latter is a combining character.
>>> import unicodedata
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E WITH CIRCUMFLEX']
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']
The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does
>>> unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
True
To finish up, here this is expressed in functions:
import unicodedata
def normalize_caseless(text):
return unicodedata.normalize("NFKD", text.casefold())
def caseless_equal(left, right):
return normalize_caseless(left) == normalize_caseless(right)
Using Python 2, calling .lower() on each string or Unicode object...
string1.lower() == string2.lower()
...will work most of the time, but indeed doesn't work in the situations #tchrist has described.
Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:
>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'
>>> u = utf8_bytes.decode('utf8')
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True
The Σ character has two lowercase forms, ς and σ, and .lower() won't help compare them case-insensitively.
However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:
>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True
So if you care about edge-cases like the three sigmas in Greek, use Python 3.
(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)
Section 3.13 of the Unicode standard defines algorithms for caseless
matching.
X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).
Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'å'). D145 introduces "canonical caseless matching":
import unicodedata
def NFD(text):
return unicodedata.normalize('NFD', text)
def canonical_caseless(text):
return NFD(NFD(text).casefold())
NFD() is called twice for very infrequent edge cases involving U+0345 character.
Example:
>>> 'å'.casefold() == 'å'.casefold()
False
>>> canonical_caseless('å') == canonical_caseless('å')
True
There are also compatibility caseless matching (D146) for cases such as '㎒' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.
I saw this solution here using regex.
import re
if re.search('mandy', 'Mandy Pande', re.IGNORECASE):
# is True
It works well with accents
In [42]: if re.search("ê","ê", re.IGNORECASE):
....: print(1)
....:
1
However, it doesn't work with unicode characters case-insensitive. Thank you #Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:
In [36]: "ß".lower()
Out[36]: 'ß'
In [37]: "ß".upper()
Out[37]: 'SS'
In [38]: "ß".upper().lower()
Out[38]: 'ss'
In [39]: if re.search("ß","ßß", re.IGNORECASE):
....: print(1)
....:
1
In [40]: if re.search("SS","ßß", re.IGNORECASE):
....: print(1)
....:
In [41]: if re.search("ß","SS", re.IGNORECASE):
....: print(1)
....:
You can use casefold() method. The casefold() method ignores cases when comparing.
firstString = "Hi EVERYONE"
secondString = "Hi everyone"
if firstString.casefold() == secondString.casefold():
print('The strings are equal.')
else:
print('The strings are not equal.')
Output:
The strings are equal.
The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:
>>> "hello".upper() == "HELLO".upper()
True
>>>
How about converting to lowercase first? you can use string.lower().
a clean solution that I found, where I'm working with some constant file extensions.
from pathlib import Path
class CaseInsitiveString(str):
def __eq__(self, __o: str) -> bool:
return self.casefold() == __o.casefold()
GZ = CaseInsitiveString(".gz")
ZIP = CaseInsitiveString(".zip")
TAR = CaseInsitiveString(".tar")
path = Path("/tmp/ALL_CAPS.TAR.GZ")
GZ in path.suffixes, ZIP in path.suffixes, TAR in path.suffixes, TAR == ".tAr"
# (True, False, True, True)
You can mention case=False in the str.contains()
data['Column_name'].str.contains('abcd', case=False)
def search_specificword(key, stng):
key = key.lower()
stng = stng.lower()
flag_present = False
if stng.startswith(key+" "):
flag_present = True
symb = [',','.']
for i in symb:
if stng.find(" "+key+i) != -1:
flag_present = True
if key == stng:
flag_present = True
if stng.endswith(" "+key):
flag_present = True
if stng.find(" "+key+" ") != -1:
flag_present = True
print(flag_present)
return flag_present
Output:
search_specificword("Affordable housing", "to the core of affordable outHousing in europe")
False
search_specificword("Affordable housing", "to the core of affordable Housing, in europe")
True
from re import search, IGNORECASE
def is_string_match(word1, word2):
# Case insensitively function that checks if two words are the same
# word1: string
# word2: string | list
# if the word1 is in a list of words
if isinstance(word2, list):
for word in word2:
if search(rf'\b{word1}\b', word, IGNORECASE):
return True
return False
# if the word1 is same as word2
if search(rf'\b{word1}\b', word2, IGNORECASE):
return True
return False
is_match_word = is_string_match("Hello", "hELLO")
True
is_match_word = is_string_match("Hello", ["Bye", "hELLO", "#vagavela"])
True
is_match_word = is_string_match("Hello", "Bye")
False
Consider using FoldedCase from jaraco.text:
>>> from jaraco.text import FoldedCase
>>> FoldedCase('Hello World') in ['hello world']
True
And if you want a dictionary keyed on text irrespective of case, use FoldedCaseKeyedDict from jaraco.collections:
>>> from jaraco.collections import FoldedCaseKeyedDict
>>> d = FoldedCaseKeyedDict()
>>> d['heLlo'] = 'world'
>>> list(d.keys()) == ['heLlo']
True
>>> d['hello'] == 'world'
True
>>> 'hello' in d
True
>>> 'HELLO' in d
True
def insenStringCompare(s1, s2):
""" Method that takes two strings and returns True or False, based
on if they are equal, regardless of case."""
try:
return s1.lower() == s2.lower()
except AttributeError:
print "Please only pass strings into this method."
print "You passed a %s and %s" % (s1.__class__, s2.__class__)
This is another regex which I have learned to love/hate over the last week so usually import as (in this case yes) something that reflects how im feeling!
make a normal function.... ask for input, then use ....something = re.compile(r'foo*|spam*', yes.I)...... re.I (yes.I below) is the same as IGNORECASE but you cant make as many mistakes writing it!
You then search your message using regex's but honestly that should be a few pages in its own , but the point is that foo or spam are piped together and case is ignored.
Then if either are found then lost_n_found would display one of them. if neither then lost_n_found is equal to None. If its not equal to none return the user_input in lower case using "return lost_n_found.lower()"
This allows you to much more easily match up anything thats going to be case sensitive. Lastly (NCS) stands for "no one cares seriously...!" or not case sensitive....whichever
if anyone has any questions get me on this..
import re as yes
def bar_or_spam():
message = raw_input("\nEnter FoO for BaR or SpaM for EgGs (NCS): ")
message_in_coconut = yes.compile(r'foo*|spam*', yes.I)
lost_n_found = message_in_coconut.search(message).group()
if lost_n_found != None:
return lost_n_found.lower()
else:
print ("Make tea not love")
return
whatz_for_breakfast = bar_or_spam()
if whatz_for_breakfast == foo:
print ("BaR")
elif whatz_for_breakfast == spam:
print ("EgGs")
I wasn't sure what to title it, but I'm writing a function that checks if a phrase is a palindrome or not. If something is capitalized or not doesn't matter and if there are other characters it will remove them. If the string is the same forwards and backwards (that's what a palindrome is) then it will be a Boolean True, and a Boolean False if it isn't. For example:
is_palindrome('ta.cocat')
#It would return
True
is_palidrome('Tacocat')
#It would return
True
is_palindrome('tacodog')
#It would return
False
I've written code that will take out extra characters, but I can't figure out how to make it that capitalization doesn't matter.
#Strip non-alpha
def to_alphanum(str):
return ''.join([char for char in str if char.isalnum()])
#Palindromes
def is_palindrome(str):
str = to_alphanum(str)
if(str==str[::-1]):
return True
else:
return False
#Here's examples about what it returns
is_palindrome('taco.cat')
True
is_palindrome('Tacocat')
>>> False
just use the lower function on your input string, so that case doesnt matters in your function
str = to_alphanum(str).lower()
You can simply use:
def isPalindrome(s):
s = to_alphanum(s).lower()
rev = s[::-1]
# Checking if both string are equal or not
if (s == rev):
return True
return False
s1 = "malayalam"
>>> isPalindrome(s1)
True
Going off of this link: How to check if a word is an English word with Python?
Is there any way to see (in python) if a string of letters is contained in any word in the English language? For example, fun(wat) would return true since "water" is a word (and I'm sure there are multiple other words that contain wat) but fun(wayterlx) would be false since wayterlx is not contained in any English word. (and it is not a word itself)
Edit: A second example: d.check("blackjack") returns true but d.check("lackjac") returns false, but in the function I am looking for it would return true since it is contained in some english word.
Based on solution to the linked answer.
We can define next utility function using Dict.suggest method
def is_part_of_existing_word(string, words_dictionary):
suggestions = words_dictionary.suggest(string)
return any(string in suggestion
for suggestion in suggestions)
then simply
>>> import enchant
>>> english_dictionary = enchant.Dict("en")
>>> is_part_of_existing_word('wat', words_dictionary=english_dictionary)
True
>>> is_part_of_existing_word('wate', words_dictionary=english_dictionary)
True
>>> is_part_of_existing_word('way', words_dictionary=english_dictionary)
True
>>> is_part_of_existing_word('wayt', words_dictionary=english_dictionary)
False
>>> is_part_of_existing_word('wayter', words_dictionary=english_dictionary)
False
>>> is_part_of_existing_word('wayterlx', words_dictionary=english_dictionary)
False
>>> is_part_of_existing_word('lackjack', words_dictionary=english_dictionary)
True
>>> is_part_of_existing_word('ucumber', words_dictionary=english_dictionary)
True
This is my code. it seems to be checking only the first character. I'd like to know how to fix it.
import string
ASCII_LOWERCASE = "abcdefghijklmnopqrstuvwxyz"
ASCII_UPPERCASE = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
DECIMAL_DIGITS = "0123456789"
ALPHABETS = ASCII_LOWERCASE + ASCII_UPPERCASE
def is_alpha(string):
for c in "string":
if c in ALPHABETS:
return False
else:
return True
Do not return in every iteration, only return False when you find something that is not in ALPHABET. What about this?
def is_alpha(your_string):
for c in your_string:
if c not in ALPHABETS:
return False
return True
actually its almost half right:
def is_alpha(string):
for c in string:
if not c in ALPHABETS:
return False
return True
short circuiting on False is ok but you need to finish the loop to determine True
There's several issues with your code here.
In your function definition you say
for c in "string"
which means you're checking the characters in the string literal "string".
You'll need to remove the quotes to have it refer to your parameter and actually check the string you're looking at. Further, you should change the name of your parameter and variable to something less ambiguous than string.
As Two-Bit Alchemist points out, you're returning from inside the loop so if the first character is in ALPHABETS then it returns and doesn't execute any more.
Which goes to the final issue. You're returning False if it is an alpha character instead of true. If you change
if c in ALPHABETS:
return False
to
if c not in ALPHABETS:
return False
then delete the else branch and put the return True after the loop it will work as you intend.
Change your function like this:
def is_alpha(string):
for c in string:
if c not in ALPHABETS:
return False
return True
First of all, you had for c in "string": in your code. There you are creating a new string ("string") and looping through the characters in it. Instead, you want to loop through the parameter passed to your function. So remove the double quotes.
Your function returns after first character because you are returning in both cases. Return true only when all characters are in ALPHABETS
I was looking for a way to write code that checks whether a certain string is a part of another string. I understand that it is easy to do when we have numbers, but I don't know how to do it with strings.
For example, I have this function
a = is_part("motherland", "land")
I need to know that "land" is a part of the word "motherland" (return True or False). Is it possible to check this?
UPDATE: How can I create a restriction when the second word always has to be in the end of the first one. For example, in case when I check whether "eight" is a part of "eighteen" it returns False because "eight" is not at the end of the first word
This should help:
>>> "land" in "motherland"
True
>>> "banana" in "motherland"
False
Here is a function that determines whether a string target is contained within another string some_string.
def is_part(some_string, target):
return target in some_string
>>> is_part('motherland', 'land')
True
>>> is_part('motherland', 'father')
False
>>> is_part('motherland', '')
True
If you don't like an empty string returning true, change the return statement to
return (target in some_string) if target else False
If, on the other hand, you need to implement it yourself:
def is_part(some_string, target):
if target:
target_len = len(target)
for i in range(len(some_string)):
if some_string[i:i+target_len] == target:
return True
return False