Related
How can I compare strings in a case insensitive way in Python?
I would like to encapsulate comparison of a regular strings to a repository string, using simple and Pythonic code. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.
Assuming ASCII strings:
string1 = 'Hello'
string2 = 'hello'
if string1.lower() == string2.lower():
print("The strings are the same (case insensitive)")
else:
print("The strings are NOT the same (case insensitive)")
As of Python 3.3, casefold() is a better alternative:
string1 = 'Hello'
string2 = 'hello'
if string1.casefold() == string2.casefold():
print("The strings are the same (case insensitive)")
else:
print("The strings are NOT the same (case insensitive)")
If you want a more comprehensive solution that handles more complex unicode comparisons, see other answers.
Comparing strings in a case insensitive way seems trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.
The first thing to note is that case-removing conversions in Unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":
>>> "ß".lower()
'ß'
>>> "ß".upper().lower()
'ss'
But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal - that's the newer capital form. The recommended way is to use casefold:
str.casefold()
Return a casefolded copy of the string. Casefolded strings may be used for
caseless matching.
Casefolding is similar to lowercasing but more aggressive because it is
intended to remove all case distinctions in a string. [...]
Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).
Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" - but it doesn't:
>>> "ê" == "ê"
False
This is because the accent on the latter is a combining character.
>>> import unicodedata
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E WITH CIRCUMFLEX']
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']
The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does
>>> unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
True
To finish up, here this is expressed in functions:
import unicodedata
def normalize_caseless(text):
return unicodedata.normalize("NFKD", text.casefold())
def caseless_equal(left, right):
return normalize_caseless(left) == normalize_caseless(right)
Using Python 2, calling .lower() on each string or Unicode object...
string1.lower() == string2.lower()
...will work most of the time, but indeed doesn't work in the situations #tchrist has described.
Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:
>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'
>>> u = utf8_bytes.decode('utf8')
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True
The Σ character has two lowercase forms, ς and σ, and .lower() won't help compare them case-insensitively.
However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:
>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True
So if you care about edge-cases like the three sigmas in Greek, use Python 3.
(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)
Section 3.13 of the Unicode standard defines algorithms for caseless
matching.
X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).
Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'å'). D145 introduces "canonical caseless matching":
import unicodedata
def NFD(text):
return unicodedata.normalize('NFD', text)
def canonical_caseless(text):
return NFD(NFD(text).casefold())
NFD() is called twice for very infrequent edge cases involving U+0345 character.
Example:
>>> 'å'.casefold() == 'å'.casefold()
False
>>> canonical_caseless('å') == canonical_caseless('å')
True
There are also compatibility caseless matching (D146) for cases such as '㎒' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.
I saw this solution here using regex.
import re
if re.search('mandy', 'Mandy Pande', re.IGNORECASE):
# is True
It works well with accents
In [42]: if re.search("ê","ê", re.IGNORECASE):
....: print(1)
....:
1
However, it doesn't work with unicode characters case-insensitive. Thank you #Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:
In [36]: "ß".lower()
Out[36]: 'ß'
In [37]: "ß".upper()
Out[37]: 'SS'
In [38]: "ß".upper().lower()
Out[38]: 'ss'
In [39]: if re.search("ß","ßß", re.IGNORECASE):
....: print(1)
....:
1
In [40]: if re.search("SS","ßß", re.IGNORECASE):
....: print(1)
....:
In [41]: if re.search("ß","SS", re.IGNORECASE):
....: print(1)
....:
You can use casefold() method. The casefold() method ignores cases when comparing.
firstString = "Hi EVERYONE"
secondString = "Hi everyone"
if firstString.casefold() == secondString.casefold():
print('The strings are equal.')
else:
print('The strings are not equal.')
Output:
The strings are equal.
The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:
>>> "hello".upper() == "HELLO".upper()
True
>>>
How about converting to lowercase first? you can use string.lower().
a clean solution that I found, where I'm working with some constant file extensions.
from pathlib import Path
class CaseInsitiveString(str):
def __eq__(self, __o: str) -> bool:
return self.casefold() == __o.casefold()
GZ = CaseInsitiveString(".gz")
ZIP = CaseInsitiveString(".zip")
TAR = CaseInsitiveString(".tar")
path = Path("/tmp/ALL_CAPS.TAR.GZ")
GZ in path.suffixes, ZIP in path.suffixes, TAR in path.suffixes, TAR == ".tAr"
# (True, False, True, True)
You can mention case=False in the str.contains()
data['Column_name'].str.contains('abcd', case=False)
def search_specificword(key, stng):
key = key.lower()
stng = stng.lower()
flag_present = False
if stng.startswith(key+" "):
flag_present = True
symb = [',','.']
for i in symb:
if stng.find(" "+key+i) != -1:
flag_present = True
if key == stng:
flag_present = True
if stng.endswith(" "+key):
flag_present = True
if stng.find(" "+key+" ") != -1:
flag_present = True
print(flag_present)
return flag_present
Output:
search_specificword("Affordable housing", "to the core of affordable outHousing in europe")
False
search_specificword("Affordable housing", "to the core of affordable Housing, in europe")
True
from re import search, IGNORECASE
def is_string_match(word1, word2):
# Case insensitively function that checks if two words are the same
# word1: string
# word2: string | list
# if the word1 is in a list of words
if isinstance(word2, list):
for word in word2:
if search(rf'\b{word1}\b', word, IGNORECASE):
return True
return False
# if the word1 is same as word2
if search(rf'\b{word1}\b', word2, IGNORECASE):
return True
return False
is_match_word = is_string_match("Hello", "hELLO")
True
is_match_word = is_string_match("Hello", ["Bye", "hELLO", "#vagavela"])
True
is_match_word = is_string_match("Hello", "Bye")
False
Consider using FoldedCase from jaraco.text:
>>> from jaraco.text import FoldedCase
>>> FoldedCase('Hello World') in ['hello world']
True
And if you want a dictionary keyed on text irrespective of case, use FoldedCaseKeyedDict from jaraco.collections:
>>> from jaraco.collections import FoldedCaseKeyedDict
>>> d = FoldedCaseKeyedDict()
>>> d['heLlo'] = 'world'
>>> list(d.keys()) == ['heLlo']
True
>>> d['hello'] == 'world'
True
>>> 'hello' in d
True
>>> 'HELLO' in d
True
def insenStringCompare(s1, s2):
""" Method that takes two strings and returns True or False, based
on if they are equal, regardless of case."""
try:
return s1.lower() == s2.lower()
except AttributeError:
print "Please only pass strings into this method."
print "You passed a %s and %s" % (s1.__class__, s2.__class__)
This is another regex which I have learned to love/hate over the last week so usually import as (in this case yes) something that reflects how im feeling!
make a normal function.... ask for input, then use ....something = re.compile(r'foo*|spam*', yes.I)...... re.I (yes.I below) is the same as IGNORECASE but you cant make as many mistakes writing it!
You then search your message using regex's but honestly that should be a few pages in its own , but the point is that foo or spam are piped together and case is ignored.
Then if either are found then lost_n_found would display one of them. if neither then lost_n_found is equal to None. If its not equal to none return the user_input in lower case using "return lost_n_found.lower()"
This allows you to much more easily match up anything thats going to be case sensitive. Lastly (NCS) stands for "no one cares seriously...!" or not case sensitive....whichever
if anyone has any questions get me on this..
import re as yes
def bar_or_spam():
message = raw_input("\nEnter FoO for BaR or SpaM for EgGs (NCS): ")
message_in_coconut = yes.compile(r'foo*|spam*', yes.I)
lost_n_found = message_in_coconut.search(message).group()
if lost_n_found != None:
return lost_n_found.lower()
else:
print ("Make tea not love")
return
whatz_for_breakfast = bar_or_spam()
if whatz_for_breakfast == foo:
print ("BaR")
elif whatz_for_breakfast == spam:
print ("EgGs")
I just tried replacing a character in a python string with a null ('') character. Some weird things are happening. Can someone please explain me why is all this happening?
>>> a = "SampleText"
>>> a
'SampleText'
>>> a.replace('a','\0')
'S\x00mpleText'
>>> len(a)
10
>>> a.replace('\0','a')
'SampleText'
>>> len(a)
10
>>> a.replace('a','')
'SmpleText'
>>> len(a)
10
>>> a.replace('','a')
'aSaaamapalaeaTaeaxata'
>>> len(a)
10
The replace function returns the new string and therefore you need to asign it to a variable again. if you write a = a.replace('a','\0') it'll work as you expect it.
This question already has answers here:
Compare if two variables reference the same object in python
(6 answers)
Closed 5 months ago.
The is operator does not match the values of the variables, but the
instances themselves.
What does it really mean?
I declared two variables named x and y assigning the same values in both variables, but it returns false when I use the is operator.
I need a clarification. Here is my code.
x = [1, 2, 3]
y = [1, 2, 3]
print(x is y) # It prints false!
You misunderstood what the is operator tests. It tests if two variables point the same object, not if two variables have the same value.
From the documentation for the is operator:
The operators is and is not test for object identity: x is y is true if and only if x and y are the same object.
Use the == operator instead:
print(x == y)
This prints True. x and y are two separate lists:
x[0] = 4
print(y) # prints [1, 2, 3]
print(x == y) # prints False
If you use the id() function you'll see that x and y have different identifiers:
>>> id(x)
4401064560
>>> id(y)
4401098192
but if you were to assign y to x then both point to the same object:
>>> x = y
>>> id(x)
4401064560
>>> id(y)
4401064560
>>> x is y
True
and is shows both are the same object, it returns True.
Remember that in Python, names are just labels referencing values; you can have multiple names point to the same object. is tells you if two names point to one and the same object. == tells you if two names refer to objects that have the same value.
Another duplicate was asking why two equal strings are generally not identical, which isn't really answered here:
>>> x = 'a'
>>> x += 'bc'
>>> y = 'abc'
>>> x == y
True
>>> x is y
False
So, why aren't they the same string? Especially given this:
>>> z = 'abc'
>>> w = 'abc'
>>> z is w
True
Let's put off the second part for a bit. How could the first one be true?
The interpreter would have to have an "interning table", a table mapping string values to string objects, so every time you try to create a new string with the contents 'abc', you get back the same object. Wikipedia has a more detailed discussion on how interning works.
And Python has a string interning table; you can manually intern strings with the sys.intern method.
In fact, Python is allowed to automatically intern any immutable types, but not required to do so. Different implementations will intern different values.
CPython (the implementation you're using if you don't know which implementation you're using) auto-interns small integers and some special singletons like False, but not strings (or large integers, or small tuples, or anything else). You can see this pretty easily:
>>> a = 0
>>> a += 1
>>> b = 1
>>> a is b
True
>>> a = False
>>> a = not a
>>> b = True
a is b
True
>>> a = 1000
>>> a += 1
>>> b = 1001
>>> a is b
False
OK, but why were z and w identical?
That's not the interpreter automatically interning, that's the compiler folding values.
If the same compile-time string appears twice in the same module (what exactly this means is hard to define—it's not the same thing as a string literal, because r'abc', 'abc', and 'a' 'b' 'c' are all different literals but the same string—but easy to understand intuitively), the compiler will only create one instance of the string, with two references.
In fact, the compiler can go even further: 'ab' + 'c' can be converted to 'abc' by the optimizer, in which case it can be folded together with an 'abc' constant in the same module.
Again, this is something Python is allowed but not required to do. But in this case, CPython always folds small strings (and also, e.g., small tuples). (Although the interactive interpreter's statement-by-statement compiler doesn't run the same optimization as the module-at-a-time compiler, so you won't see exactly the same results interactively.)
So, what should you do about this as a programmer?
Well… nothing. You almost never have any reason to care if two immutable values are identical. If you want to know when you can use a is b instead of a == b, you're asking the wrong question. Just always use a == b except in two cases:
For more readable comparisons to the singleton values like x is None.
For mutable values, when you need to know whether mutating x will affect the y.
is only returns true if they're actually the same object. If they were the same, a change to one would also show up in the other. Here's an example of the difference.
>>> x = [1, 2, 3]
>>> y = [1, 2, 3]
>>> print x is y
False
>>> z = y
>>> print y is z
True
>>> print x is z
False
>>> y[0] = 5
>>> print z
[5, 2, 3]
Prompted by a duplicate question, this analogy might work:
# - Darling, I want some pudding!
# - There is some in the fridge.
pudding_to_eat = fridge_pudding
pudding_to_eat is fridge_pudding
# => True
# - Honey, what's with all the dirty dishes?
# - I wanted to eat pudding so I made some. Sorry about the mess, Darling.
# - But there was already some in the fridge.
pudding_to_eat = make_pudding(ingredients)
pudding_to_eat is fridge_pudding
# => False
is and is not are the two identity operators in Python. is operator does not compare the values of the variables, but compares the identities of the variables. Consider this:
>>> a = [1,2,3]
>>> b = [1,2,3]
>>> hex(id(a))
'0x1079b1440'
>>> hex(id(b))
'0x107960878'
>>> a is b
False
>>> a == b
True
>>>
The above example shows you that the identity (can also be the memory address in Cpython) is different for both a and b (even though their values are the same). That is why when you say a is b it returns false due to the mismatch in the identities of both the operands. However when you say a == b, it returns true because the == operation only verifies if both the operands have the same value assigned to them.
Interesting example (for the extra grade):
>>> del a
>>> del b
>>> a = 132
>>> b = 132
>>> hex(id(a))
'0x7faa2b609738'
>>> hex(id(b))
'0x7faa2b609738'
>>> a is b
True
>>> a == b
True
>>>
In the above example, even though a and b are two different variables, a is b returned True. This is because the type of a is int which is an immutable object. So python (I guess to save memory) allocated the same object to b when it was created with the same value. So in this case, the identities of the variables matched and a is b turned out to be True.
This will apply for all immutable objects:
>>> del a
>>> del b
>>> a = "asd"
>>> b = "asd"
>>> hex(id(a))
'0x1079b05a8'
>>> hex(id(b))
'0x1079b05a8'
>>> a is b
True
>>> a == b
True
>>>
Hope that helps.
x is y is same as id(x) == id(y), comparing identity of objects.
As #tomasz-kurgan pointed out in the comment below is operator behaves unusually with certain objects.
E.g.
>>> class A(object):
... def foo(self):
... pass
...
>>> a = A()
>>> a.foo is a.foo
False
>>> id(a.foo) == id(a.foo)
True
Ref;
https://docs.python.org/2/reference/expressions.html#is-not
https://docs.python.org/2/reference/expressions.html#id24
As you can check here to a small integers. Numbers above 257 are not an small ints, so it is calculated as a different object.
It is better to use == instead in this case.
Further information is here: http://docs.python.org/2/c-api/int.html
X points to an array, Y points to a different array. Those arrays are identical, but the is operator will look at those pointers, which are not identical.
It compares object identity, that is, whether the variables refer to the same object in memory. It's like the == in Java or C (when comparing pointers).
A simple example with fruits
fruitlist = [" apple ", " banana ", " cherry ", " durian "]
newfruitlist = fruitlist
verynewfruitlist = fruitlist [:]
print ( fruitlist is newfruitlist )
print ( fruitlist is verynewfruitlist )
print ( newfruitlist is verynewfruitlist )
Output:
True
False
False
If you try
fruitlist = [" apple ", " banana ", " cherry ", " durian "]
newfruitlist = fruitlist
verynewfruitlist = fruitlist [:]
print ( fruitlist == newfruitlist )
print ( fruitlist == verynewfruitlist )
print ( newfruitlist == verynewfruitlist )
The output is different:
True
True
True
That's because the == operator compares just the content of the variable. To compare the identities of 2 variable use the is operator
To print the identification number:
print ( id( variable ) )
The is operator is nothing but an English version of ==.
Because the IDs of the two lists are different so the answer is false.
You can try:
a=[1,2,3]
b=a
print(b is a )#True
*Because the IDs of both the list would be same
In Scala I could test if a string has a capital letter like this:
val nameHasUpperCase = name.exists(_.isUpper)
The most comprehensive form in Python I can think of is:
a ='asdFggg'
functools.reduce(lambda x, y: x or y, [c.isupper() for c in a])
->True
Somewhat clumsy. Is there a better way to do this?
The closest to the Scala statement is probably an any(..) statement here:
any(x.isupper() for x in a)
This will work in using a generator: from the moment such element is found, any(..) will stop and return True.
This produces:
>>> a ='asdFggg'
>>> any(x.isupper() for x in a)
True
Or another one with map(..):
any(map(str.isupper,a))
Another way of doing this would be comparing the original string to it being completely lower case:
>>> a ='asdFggg'
>>> a == a.lower()
False
And if you want this to return true, then use != instead of ==
There is also
nameHasUpperCase = bool(re.search(r'[A-Z]', name))
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a question. If I have a string, for example:
str1 = 'Wazzup1'
and numbers:
nums = '1234567890'
I need a code that will look into str1 and tell me if it has any number(not all of them). Please help.
Use any and a generator expression:
any(x in nums for x in str1)
Below is a demonstration:
>>> str1 = 'Wazzup1'
>>> nums = '1234567890'
>>> any(x in nums for x in str1)
True
>>>
Note that the above is for when you have a custom set of numbers to test for. However, if you are just looking for digits, then a cleaner approach would be to use str.isdigit:
>>> str1 = 'Wazzup1'
>>> any(x.isdigit() for x in str1)
True
>>>
Use the any function, which returns a Boolean which is true iff at least one of the elements in the iterable is true.
string = 'Wazzup1'
result = any(c.isdigit() for c in string)
print(result) # True
As most Python programmers will tell you, the most concise way to do this is using:
any(x in nums for x in str1)
However, if you're new to Python or need a better grasp of the basics of string manipulation, then you should learn how to do this using more fundamental tools.
You can access the individual elements of a string, list, tuple, or any other iterable in Python using square brackets around an index. The characters of a string are indexed starting from 0 (e.g. "hello"[0] gives "h").
Using a for loop, the solution is easier to understand for a Python newbie than the above-mentioned any solution:
result = False
for i in range(len(str1)):
if str1[i] in nums:
result = True
A Python for loop can also iterate directly over the elements of the string:
result = False
for x in str1:
if x in nums:
result = True
In the first code snippet in this post, the expression x in nums for x in str1 uses Python's list comprehension feature. This goes through every element x of str1 and finds the result of x in nums. any(x in nums for x in str1) returns True if (and only if) at least one of these results is True (meaning a numerical digit is in str1). This is much like the second for loop example given in this post, and many Python programmers choose this option because it is concise and still understandable by other Python programmers.
You can use any() and string.digits to check whether the string contain a digit:
import string
if any(x in string.digits for x in str1):
pass
You could also use a regular expression:
>>> str1 = 'Wazzup1'
>>> import re
>>> bool(re.search(r'\d', str1))
True
Note: there might be a difference in how c.isdigit(), c in nums, int(c) and \d define what is a digit due to locale or Unicode.