Compare strings with unicode characters in python [duplicate] - python

This question already has answers here:
How do I compare a Unicode string that has different bytes, but the same value?
(3 answers)
Closed 2 years ago.
Here is the substring Ritē
I have two strings, one is from the extracted file name by zipfile. I used filename.encode('cp437').decode('utf-8') to have all the paths extracted correctly. The other one is read from a .plist using plistlib.readPlist(). Both are printed correctly using print(). However, they are not the same in comparison. I tried to encode both of them in utf-8, here is what they look like:
Rite\xcc\x84
Rit\xc4\x93
One interprets character e and - on top, the other one interprets the 'LATIN SMALL LETTER E WITH MACRON'Does any one have any advice on this, in order to compare the two strings? Thank you in advance

Based on the comments it sounds like this is what you're looking for:
import unicodedata
foo = 'Rit\u0113'
bar = 'Rite\u0304'
print(foo, bar)
print(unicodedata.normalize('NFD', foo))
print(unicodedata.normalize('NFD', bar))
assert unicodedata.normalize('NFD', foo) == unicodedata.normalize('NFD', bar)
I selected NFD as the form, but you may prefer NFC.

Related

How to correctly (re) interpret encoded text? [duplicate]

This question already has answers here:
python - how to apply backspaces to a string [closed]
(2 answers)
Closed 1 year ago.
As a simple use case in Python, I wish to convert some encoded text and set it equal to a variable or dictionary key as would be printed on screen. This issue came about by piping some std out to memory from a command line function where some of the text didn't seem to be properly interpreted in python.
Example:
myVar = "N\x08NA\x08AM\x08ME\x08E"
print(myVar)
="NAME"
When myVar is input as a dictionary key, I get the following result:
myDict = {}
myDict[myVar] = 'foobar'
print(myDict.keys())
=dict_keys(['N\x08NA\x08AM\x08ME\x08E'])
How can I make myDict.keys() = dict_keys(['Name'])?
Same question for a variable where
myVar = "NAME"
rather than 'N\x08NA\x08AM\x08ME\x08E'
I've tried variants of myVar.encode() and str(myVar) with no success.
You can easily remove every character that would be erased by a backspace (\x08 or \b) with a regular expression.
import re
re.sub('.\x08', '', "N\x08NA\x08AM\x08ME\x08E")
='NAME'

Python string concatenation of multiple strings separated without comma [duplicate]

This question already has answers here:
String concatenation without '+' operator
(6 answers)
Closed 3 years ago.
Though it might seem a very trivial question, I still want to know the principle behind it. When we write multiple strings together without any comma,python concatenates them. I was under the impression that it will throw some error. Below is a sample output:
print('hello''world')
# This will output helloworld
Even if I write those multiple strings in the python REPL, the output will be the concatenated form of the strings. Can anyone please explain the logic behind this operation ?
See https://docs.python.org/3.8/reference/lexical_analysis.html#string-literal-concatenation.
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation.
Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings

How to write a string starting with ' and ending with " in Python? [duplicate]

This question already has answers here:
Having both single and double quotation in a Python string
(9 answers)
Closed 5 years ago.
I'd like to save the following characters 'bar" as a string variable, but it seems to be more complicated than I thought :
foo = 'bar" is not a valid string.
foo = ''bar"' is not a valid string either.
foo = '''bar"'' is still not valid.
foo = ''''bar"''' actually saves '\'bar"'
What is the proper syntax in this case?
The last string saves '\'bar"' as the representation, but it is the string you're looking for, just print it:
foo = ''''bar"'''
print(foo)
'bar"
when you hit enter in the interactive interpreter you'll get it's repr which escapes the second ' to create the string.
Using a triple quoted literal is the only way to define this without explicitly using escapes. You can get the same result by escaping quotes:
print('\'foo"')
'foo"
print("'foo\"")
'foo"

How to use string slicing inside string.format [duplicate]

This question already has answers here:
Slicing strings in str.format
(6 answers)
Closed 6 years ago.
How can I do variable string slicing inside string.format like this.
"{0[:2]} Some text {0[2:4]} some text".format("123456")
Result I want result like this.
12 Some text 34 some text
You can't. Best you can do is limit how many characters of a string are printed (roughly equivalent to specifying a slice end), but you can't specify arbitrary start or end indices.
Save the data to a named variable and pass the slices to the format method, it's more readable, more intuitive, and easier for the parser to identify errors when they occur:
mystr = "123456"
"{} Some text {} some text".format(mystr[:2], mystr[2:4])
You could move some of the work from that to the format string if you really wanted to, but it's not a huge improvement (and in fact, involves larger temporaries when a slice ends up being needed anyway):
"{:.2s} Some text {:.2s} some text".format(mystr, mystr[2:])

format() function in python - Usage of multiple curly brackets {{{}}} [duplicate]

This question already has answers here:
How do I escape curly-brace ({}) characters in a string while using .format (or an f-string)?
(23 answers)
Closed 6 years ago.
again :)
I found this bit of code
col_width=[13,11]
header=['First Name','Last Name']
format_specs = ["{{:{}}}".format(col_width[i]) for i in range(len(col_width))]
lheader=[format_specs[i].format(self.__header[i]) for i in range(nb_columns)]
How Python evaluate this statement? Why we use three { when we have one element to format in every iteration?
when you do {{}}, python skips the replacement of {} and makes it the part of string. Below is the sample example to explain this:
>>> '{{}}'.format(3) # with two '{{}}'
'{}' # nothing added to the string, instead made inner `{}` as the part of string
>>> '{{{}}}'.format(3) # with three '{{{}}}'
'{3}' # replaced third one with the number
Similarly, your expression is evaluating as:
>>> '{{:{}}}'.format(3)
'{:3}' # during creation of "format_specs"
For details, refer: Format String Syntax document.

Categories

Resources